Reaves.dev

v0.1.0

built using

Phoenix v1.7.12

DB Normalization

Stephen M. Reaves

::

2023-07-12

Notes about Lesson 10 of CS-6400

Summary

Whats it all about?

Given a relation and a set of functional dependencies like this:

RegularUser
EmailInterestSinceAgeBirthYearCurrentCitySalary
[email protected]Music101985Seattle27,000
[email protected]Reading51985Seattle27,000
[email protected]Tennis141985Seattle27,000
[email protected]NULLNULL1988Las Vegas24,000
[email protected]DIY181988San Fransciso24,000

Given an email, we know BirthYear, CurrentCity and Salary

EmailBirthYear,CurrentCity,SalaryEmail \rightarrow BirthYear, CurrentCity, Salary

Given an Email and Interest we know the SinceAge

Email,InterestSinceAgeEmail, Interest \rightarrow SinceAge

Given a BirthYear we know the Salary

BirthYearSalaryBirthYear \rightarrow Salary

How do we decompose this relation into smaller relations without losing information and so that the functional dependencies can still be enforced?

Rules

  1. No redundancy of facts
  2. No cluttering of facts
  3. Must preserve information
  4. Must preserve functional dependencies

Multi-Values

It’s easy to have data in non-first normal form.

Like in the table above, each user has an email, but can have multiple Interests.

There are some problems with this.

Redundancy

The first problem is with redundancy.

For each Email, the same BirthYear, CurrentCity, and Salary are repeated

For each BirthYear, the same Salary is repeated

RegularUser
EmailInterestSinceAgeBirthYearCurrentCitySalary

[email protected]

Music10

1985

Seattle

27,000

[email protected]

Reading5

1985

Seattle

27,000

[email protected]

Tennis14

1985

Seattle

27,000

[email protected]NULLNULL

1988

Las Vegas

24,000

[email protected]DIY18

1988

San Fransciso

24,000


Redundancy can lead to inconsistency

Imagine a situation where user1 moves from Seattle to Atlanta. You need to make sure you change their CurrentCity in every instance, not just one.

Inserton Anomalies

Another issue can arrive when we insert data. If we insert a new user without any Interest, we need to insert NULL values for Interest and SinceAge. If we insert a pair of BirthYear and Salary then we need to insert NULL values for Email, Interest, SinceAge, and CurrentCity.

Deletion Anomalies

If we delete a user that is the only use born in a certain year, then we lose the information about what’s the average salary for that year.

Update Anomolies

These are similar to the previous anomalies.

Information Loss

If we decompose RegularUser into two relations, we could get too many rows back when we recombine them.

RegularUser
EmailInterestSinceAgeBirthYearCurrentCitySalary
[email protected]Music101985Seattle27,000
[email protected]Reading51985Seattle27,000
[email protected]Tennis141985Seattle27,000
[email protected]NULLNULL1988Las Vegas24,000
[email protected]DIY181988San Fransciso24,000

Can become

EmailInterestSinceAgeBirthYearCurrentCity
[email protected]Music101985Seattle
[email protected]Reading51985Seattle
[email protected]Tennis141985Seattle
[email protected]NULLNULL1988Las Vegas
[email protected]DIY181988San Fransciso

And

CurrentCitySalary
Seattle27,000
Seattle28,000
Las Vegas24,000
San Fransciso24,000

It’s possible to combine the last two tables and have multiple (different) values for a given CurrentCity. Adding more rows will mask what the actual value is, so we say this is information loss.

Dependency Loss

The decomposition above also causes us to lose the ability to enforce functional dependencies. We see this again with muliple salaries for a single user.

Perfect Decomposition

Instead we can decompose the original table into

EmailInterestSinceAge
[email protected]Music10
[email protected]Reading5
[email protected]Tennis14
[email protected]DIY18

EmailBirthYearCurrentCity
[email protected]1985Seattle
[email protected]1988Las Vegas
[email protected]1988San Fransciso

BirthYearSalary
198527,000
198824,000

Now we can throw away the main table and act only on the sub tables. When we want to see everything, we can join.

Functional Dependencies

Let X and Y be sets of attributes in R. Y is functionally dependent on X in R iff for each xR.X x \in R.X there is preciesly one yR.Y y \in R.Y

Full Function Dependencies

Let X and Y be sets of attributes in R. Y is fully functionally dependent on X iff Y is functionally on X AND Y is not functionally dependent on any proper subset of X.

Functional Dependencies and Keys

We use keys to enforce full functional dependencies.

Normal Forms

The set of all Data Structures is called Non-First Normal Form (NF2 \text{NF}^2 ). Most of these are not “relations” as we know them.

The set of First Normal Form (1NF) relations is a subset oft NF2 \text{NF}^2 . The set of Second Normal Form (2NF) relations is a subset of 1NF, and so on. Fourth Normal Form (4NF) is also called Boyce-Codd Normal Form (BCNF). BCNF is the goal.

Definitions

Here are some definitions:

“All attributes must depend on the key (1NF), the whole key (2NF), and nothing but the key (3NF), so help me Codd!”

Armstrong’s Rules

How to compute with functional dependencies (meaning).

Reflexivity := If Y is part of X, then X \rightarrow Y

Augmentation := If X \rightarrow Y, then WX \rightarrow WY

Transitivity := If X \rightarrow Y and Y \rightarrow Z, then X \rightarrow Z

How to guarantee lossless joins

The join field must be a key in at least one of the relations.