DB Normalization
Summary
- Whats it all about?
- Rules
- Multi-Values
- Information Loss
- Dependency Loss
- Perfect Decomposition
- Functional Dependencies
- Normal Forms
- Armstrong’s Rules
- How to guarantee lossless joins
Whats it all about?
Given a relation and a set of functional dependencies like this:
RegularUser |
---|
Interest | SinceAge | BirthYear | CurrentCity | Salary | |
---|---|---|---|---|---|
[email protected] | Music | 10 | 1985 | Seattle | 27,000 |
[email protected] | Reading | 5 | 1985 | Seattle | 27,000 |
[email protected] | Tennis | 14 | 1985 | Seattle | 27,000 |
[email protected] | NULL | NULL | 1988 | Las Vegas | 24,000 |
[email protected] | DIY | 18 | 1988 | San Fransciso | 24,000 |
Given an email, we know BirthYear, CurrentCity and Salary
Given an Email and Interest we know the SinceAge
Given a BirthYear we know the Salary
How do we decompose this relation into smaller relations without losing information and so that the functional dependencies can still be enforced?
Rules
- No redundancy of facts
- No cluttering of facts
- Must preserve information
- Must preserve functional dependencies
Multi-Values
It’s easy to have data in non-first normal form.
Like in the table above, each user has an email, but can have multiple Interests.
There are some problems with this.
Redundancy
The first problem is with redundancy.
For each Email, the same BirthYear, CurrentCity, and Salary are repeated
For each BirthYear, the same Salary is repeated
RegularUser |
---|
Interest | SinceAge | BirthYear | CurrentCity | Salary | |
---|---|---|---|---|---|
Music | 10 | 1985 | Seattle | 27,000 | |
Reading | 5 | 1985 | Seattle | 27,000 | |
Tennis | 14 | 1985 | Seattle | 27,000 | |
[email protected] | NULL | NULL | 1988 | Las Vegas | 24,000 |
[email protected] | DIY | 18 | 1988 | San Fransciso | 24,000 |
Redundancy can lead to inconsistency
Imagine a situation where user1 moves from Seattle to Atlanta. You need to make sure you change their CurrentCity in every instance, not just one.
Inserton Anomalies
Another issue can arrive when we insert data. If we insert a new user without any Interest, we need to insert NULL values for Interest and SinceAge. If we insert a pair of BirthYear and Salary then we need to insert NULL values for Email, Interest, SinceAge, and CurrentCity.
Deletion Anomalies
If we delete a user that is the only use born in a certain year, then we lose the information about what’s the average salary for that year.
Update Anomolies
These are similar to the previous anomalies.
Information Loss
If we decompose RegularUser into two relations, we could get too many rows back when we recombine them.
RegularUser |
---|
Interest | SinceAge | BirthYear | CurrentCity | Salary | |
---|---|---|---|---|---|
[email protected] | Music | 10 | 1985 | Seattle | 27,000 |
[email protected] | Reading | 5 | 1985 | Seattle | 27,000 |
[email protected] | Tennis | 14 | 1985 | Seattle | 27,000 |
[email protected] | NULL | NULL | 1988 | Las Vegas | 24,000 |
[email protected] | DIY | 18 | 1988 | San Fransciso | 24,000 |
Can become
Interest | SinceAge | BirthYear | CurrentCity | |
---|---|---|---|---|
[email protected] | Music | 10 | 1985 | Seattle |
[email protected] | Reading | 5 | 1985 | Seattle |
[email protected] | Tennis | 14 | 1985 | Seattle |
[email protected] | NULL | NULL | 1988 | Las Vegas |
[email protected] | DIY | 18 | 1988 | San Fransciso |
And
CurrentCity | Salary |
---|---|
Seattle | 27,000 |
Seattle | 28,000 |
Las Vegas | 24,000 |
San Fransciso | 24,000 |
It’s possible to combine the last two tables and have multiple (different) values for a given CurrentCity. Adding more rows will mask what the actual value is, so we say this is information loss.
Dependency Loss
The decomposition above also causes us to lose the ability to enforce functional dependencies. We see this again with muliple salaries for a single user.
Perfect Decomposition
Instead we can decompose the original table into
Interest | SinceAge | |
---|---|---|
[email protected] | Music | 10 |
[email protected] | Reading | 5 |
[email protected] | Tennis | 14 |
[email protected] | DIY | 18 |
BirthYear | CurrentCity | |
---|---|---|
[email protected] | 1985 | Seattle |
[email protected] | 1988 | Las Vegas |
[email protected] | 1988 | San Fransciso |
BirthYear | Salary |
---|---|
1985 | 27,000 |
1988 | 24,000 |
Now we can throw away the main table and act only on the sub tables. When we want to see everything, we can join.
Functional Dependencies
Let X and Y be sets of attributes in R. Y is functionally dependent
on X in
R iff for each there is preciesly one
Full Function Dependencies
Let X and Y be sets of attributes in R. Y is fully functionally dependent
on
X iff Y is functionally on X AND Y is not functionally dependent on any proper
subset of X.
Functional Dependencies and Keys
We use keys to enforce full functional dependencies.
Normal Forms
The set of all Data Structures is called Non-First Normal Form
(). Most of these are not “relations” as we know them.
The set of First Normal Form (1NF) relations is a subset oft . The set of Second Normal Form (2NF) relations is a subset of 1NF, and so on. Fourth Normal Form (4NF) is also called Boyce-Codd Normal Form (BCNF). BCNF is the goal.
Definitions
Here are some definitions:
- 1NF
- R is in 1NF iff all domain values are atomic
- All relations are 1NF by definition
- R is in 1NF iff all domain values are atomic
- 2NF
- R is in 2NF iff R is in 1NF and every non-key attribute is fully dependent on the key.
- 3NF
- R is in 3NF iff R is in 2NF and every non-key attribute is non-transitively dependent on the key.
- BCNF
- R is in BCNF iff every determinant is a candidate key.
- Determinant
- a set of attributes on which some other attribute is fully functionally dependent.
“All attributes must depend on the key (1NF), the whole key (2NF), and nothing but the key (3NF), so help me Codd!”
Armstrong’s Rules
How to compute with functional dependencies (meaning).
Reflexivity := If Y is part of X, then X Y
Augmentation := If X Y, then WX WY
Transitivity := If X Y and Y Z, then X Z
How to guarantee lossless joins
The join field must be a key in at least one of the relations.