Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Introduction to Normalization

CPSC 356 Database

Ellen Walker

Hiram College

Building a Schema

• Start with a list of all attributes, considered as if you had a giant flat database (one relation) with all possible information in one place

• Divide the attributes into multiple relations– Intuitively– According to formal rules (normalization)

• This is a formalized alternative to the algorithms we learned before

What Makes A Good Schema?

1. Each relation should have clear semantics, i.e. can be easily described in a few words

2. Try to avoid redundancy (to minimize storage space, but also to avoid anomalies)

3. Avoid a design that encourages too many NULL values in a relation. NULL can be ambiguous: N/A vs. unknown vs. not-yet-entered, etc.

4. Don’t split related attributes so that the relationship between them is lost (e.g. make sure LastName and UserID are both in the same relation)

Tracking Real Estate Staff

• Consider a single relation for real estate• It contains branch name, branch number,

staff name, staff number, staff salary, etc.– One entry for each staff member of each branch– Branch information is repeated for different staff

(REDUNDANCY!)– Staff information is repeated if they work in

multiple branches (REDUNDANCY!)

• This is an example of what NOT to do

Redundancy-caused Anomalies

• Insertion Anomalies– A branch with no staff has many NULLs– Entering a new staff member has NULL branch info

• But branch number and staff number are both part of primary key! (Why)

• Deletion Anomalies– When the last staff member at a branch is deleted, the

branch info is lost

• Update Anomalies– If we make a change in branch info once, it must be

changed in all copies (for all staff).

Solving Redundancy Problems

• Decompose the relation into multiple relations• Use Foreign Keys so the complete relation can be

reconstructed through a join– Branch: has branch number & branch info– Staff: has staff number, staff info & branch number as

foreign key

• Foreign Keys are exactly the attributes that are in the primary key of the other relation

• Insertion, deletion & update anomalies are gone! – Consider: add branch with no staff, remove last staff

member, update branch info

How to Decompose?

• Decompositions are not (always) intuitively obvious• Codd discovered mathematical properties (called

Normal Forms) that describe “goodness” of decomposition

• First, Second, Third normal forms decrease redundancy without loss of information

• BCNF, Fourth and Fifth normal forms potentially introduce information loss (we will see…)

• To understand normal forms, start with functional dependencies

Functional Dependencies

• If A and B are attributes, and every value of A is associated with exactly one value of B (so knowing A predicts B), then B is functionally dependent on A (We write this as: A->B)

• Functional dependency is based on the semantics (meaning) of the attributes.

• A->B and B->A are two different constraints– Email -> first name is a valid dependency– First name -> Email is not a valid dependency

Examples of Functional Dependencies

• US Zip Code -> State• US Area Code -> State• Email -> Firstname, lastname• HotelNo, RoomNo -> Price• JobTitle, ServiceLength -> Salary

What are the dependencies?

• item place customer-name

• ring Kay jewelers prince charming• ring walmart miss piggy

• Place -> item?• Item -> place?

• oil walmart tin man

Finding Dependencies in Data

• If a value of attribute A is associated with two or more values of B, then it is not true that A->B.

• If a value of attribute A is associated with exactly one value of B, then it might be true that A->B.

• Only when every possible value of attribute A is associated with exactly one value of B is it true that A->B.

Characteristics of Functional Dependencies for Normalization

• For any given values of the attributes on the left, there is exactly one possible attribute on the right

• No future data will ever invalidate the dependency

• Dependency is nontrivial -- no attributes from the left are repeated on the right

Keys & Functional Dependency

• Remember, a candidate key is a subset of attributes that is (guaranteed) unique for every tuple

• Therefore, a valid candidate key determines all other attributes in the tuple

• Therefore, there is a functional dependency from the candidate key to all other non-key attributes of the relation.

• (Since the primary key is a candidate key, these arguments can also be made for primary keys)

Manipulating Functional Dependencies

• Given a set of dependencies, derive more dependencies using inference rules

• The closure X+ of a set of dependencies is the set of all possible dependencies that can be derived from it.

Armstrong’s Inference Rules for Manipulating Dependencies

1. if Y is a subset of X, then X -> Y

Alternatively: X,Y -> X (Reflexive)

1. If X->Y then X,Z->Y,Z(Augmentation)

2. If X->Y and Y->Z then X->Z(Transitive)

Additional Inference Rules

4. A->A (Self-determination)

5. If A->B,C then A->B and A->C (Decomposition)

6. If A->B and A->C then A->B,C (Union)

7. If A->B and C->D then A,C -> B,D (Composition)

When are two sets of FDs equivalent?

• When we can use inference rules to transform A to B , then A and B are equivalent

• Problem: it might take a long time to find the right set of inference rules

• What we need is a “standard form” of FD’s - then we can just compare

Finding the Closure

• F is a set of functional dependencies (e.g. the obvious ones from primary keys) We want to find X+, which is the set of all attributes that are dependent on X (based on F).X+ = Xrepeat for each dependency Y->Z in F do if Y is a subset of X+ then X+ = X+ union

Zuntil no more can be added to X+

Closure Example

• F is the following set of dependencies:A->B,C C->D A,D -> F

• What is A+ (all attributes that can be derived from A)?– Initialize A+ = A– Because A->B,C add B,C to A+– Because C is in A+ and C->D, add D to A+– Because A and D are in A+, add F to A+– Therefore A+ is A,B,C,D,F

Equivalence Test

• Are the following sets of FDs equivalent?– AB->C, D->E, AE->G, GD->H, ID->J– ABD->C, ABE->G, GD->EH, IE->J

• Compute closures for each, if any two are different, they are not equivalent– You will need to consider every left side…

Finding a Key

• Given a relation with attributes ABCDEFGHIJ and the following FDs, find a candidate key for the relation– AB->C, D->E, AE->G, GD->H, ID->J

• A candidate key is a subset of attributes that has the entire set of attributes as its closure.– Let’s try ABD…

What is Normalization?

• Formal technique for analyzing relations based on primary key (or candidate keys) and functional dependencies

• Series of tests (normal forms), each of which is harder to “pass”– Normal forms 1NF, 2NF, 3NF, BCNF depend on functional

dependencies– Higher forms (4NF, 5NF) based on other dependencies

• To avoid update anomalies without loss, normalize to 3NF.

Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Documents

Transcript of Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.