Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

22
Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College

Transcript of Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Page 1: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Introduction to Normalization

CPSC 356 Database

Ellen Walker

Hiram College

Page 2: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Building a Schema

• Start with a list of all attributes, considered as if you had a giant flat database (one relation) with all possible information in one place

• Divide the attributes into multiple relations– Intuitively– According to formal rules (normalization)

• This is a formalized alternative to the algorithms we learned before

Page 3: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

What Makes A Good Schema?

1. Each relation should have clear semantics, i.e. can be easily described in a few words

2. Try to avoid redundancy (to minimize storage space, but also to avoid anomalies)

3. Avoid a design that encourages too many NULL values in a relation. NULL can be ambiguous: N/A vs. unknown vs. not-yet-entered, etc.

4. Don’t split related attributes so that the relationship between them is lost (e.g. make sure LastName and UserID are both in the same relation)

Page 4: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Tracking Real Estate Staff

• Consider a single relation for real estate• It contains branch name, branch number,

staff name, staff number, staff salary, etc.– One entry for each staff member of each branch– Branch information is repeated for different staff

(REDUNDANCY!)– Staff information is repeated if they work in

multiple branches (REDUNDANCY!)

• This is an example of what NOT to do

Page 5: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Redundancy-caused Anomalies

• Insertion Anomalies– A branch with no staff has many NULLs– Entering a new staff member has NULL branch info

• But branch number and staff number are both part of primary key! (Why)

• Deletion Anomalies– When the last staff member at a branch is deleted, the

branch info is lost

• Update Anomalies– If we make a change in branch info once, it must be

changed in all copies (for all staff).

Page 6: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Solving Redundancy Problems

• Decompose the relation into multiple relations• Use Foreign Keys so the complete relation can be

reconstructed through a join– Branch: has branch number & branch info– Staff: has staff number, staff info & branch number as

foreign key

• Foreign Keys are exactly the attributes that are in the primary key of the other relation

• Insertion, deletion & update anomalies are gone! – Consider: add branch with no staff, remove last staff

member, update branch info

Page 7: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

How to Decompose?

• Decompositions are not (always) intuitively obvious• Codd discovered mathematical properties (called

Normal Forms) that describe “goodness” of decomposition

• First, Second, Third normal forms decrease redundancy without loss of information

• BCNF, Fourth and Fifth normal forms potentially introduce information loss (we will see…)

• To understand normal forms, start with functional dependencies

Page 8: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Functional Dependencies

• If A and B are attributes, and every value of A is associated with exactly one value of B (so knowing A predicts B), then B is functionally dependent on A (We write this as: A->B)

• Functional dependency is based on the semantics (meaning) of the attributes.

• A->B and B->A are two different constraints– Email -> first name is a valid dependency– First name -> Email is not a valid dependency

Page 9: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Examples of Functional Dependencies

• US Zip Code -> State• US Area Code -> State• Email -> Firstname, lastname• HotelNo, RoomNo -> Price• JobTitle, ServiceLength -> Salary

Page 10: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

What are the dependencies?

• item place customer-name

• ring Kay jewelers prince charming• ring walmart miss piggy

• Place -> item?• Item -> place?

• oil walmart tin man

Page 11: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Finding Dependencies in Data

• If a value of attribute A is associated with two or more values of B, then it is not true that A->B.

• If a value of attribute A is associated with exactly one value of B, then it might be true that A->B.

• Only when every possible value of attribute A is associated with exactly one value of B is it true that A->B.

Page 12: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Characteristics of Functional Dependencies for Normalization

• For any given values of the attributes on the left, there is exactly one possible attribute on the right

• No future data will ever invalidate the dependency

• Dependency is nontrivial -- no attributes from the left are repeated on the right

Page 13: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Keys & Functional Dependency

• Remember, a candidate key is a subset of attributes that is (guaranteed) unique for every tuple

• Therefore, a valid candidate key determines all other attributes in the tuple

• Therefore, there is a functional dependency from the candidate key to all other non-key attributes of the relation.

• (Since the primary key is a candidate key, these arguments can also be made for primary keys)

Page 14: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Manipulating Functional Dependencies

• Given a set of dependencies, derive more dependencies using inference rules

• The closure X+ of a set of dependencies is the set of all possible dependencies that can be derived from it.

Page 15: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Armstrong’s Inference Rules for Manipulating Dependencies

1. if Y is a subset of X, then X -> Y

Alternatively: X,Y -> X (Reflexive)

1. If X->Y then X,Z->Y,Z(Augmentation)

2. If X->Y and Y->Z then X->Z(Transitive)

Page 16: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Additional Inference Rules

4. A->A (Self-determination)

5. If A->B,C then A->B and A->C (Decomposition)

6. If A->B and A->C then A->B,C (Union)

7. If A->B and C->D then A,C -> B,D (Composition)

Page 17: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

When are two sets of FDs equivalent?

• When we can use inference rules to transform A to B , then A and B are equivalent

• Problem: it might take a long time to find the right set of inference rules

• What we need is a “standard form” of FD’s - then we can just compare

Page 18: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Finding the Closure

• F is a set of functional dependencies (e.g. the obvious ones from primary keys) We want to find X+, which is the set of all attributes that are dependent on X (based on F).X+ = Xrepeat for each dependency Y->Z in F do if Y is a subset of X+ then X+ = X+ union

Zuntil no more can be added to X+

Page 19: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Closure Example

• F is the following set of dependencies:A->B,C C->D A,D -> F

• What is A+ (all attributes that can be derived from A)?– Initialize A+ = A– Because A->B,C add B,C to A+– Because C is in A+ and C->D, add D to A+– Because A and D are in A+, add F to A+– Therefore A+ is A,B,C,D,F

Page 20: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Equivalence Test

• Are the following sets of FDs equivalent?– AB->C, D->E, AE->G, GD->H, ID->J– ABD->C, ABE->G, GD->EH, IE->J

• Compute closures for each, if any two are different, they are not equivalent– You will need to consider every left side…

Page 21: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

Finding a Key

• Given a relation with attributes ABCDEFGHIJ and the following FDs, find a candidate key for the relation– AB->C, D->E, AE->G, GD->H, ID->J

• A candidate key is a subset of attributes that has the entire set of attributes as its closure.– Let’s try ABD…

Page 22: Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.

What is Normalization?

• Formal technique for analyzing relations based on primary key (or candidate keys) and functional dependencies

• Series of tests (normal forms), each of which is harder to “pass”– Normal forms 1NF, 2NF, 3NF, BCNF depend on functional

dependencies– Higher forms (4NF, 5NF) based on other dependencies

• To avoid update anomalies without loss, normalize to 3NF.