16 Normalization

21
1 CS 338: Computer Applications in Business: Databases (Fall 2014) ©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.) CS 338: Computer Applications in Business: Databases Basics of Functional Dependencies and Normalization for Relational Databases ©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.) Rice University Data Center Fall 2014 Chapter 15 Overview Database design may be performed using two approaches: bottom-up or top-down 2 Bottom-Up Approach Considers basic relationships among individual attributes as the starting point and uses those to construct relation schemas Not Popular: It suffers from the problem of having to collect a large number of binary relationships among attributes as the starting point Top-Down Approach Starts with a number of groupings of attributes into relations that exist together Then, the relations are analyzed individually and collectively, leading to further decomposition until all properties are met

Transcript of 16 Normalization

Page 1: 16 Normalization

1

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

CS 338: Computer Applications in Business: Databases

Basics of Functional Dependencies and Normalization for Relational Databases

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.) Rice University Data Center

Fall 2014

Chapter 15

Overview

Database design may be performed using two approaches: bottom-up or top-down

2

Bottom-Up Approach

• Considers basic relationships among individual attributes as the starting point and uses those to construct relation schemas

• Not Popular: It suffers from the problem of having to collect a large number of binary relationships among attributes as the starting point

Top-Down Approach

• Starts with a number of groupings of attributes into relations that exist together

• Then, the relations are analyzed individually and collectively, leading to further decomposition until all properties are met

Page 2: 16 Normalization

2

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Overview

Implicit goals of the design activity

3

1. Information preservation

• Maintaining all concepts, including attribute types, entity types, relationship types as well as generalization/specialization relationships

• Relational design must preserve all of these concepts originally captured in the conceptual design after the conceptual to logical design mapping

2. Minimum redundancy

• Minimize redundant storage of the same information and reducing the need for multiple updates to maintain consistency across multiple copies of the same information

Informal Design Guidelines for Relation Schemas

Four informal guidelines that can measures of quality of relation schema design

4

1. Making sure attribute semantics are clear

2. Reducing redundant information in tuples

3. Reducing the NULL values in tuples

4. Disallowing possibility of generating spurious tuples

Page 3: 16 Normalization

3

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

1. Making sure attribute semantics are clear

5

Semantics of a relation

• Whenever we group attributes to form a relation schema, we assume that

• attributes belong to one relation have certain real-world meaning and

• a proper interpretation associated with them

• Semantics of a relation refers to its meaning resulting from the interpretation of attribute values in a tuple

• Recall: a relation can be interpreted as a set of facts

• If conceptual design is done carefully and the mapping procedure is followed systematically, the relation schema design should have a clear meaning

1. Making sure attribute semantics are clear

6

Easier to explain semantics of relation (indicates better design)

Each tuple represents an

employee

Dnumber is a foreign key that represents

implicitly a relationship

Page 4: 16 Normalization

4

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

1. Making sure attribute semantics are clear

7

Easier to explain semantics of relation (indicates better design)

Each tuple represents an employee with values for the employee’s name (Ename), SSN (Ssn), birth date (Bdate), and address (address), and the department number (Dnum)

Dnumber is a foreign key that represents implicitly a relationship

Each tuple in DEPT_LOCATIONS gives Department number (Dnumber) and one of the locations of the department (Dlocation) (multivalued attribute)

Each tuple in WORKS_ON gives an employee Ssn, the project number of one of the projects that the employee works on (Pnumber), and the number of hours per week (Hours)

Guideline 1

• Design relation schema so that it is easy to explain its meaning

• Do not combine attributes from multiple entity types and relationship types into a single relation

8

• Although nothing wrong logically with these two relations, they violate Guideline 1 by mixing attributes from distinct real-world entities • EMP_DEPT mixes attributes of

employees and departments • EMPL_PROJ mixes attributes of

employees and projects and the WORKS_ON relationship

Page 5: 16 Normalization

5

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

2. Reducing redundant information in tuples

• Major aim of relational database design is to group attributes into relations to minimize data redundancy.

• Significant effect on storage space

9

Only the department number (Dnumber) is repeated in the EMPLOYEE relation for each employee who works in the department as a foreign key

2. Reducing redundant information in tuples

Example: • EMP_DEPT is the result of applying the NATURAL JOIN

operation to EMPLOYEE and DEPARTMENT

• Attribute values pertaining to a particular department (Dnumber, Dname, Dmgr_ssn) are repeated for every employee who works for that department

10

Page 6: 16 Normalization

6

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

2. Reducing redundant information in tuples

• Potential benefits include:

• Updates to the data stored in the database are achieved with a minimal number of operations thus reducing the opportunities for data inconsistencies.

• Reduction in the file storage space required by the base relations thus minimizing costs.

• Problems associated with data redundancy are illustrated by comparing an example on the next slide

11

2. Reducing redundant information in tuples

Example

12

Page 7: 16 Normalization

7

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

2. Reducing redundant information in tuples Update Anomalies

• StaffBranch relation has redundant data; the details of a branch are repeated for every member of staff.

• In contrast, the branch information appears only once for each branch in the Branch relation and only the branch number (branchNo) is repeated in the Staff relation, to represent where each member of staff is located.

• Storing natural joins of base relations leads to an additional problem referred to as update anomalies (a data inconsistency that results from data redundancy and a form of manipulation/update

• Types of update anomalies include

• Insertion

• Deletion

• Modification 13

2. Reducing redundant information in tuples Update Anomalies

14

Insertion Anomalies

1. To insert a new employee tuple into EMP_DEPT, we must include either the attribute values for the department that the employee works for

• Include NULLs if employee does not work for a department as yet

• We must enter all the attribute values of department so that they are consistent with the corresponding values of that department

Page 8: 16 Normalization

8

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

2. Reducing redundant information in tuples Update Anomalies

15

Insertion Anomalies

• However, using the below schema we do not have to worry about this consistency problem

• We enter only the department number in the employee tuple, all attribute values of a department are recorded only once in the database as a singe tuple in the DEPARTMENT relation

2. It is difficult to insert a new department that has no employees as yet in the EMP_DEPT relation

• We would have to enter NULLs in the attributes for employee which violates the entity integrity for EMP_DEPT (Ssn is primary key)

2. Reducing redundant information in tuples Update Anomalies

16

Deletion Anomalies

• Related to the second insertion anomaly situation:

• If we delete from EMP_DEPT an employee tuple that happens to represent that last employee working for a particular department, the information concerning that department is lost from the database

• This does not occur in the database of the below schema because DEPARTMENT tuples are stored separately

Page 9: 16 Normalization

9

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

2. Reducing redundant information in tuples Update Anomalies

17

Modification Anomalies

• In EMP_DEPT, if we change the value of one of the attributes of a particular department (e.g. manager of department 5), we must update the tuples of all employees who work in that department

• If we fail to update some tuples, the same department will be shown to have two different values for manager in different employee tuples, which would be wrong

Guideline 2

• Design base relation schemas so that no update anomalies are present in the relations

• If any anomalies are present:

• Note them clearly

• Make sure that the programs that update the database will operate correctly

18

Page 10: 16 Normalization

10

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

3. Reducing the NULL values in tuples

• May group many attributes together into a “fat” relation

• Can end up with many NULLs

• Problems with NULLs

• Wasted storage space

• Problems understanding meaning

• Problem with NULLs : how to account for them using aggregate functions (e.g. COUNT, SUM, etc…)

19

Guideline 3

• Avoid placing attributes in a base relation whose values may frequently be NULL

• If NULLs are unavoidable:

• Make sure that they apply in exceptional cases only, not to a majority of tuples

• Example: If only 15% of employees have individual offices, there is little justification for including an attribute Office_number in the EMPLOYEE relation

• Possible Solution: create a separate relation EMP_OFFICES(Essn, Office_number) to include tuples for only the employees with individual offices

20

Page 11: 16 Normalization

11

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

4. Disallowing possibility of generating spurious tuples

21

If we perform a NATURAL JOIN • Result produces many more tuples than the

original set of tuples in EMP_PROJ • Called spurious tuples: represent spurious

information that is not valid

4. Disallowing possibility of generating spurious tuples

22

Page 12: 16 Normalization

12

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Guideline 4

• Design relation schemas to be joined with equality conditions on attributes that are appropriately related

• Guarantees that no spurious tuples are generated

• Avoid relations that contain matching attributes that are not (foreign key, primary key) combinations

23

Data Redundancy: Summary

• Major aim of relational database design is to group attributes into relations to minimize data redundancy

• Of course, relational databases also rely on the existence of certain amount of data redundancy

• This redundancy is in the form of copies of primary keys for candidate keys acting as foreign keys in related relations • This helps us model relationships between data

24

Page 13: 16 Normalization

13

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Functional Dependencies

25

Functional Dependencies

• Formal tool for analysis of relational schemas

• Enables us to detect and describe some of the previously-mentioned problems in precise terms

• Theory of functional dependency

26

Page 14: 16 Normalization

14

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Definition of Functional Dependency

• Functional dependency describes relationship between attributes.

• For example, if A and B are attributes of relation R, B is functionally dependent on A (denoted A B), if each value of A in R is associated with exactly one value of B in R.

• Property of the meaning or semantics of the attributes in a relation.

27

Definition of Functional Dependency

• Diagrammatic representation.

• The determinant of a functional dependency refers to the attribute or group of attributes on the left-hand side of the arrow.

• We may say that there is a functional dependency from A to B, or that B is functionally dependent on A

• Abbreviation for functional dependency is FD or f.d.

• Set of attributes of A is called left-hand-side of the FD, and B is called the right-hand-side

28

Page 15: 16 Normalization

15

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Example 1

Consider the values shown in staffNo and sName attributes of the Staff relation

Based on sample data, the following functional dependencies appear to hold.

staffNo → sName

sName → staffNo

When identifying functional dependencies between attributes, it is important to distinguish clearly between the values held by an attribute at a given point in time and the set of all possible values that an attribute may hold at different times

29

Example 1

If the values shown in Staff relation simply represent a set of values for staffNo and sName attributes at a given moment in time, then:

• staffNo uniquely identifies each member

• sName holds the name of staff members

• Using a staff number (staffNo) we can determine the name of the member of staff (sName)

• It is possible for the sName attribute to hold duplicate values for members of staff with the same name

30

staffNo → sName

This functional dependency remains true of all possible values for the

staffNo and sName attributes

Page 16: 16 Normalization

16

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Characteristics of Functional Dependencies

• Main characteristics of functional dependencies :

• There is a one-to-one relationship between the attribute(s) on the left-hand side (determinant) and those on the right-hand side of a functional dependency.

• Holds for all time.

• The determinant has the minimal number of attributes necessary to maintain the dependency with the attribute(s) on the right hand-side.

31

Example 2

• With sufficient information available, we identify the functional dependencies for the StaffBranch relation as:

staffNo → sName, position, salary, branchNo, bAddress

branchNo → bAddress

bAddress → branchNo

branchNo, position → salary

bAddress, position → salary

32

Page 17: 16 Normalization

17

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Example 3

• Consider the data for attributes denoted A, B, C,

D, and E in the Sample relation

33

A C

Example 3

• Consider the data for attributes denoted A, B, C,

D, and E in the Sample relation

34

A C

C A

Page 18: 16 Normalization

18

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Example 3

• Consider the data for attributes denoted A, B, C,

D, and E in the Sample relation

35

A C

C A

B D

Example 3

• Consider the data for attributes denoted A, B, C,

D, and E in the Sample relation

36

A C

C A

B D

A,B E

Page 19: 16 Normalization

19

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Identifying the Primary Key for a Relation using Functional Dependencies

• StaffBranch relation has five functional dependencies

• Determinants are: staffNo, branchNo, bAddress, (branchNo, position), and (bAddress, position)

• To identify all candidate key(s), identify the attribute (or group of attributes) that uniquely identifies each tuple in this relation.

• All attributes that are not part of a candidate key should be functionally dependent on the key.

• The only candidate key and therefore primary key for StaffBranch relation, is staffNo, as all other attributes of the relation are functionally dependent on staffNo.

39

Identifying the Primary Key for a Relation using Functional Dependencies

• StaffBranch relation has five functional

dependencies

40

Page 20: 16 Normalization

20

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Normalization

• Normalization is a formal technique for analyzing relations based on their primary key (or candidate keys) and functional dependencies

• Includes series of rules that can be used to test individual relations so that a database can be normalized to any degree

• When a requirement is not met, the relation violating the requirement must be decomposed into relations that individually meet the requirements of normalization

• As normalization proceeds, the relations become progressively more restricted (stronger) in format and also less vulnerable to update anomalies.

41

Normalization

• Normal Form of a relation refers to the highest normal form condition that it meets, and hence indicates the degree to which it has been normalized

• Denormalization is the process of storing the join of higher normal form relations as a base relation (which is in a lower normal form)

• Different normal forms:

• 1NF, 2NF, 3NF

• 4NF, 5NF

• BCNF 42

We won’t worry about other normal forms in this class

Page 21: 16 Normalization

21

CS 338: Computer Applications in Business: Databases (Fall 2014)

©1992-2014 by Addison Wesley & Pearson Education, Inc., McGraw Hill, Cengage Learning Slides adapted and modified from Fundamentals of Database Systems (5/6) (Elmasri et al.), Database System Concepts (5/6) (Silberschatz et al.), Database Systems (Coronel et al.), Database Systems (4/5) (Connolly et al. ), Database Systems: Complete Book (Garcia-Molina et al.)

Boyce-Codd Normal Form (BCNF)

• Based on functional dependencies that take into account all candidate keys in a relation, however BCNF also has additional constraints

• Boyce–Codd normal form (BCNF)

• A relation is in BCNF if and only if every determinant is a candidate key.

• Violation of BCNF is quite rare.

• The potential to violate BCNF may occur in a relation that:

• contains two (or more) composite candidate keys;

• the candidate keys overlap, that is have at least one attribute in common.

43

Boyce-Codd Normal Form (BCNF)

• A relation schema R is in Boyce-Codd Normal Form (BCNF) if whenever a nontrivial functional dependency X A holds in R, then X is a superkey of R

• i.e., if X A and A X, then X R

• Relation schemas in BCNF avoid the problems of redundancy

• Examples

• Dnumber {Dname, Dmgr_ssn} but Dnumber ↛ Ename

• Pnumber {Pname, Plocation} but Dnumber ↛ SSn

• SSn Ename but SSn ↛ Pnumber