COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 4 Normalization.
-
date post
19-Dec-2015 -
Category
Documents
-
view
228 -
download
1
Transcript of COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 4 Normalization.
COMP 5138COMP 5138
Relational Database Relational Database
Management Systems Management Systems
Semester 2, 2007Semester 2, 2007
Lecture 4Lecture 4
NormalizationNormalization
2222
L12L12 NormalizationNormalization
MotivationMaking it precise: Functional DependencyMaking it precise: Normal FormMaking it precise: Schema Decomposition
Today’s Agenda
3333
L12L12 NormalizationNormalization
The relational schema is best obtained by starting with a conceptual design (eg in ER model)
This can be converted to relational schemaHowever, the relational schema may arise in other ways
Eg start from data in a spreadsheetTypically gives one wide table!
Eg choose tables by raw intuitionWe should evaluate the schema, and improve it if necessary
Schema Design Process
4444
L12L12 NormalizationNormalization
The most important requirement is adequacy: that the design should allow representing all the important facts about the design
Make sure every important process can be done using the data in the database, by joining tables as neededEg “can we find out which driver made a particular delivery?”
If a design is adequate, then we seek to avoid redundancy in the data
Evaluation of a design
5555
L12L12 NormalizationNormalization
8
8
8
5
5
Example
Redundant data is where the same information is repeated in several places in the database
3666
5368
3650
3751
4134
‘Attishoo’
‘Smiley’
‘Smethurst’
‘Guldu’
‘Madayan’
48
22
35
35
35
10
10
10
7
7
Ssn Name lot Hourly_wage
40
30
30
32
40
Hours_workedrating
Hourly_Employee relation
6666
L12L12 NormalizationNormalization
8
8
8
5
5
Example
Redundant data is where the same information is repeated in several places in the database
3666
5368
3650
3751
4134
‘Attishoo’
‘Smiley’
‘Smethurst’
‘Guldu’
‘Madayan’
48
22
35
35
35
10
10
10
7
7
Ssn Name lot Hourly_wage
40
30
30
32
40
Hours_worked
Redundant Information
rating
Hourly_Employee relation
7777
L12L12 NormalizationNormalization
8
8
8
5
5
Domain knowledge
Whether a design keeps redundant data depends on the application domain’s rules
It is not enough just to see that the same values occur several times in some instanceYou must check with users to be certain
Eg Ask “is it always the case that two hourly employees with the same rating, have the same hourly wage?”
3666
5368
3650
3751
4134
‘Attishoo’
‘Smiley’
‘Smethurst’
‘Guldu’
‘Madayan’
48
22
35
35
35
10
10
10
7
7
Ssn Name lot Hourly_wage
40
30
30
32
40
Hours_workedrating
Hourly_Employee relation
8888
L12L12 NormalizationNormalization Evils of Redundancy
Redundancy is at the root of several problems associated with relational schemas:
Update anomaly: If we change the hourly-wage in one employee’s record, we may end up with an inconsistent database (where the domain constraint is not true)Insertion anomaly: We can not insert a record for an employee unless we know the hourly wage for the employee’s rating value.Another insertion anomaly: We can’t record the hourly wage for a rating value, unless there is an employee with that rating.Deletion anomaly: If we delete all records with a given rating value, we lose the association between that rating value and its hourly-wage value!
It is the anomalies with modification that are the serious concern, not the extra space used in storage
9999
L12L12 NormalizationNormalization Normal Forms
There is a theory that allows one to see whether or not a particular schema risks having redundancy anomaliesThis theory looks at the design and also at constraints on the application, expressed in a mathematical form as functional dependenciesIf the design and its dependencies have certain properties, we say that the design is in a particular normal form
There are several that are used; we focus on BCNF
Our goal is to use schema which are in BCNF
10101010
L12L12 NormalizationNormalization Decomposition
Decomposes changes a schema by replacing one relation by other relations
Between them, the decomposed relations have all the attributes of the originalThe data in each decomposed table is a projection of the data in the original table
Ssn Name Lot Rating
Hours_worked
3666 ‘Attishoo’ 48 8 40
5368 ‘Smiley’ 22 8 30
3650 ‘Smethurst’ 35 8 30
3751 ‘Guldu’ 35 5 32
4134 ‘Madayan’ 35 5 40
Rating
Hourly_wage
8 105 7
11111111
L12L12 NormalizationNormalizationEvaluating a Decomposition
A proposal to use decomposition on a schema should be evaluatedDoes it allow all the original information to be captured properly?Does it allow all the original application domain constraints to be captured properly?These issues are formalized as properties of a decomposition
It should be lossless-joinIt should be dependency-preserving
12121212
L12L12 NormalizationNormalization Overall design process
Consider a proposed schemaFind out application domain properties expressed as functional dependenciesSee whether the schema is in BCNF If not, try to find a relation which has a decomposition that is lossless-join and dependency-preservingRepeat the evaluation for the altered schema, with the original relation replaced by its decomposed tables
13131313
L12L12 NormalizationNormalization
MotivationMaking it precise: Functional DependencyMaking it precise: Normal FormMaking it precise: Schema Decomposition
Today’s Agenda
14141414
L12L12 NormalizationNormalizationFunctional Dependency (FD)
Intuitively: “If two tuples of a relation R agree on values in X, then they must also agree on the Y values.”
Formally: Given a relation R with schema R, and two sets of attributes X = {X1, …, Xm} R and Y = {Y1, …, Yn} R.A functional dependency (FD) X Y holds over relation R if, for every allowable instance R of R:
r, s R : r.X = s.X r.Y = s.Y
We say“X (functionally) determines Y”“Y is functionally dependent on X”
15151515
L12L12 NormalizationNormalizationExample: Functional
Dependencies
Consider a relation Orders: Orders ( date, cid, pid, descr, price, weight, amount )
on which we can assert the following FDs:date cid pid amount
Each customer can only make one order for a given product each day
pid determines product details: pid descr price weight
The following FDs do not hold:cid pid amount
A customer could order the same product on several days
pid dateThe same product can be ordered at several dates
Remember: all constraints must be checked with users
16161616
L12L12 NormalizationNormalization Functional Dependencies
17171717
L12L12 NormalizationNormalization Functional Dependencies
18181818
L12L12 NormalizationNormalizationFunctional Dependencies
(cont’d)
A FD is an assertion about the schema of a relation not about a particular instance.
It must be fulfilled by all allowable relations.
If we look at an instance, we cannot tell for sure that a FD holds
The most insight we can gain by looking at a “typical” instance are “hints”…We can however check if the instance violates some FD
FDs must be identified based on the semantics of an application.
19191919
L12L12 NormalizationNormalizationApproach to Schema
RefinementFDs can be used to identify schemas with problems and to suggest refinements.
Represent FDs in the form of primary key constraints.
Main refinement technique: decomposition
Orders ( date, cid, pid, descr, price, weight, amount ) with two FDs
date cid pid amount
pid determines product details: pid descr price weight
Two relations orders2 (date, cid, pid, amount) product (pid, desc, price, weight)
20202020
L12L12 NormalizationNormalization Reasoning about FDs
Given some FDs, we can usually infer additional FDs:Example: cid points, points lot implies cid lot
A FD f is implied by a set of FDs F if f holds whenever all FDs in F hold.
F+ : closure of F is the set of all FDs that are implied by F
Armstrong’s Axioms (X, Y, Z are sets of attributes):1. Reflexivity rule: If X Y, then Y X
2. Augmentation rule: If X Y, then XZ YZ for any Z
3. Transitivity rule: If X Y and Y Z, then X Z
21212121
L12L12 NormalizationNormalizationReasoning about FDs
(cont’d)Armstrong’s Axioms are
Sound: they generate only FDs in F+ when applied to a set F of FDsComplete: repeated application of these rules will generate all FDs in the closure F+
A couple of additional rules (that follow from Armstrong’s Axioms.):4. Union rule: If X Y and X Z, then X YZ
5. Decomposition rule: If X YZ, then X Y and X Z
6. Pseudotransitivity rule: If X Y and SY Z, then XS Z
Example: Orders (date,cid,pid,descr,price,weight,amount ) and FDs {date cid pid amount, pid descr price weight }.It follows:
date, cid, pid pid (reflexivity rule) descr, price, weight descr (reflexivity rule) pid descr (decomposition rule) date, cid, pid descr (transitivity rule)
22222222
L12L12 NormalizationNormalizationProving the Additional Inference
Rules
Rule 4: UnionIf X Y and X Z then X YZ
Proof:X Y implies (rule 2, augmented with X): X XYX Z implies (rule 2, augmented with Y): XY YZ(transitivity rule): X XY and XY YZ implies X YZ
Rule 5: Decompositionsimilar
23232323
L12L12 NormalizationNormalization Example
R = (A, B, C, G, H, I)F = { A B
A C CG H CG I B H}
some members of F+
A H by transitivity from A B and B H
AG I by augmenting A C with G, to get AG CG and then transitivity with CG I
CG HI from CG H and CG I : “union rule” can be inferred from definition of functional dependencies, or Augmentation of CG I to infer CG CGI, augmentation of
CG H to infer CGI HI, and then transitivity
24242424
L12L12 NormalizationNormalization FD Attribute Closure
Computing the closure of a set of FDs can be expensive . (Size of closure is exponential in # attrs!)
Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F.
An efficient check is:Compute FD attribute closure of X (denoted X+ ) wrt F:
To compute the set of all attributes A such that X A is in F+
There is a linear time algorithm to compute this.
Check if Y is in X+
E.G. Does F = {A B, B C, C D E } imply A E?
i.e, is A E in the closure F+ ? Equivalently, is E in A+?
25252525
L12L12 NormalizationNormalizationComputing the Closure of
AttributesStarting with a given set of attributes, one repeatedly expands the set by adding the right sides of FD’s as soon as their left side is added:1. Initialise X with the given set of attributes: X={A1, …, An}
2. Repeatedly search for some FD A1 A2 … Am C such that all A1, …, Am are already in the set of attributes X, but C is not. Add C to the set X.
3. Repeat step 2 until no more attributes can be added to X4. The set X is the correct value of A+
26262626
L12L12 NormalizationNormalization From FDs to Keys
Functional dependencies can be used to find keys:
Definition: SuperkeyGiven a relation R with schema R and a set F of functional dependencies. A set of attributes K R is a superkey for R if K R F+
Note that K R does not require K to be minimal!K is a candidate key if no real subset of K has above property.
This definition of keys using functional dependencies is consistent to our previous definition. It allows us to automatically determine keys as soon as the FDs have been found.
27272727
L12L12 NormalizationNormalization Example
R = (A, B, C, G, H, I)F = {A B
A C CG HCG IB H}
(AG)+
1. result = AG2. result = ABCG (A C and A B)3. result = ABCGH (CG H and CG AGBC)4. result = ABCGHI (CG I and CG AGBCH)
Is AG a candidate key? 1. Is AG a super key?
1. Does AG R? == Is (AG)+ R
2. Is any subset of AG a superkey?1. Does A R? == Is (A)+ R2. Does G R? == Is (G)+ R
28282828
L12L12 NormalizationNormalization Find Primary Key
{ A, B, C, D}
Find FDs which include relation RA ABCD: A is primary key???B, C, DABABCD: AB is candidate key???
Check whether FD of A and FD of B are independentIf AABCD or B ABCD then AB is not a key
AC, AD, BC, BD, CDABCABCD: ABC is candidate key???ABC, ABD, BCD
29292929
L12L12 NormalizationNormalization
MotivationMaking it precise: Functional DependencyMaking it precise: Normal FormMaking it precise: Schema Decomposition
Today’s Agenda
30303030
L12L12 NormalizationNormalization Normal Forms
Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed!
If a relation is in a certain normal form (such as BCNF), it is known that certain kinds of problems are avoided/minimised. This can be used to help us decide whether decomposing the relation will help.
Role of FDs in detecting redundancy:Consider a relation R with 3 attributes, ABC.
No FDs hold: There is no redundancy here.
Given A B: Several tuples could have the same A value, and if so, they’ll all have the same B value!
31313131
L12L12 NormalizationNormalization BCNF
Be careful with the use of mathematics and logic!Definitions are precise
A relation with schema R and FDs F is in BCNF (Boyce-Codd Normal From) if, for all X A in F+
A X (called a trivial FD), orX is a superkey.
Definition says: Check each fd that applies to the relation (including those deduced from the ones you were given)
If the rhs is contained in the lhs; ok (but uninteresting)If the lhs is a candidate key for the whole relation; ok
If every fd is ok, the relation is BCNFIn fact, you only need to check the ones you are given!
A schema is in BCNF if every relation in it is BCNF
32323232
L12L12 NormalizationNormalization BCNF Example
Example:Order(date,cid,pid,amount) with F={date cid pid amount} and Product (pid,descr,price,weight) with
F = {pid descr price weight} are in BCNF.
R = (A, B, C)if F = {A B, B C}
R is not in BCNF, because B C does not have either alternative (C is not part of B; B is not a candidate key)
Decomposition R1 = (A, B), R2 = (B, C)R1 and R2 in BCNF
33333333
L12L12 NormalizationNormalization Other normal forms
Design theory has provided other definitions, each with its own uses
First normal form (1NF)Second normal form (2NF)Third normal form (3NF)Fourth normal form (4NF) etc
But we concentrate on BCNF!
34343434
L12L12 NormalizationNormalization First Normal Form
A relational R is in first normal form (1NF) if the domains of all attributes of R are atomic.
This is built in to the relational model as we defined it!
Domain is atomic if its elements are considered to be indivisible units
Examples of non-atomic domains:Set of names, composite attributes
Non-atomic values complicate storage and encourage redundant (repeated) storage of data
SID CourseID Date Name
123514121311
312, 333, 454234, 454213
10/1010/1110/01
James Albert Nemo DavisTasos West
35353535
L12L12 NormalizationNormalization Second Normal Form
Second Normal Form (2NF): if no nonkey attribute is functionally dependent on just a part of the key.
Composite key
SID CourseID Grade FName
1235141212351412
312232515460
AFAC
JamesNemoJamesNemo
8
8
8
5
5
3666
5368
3650
3751
4134
‘Attishoo’
‘Smiley’
‘Smethurst’
‘Guldu’
‘Madayan’
48
22
35
35
35
10
10
10
7
7
Ssn Name lot Hourly_wage
40
30
30
32
40
Hours_workedrating
36363636
L12L12 NormalizationNormalization Third normal form (3NF)
Relation R with FDs F is in 3NF if, for all X A in F+
A X (called a trivial FD), or
X is a superkey, orA is part of some candidate key for R.
37373737
L12L12 NormalizationNormalization Example
ExampleR = (J, K, L)F = {JK L, L K}Two candidate keys: JK and JLR is in 3NF
JK L JK is a superkeyL K K is contained in a candidate key
There is some redundancy in this schemaExample:Advisor-schema = (Client, Problem, Consultant)Consultant ProblemClient, Problem Consultant
Client Consultant Problem
James Gomez Marketing
James Jay Production
38383838
L12L12 NormalizationNormalization BCNF and 3NF
If R is in BCNF, obviously in 3NF.If R is in 3NF, it may not in BCNF.Lectures (course, time, room) with F = {course room, time room course} is in 3NF
Candidate keys: {course, time} and {time, room}
but not in BCNF.FD course room violates the BCNF requirements.Example instance:
Comp5138
Comp5138
Info3005
Info3005
Mo, 6-8 pm
Mo, 8-9 pm
Tue, 9-11
Thu, 12-14
CAR273
CAR273
Physics 1
Physics 1
course time
lecturesroom
39393939
L12L12 NormalizationNormalization
MotivationMaking it precise: Functional DependencyMaking it precise: Normal FormMaking it precise: Schema Decomposition
Today’s Agenda
40404040
L12L12 NormalizationNormalizationDecomposition of Relational
Schema
Suppose that relation R contains attributes A1 ... An. A decomposition of R consists of replacing R by two or more relations such that:
Each new relation scheme contains a subset of the attributes of R (and no attributes that do not appear in R), andEvery attribute of R appears as an attribute of one or more of the new relations.
E.g., Can decompose OrderPlus(date, cid, pid, descr, price, weight, amount) into Order(date,cid,pid,amount) and Product (pid,descr,price,weight).Decomposing R means we will store instances which are projected from instances of R.
Each instance in the new schema has a subset of the columns in the old instanceNotation: X (R)
41414141
L12L12 NormalizationNormalization Decomposition and the Data
Suppose we replace R by a decomposition into R1 and R2The information we have about the world in the new schema is what we can deduce from the instance of R1 and the instance of R2These are all the facts which are produced by joining the two tables
That is, take a tuple from R1 and a tuple from R2, which agree in the values for the common attributes Notation: R1 R2
How does this compare to the information in the original relation, before decomposition?
Answer: it always has the information in the original, but it may have introduced extra (spurious) information
42424242
L12L12 NormalizationNormalizationLossless-Join Decomposition
Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every instance R that satisfies F:
X (R) Y (R) = R
It is always true that R X (R) Y (R)
In general, the other direction does not hold! If it does, the decomposition is lossless-join.
Definition extended to decomposition into 3 or more relations in a straightforward way.
It is essential that all decompositions used to deal with redundancy be lossless, so that we are not losing knowledge by having a database which implies spurious (incorrect) statements!
43434343
L12L12 NormalizationNormalizationExample for Lossy-Join Decomp.
Paper
Disks
03.05.
03.05
1001
1001
1
6
20.0
5.0
date cid pidOrders
price
2.0
0.4
weight
100
50
amountdescr
Ordered Product
03.05.
03.05
1001
1001
1
6
date cid pid
100
50
amount
Paper
Disks
20.0
5.0
price
2.0
0.4
weightdescr
03.05.
03.05
date
Paper
Disks
Disks
Paper
03.05.
03.05
03.05.
03.05
1001
1001
1001
1001
1
6
1
6
20.0
5.0
5.0
20.0
date cid pid
Ordered Productprice
2.0
0.4
0.4
2.0
weight
100
50
100
50
amountdescr
44444444
L12L12 NormalizationNormalizationMore on Lossless-Join
Decomposition
Important Theorem: The decomposition of R into X and Y is lossless-join wrt F if and only if the closure of F contains:
X Y X, orX Y Y
C.f. Anti-Example on earlier slide:F = {date cid pid amount, pid descr price weight }
F+ does not contain date descr price weight
F+ does not contain date cid pid amount
45454545
L12L12 NormalizationNormalizationDependency-Preserving
Decomposition
Consider CSJDPQV, C is key, JP C and SD P.therefore JSD C so,BCNF decomposition: CJSDQV and SDPProblem: Checking JP C requires a join!
Intuitive:If R is decomposed into S and T, and we enforce the FDs that hold on S and on T, then all FDs that were given to hold on R must also hold.
Projection of set of FDs F: If a relational schema R is decomposed into S,T …, the projection of F onto S (denoted FS ) is the set of FDs U V in F+ (closure of F ) such that U, V are in S.
46464646
L12L12 NormalizationNormalizationDependency Preserving
Decomposition Def
Given a relation R with schema R and set of FDs F. A decomposition of R into S and T is a dependency preserving decomposition if ( FS FT
)+ = F+
i.e., if we consider only dependencies in the closure F + that can be checked in S without considering T , and in T without considering S, these imply all dependencies in F +.
Important to consider F +, not F, in this definition:ABC, F={A B, B C, A C}, decomposed into AB and BC.Is this dependency preserving? Is A C preserved?????If FAB={A B}, FBC={B C}, then (FAB FBC) ={A B, B C}
But, F= {A B, B C, A C}, hence F (FAB FBC)
Dependency preserving does not imply lossless join, And vice-versa!
47474747
L12L12 NormalizationNormalizationExamples for Dependency-
PreservingThe decomposition of Orders(date, cid, pid, descr, price, weight, amount) with F = {date cid pid amount, pid descr price weight } into
Ordered(date,cid,pid,amount) with F1 = {date cid pid amount} andProduct (pid,descr,price,weight) with F2 = {pid descr price weight}
is dependency-preserving.
The decomposition of Lectures (course, time, room) with F = {course room, time room course} into
Rooms( course, room ) with F1={course room} andTimes(course, time) with F2 = {}
is not dependency-preserving.
48484848
L12L12 NormalizationNormalizationCentral Ideas of Schema
Refinement
Every relation R with FDs F can be decomposed into 3NF relations, which is both lossless and dependency-preserving.
Every relation R with set of FDs F can be decomposed into BCNF relations which is a lossless-join.
Unfortunately, there may be no dependency-preserving decomposition into a collection of BCNF relational schema
49494949
L12L12 NormalizationNormalizationAlgorithm for Schema
Refinement
In general, several dependencies may cause violation of BCNF…
Consider relation R with FDs F. If X Y violates BCNF, decompose R into R - Y and XY.
Repeated application of this idea will give us a collection of relations that are in BCNF; lossless join decomposition, and guaranteed to terminate.
The order in which we “deal with” them could lead to very different sets of relations!
50505050
L12L12 NormalizationNormalizationExample: BCNF Decomposition
Given the following schema:course-schema (cid, title, cpoints, workload, lecturer, sid, grade)
F = { cid title cpoints lecturer, cpoints workload, cid sid grade }
{cid}+ = { cid, title, cpoints, lecturer, workload }{cpoints}+ = { cpoints, workload } {cid, sid}+ = {cid, sid, title, cpoints, lecturer, workload, grade}
Candidate Key: { cid, sid }
51515151
L12L12 NormalizationNormalizationExample: BCNF Decomposition
(cont’d)Course-schema is not BCNF, because, e.g., first and second FD violate BCNF
Decompose Course-schema along cpoints workload first!Workloads ( cpoint, workload )Courses-schema’ (cid, title, cpoints, lecturer, sid, grade )
Course-schema’ is still not BCNF; further decompose along cid title cpoints lecturer
Workloads (cpoints, workload ) with F2 = {cpoints workload}Courses (cid, title, cpoints, lecturer)
with F1 ={cid title cpoints lecturer }Enrolled (cid, sid, grade ) with F3 = {cid sid grade}
Is this decomposition lossless-join & dependency-preserving?
52525252
L12L12 NormalizationNormalizationExample: BCNF Decomposition
(cont’d)
If you decompose along cid title cpoints lecturer first, there is a different schema!
Courses (cid, title, cpoints, lecturer), with F1 ={cid title cpoints lecturer }
Workload_enroll(cid, workload, sid, grade ), with F3 = {cid sid grade}
Can we decompose any of above schema further? How about FD cpoints workload?
53535353
L12L12 NormalizationNormalization Wrap-Up
Motivation: avoid redundancyFunctional DependencyNormal Forms
BCNFDecomposition