COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 4 Normalization.

COMP 5138COMP 5138

Relational Database Relational Database

Management Systems Management Systems

Semester 2, 2007Semester 2, 2007

Lecture 4Lecture 4

NormalizationNormalization

2222

L12L12 NormalizationNormalization

MotivationMaking it precise: Functional DependencyMaking it precise: Normal FormMaking it precise: Schema Decomposition

Today’s Agenda

3333


The relational schema is best obtained by starting with a conceptual design (eg in ER model)

This can be converted to relational schemaHowever, the relational schema may arise in other ways

Eg start from data in a spreadsheetTypically gives one wide table!

Eg choose tables by raw intuitionWe should evaluate the schema, and improve it if necessary

Schema Design Process

4444


The most important requirement is adequacy: that the design should allow representing all the important facts about the design

Make sure every important process can be done using the data in the database, by joining tables as neededEg “can we find out which driver made a particular delivery?”

If a design is adequate, then we seek to avoid redundancy in the data

Evaluation of a design

5555


8

8

8

5

5

Example

Redundant data is where the same information is repeated in several places in the database

3666

5368

3650

3751

4134

‘Attishoo’

‘Smiley’

‘Smethurst’

‘Guldu’

‘Madayan’

48

22

35

35

35

10

10

10

7

7

Ssn Name lot Hourly_wage

40

30

30

32

40

Hours_workedrating

Hourly_Employee relation

6666


8

8

8

5

5

Example

Redundant data is where the same information is repeated in several places in the database

3666

5368

3650

3751

4134

‘Attishoo’

‘Smiley’

‘Smethurst’

‘Guldu’

‘Madayan’

48

22

35

35

35

10

10

10

7

7


40

30

30

32

40

Hours_worked

Redundant Information

rating


7777


8

8

8

5

5

Domain knowledge

Whether a design keeps redundant data depends on the application domain’s rules

It is not enough just to see that the same values occur several times in some instanceYou must check with users to be certain

Eg Ask “is it always the case that two hourly employees with the same rating, have the same hourly wage?”

3666

5368

3650

3751

4134

‘Attishoo’

‘Smiley’

‘Smethurst’

‘Guldu’

‘Madayan’

48

22

35

35

35

10

10

10

7

7


40

30

30

32

40

Hours_workedrating


8888

L12L12 NormalizationNormalization Evils of Redundancy

Redundancy is at the root of several problems associated with relational schemas:

Update anomaly: If we change the hourly-wage in one employee’s record, we may end up with an inconsistent database (where the domain constraint is not true)Insertion anomaly: We can not insert a record for an employee unless we know the hourly wage for the employee’s rating value.Another insertion anomaly: We can’t record the hourly wage for a rating value, unless there is an employee with that rating.Deletion anomaly: If we delete all records with a given rating value, we lose the association between that rating value and its hourly-wage value!

It is the anomalies with modification that are the serious concern, not the extra space used in storage

9999

L12L12 NormalizationNormalization Normal Forms

There is a theory that allows one to see whether or not a particular schema risks having redundancy anomaliesThis theory looks at the design and also at constraints on the application, expressed in a mathematical form as functional dependenciesIf the design and its dependencies have certain properties, we say that the design is in a particular normal form

There are several that are used; we focus on BCNF

Our goal is to use schema which are in BCNF

10101010

L12L12 NormalizationNormalization Decomposition

Decomposes changes a schema by replacing one relation by other relations

Between them, the decomposed relations have all the attributes of the originalThe data in each decomposed table is a projection of the data in the original table

Ssn Name Lot Rating

Hours_worked

3666 ‘Attishoo’ 48 8 40

5368 ‘Smiley’ 22 8 30

3650 ‘Smethurst’ 35 8 30

3751 ‘Guldu’ 35 5 32

4134 ‘Madayan’ 35 5 40

Rating

Hourly_wage

8 105 7

11111111

L12L12 NormalizationNormalizationEvaluating a Decomposition

A proposal to use decomposition on a schema should be evaluatedDoes it allow all the original information to be captured properly?Does it allow all the original application domain constraints to be captured properly?These issues are formalized as properties of a decomposition

It should be lossless-joinIt should be dependency-preserving

12121212

L12L12 NormalizationNormalization Overall design process

Consider a proposed schemaFind out application domain properties expressed as functional dependenciesSee whether the schema is in BCNF If not, try to find a relation which has a decomposition that is lossless-join and dependency-preservingRepeat the evaluation for the altered schema, with the original relation replaced by its decomposed tables

13131313



Today’s Agenda

14141414

L12L12 NormalizationNormalizationFunctional Dependency (FD)

Intuitively: “If two tuples of a relation R agree on values in X, then they must also agree on the Y values.”

Formally: Given a relation R with schema R, and two sets of attributes X = {X1, …, Xm} R and Y = {Y1, …, Yn} R.A functional dependency (FD) X Y holds over relation R if, for every allowable instance R of R:

r, s R : r.X = s.X r.Y = s.Y

We say“X (functionally) determines Y”“Y is functionally dependent on X”

15151515

L12L12 NormalizationNormalizationExample: Functional

Dependencies

Consider a relation Orders: Orders ( date, cid, pid, descr, price, weight, amount )

on which we can assert the following FDs:date cid pid amount

Each customer can only make one order for a given product each day

pid determines product details: pid descr price weight

The following FDs do not hold:cid pid amount

A customer could order the same product on several days

pid dateThe same product can be ordered at several dates

Remember: all constraints must be checked with users

16161616

L12L12 NormalizationNormalization Functional Dependencies

17171717

L12L12 NormalizationNormalization Functional Dependencies

18181818

L12L12 NormalizationNormalizationFunctional Dependencies

(cont’d)

A FD is an assertion about the schema of a relation not about a particular instance.

It must be fulfilled by all allowable relations.

If we look at an instance, we cannot tell for sure that a FD holds

The most insight we can gain by looking at a “typical” instance are “hints”…We can however check if the instance violates some FD

FDs must be identified based on the semantics of an application.

19191919

L12L12 NormalizationNormalizationApproach to Schema

RefinementFDs can be used to identify schemas with problems and to suggest refinements.

Represent FDs in the form of primary key constraints.

Main refinement technique: decomposition

Orders ( date, cid, pid, descr, price, weight, amount ) with two FDs

date cid pid amount

pid determines product details: pid descr price weight

Two relations orders2 (date, cid, pid, amount) product (pid, desc, price, weight)

20202020

L12L12 NormalizationNormalization Reasoning about FDs

Given some FDs, we can usually infer additional FDs:Example: cid points, points lot implies cid lot

A FD f is implied by a set of FDs F if f holds whenever all FDs in F hold.

F+ : closure of F is the set of all FDs that are implied by F

Armstrong’s Axioms (X, Y, Z are sets of attributes):1. Reflexivity rule: If X Y, then Y X

2. Augmentation rule: If X Y, then XZ YZ for any Z

3. Transitivity rule: If X Y and Y Z, then X Z

21212121

L12L12 NormalizationNormalizationReasoning about FDs

(cont’d)Armstrong’s Axioms are

Sound: they generate only FDs in F+ when applied to a set F of FDsComplete: repeated application of these rules will generate all FDs in the closure F+

A couple of additional rules (that follow from Armstrong’s Axioms.):4. Union rule: If X Y and X Z, then X YZ

5. Decomposition rule: If X YZ, then X Y and X Z

6. Pseudotransitivity rule: If X Y and SY Z, then XS Z

Example: Orders (date,cid,pid,descr,price,weight,amount ) and FDs {date cid pid amount, pid descr price weight }.It follows:

date, cid, pid pid (reflexivity rule) descr, price, weight descr (reflexivity rule) pid descr (decomposition rule) date, cid, pid descr (transitivity rule)

22222222

L12L12 NormalizationNormalizationProving the Additional Inference

Rules

Rule 4: UnionIf X Y and X Z then X YZ

Proof:X Y implies (rule 2, augmented with X): X XYX Z implies (rule 2, augmented with Y): XY YZ(transitivity rule): X XY and XY YZ implies X YZ

Rule 5: Decompositionsimilar

23232323

L12L12 NormalizationNormalization Example

R = (A, B, C, G, H, I)F = { A B

A C CG H CG I B H}

some members of F+

A H by transitivity from A B and B H

AG I by augmenting A C with G, to get AG CG and then transitivity with CG I

CG HI from CG H and CG I : “union rule” can be inferred from definition of functional dependencies, or Augmentation of CG I to infer CG CGI, augmentation of

CG H to infer CGI HI, and then transitivity

24242424

L12L12 NormalizationNormalization FD Attribute Closure

Computing the closure of a set of FDs can be expensive . (Size of closure is exponential in # attrs!)

Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F.

An efficient check is:Compute FD attribute closure of X (denoted X+ ) wrt F:

To compute the set of all attributes A such that X A is in F+

There is a linear time algorithm to compute this.

Check if Y is in X+

E.G. Does F = {A B, B C, C D E } imply A E?

i.e, is A E in the closure F+ ? Equivalently, is E in A+?

25252525

L12L12 NormalizationNormalizationComputing the Closure of

AttributesStarting with a given set of attributes, one repeatedly expands the set by adding the right sides of FD’s as soon as their left side is added:1. Initialise X with the given set of attributes: X={A1, …, An}

2. Repeatedly search for some FD A1 A2 … Am C such that all A1, …, Am are already in the set of attributes X, but C is not. Add C to the set X.

3. Repeat step 2 until no more attributes can be added to X4. The set X is the correct value of A+

26262626

L12L12 NormalizationNormalization From FDs to Keys

Functional dependencies can be used to find keys:

Definition: SuperkeyGiven a relation R with schema R and a set F of functional dependencies. A set of attributes K R is a superkey for R if K R F+

Note that K R does not require K to be minimal!K is a candidate key if no real subset of K has above property.

This definition of keys using functional dependencies is consistent to our previous definition. It allows us to automatically determine keys as soon as the FDs have been found.

27272727


R = (A, B, C, G, H, I)F = {A B

A C CG HCG IB H}

(AG)+

1. result = AG2. result = ABCG (A C and A B)3. result = ABCGH (CG H and CG AGBC)4. result = ABCGHI (CG I and CG AGBCH)

Is AG a candidate key? 1. Is AG a super key?

1. Does AG R? == Is (AG)+ R

2. Is any subset of AG a superkey?1. Does A R? == Is (A)+ R2. Does G R? == Is (G)+ R

28282828

L12L12 NormalizationNormalization Find Primary Key

{ A, B, C, D}

Find FDs which include relation RA ABCD: A is primary key???B, C, DABABCD: AB is candidate key???

Check whether FD of A and FD of B are independentIf AABCD or B ABCD then AB is not a key

AC, AD, BC, BD, CDABCABCD: ABC is candidate key???ABC, ABD, BCD

29292929



Today’s Agenda

30303030

L12L12 NormalizationNormalization Normal Forms

Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed!

If a relation is in a certain normal form (such as BCNF), it is known that certain kinds of problems are avoided/minimised. This can be used to help us decide whether decomposing the relation will help.

Role of FDs in detecting redundancy:Consider a relation R with 3 attributes, ABC.

No FDs hold: There is no redundancy here.

Given A B: Several tuples could have the same A value, and if so, they’ll all have the same B value!

31313131

L12L12 NormalizationNormalization BCNF

Be careful with the use of mathematics and logic!Definitions are precise

A relation with schema R and FDs F is in BCNF (Boyce-Codd Normal From) if, for all X A in F+

A X (called a trivial FD), orX is a superkey.

Definition says: Check each fd that applies to the relation (including those deduced from the ones you were given)

If the rhs is contained in the lhs; ok (but uninteresting)If the lhs is a candidate key for the whole relation; ok

If every fd is ok, the relation is BCNFIn fact, you only need to check the ones you are given!

A schema is in BCNF if every relation in it is BCNF

32323232

L12L12 NormalizationNormalization BCNF Example

Example:Order(date,cid,pid,amount) with F={date cid pid amount} and Product (pid,descr,price,weight) with

F = {pid descr price weight} are in BCNF.

R = (A, B, C)if F = {A B, B C}

R is not in BCNF, because B C does not have either alternative (C is not part of B; B is not a candidate key)

Decomposition R1 = (A, B), R2 = (B, C)R1 and R2 in BCNF

33333333

L12L12 NormalizationNormalization Other normal forms

Design theory has provided other definitions, each with its own uses

First normal form (1NF)Second normal form (2NF)Third normal form (3NF)Fourth normal form (4NF) etc

But we concentrate on BCNF!

34343434

L12L12 NormalizationNormalization First Normal Form

A relational R is in first normal form (1NF) if the domains of all attributes of R are atomic.

This is built in to the relational model as we defined it!

Domain is atomic if its elements are considered to be indivisible units

Examples of non-atomic domains:Set of names, composite attributes

Non-atomic values complicate storage and encourage redundant (repeated) storage of data

SID CourseID Date Name

123514121311

312, 333, 454234, 454213

10/1010/1110/01

James Albert Nemo DavisTasos West

35353535

L12L12 NormalizationNormalization Second Normal Form

Second Normal Form (2NF): if no nonkey attribute is functionally dependent on just a part of the key.

Composite key

SID CourseID Grade FName

1235141212351412

312232515460

AFAC

JamesNemoJamesNemo

8

8

8

5

5

3666

5368

3650

3751

4134

‘Attishoo’

‘Smiley’

‘Smethurst’

‘Guldu’

‘Madayan’

48

22

35

35

35

10

10

10

7

7


40

30

30

32

40

Hours_workedrating

36363636

L12L12 NormalizationNormalization Third normal form (3NF)

Relation R with FDs F is in 3NF if, for all X A in F+

A X (called a trivial FD), or

X is a superkey, orA is part of some candidate key for R.

37373737


ExampleR = (J, K, L)F = {JK L, L K}Two candidate keys: JK and JLR is in 3NF

JK L JK is a superkeyL K K is contained in a candidate key

There is some redundancy in this schemaExample:Advisor-schema = (Client, Problem, Consultant)Consultant ProblemClient, Problem Consultant

Client Consultant Problem

James Gomez Marketing

James Jay Production

38383838

L12L12 NormalizationNormalization BCNF and 3NF

If R is in BCNF, obviously in 3NF.If R is in 3NF, it may not in BCNF.Lectures (course, time, room) with F = {course room, time room course} is in 3NF

Candidate keys: {course, time} and {time, room}

but not in BCNF.FD course room violates the BCNF requirements.Example instance:

Comp5138

Comp5138

Info3005

Info3005

Mo, 6-8 pm

Mo, 8-9 pm

Tue, 9-11

Thu, 12-14

CAR273

CAR273

Physics 1

Physics 1

course time

lecturesroom

39393939



Today’s Agenda

40404040

L12L12 NormalizationNormalizationDecomposition of Relational

Schema

Suppose that relation R contains attributes A1 ... An. A decomposition of R consists of replacing R by two or more relations such that:

Each new relation scheme contains a subset of the attributes of R (and no attributes that do not appear in R), andEvery attribute of R appears as an attribute of one or more of the new relations.

E.g., Can decompose OrderPlus(date, cid, pid, descr, price, weight, amount) into Order(date,cid,pid,amount) and Product (pid,descr,price,weight).Decomposing R means we will store instances which are projected from instances of R.

Each instance in the new schema has a subset of the columns in the old instanceNotation: X (R)

41414141

L12L12 NormalizationNormalization Decomposition and the Data

Suppose we replace R by a decomposition into R1 and R2The information we have about the world in the new schema is what we can deduce from the instance of R1 and the instance of R2These are all the facts which are produced by joining the two tables

That is, take a tuple from R1 and a tuple from R2, which agree in the values for the common attributes Notation: R1 R2

How does this compare to the information in the original relation, before decomposition?

Answer: it always has the information in the original, but it may have introduced extra (spurious) information

42424242

L12L12 NormalizationNormalizationLossless-Join Decomposition

Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every instance R that satisfies F:

X (R) Y (R) = R

It is always true that R X (R) Y (R)

In general, the other direction does not hold! If it does, the decomposition is lossless-join.

Definition extended to decomposition into 3 or more relations in a straightforward way.

It is essential that all decompositions used to deal with redundancy be lossless, so that we are not losing knowledge by having a database which implies spurious (incorrect) statements!

43434343

L12L12 NormalizationNormalizationExample for Lossy-Join Decomp.

Paper

Disks

03.05.

03.05

1001

1001

1

6

20.0

5.0

date cid pidOrders

price

2.0

0.4

weight

100

50

amountdescr

Ordered Product

03.05.

03.05

1001

1001

1

6

date cid pid

100

50

amount

Paper

Disks

20.0

5.0

price

2.0

0.4

weightdescr

03.05.

03.05

date

Paper

Disks

Disks

Paper

03.05.

03.05

03.05.

03.05

1001

1001

1001

1001

1

6

1

6

20.0

5.0

5.0

20.0

date cid pid

Ordered Productprice

2.0

0.4

0.4

2.0

weight

100

50

100

50

amountdescr

44444444

L12L12 NormalizationNormalizationMore on Lossless-Join

Decomposition

Important Theorem: The decomposition of R into X and Y is lossless-join wrt F if and only if the closure of F contains:

X Y X, orX Y Y

C.f. Anti-Example on earlier slide:F = {date cid pid amount, pid descr price weight }

F+ does not contain date descr price weight

F+ does not contain date cid pid amount

45454545

L12L12 NormalizationNormalizationDependency-Preserving

Decomposition

Consider CSJDPQV, C is key, JP C and SD P.therefore JSD C so,BCNF decomposition: CJSDQV and SDPProblem: Checking JP C requires a join!

Intuitive:If R is decomposed into S and T, and we enforce the FDs that hold on S and on T, then all FDs that were given to hold on R must also hold.

Projection of set of FDs F: If a relational schema R is decomposed into S,T …, the projection of F onto S (denoted FS ) is the set of FDs U V in F+ (closure of F ) such that U, V are in S.

46464646

L12L12 NormalizationNormalizationDependency Preserving

Decomposition Def

Given a relation R with schema R and set of FDs F. A decomposition of R into S and T is a dependency preserving decomposition if ( FS FT

)+ = F+

i.e., if we consider only dependencies in the closure F + that can be checked in S without considering T , and in T without considering S, these imply all dependencies in F +.

Important to consider F +, not F, in this definition:ABC, F={A B, B C, A C}, decomposed into AB and BC.Is this dependency preserving? Is A C preserved?????If FAB={A B}, FBC={B C}, then (FAB FBC) ={A B, B C}

But, F= {A B, B C, A C}, hence F (FAB FBC)

Dependency preserving does not imply lossless join, And vice-versa!

47474747

L12L12 NormalizationNormalizationExamples for Dependency-

PreservingThe decomposition of Orders(date, cid, pid, descr, price, weight, amount) with F = {date cid pid amount, pid descr price weight } into

Ordered(date,cid,pid,amount) with F1 = {date cid pid amount} andProduct (pid,descr,price,weight) with F2 = {pid descr price weight}

is dependency-preserving.

The decomposition of Lectures (course, time, room) with F = {course room, time room course} into

Rooms( course, room ) with F1={course room} andTimes(course, time) with F2 = {}

is not dependency-preserving.

48484848

L12L12 NormalizationNormalizationCentral Ideas of Schema

Refinement

Every relation R with FDs F can be decomposed into 3NF relations, which is both lossless and dependency-preserving.

Every relation R with set of FDs F can be decomposed into BCNF relations which is a lossless-join.

Unfortunately, there may be no dependency-preserving decomposition into a collection of BCNF relational schema

49494949

L12L12 NormalizationNormalizationAlgorithm for Schema

Refinement

In general, several dependencies may cause violation of BCNF…

Consider relation R with FDs F. If X Y violates BCNF, decompose R into R - Y and XY.

Repeated application of this idea will give us a collection of relations that are in BCNF; lossless join decomposition, and guaranteed to terminate.

The order in which we “deal with” them could lead to very different sets of relations!

50505050

L12L12 NormalizationNormalizationExample: BCNF Decomposition

Given the following schema:course-schema (cid, title, cpoints, workload, lecturer, sid, grade)

F = { cid title cpoints lecturer, cpoints workload, cid sid grade }

{cid}+ = { cid, title, cpoints, lecturer, workload }{cpoints}+ = { cpoints, workload } {cid, sid}+ = {cid, sid, title, cpoints, lecturer, workload, grade}

Candidate Key: { cid, sid }

51515151


(cont’d)Course-schema is not BCNF, because, e.g., first and second FD violate BCNF

Decompose Course-schema along cpoints workload first!Workloads ( cpoint, workload )Courses-schema’ (cid, title, cpoints, lecturer, sid, grade )

Course-schema’ is still not BCNF; further decompose along cid title cpoints lecturer

Workloads (cpoints, workload ) with F2 = {cpoints workload}Courses (cid, title, cpoints, lecturer)

with F1 ={cid title cpoints lecturer }Enrolled (cid, sid, grade ) with F3 = {cid sid grade}

Is this decomposition lossless-join & dependency-preserving?

52525252


(cont’d)

If you decompose along cid title cpoints lecturer first, there is a different schema!

Courses (cid, title, cpoints, lecturer), with F1 ={cid title cpoints lecturer }

Workload_enroll(cid, workload, sid, grade ), with F3 = {cid sid grade}

Can we decompose any of above schema further? How about FD cpoints workload?

53535353

L12L12 NormalizationNormalization Wrap-Up

Motivation: avoid redundancyFunctional DependencyNormal Forms

BCNFDecomposition

COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 4 Normalization.

Documents

Transcript of COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 4 Normalization.