DATABASE DESIGN I - Uppsala · PDF fileDATABASE DESIGN I -1DL300 Spring 2011 ... Boyce-Codd...

Post on 13-Mar-2018

227 views 2 download

Transcript of DATABASE DESIGN I - Uppsala · PDF fileDATABASE DESIGN I -1DL300 Spring 2011 ... Boyce-Codd...

DATABASE DESIGN I - 1DL300

Spring 2011

An introductory course on database systems

http://www.it.uu.se/edu/course/homepage/dbastekn/vt11/

2011-02-16 1Manivsakan Sabesan - UDBL - IT - UU

Manivasakan SabesanUppsala Database Laboratory

Department of Information Technology, Uppsala University, Uppsala, Sweden

Introduction to Normalization

Elmasri/Navathe ch 14Padron-McCarthy/Risch ch 11

2011-02-16 2Manivasakan Sabesan - UDBL - IT - UU

Manivasakan Sabesan

Department of Information Technology

Uppsala University, Uppsala, Sweden

Normalization and problems in schema design

• Normalization - relational database design– Normalization of relational schemas or “we want to have good relations . . .”

– Functional dependencies

– Normal forms

• Problems in schema design

2011-02-16 3Manivsakan Sabesan - UDBL - IT - UU

• Problems in schema design – Unclear semantics

– Redundancy• Modification problems (updates, insertions deletions)

– Null values

– Spurious tuples

– Multi-valued dependencies

Clear Semantics of Relations

• Semantics of a relation – Meaning resulting from interpretation of attribute values in a tuple

• Easier to explain semantics of relation– Indicates better schema design

2011-02-16 4Manivsakan Sabesan - UDBL - IT - UU

– Indicates better schema design

Guideline 1

• Design relation schema so that it is easy to explain its meaning

• Do not combine attributes from multiple entity types and relationship types into a single relation

2011-02-16 5Manivsakan Sabesan - UDBL - IT - UU

Guideline 1 (cont’d.)

Ename Ssn Bdate Address Dnumber Dname Dmgr_ssn

EMP_DEPT

EMP_PROJ

2011-02-16 6Manivsakan Sabesan - UDBL - IT - UU

Ssn Pnumber Hours Ename Pname Plocation

EMP_PROJ

Example: design of a relational database

• To design a relational database we start from an E-R model and design a number of relational schemas as e.g.:

• emps(ename, salary, dept, dname, dept#, mgr)• supp_info(sname, iname, saddr, price)

2011-02-16 7Manivsakan Sabesan - UDBL - IT - UU

• supp_info(sname, iname, saddr, price)• items(iname, item#, dept)• customers(cname, addr, balance, o#, date)

• includes(o#, iname, quantity)

Ex. continues - bad design causes problems

• We have here chosen to combine the schemas for SUPPLIES and SUPPLIERS and exchange them with one schema that contain all information about suppliers:

SUPPLIERS(SNAME, SADDR)SUPPLIES(SNAME, INAME, PRICE)SUPP_INFO(SNAME, INAME, SADDR, PRICE)

• Due to this decision we will run into several problems with:

2011-02-16 8Manivsakan Sabesan - UDBL - IT - UU

• Due to this decision we will run into several problems with:– Unclear semantics– Redundancy– Modification anomalies

• Update anomalies• Insertion anomalies• Deletion anomalies

RedundancySUPP_INFO(SNAME, INAME, SADDR, PRICE)

• Redundancy:– When each supplier supplies several products, the address will be repeated for each

product.

2011-02-16 9Manivsakan Sabesan - UDBL - IT - UU

– Storing natural joins of base relations leads to update anomalies

Modification anomalies SUPP_INFO(SNAME, INAME, SADDR, PRICE)

• Update anomaly - inconsistency– When you have redundant data you risk, for example, when updating the address in the

table, to miss updating all occurrences of the redundant values (here several occurrences of address values for the same supplier).

– The result will be that we get inconsistencies among data in the database.

• Insertion anomaly– We can not insert the address for a supplier that do not currently supply any product.

2011-02-16 10Manivsakan Sabesan - UDBL - IT - UU

– We can not insert the address for a supplier that do not currently supply any product.• One can of course assign null values INAME and PRICE, but the one must remember to

remove these null values later when additional information exist. Hence, no tractable solution!

– Furthermore, sinceINAME is part of the key for this relation this is not a good solution. It is probably not even allowed to assign null values to key attributes.

• Deletion anomaly– If all tuples are removed, that corresponds to items that a specific supplier provide,

then all information about that supplier will disappear.

How to avoid these problems?

• In order to avoid all these problems, we divide the relation (table) in the following manner:

SUPPLIERS(SNAME, SADDR)SUPPLIES(SNAME, INAME, PRICE)

• Disadvantages: – To retrieve addresses for suppliers that provide bananas, we must perform an

2011-02-16 11Manivsakan Sabesan - UDBL - IT - UU

– To retrieve addresses for suppliers that provide bananas, we must perform an expensive join operation, SUPPLIERS * SUPPLIES, that we could get away with if we instead had the relation:SUPP_INFO(SNAME, INAME, SADDR, PRICE)

– The operations SELECT and PROJECT that is much cheaper will do the job in the latter case.

Guideline 2

• Design base relation schemas so that no update anomalies are present in the relations

• If any anomalies are present:– Note them clearly

– Make sure that the programs that update the database will operate correctly

2011-02-16 12Manivsakan Sabesan - UDBL - IT - UU

– Make sure that the programs that update the database will operate correctly

NULL Values in Tuples

• many attributes can end up with many NULLs in a tuple

• Problems with NULLs– Wasted storage space

– Problems understanding meaning

2011-02-16 13Manivsakan Sabesan - UDBL - IT - UU

– Problems understanding meaning

Guideline 3

• Avoid placing attributes in a base relation whose values may frequently be NULL

• If NULLs are unavoidable:– Make sure that they apply in exceptional cases only, not to a majority of tuples

2011-02-16 14Manivsakan Sabesan - UDBL - IT - UU

Ename Plocation

Smith, John Uppsala

Ssn Pnumber Hours Ename Pname Plocation

123456789 1 32.5 Smith, John SSPI Uppsala

123456789 2 7.5 Smith, John Vortex Luleå

453453453 1 20.0 English, Joyce SSPI Uppsala

EMP_PROJ

EMP_LOCS

Spurious Tuples

2011-02-16 15Manivsakan Sabesan - UDBL - IT - UU

Smith, John Uppsala

Smith, John Luleå

English, Joyce Uppsala

Ssn Pnumber Hours Pname Plocation

123456789 1 32.5 SSPI Uppsala

123456789 2 7.5 Vortex Luleå

453453453 1 20.0 SSPI Uppsala

EMP_PROJ1

EMP_LOCS

Ssn Pnumber Hours Pname Plocation

123456789 1 32.5 SSPI Uppsala

123456789 2 7.5 Vortex Luleå

453453453 1 20.0 SSPI Uppsala

EMP_PROJ1

Ename Plocation

Smith, John Uppsala

Smith, John Luleå

English, Joyce Uppsala

2011-02-16 16Manivsakan Sabesan - UDBL - IT - UU

Ssn Pnumber Hours Ename Pname Plocation

123456789 1 32.5 Smith, John SSPI Uppsala

453453453 1 20.0 Smith, John SSPI Uppsala

123456789 1 32.5 English, Joyce SSPI Uppsala

453453453 1 20.0 English, Joyce SSPI Uppsala

123456789 2 7.5 Smith, John Vortex Luleå

EMP_PROJ

453453453 1 20.0 SSPI Uppsala

Guideline 4

• Design relation schemas to be applied natural join that are appropriately related – Guarantees that no spurious tuples are generated

• Avoid relations that contain matching attributes that are not (foreign key, primary key) combinations

2011-02-16 17Manivsakan Sabesan - UDBL - IT - UU

key) combinations

Normalization

• What principles do we have to follow to systematically end up with a good schema design.– In 1972, Ted Codd defined a set of conditions that put various requirements on a

relation.

– To fulfill these conditions, the relation schema is in several steps divided into smaller schemas.

2011-02-16 18Manivsakan Sabesan - UDBL - IT - UU

smaller schemas.

• This process is called normalization.

• We need to study the following concepts for performing the normalization: – Functional dependencies

– Normal forms for relations

Functional dependency (FD)

• Let R be a relation schema with attributes A1, ..., An and let X and Y be subsets of {A1, ..., An}.

• Let r(R) determine a relation (instance) of the schema R.

We say that X functionally determines Y, and we write

X ⇒ Y

2011-02-16 19Manivsakan Sabesan - UDBL - IT - UU

• The attribute set Y is said to be dependent of the attribute set X.

• There is a functional dependencybetween X and Y.

X ⇒ Y

if for every pair of tuples t1, t2 ∈ r(R) and for all r(R) the following hold:

If t1[X] = t2[X] it holds that t1[Y] = t2[Y]

Functional dependency - properties

• If X is a candidate key for R we have:– X ⇒ Y for all Y ⊆ {A1, ..., An}.– X ∩ Y do not need to be the empty set.

– Assume that R represents an N:1 relation between the entity types E1 and E2.

– It is further assumed that X ⊆ {A1, ..., An} constitutes a key for E1 and Y constitutes a key for E2,

2011-02-16 20Manivsakan Sabesan - UDBL - IT - UU

– Then we have the dependency X ⇒ Y but not Y ⇒ X.

– If R would be a 1:1 relation, then it also holds that Y ⇒ X.

• A functional dependency is a property of R and not of any relation r(R). The semantic for the participating attributes in a relation decides if there exists a FD or not.

Armstrong’s axioms(inference rules)

• 1. If X ⊇ Y, then X → Y (reflexive rule )

• 2. If X → Y, then XZ → YZ (augmentation rule)

• 3. If X → Y and Y → Z, then X → Z (transitive rule)

2011-02-16 21Manivsakan Sabesan - UDBL - IT - UU

Additional rules:

• 4. If X → YZ, then X → Y (decomposition, or projection, rule)

• 5. If X → Y and X → Z, then X → YZ (union, or additive, rule)

• 6. If X → Y and WY → Z, then WX → Z (psuedotransitive rule)

Example

• The simplest form of FD is between the key attributes and the other attributes in a schema:

{SNAME} ⇒ {SADDR} SUPPLIERS{SNAME, INAME} ⇒ {PRICE} SUPPLIES{CNAME} ⇒ {CADDR, BALANCE} CUSTOMERS{SNAME} ⇒ {SNAME}

2011-02-16 22Manivsakan Sabesan - UDBL - IT - UU

{SNAME} ⇒ {SNAME}

• A slightly more complex FD:{SNAME, INAME} ⇒ {SADDR, PRICE}

– This dependency is the combination of the first two above. We are able to combine these two because we understand the concepts supplier, item, address and price and the relationships among them.

Full functional dependency - FFD

• A functional dependency X ⇒ Y is termed FFD if there is no attribute A ∈ X such that (X- {A}) ⇒ Y holds.

• In other words, a FFD is a dependency that do not contain any unnecessary attributes in its determinant (i.e. the left-hand side of the dependency).

SUPP_INFO(SNAME, INAME, SADDR, PRICE)

2011-02-16 23Manivsakan Sabesan - UDBL - IT - UU

{SNAME, INAME} ⇒ {PRICE} FFD{SNAME, INAME} ⇒ {SADDR} Not FFD{SNAME} ⇒ {SADDR} FFDcompare: f(x,y) = 3x + 2 Not FFD

f(x) = 3x + 2 FFD

Prime attribute

• Definition: an attribute that is a member in any of the candidate keys is called a prime attribute of R.

• For example:– In the schema R(CITY,STREET,ZIPCODE) all attributes are prime attribute,

since we have the dependencies CITY,STREET⇒ ZIPCODE and ZIPCODE⇒CITY and both CITY,STREET and STREET,ZIPCODE are keys.

2011-02-16 24Manivsakan Sabesan - UDBL - IT - UU

CITY and both CITY,STREET and STREET,ZIPCODE are keys.

– In the schema R(A,B,C,D) with the dependencies AB ⇒ C, B ⇒ D and BC ⇒A, are both AB and BC candidate keys. Therefore are A, B and C prime attribute but not D.

• Attributes that is not part of any candidate key is called non-prime or non-key attributes.

First normal form - 1NF

• Only atomic values are allowed as attribute values in the relational model.– The relations that fulfill this condition are said to be in 1NF

Ssn Ename Pnumber Hours

1234 Smith 1 12

2011-02-16 25Manivsakan Sabesan - UDBL - IT - UU

1234 Smith 1 12

2 7

4534 Wong 3 40

2 26

Second normal form - 2NF

• A relation schema R is in 2NF if:– It is in 1NF

– Every non-key attribute A in R is FFD of each candidate key in R.

2011-02-16 26Manivsakan Sabesan - UDBL - IT - UU

Third normal form - 3NF

• A relation schema R is in 3NF if:– It is in 2NF

– No non-key attribute A in R is allowed to be FFD of any other non-key attribute.

2011-02-16 27Manivsakan Sabesan - UDBL - IT - UU

Example

2011-02-16 28Manivsakan Sabesan - UDBL - IT - UU

Example

2011-02-16 29Manivsakan Sabesan - UDBL - IT - UU

Boyce -Codd normal form - BCNF

• A relation schema R is in BCNF if:– It is in 1NF

– Every determinant X is a candidate key.

• The difference between BCNF and 3NF is that in BCNF, a prime attribute can not be FFD of a non-key (or non-prime) attribute or of a prime attribute (i.e. a partial

2011-02-16 30Manivsakan Sabesan - UDBL - IT - UU

key). Therefore BCNF is a more strict condition.

Example_BCNF

2011-02-16 31Manivsakan Sabesan - UDBL - IT - UU

Additional normal forms

• There are a number of other normal forms available

• These explores additional data dependencies such as:– multi-valued dependencies (4NF)

– join dependencies (5NF)

2011-02-16 32Manivsakan Sabesan - UDBL - IT - UU

Multi-valued dependencies

• Consider the relation classes(course, teacher, book)

course

databasedatabasedatabase

teacher

ToreTore

Martin

book

ElmasriUllmanElmasri

2011-02-16 33Manivsakan Sabesan - UDBL - IT - UU

• The book attribute specifies required literature independent of teacher.

• The relation fulfills BCNF, but still every teacher must be supplied two times.

database database

operating systemoperating system

Martin Martin MartinMartin

ElmasriUllman

SilberschatzShaw

Lossless join

• We decompose a schema R with functional dependencies D into schemas R1 and R2

• R1 and R2 is called a lossless-join decomposition (in relation to D) if R1 * R2 contain the same information as R.– no spurious tuples are generated when we perform the join between the decomposed

relations.

2011-02-16 34Manivsakan Sabesan - UDBL - IT - UU

relations.

– no information is lost when the relation is decomposed.

3NF/BCNF

• BCNF is a strong condition. It is not always possible to transform (through decomposition) a schema to BCNF and keep the dependencies.

• 3NF has the most of BCNF's advantages and can still be fulfilled without giving up dependencies or the lossless-join property for the relation schema.

• Observe that 3NF admits some sort of redundance that BCNF do not allow. BCNF

2011-02-16 35Manivsakan Sabesan - UDBL - IT - UU

• Observe that 3NF admits some sort of redundance that BCNF do not allow. BCNF requires that one eliminates redundance caused by functional dependencies.

Normal forms and normalization• We assume we have an enterprise that buys products from different supplying

companies, and we would like to keep track of our data by means of a database.

• We would like to keep track of what kind of products (e.g. cars) we buy, and from which supplier (e.g. Volvo) we buy them. We can buy each product from several suppliers.

• Further, we want to know the price of the products. The price of a product is naturally dependent on which supplier we bought it from.

2011-02-16 36Manivsakan Sabesan - UDBL - IT - UU

• We would also like to store the address of each supplier, i.e. the city where the supplier is located. Here, we assume that each supplier is only located at one place.

• Perhaps we suddenly need to order a large number of cars from Volvo. It can then be good to know how many people that live in the city where Volvo is located. Thus, we can know if Volvo can employ a sufficient amount of people to produce the cars. Hence we also store the population for each city.