Databases 2

110
Databases 2 Storage optimization: functional dependencies, normal forms, data ware houses

description

Databases 2. Storage optimization : functional dependencies , normal forms , data ware houses. Functional Dependencies. X -> A is an assertion about a relation R that whenever two tuples of R agree on all the attributes of X , then they must also agree on the attribute A . - PowerPoint PPT Presentation

Transcript of Databases 2

Page 1: Databases  2

Databases 2

Storage optimization: functional dependencies, normal forms, data

ware houses

Page 2: Databases  2

Functional Dependencies

• X -> A is an assertion about a relation R that whenever two tuples of R agree on all the attributes of X, then they must also agree on the attribute A.– Say “X -> A holds in R.”– Notice convention: …,X, Y, Z represent sets of attributes;

A, B, C,… represent single attributes.– Convention: no set formers in sets of attributes, just ABC,

rather than {A,B,C }.

2

Page 3: Databases  2

Example

• Drinkers(name, addr, beersLiked, manf, favBeer).

• Reasonable FD’s to assert:1. name -> addr (the name determine the address)2. name -> favBeer (the name determine the

favourite beer)3. beersLiked -> manf (every beer has only one

manufacturer)

3

Page 4: Databases  2

Example

4

name addr BeersLiked manf FavBeer

Janeway Voyager Bud A.B. WickedAle

Janeway Voyager WickedAle Pete’s WickedAle

Spock Enterprise Bud A.B. Bud

name -> addrname -> FavBeerBeersLiked -> manf

Page 5: Databases  2

FD’s With Multiple Attributes

• No need for FD’s with more than one attribute on right.– But sometimes convenient to combine FD’s as a

shorthand.– Example: name -> addr and name -> favBeer

become name -> addr favBeer• More than one attribute on left may be

essential.– Example: bar beer -> price

5

Page 6: Databases  2

Keys of Relations

• K is a key for relation R if:1. Set K functionally determines all attributes of R2. For no proper subset of K is (1) true.

If K satisfies (1), but perhaps not (2), then K is a superkey.

Consequence: there are no two tuples having the same value in every attribute of the key.

Note E/R keys have no requirement for minimality, as in (2) for relational keys.

6

Page 7: Databases  2

Example

• Consider relation Drinkers(name, addr, beersLiked, manf, favBeer).

• {name, beersLiked} is a superkey because together these attributes determine all the other attributes.– name -> addr favBeer– beersLiked -> manf

7

Page 8: Databases  2

Example

8

name addr BeersLiked manf FavBeer

Janeway Voyager Bud A.B. WickedAle

Janeway Voyager WickedAle Pete’s WickedAle

Spock Enterprise Bud A.B. Bud

Every pair is different => rest of the attributes are determined

Page 9: Databases  2

Example, Cont.

• {name, beersLiked} is a key because neither {name} nor {beersLiked} is a superkey.– name doesn’t -> manf; beersLiked doesn’t -> addr.

• In this example, there are no other keys, but lots of superkeys.– Any superset of {name, beersLiked}.

9

Page 10: Databases  2

Example

10

name addr BeersLiked manf FavBeer

Janeway Voyager Bud A.B. WickedAle

Janeway Voyager WickedAle Pete’s WickedAle

Spock Enterprise Bud A.B. Bud

name doesn’t -> manf BeersLiked doesn’t -> addr

Page 11: Databases  2

E/R and Relational Keys

• Keys in E/R are properties of entities• Keys in relations are properties of tuples.• Usually, one tuple corresponds to one entity,

so the ideas are the same.• But --- in poor relational designs, one entity

can become several tuples, so E/R keys and Relational keys are different.

11

Page 12: Databases  2

Example

12

name addr BeersLiked manf FavBeer

Janeway Voyager Bud A.B. WickedAle

Janeway Voyager WickedAle Pete’s WickedAle

Spock Enterprise Bud A.B. Bud

name addr

Janeway Voyager

Spock Enterprise

beer manf

Bud A.B.

WickedAle Pete’s

Beers Drinkers

name, Beersliked relational key

In E/R name is a key for Drinkers, and beersLiked is a key for Beers

Page 13: Databases  2

Where Do Keys Come From?

1. We could simply assert a key K. Then the only FD’s are K -> A for all atributes A, and K turns out to be the only key obtainable from the FD’s.

2. We could assert FD’s and deduce the keys by systematic exploration.

E/R gives us FD’s from entity-set keys and many-one relationships.

13

Page 14: Databases  2

Armstrong’s axioms

• Let X,Y,Z R (i.e. X,Y,Z are attribute sets of R)

Armstrong’s axioms:• Reflexivity: if Y X then X -> Y• Augmentation: if X -> Y then for every Z:

XZ -> YZ• Transitivity: if X -> Y and Y -> Z then X -> Z

14

Page 15: Databases  2

FD’s From “Physics”

• While most FD’s come from E/R keyness and many-one relationships, some are really physical laws.

• Example: “no two courses can meet in the same room at the same time” tells us: hour room -> course.

15

Page 16: Databases  2

Inferring FD’s: Motivation

• In order to design relation schemas well, we often need to tell what FD’s hold in a relation.

• We are given FD’s X1 -> A1, X2 -> A2,…, Xn -> An , and we want to know whether an FD Y -> B must hold in any relation that satisfies the given FD’s.– Example: If A -> B and B -> C hold, surely A -> C

holds, even if we don’t say so.

16

Page 17: Databases  2

Inference Test

• To test if Y -> B, start assuming two tuples agree in all attributes of Y.

• Use the given FD’s to infer that these tuples must also agree in certain other attributes.

• If B is eventually found to be one of these attributes, then Y -> B is true; otherwise, the two tuples, with any forced equalities form a two-tuple relation that proves Y -> B does not follow from the given FD’s.

17

Page 18: Databases  2

Closure Test

• An easier way to test is to compute the closure of Y, denoted Y +.

• Basis: Y + = Y.• Induction: Look for an FD’s left side X that is a

subset of the current Y +. If the FD is X -> A, add A to Y +.

18

Page 19: Databases  2

19

Y+

new Y+

X A

Page 20: Databases  2

Finding All Implied FD’s

• Motivation: “normalization,” the process where we break a relation schema into two or more schemas.

• Example: ABCD with FD’s AB ->C, C ->D, and D ->A.– Decompose into ABC, AD. What FD’s hold in

ABC ?– Not only AB ->C, but also C ->A !

20

Page 21: Databases  2

Basic Idea

• To know what FD’s hold in a projection, we start with given FD’s and find all FD’s that follow from given ones.

• Then, restrict to those FD’s that involve only attributes of the projected schema.

21

Page 22: Databases  2

Simple, Exponential Algorithm

1. For each set of attributes X, compute X +.2. Add X ->A for all A in X + - X.3. However, drop XY ->A whenever we

discover X ->A. Because XY ->A follows from X ->A.

4. Finally, use only FD’s involving projected attributes.

22

Page 23: Databases  2

A Few Tricks

• Never need to compute the closure of the empty set or of the set of all attributes:– ∅ += ∅– R + =R

• If we find X + = all attributes, don’t bother computing the closure of any supersets of X:– X + = R and X Y => Y + = R

23

Page 24: Databases  2

Example

• ABC with FD’s A ->B and B ->C. Project onto AC.– A +=ABC ; yields A ->B, A ->C.

• We do not need to compute AB + or AC +.

– B +=BC ; yields B ->C.– C +=C ; yields nothing.– BC +=BC ; yields nothing.

24

Page 25: Databases  2

Example, Continued

• Resulting FD’s: A ->B, A ->C, and B ->C.• Projection onto AC : A ->C.

– Only FD that involves a subset of {A,C }.

25

Page 26: Databases  2

A Geometric View of FD’s

• Imagine the set of all instances of a particular relation.

• That is, all finite sets of tuples that have the proper number of components.

• Each instance is a point in this space.

26

Page 27: Databases  2

Example: R(A,B)

27

{(1,2), (3,4)}

{}

{(1,2), (3,4), (1,3)}

{(5,1)}

Page 28: Databases  2

An FD is a Subset of Instances

• For each FD X -> A there is a subset of all instances that satisfy the FD.

• We can represent an FD by a region in the space.

• Trivial FD : an FD that is represented by the entire space.

– Example: A -> A.

28

Page 29: Databases  2

Example: A -> B for R(A,B)

29

{(1,2), (3,4)}

{}

{(1,2), (3,4), (1,3)}

{(5,1)}A -> B

Page 30: Databases  2

Representing Sets of FD’s

• If each FD is a set of relation instances, then a collection of FD’s corresponds to the intersection of those sets.– Intersection = all instances that satisfy all of the

FD’s.

30

Page 31: Databases  2

Example

31

A->B

B->C

CD->A

Instances satisfyingA->B, B->C, andCD->A

Page 32: Databases  2

Implication of FD’s

• If an FD Y -> B follows from FD’s X1 -> A1,…, Xn -> An , then the region in the space of instances for Y -> B must include the intersection of the regions for the FD’s Xi -> Ai .– That is, every instance satisfying all the FD’s Xi -

> Ai surely satisfies Y -> B.– But an instance could satisfy Y -> B, yet not be

in this intersection.32

Page 33: Databases  2

Example

33

A->B B->CA->C

Page 34: Databases  2

34

Normalization: Anomalies• Goal of relational schema design is to avoid

anomalies and redundancy.– Update anomaly : one occurrence of a fact is

changed, but not all occurrences.– Deletion anomaly : valid fact is lost when a tuple is

deleted.

Page 35: Databases  2

35

Example of Bad Design

Data is redundant, because each of the ???’s can be figured out by using the FD’s name -> addr favBeer and beersLiked -> manf.

name addr BeersLiked manf FavBeer

Janeway Voyager Bud A.B. WickedAle

Janeway ??? WickedAle Pete’s ???

Spock Enterprise Bud ??? Bud

Drinkers(name, addr, beersLiked, manf, favBeer)

Page 36: Databases  2

36

This Bad Design AlsoExhibits Anomalies

• Update anomaly: if Janeway is transferred to Intrepid, will we remember to change each of her tuples?• Deletion anomaly: If nobody likes Bud, we lose track of the fact that Anheuser-Busch manufactures Bud.

name addr BeersLiked manf FavBeer

Janeway Voyager Bud A.B. WickedAle

Janeway Voyager WickedAle Pete’s WickedAle

Spock Enterprise Bud A.B. Bud

Page 37: Databases  2

37

Boyce-Codd Normal Form

• We say a relation R is in BCNF : if whenever X ->A is a nontrivial FD that holds in R, X is a superkey.– Remember: nontrivial means A is not a member

of set X.– Remember, a superkey is any superset of a key

(not necessarily a proper superset).

Page 38: Databases  2

38

Example

• Drinkers(name, addr, beersLiked, manf, favBeer)• FD’s: name->addr favBeer, beersLiked->manf

• Only key is {name, beersLiked}.• In each FD, the left side is not a superkey.• Any one of these FD’s shows Drinkers is not in

BCNF

Page 39: Databases  2

39

Another Example

• Beers(name, manf, manfAddr)• FD’s: name->manf, manf->manfAddr• Only key is {name}.• name->manf does not violate BCNF, but

manf->manfAddr does.

Page 40: Databases  2

40

Decomposition into BCNF

• Given: relation R with FD’s F.• Look among the given FD’s for a BCNF

violation X ->B.– If any FD following from F violates BCNF, then

there will surely be an FD in F itself that violates BCNF.

• Compute X +.– Not all attributes, or else X is a superkey.

Page 41: Databases  2

41

Decompose R Using X -> B

• Replace R by relations with schemas:1. R1 = X +.

2. R2 = (R – X +) U X.

Project given FD’s F onto the two new relations.

1. Compute the closure of F = all nontrivial FD’s that follow from F.

2. Use only those FD’s whose attributes are all in R1 or all in R2.

Page 42: Databases  2

42

Decomposition Picture

R-X + X X +-X

R2

R1

R

Page 43: Databases  2

43

Example

• Drinkers(name, addr, beersLiked, manf, favBeer)• F = name->addr, name -> favBeer, beersLiked->manf• Pick BCNF violation name->addr.• Close the left side: {name}+ = {name, addr, favBeer}.• Decomposed relations:

1. Drinkers1(name, addr, favBeer)2. Drinkers2(name, beersLiked, manf)

Page 44: Databases  2

44

Example, Continued

• We are not done; we need to check Drinkers1 and Drinkers2 for BCNF.

• Projecting FD’s is complex in general, easy here.

• For Drinkers1(name, addr, favBeer), relevant FD’s are name->addr and name->favBeer.– Thus, name is the only key and Drinkers1 is in BCNF.

Page 45: Databases  2

45

Example, Continued

• For Drinkers2(name, beersLiked, manf), the only FD is beersLiked->manf, and the only key is {name, beersLiked}.

– Violation of BCNF.• beersLiked+ = {beersLiked, manf}, so we

decompose Drinkers2 into:1. Drinkers3(beersLiked, manf)2. Drinkers4(name, beersLiked)

Page 46: Databases  2

46

Example, Concluded

• The resulting decomposition of Drinkers :1. Drinkers1(name, addr, favBeer)2. Drinkers3(beersLiked, manf)3. Drinkers4(name, beersLiked)

Notice: Drinkers1 tells us about drinkers, Drinkers3 tells us about beers, and Drinkers4 tells us the relationship between drinkers and the beers they like.

Page 47: Databases  2

47

Third Normal Form - Motivation

• There is one structure of FD’s that causes trouble when we decompose.

• AB ->C and C ->B.– Example: A = street address, B = city, C = zip

code.

• There are two keys, {A,B } and {A,C }.• C ->B is a BCNF violation, so we must

decompose into AC, BC.

Page 48: Databases  2

48

We Cannot Enforce FD’s

• The problem is that if we use AC and BC as our database schema, we cannot enforce the FD AB ->C by checking FD’s in these decomposed relations.

• Example with A = street, B = city, and C = zip on the next slide.

Page 49: Databases  2

49

An Unenforceable FD street zip

545 Tech Sq. 02138545 Tech Sq. 02139

city zip

Cambridge 02138Cambridge 02139

Join tuples with equal zip codes.

street city zip

545 Tech Sq. Cambridge 02138545 Tech Sq. Cambridge 02139

Although no FD’s were violated in the decomposed relations,FD street city -> zip is violated by the database as a whole.

Page 50: Databases  2

50

3NF Let’s Us Avoid This Problem

• 3rd Normal Form (3NF) modifies the BCNF condition so we do not have to decompose in this problem situation.

• An attribute is prime if it is a member of any key.

• X ->A violates 3NF if and only if X is not a superkey, and also A is not prime.

Page 51: Databases  2

51

Example

• In our problem situation with FD’s AB ->C and C ->B, we have keys AB and AC.

• Thus A, B, and C are each prime.• Although C ->B violates BCNF, it does not

violate 3NF.

Page 52: Databases  2

52

What 3NF and BCNF Give You

• There are two important properties of a decomposition:

1. Recovery : it should be possible to project the original relations onto the decomposed schema, and then reconstruct the original.

2. Dependency preservation : it should be possible to check in the projected relations whether all the given FD’s are satisfied.

Page 53: Databases  2

53

3NF and BCNF, Continued

• We can get (1) with a BCNF decompsition.– Explanation needs to wait for relational algebra.

• We can get both (1) and (2) with a 3NF decomposition.

• But we can’t always get (1) and (2) with a BCNF decomposition.– street-city-zip is an example.

Page 54: Databases  2

54

A New Form of Redundancy

• Multivalued dependencies (MVD’s) express a condition among tuples of a relation that exists when the relation is trying to represent more than one many-many relationship.

• Then certain attributes become independent of one another, and their values must appear in all combinations.

Page 55: Databases  2

55

Example

Drinkers(name, addr, phones, beersLiked)• A drinker’s phones are independent of the beers

they like.• Thus, each of a drinker’s phones appears with

each of the beers they like in all combinations.• This repetition is unlike redundancy due to FD’s,

of which name->addr is the only one.

Page 56: Databases  2

56

Tuples Implied by Independence

If we have tuples:

Then these tuples must also be in the relation.

name addr phones beersLiked

Sue a p1 b1

Sue a p2 b2

Sue a p1 b2

Sue a p2 b1

Page 57: Databases  2

57

Definition of MVD

• A multivalued dependency (MVD) X ->->Y is an assertion that if two tuples of a relation agree on all the attributes of X, then their components in the set of attributes Y may be swapped, and the result will be two tuples that are also in the relation.

Page 58: Databases  2

58

Example

• The name-addr-phones-beersLiked example illustrated the MVD

name->->phones and the MVD

name ->-> beersLiked.

Page 59: Databases  2

59

Picture of MVD X ->->Y X Y others

equal

exchange

Page 60: Databases  2

60

MVD Rules

• Every FD is an MVD.– If X ->Y is a FD, then swapping Y ’s between two

tuples that agree on X doesn’t change the tuples.– Therefore, the “new” tuples are surely in the

relation, and we know X ->->Y.

• Complementation : If X ->->Y, and Z is all the other attributes, then X ->->Z.

Page 61: Databases  2

61

Splitting Doesn’t Hold

• Like FD’s, we cannot generally split the left side of an MVD.

• But unlike FD’s, we cannot split the right side either --- sometimes you have to leave several attributes on the right side.

Page 62: Databases  2

62

Example

• Consider a drinkers relation:Drinkers(name, areaCode, phone, beersLiked,

manf)• A drinker can have several phones, with the

number divided between areaCode and phone (last 7 digits).

• A drinker can like several beers, each with its own manufacturer.

Page 63: Databases  2

63

Example, Continued

• Since the areaCode-phone combinations for a drinker are independent of the beersLiked-manf combinations, we expect that the following MVD’s hold:

name ->-> areaCode phonename ->-> beersLiked manf

Page 64: Databases  2

64

Example DataHere is possible data satisfying these MVD’s:

name areaCode phone beersLiked manf

Sue 650 555-1111 Bud A.B.Sue 650 555-1111 WickedAle Pete’sSue 415 555-9999 Bud A.B.Sue 415 555-9999 WickedAle Pete’s

But we cannot swap area codes or phones my themselves.That is, neither name ->-> areaCode nor name ->-> phoneholds for this relation.

Page 65: Databases  2

65

Fourth Normal Form

• The redundancy that comes from MVD’s is not removable by putting the database schema in BCNF.

• There is a stronger normal form, called 4NF, that (intuitively) treats MVD’s as FD’s when it comes to decomposition, but not when determining keys of the relation.

Page 66: Databases  2

66

4NF Definition

• A relation R is in 4NF if whenever X ->->Y is a nontrivial MVD, then X is a superkey.

– “Nontrivial means that:1. Y is not a subset of X, and2. X and Y are not, together, all the attributes.

– Note that the definition of “superkey” still depends on FD’s only.

Page 67: Databases  2

67

BCNF Versus 4NF

• Remember that every FD X ->Y is also an MVD, X ->->Y.

• Thus, if R is in 4NF, it is certainly in BCNF.– Because any BCNF violation is a 4NF violation.

• But R could be in BCNF and not 4NF, because MVD’s are “invisible” to BCNF.

Page 68: Databases  2

68

Decomposition and 4NF

• If X ->->Y is a 4NF violation for relation R, we can decompose R using the same technique as for BCNF.

1. XY is one of the decomposed relations.2. All but Y – X is the other.

Page 69: Databases  2

69

Example

Drinkers(name, addr, phones, beersLiked)

FD: name -> addrMVD’s: name ->-> phones

name ->-> beersLiked• Key is {name, phones, beersLiked}.• All dependencies violate 4NF.

Page 70: Databases  2

70

Example, Continued

• Decompose using name -> addr:1. Drinkers1(name, addr)

– In 4NF, only dependency is name -> addr.

2. Drinkers2(name, phones, beersLiked)– Not in 4NF. MVD’s name ->-> phones and

name ->-> beersLiked apply. No FD’s, so all three attributes form the key.

Page 71: Databases  2

71

Example: Decompose Drinkers2• Either MVD name ->-> phones or name ->->

beersLiked tells us to decompose to:– Drinkers3(name, phones)– Drinkers4(name, beersLiked)

Page 72: Databases  2

On-Line Application Processing

• Warehousing• Data Cubes• Data Mining

72

Page 73: Databases  2

73

Overview

• Traditional database systems are tuned to many, small, simple queries.

• Some new applications use fewer, more time-consuming, complex queries.

• New architectures have been developed to handle complex “analytic” queries efficiently.

Page 74: Databases  2

74

The Data Warehouse

• The most common form of data integration.– Copy sources into a single DB (warehouse) and try

to keep it up-to-date.– Usual method: periodic reconstruction of the

warehouse, perhaps overnight.– Frequently essential for analytic queries.

Page 75: Databases  2

75

OLTP

• Most database operations involve On-Line Transaction Processing (OTLP).– Short, simple, frequent queries and/or

modifications, each involving a small number of tuples.

– Examples: • Answering queries from a Web interface• Sales at cash registers • Selling airline tickets

Page 76: Databases  2

76

OLAP

• Of increasing importance are On-Line Analytical Processing (OLAP) queries.– Few, but complex queries --- may run for hours.– Queries do not depend on having an absolutely

up-to-date database.

• Sometimes called Data Mining.

Page 77: Databases  2

77

OLAP Examples

1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer.

2. Analysts at Wal-Mart look for items with increasing sales in some region.

Page 78: Databases  2

78

Common Architecture

• Databases at store branches handle OLTP.• Local store databases copied to a central

warehouse overnight.• Analysts use the warehouse for OLAP.

Page 79: Databases  2

79

Star Schemas

• A star schema is a common organization for data at a warehouse. It consists of:

1. Fact table : a very large accumulation of facts such as sales. Often “insert-only.”

2. Dimension tables : smaller, generally static information about the entities involved in the facts.

Page 80: Databases  2

80

Example: Star Schema

• Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged.

• The fact table is a relation:Sales(bar, beer, drinker, day, time, price)

Page 81: Databases  2

81

Example, Continued

• The dimension tables include information about the bar, beer, and drinker “dimensions”:

Bars(bar, addr, license)Beers(beer, manf)Drinkers(drinker, addr, phone)

Page 82: Databases  2

82

Dimensions and Dependent Attributes

• Two classes of fact-table attributes:1. Dimension attributes : the key of a dimension

table.2. Dependent attributes : a value determined by

the dimension attributes of the tuple.

Page 83: Databases  2

83

Example: Dependent Attribute

• price is the dependent attribute of our example Sales relation.

• It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time attributes).

Page 84: Databases  2

84

Approaches to Building Warehouses

1. ROLAP = “relational OLAP”: Tune a relational DBMS to support star schemas.

2. MOLAP = “multidimensional OLAP”: Use a specialized DBMS with a model such as the “data cube.”

Page 85: Databases  2

85

ROLAP Techniques

1. Bitmap indexes : For each key value of a dimension table (e.g., each beer for relation Beers) create a bit-vector telling which tuples of the fact table have that value.

2. Materialized views : Store the answers to several useful queries (views) in the warehouse itself.

Page 86: Databases  2

86

Typical OLAP Queries

• Often, OLAP queries begin with a “star join”: the natural join of the fact table with all or most of the dimension tables.

• Example:SELECT *FROM Sales, Bars, Beers, DrinkersWHERE Sales.bar = Bars.bar ANDSales.beer = Beers.beer ANDSales.drinker = Drinkers.drinker;

Page 87: Databases  2

87

Typical OLAP Queries --- 2

• The typical OLAP query will:1. Start with a star join.2. Select for interesting tuples, based on dimension

data.3. Group by one or more dimensions.4. Aggregate certain attributes of the result.

Page 88: Databases  2

88

Example: OLAP Query

• For each bar in Palo Alto, find the total sale of each beer manufactured by Anheuser-Busch.

2. Filter: addr = “Palo Alto” and manf = “Anheuser-Busch”.

3. Grouping: by bar and beer.4. Aggregation: Sum of price.

Page 89: Databases  2

89

Example: In SQL

SELECT bar, beer, SUM(price)

FROM Sales NATURAL JOIN Bars

NATURAL JOIN Beers

WHERE addr = ’Palo Alto’ AND

manf = ’Anheuser-Busch’

GROUP BY bar, beer;

Page 90: Databases  2

90

Using Materialized Views

• A direct execution of this query from Sales and the dimension tables could take too long.

• If we create a materialized view that contains enough information, we may be able to answer our query much faster.

Page 91: Databases  2

91

Example: Materialized View

• Which views could help with our query?• Key issues:

1. It must join Sales, Bars, and Beers, at least.2. It must group by at least bar and beer.3. It must not select out Palo-Alto bars or Anheuser-

Busch beers.4. It must not project out addr or manf.

Page 92: Databases  2

92

Example --- Continued

• Here is a materialized view that could help:CREATE VIEW BABMS(bar, addr,

beer, manf, sales) AS

SELECT bar, addr, beer, manf,

SUM(price) sales

FROM Sales NATURAL JOIN Bars

NATURAL JOIN Beers

GROUP BY bar, addr, beer, manf; Since bar -> addr and beer -> manf, there is no realgrouping. We need addr and manf in the SELECT.

Page 93: Databases  2

93

Example --- Concluded

• Here’s our query using the materialized view BABMS:

SELECT bar, beer, sales

FROM BABMS

WHERE addr = ’Palo Alto’ AND

manf = ’Anheuser-Busch’;

Page 94: Databases  2

94

MOLAP and Data Cubes

• Keys of dimension tables are the dimensions of a hypercube.– Example: for the Sales data, the four dimensions

are bars, beers, drinkers, and time.

• Dependent attributes (e.g., price) appear at the points of the cube.

Page 95: Databases  2

95

Marginals

• The data cube also includes aggregation (typically SUM) along the margins of the cube.

• The marginals include aggregations over one dimension, two dimensions,…

Page 96: Databases  2

96

Example: Marginals

• Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days).

• It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,…

Page 97: Databases  2

97

Structure of the Cube

• Think of each dimension as having an additional value *.

• A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s.

• Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s.

Page 98: Databases  2

98

Drill-Down

• Drill-down = “de-aggregate” = break an aggregate into its constituents.

• Example: having determined that Joe’s Bar sells very few Anheuser-Busch beers, break down his sales by particular A.-B. beer.

Page 99: Databases  2

99

Roll-Up

• Roll-up = aggregate along one or more dimensions.

• Example: given a table of how much Bud each drinker consumes at each bar, roll it up into a table giving total amount of Bud consumed for each drinker.

Page 100: Databases  2

100

Materialized Data-Cube Views

• Data cubes invite materialized views that are aggregations in one or more dimensions.

• Dimensions may not be completely aggregated --- an option is to group by an attribute of the dimension table.

Page 101: Databases  2

101

Example

• A materialized view for our Sales data cube might:

1. Aggregate by drinker completely.2. Not aggregate at all by beer.3. Aggregate by time according to the week.4. Aggregate according to the city of the bar.

Page 102: Databases  2

102

Data Mining

• Data mining is a popular term for queries that summarize big data sets in useful ways.

• Examples:1. Clustering all Web pages by topic.2. Finding characteristics of fraudulent credit-card

use.

Page 103: Databases  2

103

Market-Basket Data

• An important form of mining from relational data involves market baskets = sets of “items” that are purchased together as a customer leaves a store.

• Summary of basket data is frequent itemsets = sets of items that often appear together in baskets.

Page 104: Databases  2

104

Example: Market Baskets

• If people often buy hamburger and ketchup together, the store can:

1. Put hamburger and ketchup near each other and put potato chips between.

2. Run a sale on hamburger and raise the price of ketchup.

Page 105: Databases  2

105

Finding Frequent Pairs

• The simplest case is when we only want to find “frequent pairs” of items.

• Assume data is in a relation Baskets(basket, item).

• The support threshold s is the minimum number of baskets in which a pair appears before we are interested.

Page 106: Databases  2

106

Frequent Pairs in SQL

SELECT b1.item, b2.item

FROM Baskets b1, Baskets b2

WHERE b1.basket = b2.basket

AND b1.item < b2.item

GROUP BY b1.item, b2.item

HAVING COUNT(*) >= s;

Look for twoBasket tupleswith the samebasket anddifferent items.First item mustprecede second,so we don’tcount the samepair twice.

Create a group foreach pair of itemsthat appears in atleast one basket.

Throw away pairs of itemsthat do not appear at leasts times.

Page 107: Databases  2

107

A-Priori Trick --- 1

• Straightforward implementation involves a join of a huge Baskets relation with itself.

• The a-priori algorithm speeds the query by recognizing that a pair of items {i,j } cannot have support s unless both {i } and {j } do.

Page 108: Databases  2

108

A-Priori Trick --- 2

• Use a materialized view to hold only information about frequent items.

INSERT INTO Baskets1(basket, item)

SELECT * FROM Baskets

WHERE item IN (

SELECT ITEM FROM Baskets

GROUP BY item

HAVING COUNT(*) >= s

);

Items thatappear in atleast s baskets.

Page 109: Databases  2

109

A-Priori Algorithm

1. Materialize the view Baskets1.2. Run the obvious query, but on Baskets1

instead of Baskets.• Baskets1 is cheap, since it doesn’t involve

a join.• Baskets1 probably has many fewer tuples

than Baskets.– Running time shrinks with the square of the

number of tuples involved in the join.

Page 110: Databases  2

110

Example: A-Priori

• Suppose:1. A supermarket sells 10,000 items.2. The average basket has 10 items.3. The support threshold is 1% of the baskets.

• At most 1/10 of the items can be frequent.• Probably, the minority of items in one basket

are frequent -> factor 4 speedup.