Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull...

63
Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado

Transcript of Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull...

Page 1: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Relative Information Capacity of Simple Relational Database Schemata

Paper by: Richard HullPresented by: Jose Picado

Page 2: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Outline

• Problem: Data relativism and information capacity– Definition– Examples– Importance

• Hierarchy of dominance measures• Basic results• Discussion

Page 3: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Data relativism

• Represent the same data in different ways

Page 4: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Data relativism

• Represent the same data in different ways• Represent the same data under different

schemas

Page 5: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Data relativism

• Represent the same data in different ways• Represent the same data under different

schemas

Person

name sex spouseSchema 1

Example taken from: Kosky, Anhony. Transforming Databases with Recursive Data Structures, 1996.

Page 6: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Data relativism

• Represent the same data in different ways• Represent the same data under different

schemas

Person

name sex spouse

Female

name

Male

name

Marriage

husband wife

Schema 1

Schema 2

Example taken from: Kosky, Anhony. Transforming Databases with Recursive Data Sturctures, 1996.

Page 7: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Relative information capacity

• Expressiveness of a schema• Different schemas representing same data

may have different information capacity

Page 8: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Relative information capacity

• Expressiveness of a schema• Different schemas representing same data

may have different information capacity

Person

name sex spouse

Female

name

Male

name

Marriage

husband wife

Schema 1

Schema 2

Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.

Page 9: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Relative information capacity

• Expressiveness of a schema• Different schemas representing same data

may have different information capacity

Person

name sex spouse

Female

name

Male

name

Marriage

husband wife

Schema 1:• Does not require that the

spouse attribute of a man goes to a woman.

• Does not require that for each spouse attribute in one direction there is a corresponding spouse attribute in another direction.

Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.

Page 10: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Relative information capacity

• Expressiveness of a schema• Different schemas representing same data

may have different information capacity

Person

name sex spouse

Female

name

Male

name

Marriage

husband wife

Schema 2:• Allows unmarried people to

be represented in the database.

Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.

Page 11: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Relative information capacity

• Possible solution: – Transform existing schema to new schema by

structural manipulations

Person

name sex spouse

Female

name

Male

name

Marriage

husband wife

transformation

Page 12: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Relative information capacity

• Possible solution: – Transform existing schema to new schema by

structural manipulations– Information capacity preserving?

Person

name sex spouse

Female

name

Male

name

Marriage

husband wife

transformation

Page 13: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Importance

• Schema evolution– None of the information stored in the initial

database is lost

Person

name sex spouse

Female

name

Male

name

Marriage

husband wife

Page 14: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Importance• Data integration– All information in one of the component

databases is reflected in the integrated database

City

name state

State

name capital

City

name isCapital country

Country

name language currency

City

name place

Country

name language currency capital

State

name capital

Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.

Page 15: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Importance

• Database normalization theory• User view construction• Schema simplification• Translation between data models

Page 16: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Hull’s paper

• Introduces theoretical tools for studying measures of relative information capacity– Theoretical frameworks at the time were complex– There was no clear definition about the concept– Hull introduced nice ways of comparing schemata

and their information capacity• Defines a hierarchy of measures to compare

information capacity of schemata

Page 17: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Hull’s paper

• Gives some basic results concerning the previous measures

• Considers only non-keyed relations

Person

id name

Person

id name

123 John

123 Mary

123 John

123 Mary

Non-keyed Keyed

Instances:

Relations:

Page 18: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Definitions

• Schema P is a set of relations• Relations composed of attributes, which may

be of different basic types• Basic types are domain designators (have a

fixed domain of possible values)• I(P) is the instances of P, usually infinite

Person

id name

111 John

222 Mary123 Anne

234 Joe

aaa Jack

bbb Ted

Schema P Instances I(P)

Page 19: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Transformation

• P and Q are relational schemata• A transformation from P to Q is a map

Page 20: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Transformation

• P and Q are relational schemata• A transformation from P to Q is a map

PPerson

id nameBirth

id date

Page 21: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Transformation

• P and Q are relational schemata• A transformation from P to Q is a map

P

QPersonInfo

id name bdate

Person

id nameBirth

id date

Page 22: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Transformation

• P and Q are relational schemata• A transformation from P to Q is a map

P

QPersonInfo

id name bdate

Person

id nameBirth

id date

PersonInfo(x,y,z) :- Person(x,y), Birth(x,z).

Page 23: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Dominance

• P and Q are relational schemata

• Q dominates P via if the composition of followed by is the identity on P

Page 24: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Dominance

Person

name sex spouse

Female

name

Male

name

Marriage

husband wife

P

Q

Page 25: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Dominance

1. Take instances of P: I(P)

Person

John male Mary

Mary female John

Anne female Joe

Joe male Anne

Page 26: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Dominance

2. Apply to I(P) Male(x) :- Person(x,y,z), y=“male”.Female(x) :- Person(x,y,z), y=“female”.Marriage(x,y) :- Person(x,u,y), Person(y,v,x), u=“male”, v=“female”

Male

John

Joe

Female

Mary

Anne

Marriage

John Mary

Joe Anne

Person

John male Mary

Mary female John

Anne female Joe

Joe male Anne

Page 27: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Dominance

3. Apply to (I(P))

Person(x,”male”,z) :- Male(x), Marriage(x,z).Person(x,”female”,z) :- Female(x), Marriage(x,z).

Male

John

Joe

Female

Mary

Anne

Marriage

John Mary

Joe Anne

Person

John male Mary

Mary female John

Anne female Joe

Joe male Anne

Page 28: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

( (I(P)))

Dominance

4. Compare I(P) and ( (I(P)))

Person

John male Mary

Mary female John

Anne female Joe

Joe male Anne

Person

John male Mary

Mary female John

Anne female Joe

Joe male Anne

I(P)

Page 29: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Dominance

• P and Q are relational schemata

• Q dominates P via if the composition of followed by is the identity on P

Q has at least as much capacity for storing information as P

Information structured according to P can be restructured to “fit” into Q, and restructured again to “fit” into P

Page 30: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Equivalence

• P and Q are equivalent (xxx) if they have equivalent information capacity

• P and Q are equivalent if – Q dominates P (xxx) and – P dominates Q (xxx)

Page 31: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Information dominance measures

1. Calculous dominance2. Generic dominance3. Internal dominance4. Absolute dominance

More restrictive

Less restrictive

Page 32: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Types of equivalency

1. P and Q are equivalent (calc)2. P and Q are equivalent (gen)3. P and Q are equivalent (int)4. P and Q are equivalent (abs)

More restrictive

Less restrictive

Page 33: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Level 1: Calculous dominance

• Only allow transformations to be relational calculus expressions

• Relational calculus:– First order logic or predicate calculus– Predicates: atom,

– Each query Q(x1, …, xn) is a predicate P

Page 34: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Level 1: Calculous dominance

• Only allow transformations to be relational calculus expressions

• are relational calculus expressions

• Q dominates P calculously

Page 35: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Level 2: Generic dominance

• Only allow transformations that treat domain elements as “essentially uninterpreted objects”

• Treat all elements as equals except some set of constants

• Property of all query languages, such as SQL and Datalog

Page 36: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Level 2: Generic dominance

• Only allow transformations that treat domain elements as “essentially uninterpreted objects”

• treat all elements as equals

• Q dominates P generically

Page 37: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Level 3: Internal dominance

• Only allow transformations that do not invent any data

• Invent data: numerical computations or string manipulations

player goals games player performance

performance = goals/games

Page 38: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Level 3: Internal dominance

• Only allow transformations that do not invent any data

• do not invent data• Q dominates P internally

Page 39: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Level 4: Absolute dominance

• Some set of values • : instances of P that contain only values

in Y, where• : cardinality of instances of P containing

only values in Y• If then

Q dominates P absolutely• Easy to compute: based on counting of

instances, instead of transformations

Page 40: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• Q dominates P calculously

Q dominates P generically

Q dominates P internally

Q dominates P absolutely

Page 41: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• Sometimes absolute and internal dominance hold, but generic and calculous dominance don’t

A A

B B

A B

Q

PQ dominates P (abs, int)• and transformation (int)

does not invent data

Q does not dominate P (gen, calc)• There is no transformation (gen, calc) that

takes instances of P to Q and then back to P

Page 42: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• Absolute dominance useful for verifying calculous (not) dominance

A B

A C

A B C

Q

P• Q dominates P calculously

Q dominates P absolutely

• P does not dominate Q absolutelyP does not dominates Q

calculously*under certain constraints

Page 43: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• Dominance is preserved by re-namings of basic types (homomorphism)– h(P): homomorphism of P– If Q dominates P then

h(Q) dominates h(P)for any measure of dominance (calc, gen, int, abs)

Page 44: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• Calculous dominance does not accurately measure the presence of “semantic correspondence”

Page 45: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• Calculous dominance does not accurately measure the presence of “semantic correspondence”

name position goalsname goals minutes S1R1

NAME NUMBER NUMBER NAME NAME NUMBER

title publisher pagestitle pages edition S2R2P

Page 46: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• Calculous dominance does not accurately measure the presence of “semantic correspondence”

NAME NAME NUMBER NUMBERT

P

Q

name position goalsname goals minutes S1R1

NAME NUMBER NUMBER NAME NAME NUMBER

title publisher pagestitle pages edition S2R2

Page 47: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• Calculous dominance does not accurately measure the presence of “semantic correspondence”

NAME NAME NUMBER NUMBERT

P

Q

Q dominates P (calc), but there is not semantic mapping from P to Q

name position goalsname goals minutes S1R1

NAME NUMBER NUMBER NAME NAME NUMBER

title publisher pagestitle pages edition S2R2

Page 48: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• If only non-keyed relational schemata with only one basic type, then all types of dominance are equivalent

Theorem: Let P and Q be non-keyed relational schemata over a single basic type B. Then the following are equivalent:a. Q dominates P (calc)b. Q dominates P (gen)c. Q dominates P (int)d. Q dominates P (abs)

Page 49: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Basic results

• With any reasonable measure of relative information capacity, two non-keyed relational schemata are equivalent iff they are identical

• In the relational model (non-keyed), there is essentially at most one way to represent a given data set

Page 50: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Strong points:– ???

Page 51: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Strong points:1. Provides a theory to study relative information

capacity

Page 52: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Strong points:1. Provides a theory to study relative information

capacity2. Data relativism is important as it arises in many

areas

Page 53: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Strong points:1. Provides a theory to study relative information

capacity2. Data relativism is important as it arises in many

areas3. Defines a hierarchy of dominance measures

Page 54: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Strong points:1. Provides a theory to study relative information

capacity2. Data relativism is important as it arises in many

areas3. Defines a hierarchy of dominance measures4. Gives important results about the relational

model

Page 55: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Weak points:– ???

Page 56: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Weak points:1. Does not support dependencies/constraints• Hierarchy of dominance measures• Basic results

Page 57: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Functional dependency (FD):Given attributes in relation R, the functional dependency means that all tuples in R that agree on attributes must also agree on .

id name address

123 John 21 Kings St.

234 Mary 31 Kings St.

Page 58: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Multivalued dependency (MVD):For MVD , if two tuples of R agree on all the attributes of X, then their components in Y may be swapped, and the result will be two tuples that are also in the relation.

course book lecturer

Machine Learning

Pattern Recognition

John

Artificial Intelligence

AIMA Mary

Page 59: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Inclusion dependency (IND):For , for any tuple t1 in R1, there must exist a tuple t2 in R2, such that

id title

111 Pattern Recognition

222 AIMA

bookid customer

111 John

222 Mary

Book

Order

Page 60: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Weak points:1. Does not support dependencies/constraints• Hierarchy of dominance measures• Basic results

Dependencies change the final result of the paper

Page 61: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Discussion

• Weak points:1. Does not support dependencies/constraints• Hierarchy of dominance measures• Basic results

2. Open questions: • Absolute dominance implies internal dominance?• Generic dominance implies calculous dominance?• Is there a measure for “semantic correspondence”?

Page 62: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Thank you

Page 63: Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado.

Quiz

• What are the four formal measures of relative information capacity defined by Hull? Write them in order from most restrictive to less restrictive.