Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto.

Post on 19-Dec-2015

221 views 0 download

Tags:

Transcript of Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto.

Well-designed XML Data

Marcelo Arenas and Leonid Libkin

University of Toronto

Outline

Part 1 - Database Normalization from the 1970s and 1980s.

Part 2 - Classical theory revisited: normalizing XML documents.

Part 3 - Classical theory re-done: new justifications for normalization.

2

Part 1: Classical Normalization

Design: decide how to represent the information in a particular data model.

• Even for simple application domains there is a large number of ways of representing the data of interest.

We have to design the schema of the database.

• Set of relations.

• Set of attributes for each relation.

• Set of data dependencies.

3

Designing a Database: An Example

Attributes: number, title, section, room.

Data dependency: every course number is associated with only one title.

Relational Schema:

R(number, title, section, room),

number title

GOOD alternative:

S(number, title), number title

T(number, section, room),

4

BAD alternative:

Problems with BAD: Update Anomaly

number title section room

CSC258 Computer Organization 1 LP266

CSC258 Computer Organization 2 GB258

CSC258 Computer Organization 3 GB248

CSC434 Database Systems 1 GB248

Title of CSC258 is changed to Computer Organization I.

5

Problems with BAD: Update Anomaly

number title section room

CSC258 Computer Organization 1 LP266

CSC258 Computer Organization 2 GB258

CSC258 Computer Organization 3 GB248

CSC434 Database Systems 1 GB248

Title of CSC258 is changed to Computer Organization I.

5

Problems with BAD: Update Anomaly

number title section room

CSC258 Computer Organization I

1 LP266

CSC258 Computer Organization I

2 GB258

CSC258 Computer Organization I

3 GB248

CSC434 Database Systems 1 GB248Title of CSC258 is changed to Computer Organization I.The instance stores redundant information.

5

Deletion Anomaly

number title section room

CSC258 Computer Organization I

1 LP266

CSC258 Computer Organization I

2 GB258

CSC258 Computer Organization I

3 GB248

CSC434 Database Systems 1 GB248CSC434 is not given in this term.

6

Deletion Anomaly

number title section room

CSC258 Computer Organization I

1 LP266

CSC258 Computer Organization I

2 GB258

CSC258 Computer Organization I

3 GB248

CSC434 Database Systems 1 GB248CSC434 is not given in this term.

6

Deletion Anomaly

number title section room

CSC258 Computer Organization I

1 LP266

CSC258 Computer Organization I

2 GB258

CSC258 Computer Organization I

3 GB248

CSC434 is not given in this term.

Additional effect: all the information about CSC434 was deleted.

6

Insertion Anomaly

number title section room

CSC258 Computer Organization I

1 LP266

CSC258 Computer Organization I

2 GB258

CSC258 Computer Organization I

3 GB248

A new course is created: (CSC336, Numerical Methods)

7

Insertion Anomaly

number title section room

CSC258 Computer Organization I

1 LP266

CSC258 Computer Organization I

2 GB258

CSC258 Computer Organization I

3 GB248

A new course is created: (CSC336, Numerical Methods)

7

Insertion Anomaly

number title section room

CSC258 Computer Organization I

1 LP266

CSC258 Computer Organization I

2 GB258

CSC258 Computer Organization I

3 GB248

CSC336 Numerical Methods ? ?A new course is created: (CSC336, Numerical Methods)The instance stores attributes that are not directly related.

7

Avoiding Update Anomalies

number

title

CSC258

Computer Organization

CSC434

Database Systems

number

section room

CSC258

1 LP266

CSC258

2 GB258

CSC258

3 GB248

CSC434

1 GB248Title of CSC258 is changed to Computer Organization I.

8

Avoiding Update Anomalies

number

title

CSC258

Computer Organization

CSC434

Database Systems

number

section room

CSC258

1 LP266

CSC258

2 GB258

CSC258

3 GB248

CSC434

1 GB248Title of CSC258 is changed to Computer Organization I.

8

Avoiding Update Anomalies

number

title

CSC258

Computer Organization I

CSC434

Database Systems

number

section room

CSC258

1 LP266

CSC258

2 GB258

CSC258

3 GB248

CSC434

1 GB248Title of CSC258 is changed to Computer Organization I.CSC434 is not given in this term.

The instance does not store redundant information.

8

Avoiding Update Anomalies

number

title

CSC258

Computer Organization I

CSC434

Database Systems

number

section room

CSC258

1 LP266

CSC258

2 GB258

CSC258

3 GB248

CSC434

1 GB248CSC434 is not given in this term.

8

Avoiding Update Anomalies

number

title

CSC258

Computer Organization I

CSC434

Database Systems

number

section room

CSC258

1 LP266

CSC258

2 GB258

CSC258

3 GB248

CSC434 is not given in this term.

The title of CSC434 is not removed from the instance.

A new course is created: (CSC336, Numerical Methods)

8

Avoiding Update Anomalies

number

title

CSC258

Computer Organization I

CSC434

Database Systems

number

section room

CSC258

1 LP266

CSC258

2 GB258

CSC258

3 GB248

A new course is created: (CSC336, Numerical Methods)

8

Avoiding Update Anomalies

number

title

CSC258

Computer Organization I

CSC434

Database Systems

CSC336

Numerical Methods

number

section room

CSC258

1 LP266

CSC258

2 GB258

CSC258

3 GB248

A new course is created: (CSC336, Numerical Methods)No information about sections has to be provided.Each relation stores attributes that are directly related.

8

Normalization Theory

Main idea: a normal form defines a condition that a well designed database should satisfy.

Normal form: syntactic condition on the database schema.• Defined for a class of data dependencies.

Main problems:

• How to test whether a database schema is in a particular normal form.

• How to transform a database schema into an equivalent one satisfying a particular normal form.

9

Normalization Theory Today

Normalization theory for relational databases was developed in the 70s and 80s.

Why do we need normalization theory today?• New data models have emerged: XML.

• XML documents can contain redundant information.

Redundant information in XML documents:• Can be discovered if the user provides semantic

information.

• Can be eliminated.

10

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

name #PCDATA

grade #PCDATA

DTD:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

name #PCDATA

grade #PCDATA

DTD:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

</course>

<course cno=“CSC434”>

</course>

</courses>

XML Document:

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

name #PCDATA

grade #PCDATA

DTD:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

name #PCDATA

grade #PCDATA

DTD:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

name #PCDATA

grade #PCDATA

DTD:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

name #PCDATA

grade #PCDATA

DTD:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

name #PCDATA

grade #PCDATA

DTD:

11

Part 2: XML and Normalization

<courses>

<course cno=“CSC258”>

<taken_by>

<student sno=“st1”>

<name> Fox </name>

<grade> B+ </grade>

</student>

</taken_by>

</course>

</courses>

XML Document:

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

name #PCDATA

grade #PCDATA

DTD:

11

XML Databases

D : : Two students with the same @sno value must have the same name.

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

12

XML Schema: (D, )

Redundancy in XML

courses

coursecourse info

@cno @cno taken_bytaken_by

student student

@snoname gradegrade name@sno

student

name@sno

. . .

“st1” “st1” “A+”“B+”

“CSC258” “CSC434”

“Fox”“Fox”

“st1” “Fox”

13

XML Database Normalization

DTD: Data dependency:

Two students with the same @sno value must have the same name.

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student name, grade

14

XML Database Normalization

DTD:

, info* @sno is the identifier of info elements.

courses course*

course @cno

course taken_by

taken_by

student*

student @sno

student gradeinfo @sno

info name

Data dependency:

Two students with the same @sno value must have the same name.

14

A “Non-relational” Example

DBLP

conf conf

title issueissue

article articlearticle

@yeartitle title @year

@year

“ICDT”

@year

author @yeartitleauthor“1999”

“1999”

“1999”“Dong” “2001”“Jarke”

“2001”

“. . .” “. . .” “. . .”

15

XNF: XML Normal Form

It eliminates two types of anomalies.

It was defined for XML functional dependencies:

DBLP.conf.@title DBLP.confDBLP.conf.issue

DBLP.conf.issue.article.@year

16

Problems to Address

Functional dependencies for XML.

Normal form for XML documents (XNF).

•Generalizes BCNF.

Algorithm for normalizing XML documents.

•Implication problem for functional dependencies.

17

Framework: Paths in DTDs

Paths(D): all paths in a DTD Dcourses.course courses.course.@cnocourses.course.student.namecourses.course.student.name.S

We distinguish three kinds of elements: attributes (@), strings (S) and element types.

FDs are defined by means of a relational representation of XML documents.

18

Framework: XML Trees

v1

v2

v3 v4

v5

v6 v7

v0

. . .

courses

coursecourse

@cno

“cs100”

@sno name grade @sno name grade

student student

“123” “456”

“Fox” “B+” “Smith” “A-”

S S S S

19

Tree Tuples

v1

v2

v0

courses

course

@cno student

“cs100”

t(courses) = v0

t(courses.course) = v1

t(courses.course.@cno) = “cs100”t(courses.course.student) = v2

t(p) = , for the remaining paths

Relational representation: tree tuples - mappings

t : Paths(D) Vertices Strings {}

A tree tuple represents an XML tree:

20

XML Tree: set of Tree Tuples

v1

v2

v3 v4

v5

v6 v7

v0

. . .

courses

coursecourse

@cno

“cs100”

@sno name grade @sno name grade

student student

“123” “456”

“Fox” “B+” “Smith” “A-”

S S S S

v1

v2

courses

course

@cno

“cs100”

student

v0

v3 v4

@sno name grade

“123”

“Fox” “B+”

S S

v5

v6 v7

@sno name grade

student

“456”

“Smith” “A-”

S S

. . .

course

21

Functional Dependencies for XML

Expressions of the form: X Y

defined over a DTD D, where X, Y are finitenon-empty subsets of Paths(D).

XML tree T can be tested for satisfaction of X Y

if:

X Y Paths(T) Paths(D)

T X Y if for every pair u, v of tree tuples in T:

u.X = v.X and u.X ≠ implies u.Y = v.Y

22

FD: Examples

University DTD: courses course*course @cno, student*student @sno, name, grade

Two students with the same @sno value must have the same name:

courses.course.student.@sno courses.course.student.name.S

Every student can have at most one grade in every course:

{ courses.course, courses.course.student.@sno }

courses.course.student.grade.S

23

Implication Problem for FD

Given a DTD D and a set of functional dependencies {}:

(D, ) if for any XML tree T conforming to D and satisfying , it is the case that T

(D, )+ = { | (D, ) }

Functional dependency is trivial if it is implied by the DTD alone: (D, )

24

XNF: XML Normal Form

XML specification: a DTD D and a set of functional dependencies .

A Relational DB is in BCNF if for every non-trivial functional dependency X Y in the specification, X is a key.

(D, ) is in XNF if:

For each non-trivial FD X p.@l or X p.S in (D, )+, X p is in (D, )+.

25

Back to DBLP

DBLP is not in XNF:

DBLP.conf.issue DBLP.conf.issue.article.@year (D,)+

DBLP.conf.issue DBLP.conf.issue.article

(D,)+

Proposed solution is in XNF.

26

Normalization Algorithm

The algorithm applies two transformations until theschema is in XNF.

If there is an anomalous FD of the form:

DBLP.conf.issue DBLP.conf.issue.article.@year

then apply the “DBLP example rule”.

Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

27

Normalizing XML Documents

Theorem The decomposition algorithm terminates and outputs a specification in XNF.

Furthermore, it does not lose information:

UnnormalizedNormalizedXML document XML Document

Q1, Q2 are XQuery core queries.

Q1

Q2

28

Part 3: What was Missing? Justification!

What is a good database design?

• Well-known solutions: BCNF, 4NF, …

But what is it that makes a database design good?

• Elimination of update anomalies.

• Existence of algorithms that produce good designs: lossless decomposition, dependency preservation.

Previous work was specific for the relational model.

• Classical problems have to be revisited in the XML context.

29

Justification of Normal Forms

Problematic to evaluate XML normal forms.

• No XML update language has been standardized.

• No XML query language yet has the same “yardstick” status as relational algebra.

• We do not even know if implication of XML FDs is decidable!

We need a different approach.

• It must be based on some intrinsic characteristics of the data.

• It must be applicable to new data models.

• It must be independent of query/update/constraint issues.

Our approach is based on information theory.

30

Information Theory

Entropy measures the amount of information provided by a certain event.

Assume that an event can have n different outcomes with probabilities p1, …, pn.

Amount of information gained by knowing that event i occurred :Average amount of information gained (entropy) :

Entropy is maximal if each pi = 1/n :

31

ip

1log

n

i ii p

p1

1log

nlog

Entropy and Redundancies

Database schema: R(A,B,C), A B

Instance I:

Pick a domain properly containing adom(I) :

• Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4

• Entropy: log 5 ≈ 2.322

A B C

1 2 3

1 2 4

A B C

1 2 3

1 2 4

A B C

1 2

1 2 4

A B C

1 2 3

1 2 4

A B C

1 3

1 2 4

Pick a domain properly containing adom(I) : {1, …, 6}

• Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2

• Entropy: log 1 = 0

{1, …, 6}

32

Entropy and Normal Forms

Let be a set of FDs over a schema S.

Theorem (S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0).

A similar result holds for 4NF and MVDs.

This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ...

33

Problems with the Measure

The measure cannot distinguish between different types of data dependencies.

It cannot distinguish between different instances of the same schema:

A B C

1 2 3

1 2 4

1 5

A B C

1 2 3

1 4

entropy = 0

R(A,B,C), A B

entropy = 0

34

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 2 3

1 2 4

35

A General Measure

Instance I of schema R(A,B,C), A B :

Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.

A B C

1 2 3

1 2 4

35

A General Measure

Instance I of schema R(A,B,C), A B :

Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.

A B C

1 2 3

1 2 4

35

A General Measure

Instance I of schema R(A,B,C), A B :

Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.

A B C

1 3

1 2 4

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 3

1 2 4

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) =

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

2 3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) =

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 2 3

1 2 1

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) =

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

4 2 3

1 2 7

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) =

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 2 3

1 2 3

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

For a ≠ 2, P(a | X) =

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

a 3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

For a ≠ 2, P(a | X) =

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

2 a 3

1 2 7

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

For a ≠ 2, P(a | X) =

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 a 3

1 2 6

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

For a ≠ 2, P(a | X) = 42/

(48 + 6 42) = 0.16

(48 + 6 42) = 0.14

Entropy ≈ 2.8057 (log 7 ≈ 2.8073)

35

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 3

1 2 4

Value : we consider the average over all sets X Pos(I) – {p}.

•Average: 2.4558 < log 7 (maximal entropy)

•It corresponds to conditional entropy.

•It depends on the value of k ...35

A General Measure

Previous value:

For each k, we consider the ratio:

• How close the given position p is to having the maximum possible information content.

General measure:

)|( pInf kI

k

pInf kI

log

)|(

k

pInfpInf

kI

kI log

)|(lim)|(

36

Basic Properties

The measure is well defined:

For every set of first order constraints defined over a schema S, every I inst(S,), and every p Pos(I): exists.

Bounds:

)|( pInf I

1)|(0 pInf I

37

Basic Properties

The measure does not depend on a particular representation of constraints. If 1 and 2 are equivalent:

It overcomes the limitations of the simple measure: R(A,B,C), A B

)|()|( 21 pInfpInf II

A B C

1 2 3

1 2 4

1 5

A B C

1 2 3

1 4

0.875 0.781

38

Well-Designed Databases

Definition A database specification (S,) is well-designed if for every I inst(S,) and every p Pos(I), = 1.

In other words, every position in every instance carries the maximum possible amount of information.

We would like to test this definition in the relational world ...

)|( pInf I

39

Relational Databases

is a set of data dependencies over a schema S:

= : (S,) is well-designed.

is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF.

is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF.

is a set of FDs and JDs:

• If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true.

• A syntactic characterization of being well-designed is given in [AL03].

40

Relational Databases

The problem of verifying whether a relational schema is well-designed is undecidable.

If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable.

Now we would like to apply our definition in the XML world ...

41

XML Databases

XML schema: (D,).

• D is a DTD.

• is a set of data dependencies over D.

We would like to evaluate XML normal forms.

The notion of being well-designed extends from relations to XML.

• The measure is robust; we just need to define the set of positions in an XML tree T: Pos(T).

42

Positions in an XML Tree

DBLP

conf conf

title issueissue

article articlearticle

@yeartitle title @year

“ICDT”

author @yeartitleauthor“1999” “1999”“Dong” “2001”“Jarke”“. . .” “. . .” “. . .”

“ICDT”

“1999” “1999”“Dong” “2001”“Jarke”“. . .” “. . .” “. . .”

43

Well-Designed XML Data

We consider k such that adom(T) {1, …,k}.

For each k :

We consider the ratio:

General measure:

)|( pInf kT

k

pInfpInf

kT

kT log

)|(lim)|(

kpInf kT log/)|(

44

XNF: XML Normal Form

For arbitrary XML data dependencies:

Definition An XML specification (D,) is well-designed if for every T inst(D,) and every p Pos(T), = 1.

For functional dependencies:

Theorem An XML specification (D,) is in XNF if and only if (D,) is well-designed.

)|( pInfT

45

Normalization Algorithms

The information-theoretic measure can also be used for reasoning about normalization algorithms.

For BCNF and XNF decomposition algorithms:

Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease.

46

Future Work

We would like to consider more complex XML constraints and characterize good designs they give rise to.

We would like to characterize 3NF by using the measure developed in this paper.

• In general, we would like to characterize “non-perfect” normal forms.

We would like to develop better characterizations of normalization algorithms using our measure.

• Why is the “usual” BCNF decomposition algorithm good? Why does it always stop?

47