A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science...

24
A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto

Transcript of A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science...

Page 1: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

A Normal Form for XML Documents

Marcelo Arenas Leonid Libkin

Department of Computer ScienceUniversity of Toronto

Page 2: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

2

Motivating Example

courses

coursecourse info

@cno @cnostudent student

@snoname gradegrade name@sno

student

name@sno. . .

“123” “123” “A+”“B+”

“cs100” “cs225”

“Fox”“Fox”

“123” “Fox”

Page 3: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

3

Our Goal

courses course*course @cno, student*student @sno, name, gradename Sgrade S

DTD: Integrity Constraints:

, info*

info @sno, name

two students with the same @sno value must have the same name.

@sno is the key of info.

Page 4: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

4

A “Non-relational” Example

DBLP

conf conf

title issueissue

article articlearticle

@yeartitle title @year

@year

“ICDT”

@year

author @yeartitleauthor“1999”

“1999”

“1999”“Dong” “2001”“Jarke”

“2001”

“. . .” “. . .” “. . .”

. . .

Page 5: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

5

Problems to Address

Functional dependencies for XML.

Normal form for XML documents (XNF). Generalizes BCNF.

Algorithm for normalizing XML documents. Implication problem for functional

dependencies.

Page 6: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

6

Framework: DTDs

We do not consider mixed content, IDs and IDREFs.

Paths(D): all paths in a DTD Dcourses.course courses.course.@cnocourses.course.student.namecourses.course.student.name.S

We distinguish three kinds of elements: attributes (@), strings (S) and element types.

Page 7: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

7

Framework: XML Trees

v1

v2

v3 v4

v5

v6 v7

v0

. . .

courses

coursecourse

@cno

“cs100”

@sno name grade @sno name grade

student student

“123” “456”

“Fox” “B+” “Smith” “A-”

S S S S

Page 8: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

8

Towards FDs for XML

We know how to handle FDs in relational DBs.

We need a relational representation of XML.

We do it by considering tree tuples: a tree tuple in a DTD D is a mapping

t : Paths(D) Vertices Strings {}

consistent with D: could be extracted from an XML tree conforming to D.

Page 9: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

9

Tree Tuples

v1

v2

v0

courses

course

@cno student

“cs100”

t(courses) = v0

t(courses.course) = v1

t(courses.course.@cno) = “cs100”t(courses.course.student) = v2

t(p) = , for the remaining paths

A tree tuple represents an XML tree:

Page 10: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

10

XML Tree: set of Tree Tuples

v1

v2

v3 v4

v5

v6 v7

v0

. . .

courses

coursecourse

@cno

“cs100”

@sno name grade @sno name grade

student student

“123” “456”

“Fox” “B+” “Smith” “A-”

S S S S

v1

v2

courses

course

@cno

“cs100”

student

v0

v3 v4

@sno name grade

“123”

“Fox” “B+”

S S

v5

v6 v7

@sno name grade

student

“456”

“Smith” “A-”

S S

. . .

course

Page 11: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

11

Functional Dependencies

Expressions of the form: X Y

defined over a DTD D, where X, Y are finitenon-empty subsets of Paths(D).

XML tree T can be tested for satisfaction of X Y

if:

X Y Paths(T) Paths(D)

T X Y if for every pair u, v of tree tuples in T:

u.X = v.X and u.X ≠ implies u.Y = v.Y

Page 12: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

12

FD: Examples

University DTD: courses course*course @cno, student*student @sno, name, grade

Two students with the same @sno value must have the same name:

courses.course.student.@sno courses.course.student.name.S

Every student can have at most one grade in every course:

{ courses.course, courses.course.student.@sno }

courses.course.student.grade.S

Page 13: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

13

Checking FD Satisfaction

v1

v2

v3 v4

v6

v7 v8

v0

courses

coursecourse

@cno

“cs100”@sno name grade @sno name grade

student

“123” “123”

“Fox” “B+” “Fox” “A+”

S S S S

v5

@cno

“cs225”

studentv1

v2

v3 v4

v0

courses

course

@cno

“cs100”@sno name grade

student

“123”

“Fox” “B+”

S S

v6

v7 v8

course

@sno name grade

“123”

“Fox” “A+”

S S

v5

@cno

“cs225”

student

{ courses.course, courses.course.student.@sno } courses.course.student.grade.S

Page 14: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

14

Checking FD Satisfaction

v1

v2

v3 v4

v5

v6 v7

v0

courses

course

@cno

“cs100”@sno name grade @sno name grade

student

“123” “123”

“Fox” “B+” “Fox” “A+”

S S S S

studentv1

v2

v3 v4

v0

courses

course

@cno

“cs100”@sno name grade

student

“123”

“Fox” “B+”

S S

v5

v6 v7

@sno name grade

“123”

“Fox” “A+”

S S

student

{ courses.course, courses.course.student.@sno } courses.course.student.grade.S

Page 15: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

15

Implication Problem for FD

Given a DTD D and a set of functional dependencies {}:

(D, ) if for any XML tree T conforming to D and satisfying , it is

the case that T

(D, )+ = { | (D, ) }

Functional dependency is trivial if it is implied by the DTD alone: (D, )

Page 16: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

16

XNF: XML Normal Form

XML specification: a DTD D and a set of functional dependencies .

A Relational DB is in BCNF if for every non-trivial functional dependency X Y in the specification, X is a key.

(D, ) is in XNF if:

For each non-trivial FD X p.@l or X p.S in (D, )+, X p is in (D, )+.

Page 17: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

17

Back to DBLP

DBLP is not in XNF:

DBLP.conf.issue DBLP.conf.issue.article.@year (D,)+

DBLP.conf.issue DBLP.conf.issue.article

(D,)+

Proposed solution is in XNF.

Page 18: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

18

Normalization Algorithm

The algorithm applies two transformations until

the schema is in XNF.

If there is an anomalous FD of the form:

DBLP.conf.issue DBLP.conf.issue.article.@year

then apply the “DBLP example rule”.

Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

Page 19: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

19

Normalizing XML Documents

Theorem The decomposition algorithm terminates and outputs a specification in XNF.

It does not lose information:

Unnormalized NormalizedXML document XML Document

Q1, Q2 are XQuery core queries.

It works even if we cannot compute (D,)+. If we know (D,)+, the output is better.

Q1

Q2

Page 20: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

20

Implication Problem (cont’d)

Typically, regular expressions used in DTDs are rather simple.

Trivial regular expression: s1, s2, …, sn

Each si is either one of ai, ai?, ai+ or ai*

For i ≠ j, ai ≠ aj

D is a simple DTD if all its productions use “permutations” of trivial regular expressions.Example: (a | b)* is a permutation of a*, b*

Page 21: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

21

Implication Problem (cont’d)

Theorem For simple DTDs: The implication problem for FDs is solvable

in O(n2). Testing if a specification is in XNF can be

done in O(n3).

Other results: There is a larger class of DTDs for which

these problems are tractable. There is an even larger class of DTDs for

which they are coNP-complete.

Page 22: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

22

Final Comments

The normalization algorithm can be improved in various ways.

Why XNF? Theorem XNF generalizes BCNF and a normal form

for nested relations (called NNF) when those are coded as XML documents.

A complete classification of the complexity of the implication problem for FDs remains open.

Implementation.

Page 23: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

Backup Slides

Page 24: A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

24

Normalization Algorithm

We consider FDs of the form:

{q, p1.@l1, …, pn.@ln } p

where n 0 and q ends with an element type.

The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD with n = 0:

DBLP.conf.issue DBLP.conf.issue.article.@year

then apply the “DBLP example rule”.

Otherwise: choose a minimal anomalous FD and apply the “University example rule”.