Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin...

67
Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto

Transcript of Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin...

Page 1: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Well-designed XML Data

Marcelo Arenas and Leonid Libkin

University of Toronto

Page 2: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Outline

Part 1 - Database Normalization from the 1970s and 1980s.

Part 2 - Classical theory revisited: normalizing XML documents.

Part 3 - Classical theory re-done: new justifications for normalization.

2

Page 3: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 1: Classical Normalization

Design: decide how to represent the information in a particular data model.

• Even for simple application domains there is a large number of ways of representing the data of interest.

We have to design the schema of the database.

• Set of relations.

• Set of attributes for each relation.

• Set of data dependencies.

3

Page 4: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Designing a Database: An Example

Attributes: number, title, section, room.

Data dependency: every course number is associated with only one title.

Relational Schema:

R(number, title, section, room),

number → title

T(number, section, room),

S(number, title), number → title

GOOD alternative:

4

BAD alternative:

Page 5: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Problems with BAD: Update Anomaly

GB2481Database SystemsCSC434

GB2483Computer OrganizationCSC258

GB2582Computer OrganizationCSC258

LP2661Computer OrganizationCSC258

roomsectiontitlenumber

Title of CSC258 is changed to Computer Organization I.

5

Page 6: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Problems with BAD: Update Anomaly

GB2481Database SystemsCSC434

GB2483Computer OrganizationCSC258

GB2582Computer OrganizationCSC258

LP2661Computer OrganizationCSC258

roomsectiontitlenumber

Title of CSC258 is changed to Computer Organization I.

5

Page 7: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Problems with BAD: Update Anomaly

GB2481Database SystemsCSC434

GB2483Computer Organization I

CSC258

GB2582Computer Organization I

CSC258

LP2661Computer Organization I

CSC258

roomsectiontitlenumber

Title of CSC258 is changed to Computer Organization I.The instance stores redundant information.

5

Page 8: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Deletion Anomaly

GB2481Database SystemsCSC434

GB2483Computer Organization I

CSC258

GB2582Computer Organization I

CSC258

LP2661Computer Organization I

CSC258

roomsectiontitlenumber

CSC434 is not given in this term.

6

Page 9: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Deletion Anomaly

GB2481Database SystemsCSC434

GB2483Computer Organization I

CSC258

GB2582Computer Organization I

CSC258

LP2661Computer Organization I

CSC258

roomsectiontitlenumber

CSC434 is not given in this term.

6

Page 10: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Deletion Anomaly

GB2483Computer Organization I

CSC258

GB2582Computer Organization I

CSC258

LP2661Computer Organization I

CSC258

roomsectiontitlenumber

CSC434 is not given in this term.

Additional effect: all the information about CSC434 was deleted.

6

Page 11: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Insertion Anomaly

GB2483Computer Organization I

CSC258

GB2582Computer Organization I

CSC258

LP2661Computer Organization I

CSC258

roomsectiontitlenumber

A new course is created: (CSC336, Numerical Methods)

7

Page 12: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Insertion Anomaly

GB2483Computer Organization I

CSC258

GB2582Computer Organization I

CSC258

LP2661Computer Organization I

CSC258

roomsectiontitlenumber

A new course is created: (CSC336, Numerical Methods)

7

Page 13: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Insertion Anomaly

GB2483Computer Organization I

CSC258

??Numerical MethodsCSC336

GB2582Computer Organization I

CSC258

LP2661Computer Organization I

CSC258

roomsectiontitlenumber

A new course is created: (CSC336, Numerical Methods)The instance stores attributes that are not directly related.

7

Page 14: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Avoiding Update Anomalies

Database SystemsCSC434

Computer Organization

CSC258

titlenumber

GB2481CSC434

GB2483CSC258

GB2582CSC258

LP2661CSC258

roomsectionnumber

Title of CSC258 is changed to Computer Organization I.

8

Page 15: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Avoiding Update Anomalies

Database SystemsCSC434

Computer Organization

CSC258

titlenumber

GB2481CSC434

GB2483CSC258

GB2582CSC258

LP2661CSC258

roomsectionnumber

Title of CSC258 is changed to Computer Organization I.

8

Page 16: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Avoiding Update Anomalies

Database SystemsCSC434

Computer Organization I

CSC258

titlenumber

GB2481CSC434

GB2483CSC258

GB2582CSC258

LP2661CSC258

roomsectionnumber

Title of CSC258 is changed to Computer Organization I.CSC434 is not given in this term.

The instance does not store redundant information.

8

Page 17: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Avoiding Update Anomalies

Database SystemsCSC434

Computer Organization I

CSC258

titlenumber

GB2481CSC434

GB2483CSC258

GB2582CSC258

LP2661CSC258

roomsectionnumber

CSC434 is not given in this term.

8

Page 18: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Avoiding Update Anomalies

Database SystemsCSC434

Computer Organization I

CSC258

titlenumber

GB2483CSC258

GB2582CSC258

LP2661CSC258

roomsectionnumber

CSC434 is not given in this term.

The title of CSC434 is not removed from the instance.

A new course is created: (CSC336, Numerical Methods)

8

Page 19: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Avoiding Update Anomalies

Database SystemsCSC434

Computer Organization I

CSC258

titlenumber

GB2483CSC258

GB2582CSC258

LP2661CSC258

roomsectionnumber

A new course is created: (CSC336, Numerical Methods)

8

Page 20: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Avoiding Update Anomalies

Database SystemsCSC434 Numerical MethodsCSC336

Computer Organization I

CSC258

titlenumber

GB2483CSC258

GB2582CSC258

LP2661CSC258

roomsectionnumber

A new course is created: (CSC336, Numerical Methods)No information about sections has to be provided.Each relation stores attributes that are directly related.

8

Page 21: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalization Theory

Main idea: a normal form defines a condition that a well designed database should satisfy.

Normal form: syntactic condition on the database schema.• Defined for a class of data dependencies.

Main problems:

• How to test whether a database schema is in a particular normal form.

• How to transform a database schema into an equivalent one satisfying a particular normal form.

9

Page 22: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalization Theory Today

Normalization theory for relational databases was developed in the 70s and 80s.

Why do we need normalization theory today?• New data models have emerged: XML.

• XML documents can contain redundant information.

Redundant information in XML documents:• Can be discovered if the user provides semantic

information.

• Can be eliminated.

10

Page 23: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

11

Page 24: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

11

Page 25: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

11

Page 26: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

11

Page 27: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

11

Page 28: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

#PCDATA⇒grade

#PCDATA⇒name

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

DTD:

11

Page 29: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

#PCDATA⇒grade

#PCDATA⇒name

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

DTD:

11

Page 30: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

<course cno=“CSC434”>

</course>

<course cno=“CSC258”>

<courses>

XML Document:

#PCDATA⇒grade

#PCDATA⇒name

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

DTD:

11

Page 31: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

#PCDATA⇒grade

#PCDATA⇒name

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

DTD:

11

Page 32: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

#PCDATA⇒grade

#PCDATA⇒name

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

DTD:

11

Page 33: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

#PCDATA⇒grade

#PCDATA⇒name

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

DTD:

11

Page 34: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

#PCDATA⇒grade

#PCDATA⇒name

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

DTD:

11

Page 35: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 2: XML and Normalization

</courses>

</course>

</taken_by>

</student>

<grade> B+ </grade>

<name> Fox </name>

<student sno=“st1”>

<taken_by>

<course cno=“CSC258”>

<courses>

XML Document:

#PCDATA⇒grade

#PCDATA⇒name

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

DTD:

11

Page 36: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

XML Databases

D : Σ : Two students with the same @sno value must have the same name.

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

12

XML Schema: (D, Σ)

Page 37: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Redundancy in XML

courses

coursecourse info

@cno @cno taken_bytaken_by

student student

@snoname gradegrade name@sno

student

name@sno

. . .

“st1” “st1” “A+”“B+”

“CSC258” “CSC434”

“Fox”“Fox”

“st1” “Fox”

13

Page 38: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

XML Database Normalization

DTD: Data dependency:

Two students with the same @sno value must have the same name.

name, grade

⇒student

@sno⇒student

student*⇒taken_by

taken_by⇒course

@cno⇒course

course*⇒courses

14

Page 39: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

XML Database Normalization

DTD:

, info* @sno is the identifier of info elements.

grade⇒student

@sno⇒student

student*⇒taken_by

taken_by

⇒course

@cno⇒course

course*⇒courses

name⇒info

@sno⇒info

Data dependency:

Two students with the same @sno value must have the same name.

14

Page 40: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

A “Non-relational” Example

DBLP

conf conf

title issueissue

article articlearticle

@yeartitle title @year

@year

“ICDT”

@year

author @yeartitleauthor“1999”

“1999”

“1999”“Dong” “2001”“Jarke”

“2001”

“. . .” “. . .” “. . .”

15

Page 41: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

XNF: XML Normal Form

It eliminates two types of anomalies.

It was defined for XML functional dependencies:

DBLP.conf.@title → DBLP.confDBLP.conf.issue →

DBLP.conf.issue.article.@year

16

Page 42: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Problems to Address

Functional dependencies for XML.

Normal form for XML documents (XNF).

•Generalizes BCNF.

Algorithm for normalizing XML documents.

•Implication problem for functional dependencies.

17

Page 43: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Framework: Paths in DTDs

Paths(D): all paths in a DTD Dcourses.course courses.course.@cnocourses.course.student.namecourses.course.student.name.S

EPaths(D): all paths in a DTD D that end with an element, for example, courses.course.

We distinguish three kinds of elements: attributes (@), strings (S) and element types.

FDs are defined by means of a relational representation of XML documents.

18

Page 44: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Framework: XML Trees

v1

v2

v3 v4

v5

v6 v7

v0

. . .

courses

coursecourse

@cno

“cs100”@sno name grade @sno name grade

student student

“123” “456”

“Fox” “B+” “Smith” “A­”S S S S

19

Page 45: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Tree Tuples

v1

v2

v0

courses

course

@cno student

“cs100”

t(courses) = v0

t(courses.course) = v1

t(courses.course.@cno) = “cs100”t(courses.course.student) = v2

t(p) = ⊥, for the remaining paths

We consider tuples containing a minimal

amount of ⊥ values

Relational representation: tree tuples - mappings

t : Paths(D) → Vertices ∪ Strings ∪ {⊥}

A tree tuple represents an XML tree:

20

Page 46: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

XML Tree: set of Tree Tuples

v1

v2

v3 v4

v5

v6 v7

v0

. . .

courses

coursecourse

@cno

“cs100”@sno name grade @sno name grade

student student

“123” “456”

“Fox” “B+” “Smith” “A­”S S S S

v1

v2

courses

course

@cno

“cs100”

student

v0

v3 v4

@sno name grade

“123”

“Fox” “B+”S S

v5

v6 v7

@sno name grade

student

“456”

“Smith” “A­”S S

. . .

course

21

Page 47: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Functional Dependencies for XML

Expressions of the form: X → Y

defined over a DTD D, where X, Y are finitenon-empty subsets of Paths(D).

XML tree T can be tested for satisfaction of X → Y if:

X ∪ Y ⊆ Paths(T) ⊆ Paths(D)

T |= X → Y if for every pair u, v of tree tuples in T:

u.X = v.X and u.X ≠ ⊥ implies u.Y = v.Y22

Page 48: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

FD: Examples

University DTD: courses ⇒ course*course ⇒ @cno, student*student ⇒ @sno, name, grade

Two students with the same @sno value must have the same name:

courses.course.student.@sno → courses.course.student.name.S

Every student can have at most one grade in every course:

{ courses.course, courses.course.student.@sno } →

courses.course.student.grade.S

23

Page 49: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Implication Problem for FD

Given a DTD D and a set of functional dependencies Σ ∪ {ϕ}:

(D, Σ) |- ϕ (implies ϕ) if for any XML tree T conforming to D and satisfying Σ , it is the case that T |= ϕ

(D, Σ)+ = { ϕ | (D, Σ) |- ϕ }

Functional dependency ϕ is trivial if it is implied by the DTD alone.

24

Page 50: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Checking FD Satisfaction

v1

v2

v3 v4

v6

v7 v8

v0

courses

coursecourse

@cno

“cs100”@sno name grade @sno name grade

student

“123” “123”

“Fox” “B+” “Fox” “A+”S S S S

v5

@cno

“cs225”

studentv1

v2

v3 v4

v0

courses

course

@cno

“cs100”@sno name grade

student

“123”

“Fox” “B+”S S

v6

v7 v8

course

@sno name grade

“123”

“Fox” “A+”S S

v5

@cno

“cs225”

student

{ courses.course,   courses.course.student.@sno }  →  courses.course.student.grade.S

Page 51: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Checking FD Satisfaction

v1

v2

v3 v4

v5

v6 v7

v0

courses

course

@cno

“cs100”@sno name grade @sno name grade

student

“123” “123”

“Fox” “B+” “Fox” “A+”S S S S

studentv1

v2

v3 v4

v0

courses

course

@cno

“cs100”@sno name grade

student

“123”

“Fox” “B+”S S

v5

v6 v7

@sno name grade

“123”

“Fox” “A+”S S

student

{ courses.course,   courses.course.student.@sno }  →  courses.course.student.grade.S

Page 52: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

XNF: XML Normal Form

XML specification: a DTD D and a set of functional dependencies Σ.

A Relational DB is in BCNF if for every non-trivial functional dependency X → Y in the specification, X is a key.

(D, Σ) is in XNF if:

For each non-trivial FD X → p.@l or X → p.S in (D, Σ)+, X → p is in (D, Σ)+.

25

Page 53: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Back to DBLP

DBLP is not in XNF:

DBLP.conf.issue → DBLP.conf.issue.article.@year ∈ (D,Σ)+

DBLP.conf.issue → DBLP.conf.issue.article ∉

(D,Σ)+

Proposed solution is in XNF.26

Page 54: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalization Algorithm

The algorithm applies two transformations until theschema is in XNF.

If there is an anomalous FD of the form:

DBLP.conf.issue → DBLP.conf.issue.article.@year

then apply the “DBLP example rule”.

Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

27

Page 55: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalizing XML Documents

28

Remember:

DBLP.conf.issue[q] → DBLP.conf.issue.article.[p]@year

Page 56: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalizing XML Documents

28

Page 57: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalizing XML Documents

28

Page 58: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalizing XML Documents

28

Page 59: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalizing XML Documents

28

Page 60: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Normalizing XML Documents

28

Page 61: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Reasoning About FDs

28

Page 62: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Part 3: What was Missing? Justification!

What is a good database design?

• Well-known solutions: BCNF, 4NF, …

But what is it that makes a database design good?

• Elimination of update anomalies.

• Existence of algorithms that produce good designs: lossless decomposition, dependency preservation.

Previous work was specific for the relational model.

• Classical problems have to be revisited in the XML context.

29

Page 63: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Justification of Normal Forms

Problematic to evaluate XML normal forms.

• No XML update language has been standardized.

• No XML query language yet has the same “yardstick” status as relational algebra.

• We do not even know if implication of XML FDs is decidable!

We need a different approach.

• It must be based on some intrinsic characteristics of the data.

• It must be applicable to new data models.

• It must be independent of query/update/constraint issues.

Our approach is based on information theory. 30

Page 64: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Information Theory

Entropy measures the amount of information provided by a certain event.

Assume that an event can have n different outcomes with probabilities p1, …, pn.

Amount of information gained by knowing that event i occurred :Average amount of information gained (entropy) :

Entropy is maximal if each pi = 1/n :

31

log1pi

∑i = 1

n

p i log1pi

log n

Page 65: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Entropy and Redundancies

Database schema: R(A,B,C),  A → B

Instance I:

Pick a domain properly containing adom(I) :• Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4

• Entropy: log 5 ≈ 2.322

421

321CBA

421

321CBA

421

21CBA

421

321CBA

42131

CBA

Pick a domain properly containing adom(I) : {1, …, 6}

• Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2

• Entropy: log 1 = 0

{1, …, 6}

32

Page 66: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Entropy and Normal Forms

Let Σ  be a set of FDs over a schema S.

Theorem (S,Σ) is in BCNF if and only if for every instance of (S,Σ) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0).

A similar result holds for 4NF and MVDs.

This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ...

33

Page 67: Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto. ... (CSC336, Numerical Methods) No information about sections

Problems with the Measure

The measure cannot distinguish between different types of data dependencies.

It cannot distinguish between different instances of the same schema:

51

421

321

CBA

41

321

CBA

entropy = 0

R(A,B,C),  A → B

entropy = 0

34