Management of XML and Semistructured Data

38
Management of XML and Semistructured Data Lecture 7: XML-QL, Structural Recursion Monday, April 23, 2001

description

Management of XML and Semistructured Data. Lecture 7: XML-QL, Structural Recursion Monday, April 23, 2001. XML-QL. First declarative language for XML How to obtain a query language for XML fast ? Assume OEM as data model Use features from UnQL and StruQL Patterns Templates - PowerPoint PPT Presentation

Transcript of Management of XML and Semistructured Data

Page 1: Management of XML and Semistructured Data

Management of XML and Semistructured Data

Lecture 7: XML-QL, Structural Recursion

Monday, April 23, 2001

Page 2: Management of XML and Semistructured Data

XML-QL

• First declarative language for XML• How to obtain a query language for XML fast ?

– Assume OEM as data model

– Use features from UnQL and StruQL• Patterns

• Templates

• Skolem functions

– Design XML-like syntax

Page 3: Management of XML and Semistructured Data

Patterns in XML-QL

WHERE <book language=“french”> <publisher> <name> Morgan Kaufmann </> </> <author> $A </> </book> in “www.a.b.c/bib.xml”CONSTRUCT <author> $A </>

WHERE <book language=“french”> <publisher> <name> Morgan Kaufmann </> </> <author> $A </> </book> in “www.a.b.c/bib.xml”CONSTRUCT <author> $A </>

Find all authors who published in Morgan Kaufmann:

Abbreviation: </> closes any tag.

Page 4: Management of XML and Semistructured Data

Patterns in XML-QL

where <book language=$X> <author> $A </author>

</book> in “www.a.b.c/bib.xml” <book>

<author> $A </author> <author> Jones </author>

</book> in “www.a.b.c/bib.xml”construct <result> $X </>

where <book language=$X> <author> $A </author>

</book> in “www.a.b.c/bib.xml” <book>

<author> $A </author> <author> Jones </author>

</book> in “www.a.b.c/bib.xml”construct <result> $X </>

Find all languages in which Jones’ coauthors have published:

There is a join here…

Page 5: Management of XML and Semistructured Data

Constructors in XML-QL

where <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”construct <result> <author> $A </> <lang> $L </> </>

where <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”construct <result> <author> $A </> <lang> $L </> </>

Result is:<result> <author>Smith</author> <lang>English </lang> </result><result> <author>Smith</author> <lang>Mandarin</lang> </result><result> <author>Doe </author> <lang>English </lang> </result>. . . .

Find all authors and the languages in which they published:

Page 6: Management of XML and Semistructured Data

Nested Queries in XML-QL

WHERE <book.author> $A </> in “www.a.b.c/bib.xml”CONSTRUCT <result> <author> $A </> WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <lang> $L </> </>

WHERE <book.author> $A </> in “www.a.b.c/bib.xml”CONSTRUCT <result> <author> $A </> WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <lang> $L </> </>

Find all authors and the languages in which they published; group by authors:

Note: book.author is a (regular) path expression

Page 7: Management of XML and Semistructured Data

<result> <author>Smith</author> <lang>English</lang> <lang>Mandarin</lang> <lang>…</lang> …</result><result> <author>Doe</author> <lang>English</lang> …</result>

Result is:

Page 8: Management of XML and Semistructured Data

Skolem Functions in XML-QL

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A)> <author> $A</> <lang> $L </> </>

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A)> <author> $A</> <lang> $L </> </>

Same query, with Skolem functions

Assumptions:• the ID attribute is always id• default Skolem function for author is G($A), for lang is H($A, $L) (why ?)

Page 9: Management of XML and Semistructured Data

Skolem Functions in XML-QL

{ WHERE <book> <author> $A </> <title> $T </> </> in “www.a.b.c/bib.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <booktitle> $T</> /* implicit Skolem function H($A, $T) */ </>}{ WHERE <paper> <author> $A </> <title> $T </> <journal> $J </> </> in “www.d.e.f/papers.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <papertitle> $T</> /* implicit Skolem function J($A, $T) */ <journaltitle> $J</> /* implicit Skolem function K($A, $T) */ </>}

{ WHERE <book> <author> $A </> <title> $T </> </> in “www.a.b.c/bib.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <booktitle> $T</> /* implicit Skolem function H($A, $T) */ </>}{ WHERE <paper> <author> $A </> <title> $T </> <journal> $J </> </> in “www.d.e.f/papers.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <papertitle> $T</> /* implicit Skolem function J($A, $T) */ <journaltitle> $J</> /* implicit Skolem function K($A, $T) */ </>}

Object fusion with Skolem functions and block structure- Compile a complete list of authors, from two sources

Page 10: Management of XML and Semistructured Data

Result:

(some have only books,Others only papers,Others have both)

<person> <name>Smith</name> <booktitle>Book1</booktitle > <booktitle>Book2</booktitle > </result><person> <name>Jones</name> <booktitle>Book3</booktitle > <papertitle>paper1</papertitle > <journaltitle>journal1</journaltitle > </result><person> <name>Mark</name> <papertitle>paper2</papertitle > <journaltitle>journal3</journaltitle > </result>…

Page 11: Management of XML and Semistructured Data

Skolem Functions in XML-QL

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A)> <author id=G($A)> $A</> <lang id=H($A)> $L </> </>

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A)> <author id=G($A)> $A</> <lang id=H($A)> $L </> </>

“Wrong” query number 1:

What is “wrong” here ?

Page 12: Management of XML and Semistructured Data

Skolem Functions in XML-QL

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A,$L)> <author id=G($A)> $A</> <lang id=H($A,$L)> $L </> </>

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A,$L)> <author id=G($A)> $A</> <lang id=H($A,$L)> $L </> </>

“Wrong” query number 2:

What is “wrong” here ?

Page 13: Management of XML and Semistructured Data

Skolem Functions in XML-QL

{ WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <author id=F($A)> <lang id=H($A,$L)> $L </> </> }{ WHERE <person> <city> $C </> <fluent-in> $X </> </> in “www.a.b.c/bib.xml” CONSTRUCT <location id=G($C)> <lang id=H($C,$L)> $L </> </> }

{ WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <author id=F($A)> <lang id=H($A,$L)> $L </> </> }{ WHERE <person> <city> $C </> <fluent-in> $X </> </> in “www.a.b.c/bib.xml” CONSTRUCT <location id=G($C)> <lang id=H($C,$L)> $L </> </> }

“Wrong” query number 3:

Page 14: Management of XML and Semistructured Data

Three Rules to Construct Only Trees

Rule 1: nested elements must have Skolem functions that are… [how ??]

Rule 2: an element that has an atomic content must have a Skolem function that is… [how ??]

Rule 3: if a Skolem function occurs in two different places than the following condition must hold… [which ??]

CONSTRUCT <tag1 id=F([args1])> <tag2 id=G([args2])> …</> </>

CONSTRUCT <tag1 id=F([args1])> <tag2 id=G([args2])> …</> </>

CONSTRUCT … <tag id=F([args])> $X </> …CONSTRUCT … <tag id=F([args])> $X </> …

{ CONSTRUCT <tag1 id=G([args1])> <tag id=F([args])> …</> </> }{ CONSTRUCT <tag1 id=H([args2])> <tag id=F([args])> …</> </> }

{ CONSTRUCT <tag1 id=G([args1])> <tag id=F([args])> …</> </> }{ CONSTRUCT <tag1 id=H([args2])> <tag id=F([args])> …</> </> }

Page 15: Management of XML and Semistructured Data

XML-QL v.s. XQuery

• Xquery (=Quilt) v.s. XML-QL+ faithful XML data model

+ Xpath sublanguage

+ aggregate functions (like in SQL)

+ some features from XQL- Patterns- Skolem functions

Page 16: Management of XML and Semistructured Data

A Different Paradigm:Structural Recursion

Data as sets with a union operator:

{a:3, a:{b:”one”, c:5}, b:4} =

{a:3} U {a:{b:”one”,c:5}} U {b:4}

Page 17: Management of XML and Semistructured Data

Structural Recursion

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = f($T)f({}) = {}f($V) = if isInt($V) then {result: $V} else {}

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = f($T)f({}) = {}f($V) = if isInt($V) then {result: $V} else {}

Example: retrieve all integers in the data

a a b

b c3

“one” 5

4result result result

3 5 4

Page 18: Management of XML and Semistructured Data

Structural Recursion

What does this do ?

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L=a then {b:f($T)} else {$L:f($T)}f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L=a then {b:f($T)} else {$L:f($T)}f({}) = {}f($V) = $V

Page 19: Management of XML and Semistructured Data

Structural Recursion

What does this do ?

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:{$L:f($T)}}f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:{$L:f($T)}}f({}) = {}f($V) = $V

Input = tree with n nodesOutput = ???

Page 20: Management of XML and Semistructured Data

Structural Recursion

Example: increase all engine prices by 10%

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= engine then {$L: g($T)} else {$L: f($T)}f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= engine then {$L: g($T)} else {$L: f($T)}f({}) = {}f($V) = $V

g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= price then {$L:1.1*$T} else {$L: g($T)}g({}) = {}g($V) = $V

g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= price then {$L:1.1*$T} else {$L: g($T)}g({}) = {}g($V) = $V

engine body

part price

price price

part price

100

1000

100

1000

engine body

part price

price price

part price

110

1100

100

1000

Page 21: Management of XML and Semistructured Data

Structural Recursion

Retrieve all subtrees reachable by (a.b)*.aa

b

a

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= a then g($T} U $T else { }f({}) = { }f($V) = { }

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= a then g($T} U $T else { }f({}) = { }f($V) = { }

g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= b then f($T) else { }g({}) = { }g($V) = { }

g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= b then f($T) else { }g({}) = { }g($V) = { }

Page 22: Management of XML and Semistructured Data

Structural Recursion: General Form

f1($T1 U $T2) = f1($T1) U f1($T2)f1({$L: $T}) = E1($L, f1($T),...,fk($T), $T)f1({}) = { }f1($V) = { }

f1($T1 U $T2) = f1($T1) U f1($T2)f1({$L: $T}) = E1($L, f1($T),...,fk($T), $T)f1({}) = { }f1($V) = { }

fk($T1 U $T2) = fk($T1) U fk($T2)fk({$L: $T}) = Ek($L, f1($T),...,fk($T), $T)fk({}) = { }fk($V) = { }

fk($T1 U $T2) = fk($T1) U fk($T2)fk({$L: $T}) = Ek($L, f1($T),...,fk($T), $T)fk({}) = { }fk($V) = { }

. . . .

Each of E1, ..., Ek consists only of {_ : _}, U, if_then_else_

Page 23: Management of XML and Semistructured Data

Evaluating Structural Recursion

Recursive Evaluation:

• Compute the functions recursively, starting with f1 at the root

Termination is guaranteed.

How efficiently can we evaluate this ?

Page 24: Management of XML and Semistructured Data

Structural Recursion

Consider this:

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:f($T)}, $L:f($T)}f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:f($T)}, $L:f($T)}f({}) = {}f($V) = $V

Page 25: Management of XML and Semistructured Data

Naive Recursive Evaluation

a

b

c

d

a a

b b b b

c c c c c c c c

Input tree = n nodesOutput tree = 2n+1 – 1 nodes

Page 26: Management of XML and Semistructured Data

Efficient Recursive EvaluationRecursive Evaluation with function memorization.PTIME complexity.

a

b

c

d

a

b

c

d

a

b

c

d

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:f($T)}, $L:f($T)}f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:f($T)}, $L:f($T)}f({}) = {}f($V) = $V

Alternatively: apply the function in parallel to each input edge Bulk Evaluation

Page 27: Management of XML and Semistructured Data

Bulk EvaluationSometimes f doesn’t return anything use edges

a

b

c

d

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L=c then $T else f($T)f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L=c then $T else f($T)f({}) = {}f($V) = $V

d

d

Page 28: Management of XML and Semistructured Data

Epsilon Edges

Meaning of edges:

=

a b

c d

a b

c d

dc

Page 29: Management of XML and Semistructured Data

Epsilon Edges

Note: union becomes easy to draw with edges:

Example:a b c d

eU =a b c d e

=

a b c d e

T1 T2U =T1 T2

Page 30: Management of XML and Semistructured Data

Bulk Evaluation

Idea: “apply” E1, ..., Ek independently on each edge, then connect with edges PTIME

f1($T1 U $T2) = f1($T1) U f1($T2)f1({$L: $T}) = E1($L, f1($T),...,fk($T), $T)f1({}) = { }f1($V) = { }

f1($T1 U $T2) = f1($T1) U f1($T2)f1({$L: $T}) = E1($L, f1($T),...,fk($T), $T)f1({}) = { }f1($V) = { }

fk($T1 U $T2) = fk($T1) U fk($T2)fk({$L: $T}) = Ek($L, f1($T),...,fk($T), $T)fk({}) = { }fk($V) = { }

fk($T1 U $T2) = fk($T1) U fk($T2)fk({$L: $T}) = Ek($L, f1($T),...,fk($T), $T)fk({}) = { }fk($V) = { }

. . . .

Page 31: Management of XML and Semistructured Data

Bulk Evaluation

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= a then g($T} U $T else { }f({}) = { }f($V) = { }

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= a then g($T} U $T else { }f({}) = { }f($V) = { }

g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= b then f($T) else { }g({}) = { }g($V) = { }

g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= b then f($T) else { }g({}) = { }g($V) = { }

Recall (a.b)*.a:

a

b

a

a

a

b

b

a

c

ba

aa

b

b

a

c b

b

a

c

a

d

d

d

d

bc

Page 32: Management of XML and Semistructured Data

Structural Recursion

• Can evaluate in two ways:– Recursively: memorize functions’ results– Bulk: apply all functions on all edges, in

parallel, connect, eliminate what is useless

• Complexity: PTIME– More precisely: NLOGSPACE

• Works on graphs with cycles too !

Page 33: Management of XML and Semistructured Data

XSL

• two W3C drafts: XSLT and XPATH– http://www.w3.org/TR/xpath, 11/99– http://www.w3.org/TR/WD-xslt, 11/99

• in commercial products (e.g. IE5.0)

• purpose: stylesheet specification language:– stylesheet: XML -> HTML– in general: XML -> XML

Page 34: Management of XML and Semistructured Data

XSL Templates and Rules

• query = collection of template rules

• template rule = match pattern + template

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>

Retrieve all book titles:

Page 35: Management of XML and Semistructured Data

Flow Control in XSL

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>

<xsl:template match=“b”> <B><xsl:apply-templates/></B></xsl:template>

<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>

<xsl:template match=“b”> <B><xsl:apply-templates/></B></xsl:template>

<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>

Page 36: Management of XML and Semistructured Data

<a> <e> <b> <c> 1 </c>

<c> 2 </c>

</b>

<a> <c> 3 </c>

</a>

</e>

<c> 4 </c>

</a>

<A> <B> <C> 1 </C>

<C> 2 </C>

</B>

<A> <C> 3 </C>

</A>

<C> 4 </C>

</A>

Page 37: Management of XML and Semistructured Data

XSL is Structural Recursion

Equivalent to:

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V

XSL query = single functionXSL query with modes = multiple function

Page 38: Management of XML and Semistructured Data

XSL and Structural Recursion

XSL:• trees only• may loop

Structural Recursion:• arbitrary graphs• always terminates

<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>

<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>

stack overflow on IE 5.0

add the following rule: