Management of XML and Semistructured Data

Management of XML and Semistructured Data

Lecture 7: XML-QL, Structural Recursion

Monday, April 23, 2001

XML-QL

• First declarative language for XML• How to obtain a query language for XML fast ?

– Assume OEM as data model

– Use features from UnQL and StruQL• Patterns

• Templates

• Skolem functions

– Design XML-like syntax

Patterns in XML-QL

WHERE <book language=“french”> <publisher> <name> Morgan Kaufmann </> </> <author> $A </> </book> in “www.a.b.c/bib.xml”CONSTRUCT <author> $A </>

WHERE <book language=“french”> <publisher> <name> Morgan Kaufmann </> </> <author> $A </> </book> in “www.a.b.c/bib.xml”CONSTRUCT <author> $A </>

Find all authors who published in Morgan Kaufmann:

Abbreviation: </> closes any tag.

Patterns in XML-QL

where <book language=$X> <author> $A </author>

</book> in “www.a.b.c/bib.xml” <book>

<author> $A </author> <author> Jones </author>

</book> in “www.a.b.c/bib.xml”construct <result> $X </>

where <book language=$X> <author> $A </author>

</book> in “www.a.b.c/bib.xml” <book>

<author> $A </author> <author> Jones </author>

</book> in “www.a.b.c/bib.xml”construct <result> $X </>

Find all languages in which Jones’ coauthors have published:

There is a join here…

Constructors in XML-QL

where <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”construct <result> <author> $A </> <lang> $L </> </>

where <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”construct <result> <author> $A </> <lang> $L </> </>

Result is:<result> <author>Smith</author> <lang>English </lang> </result><result> <author>Smith</author> <lang>Mandarin</lang> </result><result> <author>Doe </author> <lang>English </lang> </result>. . . .

Find all authors and the languages in which they published:

Nested Queries in XML-QL

WHERE <book.author> $A </> in “www.a.b.c/bib.xml”CONSTRUCT <result> <author> $A </> WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <lang> $L </> </>

WHERE <book.author> $A </> in “www.a.b.c/bib.xml”CONSTRUCT <result> <author> $A </> WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <lang> $L </> </>

Find all authors and the languages in which they published; group by authors:

Note: book.author is a (regular) path expression

<result> <author>Smith</author> <lang>English</lang> <lang>Mandarin</lang> <lang>…</lang> …</result><result> <author>Doe</author> <lang>English</lang> …</result>

Result is:

Skolem Functions in XML-QL

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A)> <author> $A</> <lang> $L </> </>

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A)> <author> $A</> <lang> $L </> </>

Same query, with Skolem functions

Assumptions:• the ID attribute is always id• default Skolem function for author is G($A), for lang is H($A, $L) (why ?)

{ WHERE <book> <author> $A </> <title> $T </> </> in “www.a.b.c/bib.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <booktitle> $T</> /* implicit Skolem function H($A, $T) */ </>}{ WHERE <paper> <author> $A </> <title> $T </> <journal> $J </> </> in “www.d.e.f/papers.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <papertitle> $T</> /* implicit Skolem function J($A, $T) */ <journaltitle> $J</> /* implicit Skolem function K($A, $T) */ </>}

{ WHERE <book> <author> $A </> <title> $T </> </> in “www.a.b.c/bib.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <booktitle> $T</> /* implicit Skolem function H($A, $T) */ </>}{ WHERE <paper> <author> $A </> <title> $T </> <journal> $J </> </> in “www.d.e.f/papers.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <papertitle> $T</> /* implicit Skolem function J($A, $T) */ <journaltitle> $J</> /* implicit Skolem function K($A, $T) */ </>}

Object fusion with Skolem functions and block structure- Compile a complete list of authors, from two sources

Result:

(some have only books,Others only papers,Others have both)

<person> <name>Smith</name> <booktitle>Book1</booktitle > <booktitle>Book2</booktitle > </result><person> <name>Jones</name> <booktitle>Book3</booktitle > <papertitle>paper1</papertitle > <journaltitle>journal1</journaltitle > </result><person> <name>Mark</name> <papertitle>paper2</papertitle > <journaltitle>journal3</journaltitle > </result>…

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A)> <author id=G($A)> $A</> <lang id=H($A)> $L </> </>

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A)> <author id=G($A)> $A</> <lang id=H($A)> $L </> </>

“Wrong” query number 1:

What is “wrong” here ?

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A,$L)> <author id=G($A)> $A</> <lang id=H($A,$L)> $L </> </>

WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result id=F($A,$L)> <author id=G($A)> $A</> <lang id=H($A,$L)> $L </> </>


What is “wrong” here ?

{ WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <author id=F($A)> <lang id=H($A,$L)> $L </> </> }{ WHERE <person> <city> $C </> <fluent-in> $X </> </> in “www.a.b.c/bib.xml” CONSTRUCT <location id=G($C)> <lang id=H($C,$L)> $L </> </> }

{ WHERE <book language = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <author id=F($A)> <lang id=H($A,$L)> $L </> </> }{ WHERE <person> <city> $C </> <fluent-in> $X </> </> in “www.a.b.c/bib.xml” CONSTRUCT <location id=G($C)> <lang id=H($C,$L)> $L </> </> }

Three Rules to Construct Only Trees

Rule 1: nested elements must have Skolem functions that are… [how ??]

Rule 2: an element that has an atomic content must have a Skolem function that is… [how ??]

Rule 3: if a Skolem function occurs in two different places than the following condition must hold… [which ??]

CONSTRUCT <tag1 id=F([args1])> <tag2 id=G([args2])> …</> </>

CONSTRUCT <tag1 id=F([args1])> <tag2 id=G([args2])> …</> </>

CONSTRUCT … <tag id=F([args])> $X </> …CONSTRUCT … <tag id=F([args])> $X </> …

{ CONSTRUCT <tag1 id=G([args1])> <tag id=F([args])> …</> </> }{ CONSTRUCT <tag1 id=H([args2])> <tag id=F([args])> …</> </> }

{ CONSTRUCT <tag1 id=G([args1])> <tag id=F([args])> …</> </> }{ CONSTRUCT <tag1 id=H([args2])> <tag id=F([args])> …</> </> }

XML-QL v.s. XQuery

• Xquery (=Quilt) v.s. XML-QL+ faithful XML data model

+ Xpath sublanguage

+ aggregate functions (like in SQL)

+ some features from XQL- Patterns- Skolem functions

A Different Paradigm:Structural Recursion

Data as sets with a union operator:

{a:3, a:{b:”one”, c:5}, b:4} =

{a:3} U {a:{b:”one”,c:5}} U {b:4}

Structural Recursion

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = f($T)f({}) = {}f($V) = if isInt($V) then {result: $V} else {}

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = f($T)f({}) = {}f($V) = if isInt($V) then {result: $V} else {}

Example: retrieve all integers in the data

a a b

b c3

“one” 5

4result result result

3 5 4


What does this do ?

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L=a then {b:f($T)} else {$L:f($T)}f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L=a then {b:f($T)} else {$L:f($T)}f({}) = {}f($V) = $V


What does this do ?

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:{$L:f($T)}}f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:{$L:f($T)}}f({}) = {}f($V) = $V

Input = tree with n nodesOutput = ???


Example: increase all engine prices by 10%

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= engine then {$L: g($T)} else {$L: f($T)}f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= engine then {$L: g($T)} else {$L: f($T)}f({}) = {}f($V) = $V

g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= price then {$L:1.1*$T} else {$L: g($T)}g({}) = {}g($V) = $V

g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= price then {$L:1.1*$T} else {$L: g($T)}g({}) = {}g($V) = $V

engine body

part price

price price

part price

100

1000

100

1000

engine body

part price

price price

part price

110

1100

100

1000


Retrieve all subtrees reachable by (a.b)*.aa

b

a

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L= a then g($T} U $T else { }f({}) = { }f($V) = { }


g($T1 U $T2) = g($T1) U g($T2)g({$L: $T}) = if $L= b then f($T) else { }g({}) = { }g($V) = { }


Structural Recursion: General Form

f1($T1 U $T2) = f1($T1) U f1($T2)f1({$L: $T}) = E1($L, f1($T),...,fk($T), $T)f1({}) = { }f1($V) = { }


fk($T1 U $T2) = fk($T1) U fk($T2)fk({$L: $T}) = Ek($L, f1($T),...,fk($T), $T)fk({}) = { }fk($V) = { }


. . . .

Each of E1, ..., Ek consists only of {_ : _}, U, if_then_else_

Evaluating Structural Recursion

Recursive Evaluation:

• Compute the functions recursively, starting with f1 at the root

Termination is guaranteed.

How efficiently can we evaluate this ?


Consider this:

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = {$L:f($T)}, $L:f($T)}f({}) = {}f($V) = $V


Naive Recursive Evaluation

a

b

c

d

a a

b b b b

c c c c c c c c

Input tree = n nodesOutput tree = 2n+1 – 1 nodes

Efficient Recursive EvaluationRecursive Evaluation with function memorization.PTIME complexity.

a

b

c

d

a

b

c

d

a

b

c

d



Alternatively: apply the function in parallel to each input edge Bulk Evaluation

Bulk EvaluationSometimes f doesn’t return anything use edges

a

b

c

d

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L=c then $T else f($T)f({}) = {}f($V) = $V

f($T1 U $T2) = f($T1) U f($T2)f({$L: $T}) = if $L=c then $T else f($T)f({}) = {}f($V) = $V

d

d

Epsilon Edges

Meaning of edges:

=

a b

c d

a b

c d

dc

Epsilon Edges

Note: union becomes easy to draw with edges:

Example:a b c d

eU =a b c d e

=

a b c d e

T1 T2U =T1 T2

Bulk Evaluation

Idea: “apply” E1, ..., Ek independently on each edge, then connect with edges PTIME





. . . .

Bulk Evaluation





Recall (a.b)*.a:

a

b

a

a

a

b

b

a

c

ba

aa

b

b

a

c b

b

a

c

a

d

d

d

d

bc


• Can evaluate in two ways:– Recursively: memorize functions’ results– Bulk: apply all functions on all edges, in

parallel, connect, eliminate what is useless

• Complexity: PTIME– More precisely: NLOGSPACE

• Works on graphs with cycles too !

XSL

• two W3C drafts: XSLT and XPATH– http://www.w3.org/TR/xpath, 11/99– http://www.w3.org/TR/WD-xslt, 11/99

• in commercial products (e.g. IE5.0)

• purpose: stylesheet specification language:– stylesheet: XML -> HTML– in general: XML -> XML

XSL Templates and Rules

• query = collection of template rules

• template rule = match pattern + template

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>


<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>

Retrieve all book titles:

Flow Control in XSL


<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>

<xsl:template match=“b”> <xsl:apply-templates/></xsl:template>

<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>


<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>

<xsl:template match=“b”> <xsl:apply-templates/></xsl:template>

<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>

XSL is Structural Recursion

Equivalent to:

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V

XSL query = single functionXSL query with modes = multiple function

XSL and Structural Recursion

XSL:• trees only• may loop

Structural Recursion:• arbitrary graphs• always terminates

<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>

<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>

stack overflow on IE 5.0

add the following rule:

Management of XML and Semistructured Data

Documents

Transcript of Management of XML and Semistructured Data