XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries...

65
XML Typing and Query Evaluation

Transcript of XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries...

XML Typing and Query Evaluation

Plan• We will put some formal model underlying XML Trees and

queries on them– Keeping in mind the practical aspects but abstracting details

away– This will allow us to solve problems that are important in

practice, in a foundational way

• Important issues– Types: defining a language of trees

• Will be useful for verifying validity of a tree• And for optimizations of queries

– Query evaluation on a document– Query evaluation on a type (type inference / checking)

XML typing

• Not compulsory• Simplify writing software for XML

– Improve interoperability between programs

• Improve storage and performance• Simplify data protection

– Reject illegal update

Improve performance

Bib

paper book

yearjournal

title

int string string

addressauthor

title

zip city street

lastname

firstname

string string string string string

string

\\adress\\zip[@zip=“12345”]\\adress\\zip[@zip=“12345”]

\\adress\zip[@zip=“12345”]\\adress\zip[@zip=“12345”]

Typing semistructured data

Improve storage

Root

Company Employee

string

company

person

works-for

c.e.o.

address

name

managed-by

name

o i d n a m e a d d r e s s c . e . o .… … … …… … … …

Company

o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …

Employee

Store rest in XML(file)

“Well-behaved” parts can go to relational DBs

Typing semistructured data

Type checking

• Who checks– XML editor: check that the data conforms to its type– XML exchange, e.g., with Web service

• Server when delivering the data• Client/application: when receiving it

• Dynamic verification: after the data is produced• Static verification: verification of the program that

generates the data

Static verification

• Input: input type T and code of function f– f is Xquery, Xpath, etc.

• Verification of T’– Is it true that d T, f(d) T’ ?╞ ╞

• Type inference– Find the smallest T’ such that d T, f(d) T’╞ ╞

• A type is a language of trees

ExampleF= for $p in doc("parts.xml“)//part[color=“red"]return <part>

<name>$p/name/text()</name><desc>$p/desc/node()</desc>

</part>

Result type (part (name (string) desc (any) )*

If the type of parts.xml//part/desc is string(part (name (string) desc (string) )*

DifficultySemantics: for $X in Input, $Y in the input do { output ( <b/> }Can be written in XQueryInput: <a/> <a/> Result: <b/> <b/> <b/> <b/> Problem: { bi i=n2 for n ≥ 0 } cannot be described in XML

schemaThere is no « best » result

– b*– + b2 b*

– + b2 + b4b*

– + b2 + b4 + b9b*

– …

Why tree automata?

• XML = unranked trees• No theory for XML • Rich theory for strings: Automata• Extend to

rich theory for ranked trees: Tree automata – Nice algorithms– Nice theorems– Can this carry to unranked trees and XML?

• Yes!

Automata

Automata on words

Typing semistructured data

Finite state automata on words

),,,,( 0 FqQ

Alphabet

State

Initial state Accepting states

Transitions

Qq 0 QF

)(: QPQ

Typing semistructured data

q0

Nondeterministic automaton: Example

33

32

21

01

100

,

,

,

,

,,

qq

qq

qq

qqb

qqqa

2

3210 ,,,

,

qF

qqqqQ

ba

a b a a b- a b a- q0

q1

q0 q0

q1

q0

q1

q0 q0

q1

q0 q0 q2

q1

q0

KO OK

• Deterministic– No transition– No alternative transitions such as

• Determinization – It is possible to obtain an equivalent deterministic automaton– State of new automaton = set of states of the original one– Possible exponential blow-up

• Minimization• Limitations – cannot do

– Context-free languages• Essential tool – e.g., lexical analysis

Reminder

Ν, nba nn

100 ,, qqqa 0, qq

Reminder (2)• L(A) = set of words accepted by automata A• Regular languages• Can be described by regular expressions, e.g. a(b+c)*d• Closed under complement

• Closed under union, intersection

– Product automata with states (s,s’) where s is from A and s’ is from A’

)(* AL

)()(

)()(

BLAL

BLAL

Automata on words versus trees

a b b a

a

b

b a

b

b

a b

a

Left to right

Right to left

No difference

Botto

m

up

Top

down

Differences

Automata

Automata on ranked trees

Typing semistructured data

Binary tree automata

• Parallel evaluation

• For leaves:

• For other nodes:

),,,( FQ

)(: QP

)(: QPQQ

a

b

b a

b

a b

a

Botto

m

up

q q’

bq”

q1q”

q2

qqq’

Typing semistructured data

Bottom-up tree automata

• Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’

• Accepts is the root is in some state in F

• Not deterministic if alternatives or -transitions:

',',, rrqqa

}',{',, rrqqa ', rr

Example: deterministic bottom-up

1102012112

0002

0102012002

1112

,,,,,,,,

,,

,,,,,,,,

,,

qqqqqqq

qqq

qqqqqqq

qqq

1

10 ,

,,1,0

qF

qqQ

11

01

1

0

q

q

1102

1012

1112

0002

0102

0012

0002

1112

,,

,,

,,

,,

,,

,,

,,

,,

qqq

qqq

qqq

qqq

qqq

qqq

qqq

qqq

Boolean circuit evaluation

v

v

v

1v v1

10

v

0

11

11

01

1

0

q

q

0q 1q 0q

1q1q

1q1q1q

1q

1q

1q

1q

1q

OK

Regular tree language = set of trees accepted by a bottom-up tree

automaton

Typing semistructured data

Regular tree languages

Theorem: the following are equivalent– L is a regular tree language– L is accepted by a nondeterministic bottom-up

automaton– L is accepted by a deterministic bottom-up

automaton– L is accepted by a nondeterministic top-down

automaton

Deterministic top-down is weaker

Top-down tree automata

• Top-down: if a node labeled a is in state q”, then its left child moves to state q, right to q’

• Accepts is all leaves are in states in F• Not deterministic if

',", qqqa

',,',", rrqqqa

Why deterministic top-down is weaker?

• Consider the language– L = { <r> <a\>,<b\> <\r>, <r> <b\>,<a\><\r>) }

• It can be accepted by a bottom-up TA– Exercise: write a BUTA A such that L = L(A)

• Suppose that B is a deterministic top-down TA that accepts both trees in L– Exercise: Show that B also accepts <r> <a\><a\> <\r> – A contradiction

Fact: No deterministic top-down tree automata accepts exactly L

Ranked trees automata: Properties

• Like for words• Determinization • Minimization• Closed under

– Complement– Intersection– Union

But…

• XML documents are unranked:book (intro,section*,conclusion)

Automata

Automata on unranked tree

Typing semistructured data

Unranked tree automata

...,,,,,,

...,,,,,

...,,,,,

...,,,,,,

222

222

222

222

fffffffff

ttftfttt

ftffftff

ttttttttt

Issue: represent an infinite set of transitionsSolution: a regular language

• Rule:• Meaning: if the states of the children of some

node labeled a form a word in L(Q), this node moves to some state in {r1,…,rm}

Unranked tree automata (2)

mrrQLa ,...,)(, 1

fOrwherefOr

fttftOrwheretOr

ftfftAndwherefAnd

tAndwheretAnd

00,

*)(*)(11,

*)(*)(00,

11,

2

2

2

2

Building on ranked trees

a

b

b

b

b

a b

a b

a

b

b

b

b

a b

a b

Ranked tree: FirstChild-NextSibling

F: encoding into a ranked treeF is a bijectionF-1: decoding

Building on bottom-up ranked trees (2)

• For each Unranked TA A, there is a Ranked TA accepting F(L(A))

• For each Ranked TA A, there is an unranked TA accepting F-1(L(A))

• Both are easy to construct

Consequence: Unranked TA are closed under union, intersection, complement

Determinaztaion also possible, a bit more tricky

Tree Automata and

monadic second-order logic

Typing semistructured data

Monadic second-order logic

•Representation of a tree as a logical structure

E(1,2), E(1,3)… E(3,9)S(2,3), S(3,4), S(4,5)…S(8,9)

a(1), a(4), a(8)b(2), b(3), b(5), b(6), b(7), b(9)

a

b

b

b

b

a b

a b

1

6

3 42

7 8 9

5

XxXX

x

xayxSyxEyx

)(

...)(),(),(::

Monadic second-order logic

E(1,2), E(1,3)… E(3,9)S(2,3), S(3,4), S(4,5)…S(8,9)

a(1), a(4), a(8)b(2), b(3), b(5), b(6), b(7), b(9)

MSO syntax

Set variable

Quantification over a set variable

Example of MSO

•Each a node has a b-descendant•This corresponds to the formula

For each node x labeled a: each set X that ()contains x and that () is closed under descendant, X contains

some y labeled b

))()((

))()(),((

)(

)(

ybyXy

zXyXzyEzy

xX

whereXxax

Bridge

Theorem: for a set L of trees, the following are equivalent

.1L = L(A) for some bottom-up tree automata Ai.e. L is definable with bottom-tree automata

.2L = {T | T satisfies } for some MSO formula i.e. L is definable in MSO

XML typing

DTDs

Typing semistructured data

DTD

•Describe the children of a node of a label a by a regular expression

•Bizarre syntax!<ELEMENT populationdata (continent*)>

!<ELEMENT continent (name, country*)>

!<ELEMENT country (name, province*)>

!<ELEMENT province (name, city*)>

!<ELEMENT city (name, pop)>

!<ELEMENT name (#PCDATA)>

!<ELEMENT pop (#PCDATA) >

DTD and deterministism

•Regular expressions in DTD should be deterministic

–Complicated definition

•Intuition: the corresponding automata should be deterministic

–(a+b*)a is not–When reading <a>, one cannot tell whether it is an a

from (a+b) or if it is the a of the end–)b*a()b*a *)is an equivalent expression that is

deterministic

Very efficient validation

•It suffices to verify for each node a that the word formed by the labels of its children is

accepted by the finite state automata Aa

•Possible to type check the document while scanning it

Very efficient validation (2)!<ELEMENT a ( b c )>

!<ELEMENT b ( d+ )>

a

b c

d d

s t ub c

Aa

s’ t’d

dAb

<a<<b<<d/<<d/<</b<<c/<</a<

s’

st

t’

Acceptu

Warning

•The previous example can be checked with a simple automata on words

•But not the following one

!<ELEMENT part ( part* )>

•The stack is needed for accepting<a>…<a></a>…</a>

n <a> n </a<

Some bad news for DTD

•Not closed under union DTD1…

!<ELEMENT used( ad*)>

!<ELEMENT ad ( year, brand )>

DTD2…

!<ELEMENT new( ad*)>

!<ELEMENT ad ( brand )>

•L(DTD1) L(DTD2) cannot be described by a DTD but can be described easily by a tree automata

–Problem with the type of ad that depends of its parent

•Also not closed under complement•Limited expressive power

Car example continued

•The best DTD we can choose does not distinguish between ads for used and new cars

–!<ELEMENT ad (year?, brand)<

Car

Used New

Brand Year Brand

“Renault” “2008” “BMW”

Xpath Query Evaluation

Goal

• Evaluating an Xpath query against a given document– To find all matches

• We will also consider Type Checking– Given an Xpath query Q, input DTD T and output

DTD T’, Does it hold that for every document D satisfying T, Q(D) satisfies T’

• Complexity is important– Huge Documents

Data complexity vs. Combined Complexity

• Two inputs to the query evaluation problem– Data (XML document) of size |D|– Query (Xpath expression) of size |Q|– Usually |Q| << |D|

• Polynomial data complexity– Complexity that is polynomial in |D|, possibly exponential in |Q|

• Polynomial combined complexity– Complexity that is polynomial in |D| and |Q|

• Fixed Parameter Tractable complexity

Xpath Query Evaluation

• Input: XML Document D, Xpath query Q

• Output: A subset of the nodes of D, as defined by Q

• We will follow Efficient Algorithms for Processing Xpath Queries /Gottlob, Koch, Pichler 2003

Simple algorithm

process-location-step(n,Q) { S:-= Apply Q.first to n; If |Q|> 1 For each node n’ in s do process-location-step(n’,Q.next)}

Complexity

• Worst case: in each step of Q the axis is “following”

• So we apply the query in each step on O(|D|) nodes

• And we get Time(|Q|)= |D|*Time(|Q|-1)

• I.e. the complexity is O(|D|^|Q|)

Polynomial data complexity

• Sometimes considered good even if exponential in the query size

• But can we have polynomial combined complexity?

• Yes!

Xpath query parse tree

descendant::b/following-sibling::* [position() != last()]

Bottom-up vs. Top-down evaluation

• We will discuss two kinds of query evaluation algorithms:– Bottom-up means that the query parse tree is

processed from the leaves up to the root– Top-down means that the parse tree is processed

from the root to the leaves

• When processing we will fill in a Context-value table

Bottom-up evaluation

• Main idea: compute the value for each leaf for every possible context

• Propagate upwards until the root

• Dynamic programming algorithm to avoid re-evaluation of queries in the same context

An equivalent semantics to XPath

• The domain of contexts is C= dom X {<k,n> | 1<k<n< |dom|} A context is c=<x,k,n> where x is a context node k is context position n is the context size

Context-value Table

• Given a query sub-expression e, the context-value table of e specifies all combinations of context c and value v, such that computing e on the context c results in v

• Bottom-up algorithm follows: compute the context-value table in a bottom-up fashion with respect to the query

Bottom-up algorithm

Example

Complexity

• O(|D|^3*|Q|) space ignoring strings and numbers– O(|Q|) tables, with 3 columns, each including values

in 1…|D| thus O(|D|^3*|Q|)– An extra O(|D|*|Q|) multiplicative factor for strings

and numbers

• O(|D|^5*|Q|) time ignoring strings and numbers– It can take O(|D|^2) to combine two nodesets– Extra O(|Q|) in case of strings and numbers

Optimization

• Represent contexts as pairs of current and previous node

• Allows to get the time complexity down to O(|D|^4* |Q|^2)

• Space complexity can be brought down to O(|D|^2*|Q|^2) via more optimizations

Top-down evaluation

• Similar idea

• But allows to compute only values for contexts that are needed

• Same worst-case bounds

Top-down or bottom-up?

• General question in processing XML trees• The tradeoff:

– Usually easier to combine results computed in children to obtain the result at the parent

• So bottom-up traversal is usually easier to design

– On the other hand, some of the computation is redundant since we don’t know if it will become relevant

• So top-down traversal may be more efficient

Linear-time fragment• Core Xpath includes only navigation

– \ and \\

• Core Xpath can be evaluated in O(|D|*|Q|)

• Observtion: no need to consider the entire triple, only current context node

• Top-down or bottom-up evaluation with essentially the same algorithm

• But smaller tables (for every query node, all document nodes and values of evaluation)