XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries...
-
Upload
emerald-small -
Category
Documents
-
view
215 -
download
1
Transcript of XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries...
Plan• We will put some formal model underlying XML Trees and
queries on them– Keeping in mind the practical aspects but abstracting details
away– This will allow us to solve problems that are important in
practice, in a foundational way
• Important issues– Types: defining a language of trees
• Will be useful for verifying validity of a tree• And for optimizations of queries
– Query evaluation on a document– Query evaluation on a type (type inference / checking)
XML typing
• Not compulsory• Simplify writing software for XML
– Improve interoperability between programs
• Improve storage and performance• Simplify data protection
– Reject illegal update
Improve performance
Bib
paper book
yearjournal
title
int string string
addressauthor
title
zip city street
lastname
firstname
string string string string string
string
\\adress\\zip[@zip=“12345”]\\adress\\zip[@zip=“12345”]
\\adress\zip[@zip=“12345”]\\adress\zip[@zip=“12345”]
Typing semistructured data
Improve storage
Root
Company Employee
string
company
person
works-for
c.e.o.
address
name
managed-by
name
o i d n a m e a d d r e s s c . e . o .… … … …… … … …
Company
o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …
Employee
Store rest in XML(file)
“Well-behaved” parts can go to relational DBs
Typing semistructured data
Type checking
• Who checks– XML editor: check that the data conforms to its type– XML exchange, e.g., with Web service
• Server when delivering the data• Client/application: when receiving it
• Dynamic verification: after the data is produced• Static verification: verification of the program that
generates the data
Static verification
• Input: input type T and code of function f– f is Xquery, Xpath, etc.
• Verification of T’– Is it true that d T, f(d) T’ ?╞ ╞
• Type inference– Find the smallest T’ such that d T, f(d) T’╞ ╞
• A type is a language of trees
ExampleF= for $p in doc("parts.xml“)//part[color=“red"]return <part>
<name>$p/name/text()</name><desc>$p/desc/node()</desc>
</part>
Result type (part (name (string) desc (any) )*
If the type of parts.xml//part/desc is string(part (name (string) desc (string) )*
DifficultySemantics: for $X in Input, $Y in the input do { output ( <b/> }Can be written in XQueryInput: <a/> <a/> Result: <b/> <b/> <b/> <b/> Problem: { bi i=n2 for n ≥ 0 } cannot be described in XML
schemaThere is no « best » result
– b*– + b2 b*
– + b2 + b4b*
– + b2 + b4 + b9b*
– …
Why tree automata?
• XML = unranked trees• No theory for XML • Rich theory for strings: Automata• Extend to
rich theory for ranked trees: Tree automata – Nice algorithms– Nice theorems– Can this carry to unranked trees and XML?
• Yes!
Finite state automata on words
),,,,( 0 FqQ
Alphabet
State
Initial state Accepting states
Transitions
Qq 0 QF
)(: QPQ
Typing semistructured data
q0
Nondeterministic automaton: Example
33
32
21
01
100
,
,
,
,
,,
qqb
qqqa
2
3210 ,,,
,
qF
qqqqQ
ba
a b a a b- a b a- q0
q1
q0 q0
q1
q0
q1
q0 q0
q1
q0 q0 q2
q1
q0
KO OK
• Deterministic– No transition– No alternative transitions such as
• Determinization – It is possible to obtain an equivalent deterministic automaton– State of new automaton = set of states of the original one– Possible exponential blow-up
• Minimization• Limitations – cannot do
– Context-free languages• Essential tool – e.g., lexical analysis
Reminder
Ν, nba nn
100 ,, qqqa 0, qq
Reminder (2)• L(A) = set of words accepted by automata A• Regular languages• Can be described by regular expressions, e.g. a(b+c)*d• Closed under complement
• Closed under union, intersection
– Product automata with states (s,s’) where s is from A and s’ is from A’
)(* AL
)()(
)()(
BLAL
BLAL
Automata on words versus trees
a b b a
a
b
b a
b
b
a b
a
Left to right
Right to left
No difference
Botto
m
up
Top
down
Differences
Binary tree automata
• Parallel evaluation
• For leaves:
• For other nodes:
),,,( FQ
)(: QP
)(: QPQQ
a
b
b a
b
a b
a
Botto
m
up
q q’
bq”
q1q”
q2
qqq’
Typing semistructured data
Bottom-up tree automata
• Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’
• Accepts is the root is in some state in F
• Not deterministic if alternatives or -transitions:
',',, rrqqa
}',{',, rrqqa ', rr
Example: deterministic bottom-up
1102012112
0002
0102012002
1112
,,,,,,,,
,,
,,,,,,,,
,,
qqqqqqq
qqq
qqqqqqq
qqq
1
10 ,
,,1,0
qF
qqQ
11
01
1
0
q
q
1102
1012
1112
0002
0102
0012
0002
1112
,,
,,
,,
,,
,,
,,
,,
,,
qqq
qqq
qqq
qqq
qqq
qqq
qqq
qqq
Boolean circuit evaluation
v
v
v
1v v1
10
v
0
11
11
01
1
0
q
q
0q 1q 0q
1q1q
1q1q1q
1q
1q
1q
1q
1q
OK
Regular tree language = set of trees accepted by a bottom-up tree
automaton
Typing semistructured data
Regular tree languages
Theorem: the following are equivalent– L is a regular tree language– L is accepted by a nondeterministic bottom-up
automaton– L is accepted by a deterministic bottom-up
automaton– L is accepted by a nondeterministic top-down
automaton
Deterministic top-down is weaker
Top-down tree automata
• Top-down: if a node labeled a is in state q”, then its left child moves to state q, right to q’
• Accepts is all leaves are in states in F• Not deterministic if
',", qqqa
',,',", rrqqqa
Why deterministic top-down is weaker?
• Consider the language– L = { <r> <a\>,<b\> <\r>, <r> <b\>,<a\><\r>) }
• It can be accepted by a bottom-up TA– Exercise: write a BUTA A such that L = L(A)
• Suppose that B is a deterministic top-down TA that accepts both trees in L– Exercise: Show that B also accepts <r> <a\><a\> <\r> – A contradiction
Fact: No deterministic top-down tree automata accepts exactly L
Ranked trees automata: Properties
• Like for words• Determinization • Minimization• Closed under
– Complement– Intersection– Union
Unranked tree automata
...,,,,,,
...,,,,,
...,,,,,
...,,,,,,
222
222
222
222
fffffffff
ttftfttt
ftffftff
ttttttttt
Issue: represent an infinite set of transitionsSolution: a regular language
• Rule:• Meaning: if the states of the children of some
node labeled a form a word in L(Q), this node moves to some state in {r1,…,rm}
Unranked tree automata (2)
mrrQLa ,...,)(, 1
fOrwherefOr
fttftOrwheretOr
ftfftAndwherefAnd
tAndwheretAnd
00,
*)(*)(11,
*)(*)(00,
11,
2
2
2
2
Building on ranked trees
a
b
b
b
b
a b
a b
a
b
b
b
b
a b
a b
Ranked tree: FirstChild-NextSibling
F: encoding into a ranked treeF is a bijectionF-1: decoding
Building on bottom-up ranked trees (2)
• For each Unranked TA A, there is a Ranked TA accepting F(L(A))
• For each Ranked TA A, there is an unranked TA accepting F-1(L(A))
• Both are easy to construct
Consequence: Unranked TA are closed under union, intersection, complement
Determinaztaion also possible, a bit more tricky
Monadic second-order logic
•Representation of a tree as a logical structure
E(1,2), E(1,3)… E(3,9)S(2,3), S(3,4), S(4,5)…S(8,9)
a(1), a(4), a(8)b(2), b(3), b(5), b(6), b(7), b(9)
a
b
b
b
b
a b
a b
1
6
3 42
7 8 9
5
XxXX
x
xayxSyxEyx
)(
...)(),(),(::
Monadic second-order logic
E(1,2), E(1,3)… E(3,9)S(2,3), S(3,4), S(4,5)…S(8,9)
a(1), a(4), a(8)b(2), b(3), b(5), b(6), b(7), b(9)
MSO syntax
Set variable
Quantification over a set variable
Example of MSO
•Each a node has a b-descendant•This corresponds to the formula
For each node x labeled a: each set X that ()contains x and that () is closed under descendant, X contains
some y labeled b
))()((
))()(),((
)(
)(
ybyXy
zXyXzyEzy
xX
whereXxax
Bridge
Theorem: for a set L of trees, the following are equivalent
.1L = L(A) for some bottom-up tree automata Ai.e. L is definable with bottom-tree automata
.2L = {T | T satisfies } for some MSO formula i.e. L is definable in MSO
DTD
•Describe the children of a node of a label a by a regular expression
•Bizarre syntax!<ELEMENT populationdata (continent*)>
!<ELEMENT continent (name, country*)>
!<ELEMENT country (name, province*)>
!<ELEMENT province (name, city*)>
!<ELEMENT city (name, pop)>
!<ELEMENT name (#PCDATA)>
!<ELEMENT pop (#PCDATA) >
DTD and deterministism
•Regular expressions in DTD should be deterministic
–Complicated definition
•Intuition: the corresponding automata should be deterministic
–(a+b*)a is not–When reading <a>, one cannot tell whether it is an a
from (a+b) or if it is the a of the end–)b*a()b*a *)is an equivalent expression that is
deterministic
Very efficient validation
•It suffices to verify for each node a that the word formed by the labels of its children is
accepted by the finite state automata Aa
•Possible to type check the document while scanning it
Very efficient validation (2)!<ELEMENT a ( b c )>
!<ELEMENT b ( d+ )>
a
b c
d d
s t ub c
Aa
s’ t’d
dAb
<a<<b<<d/<<d/<</b<<c/<</a<
s’
st
t’
Acceptu
Warning
•The previous example can be checked with a simple automata on words
•But not the following one
!<ELEMENT part ( part* )>
•The stack is needed for accepting<a>…<a></a>…</a>
n <a> n </a<
Some bad news for DTD
•Not closed under union DTD1…
!<ELEMENT used( ad*)>
!<ELEMENT ad ( year, brand )>
DTD2…
!<ELEMENT new( ad*)>
!<ELEMENT ad ( brand )>
•L(DTD1) L(DTD2) cannot be described by a DTD but can be described easily by a tree automata
–Problem with the type of ad that depends of its parent
•Also not closed under complement•Limited expressive power
Car example continued
•The best DTD we can choose does not distinguish between ads for used and new cars
–!<ELEMENT ad (year?, brand)<
Car
Used New
Brand Year Brand
“Renault” “2008” “BMW”
Goal
• Evaluating an Xpath query against a given document– To find all matches
• We will also consider Type Checking– Given an Xpath query Q, input DTD T and output
DTD T’, Does it hold that for every document D satisfying T, Q(D) satisfies T’
• Complexity is important– Huge Documents
Data complexity vs. Combined Complexity
• Two inputs to the query evaluation problem– Data (XML document) of size |D|– Query (Xpath expression) of size |Q|– Usually |Q| << |D|
• Polynomial data complexity– Complexity that is polynomial in |D|, possibly exponential in |Q|
• Polynomial combined complexity– Complexity that is polynomial in |D| and |Q|
• Fixed Parameter Tractable complexity
Xpath Query Evaluation
• Input: XML Document D, Xpath query Q
• Output: A subset of the nodes of D, as defined by Q
• We will follow Efficient Algorithms for Processing Xpath Queries /Gottlob, Koch, Pichler 2003
Simple algorithm
process-location-step(n,Q) { S:-= Apply Q.first to n; If |Q|> 1 For each node n’ in s do process-location-step(n’,Q.next)}
Complexity
• Worst case: in each step of Q the axis is “following”
• So we apply the query in each step on O(|D|) nodes
• And we get Time(|Q|)= |D|*Time(|Q|-1)
• I.e. the complexity is O(|D|^|Q|)
Polynomial data complexity
• Sometimes considered good even if exponential in the query size
• But can we have polynomial combined complexity?
• Yes!
Bottom-up vs. Top-down evaluation
• We will discuss two kinds of query evaluation algorithms:– Bottom-up means that the query parse tree is
processed from the leaves up to the root– Top-down means that the parse tree is processed
from the root to the leaves
• When processing we will fill in a Context-value table
Bottom-up evaluation
• Main idea: compute the value for each leaf for every possible context
• Propagate upwards until the root
• Dynamic programming algorithm to avoid re-evaluation of queries in the same context
An equivalent semantics to XPath
• The domain of contexts is C= dom X {<k,n> | 1<k<n< |dom|} A context is c=<x,k,n> where x is a context node k is context position n is the context size
Context-value Table
• Given a query sub-expression e, the context-value table of e specifies all combinations of context c and value v, such that computing e on the context c results in v
• Bottom-up algorithm follows: compute the context-value table in a bottom-up fashion with respect to the query
Complexity
• O(|D|^3*|Q|) space ignoring strings and numbers– O(|Q|) tables, with 3 columns, each including values
in 1…|D| thus O(|D|^3*|Q|)– An extra O(|D|*|Q|) multiplicative factor for strings
and numbers
• O(|D|^5*|Q|) time ignoring strings and numbers– It can take O(|D|^2) to combine two nodesets– Extra O(|Q|) in case of strings and numbers
Optimization
• Represent contexts as pairs of current and previous node
• Allows to get the time complexity down to O(|D|^4* |Q|^2)
• Space complexity can be brought down to O(|D|^2*|Q|^2) via more optimizations
Top-down evaluation
• Similar idea
• But allows to compute only values for contexts that are needed
• Same worst-case bounds
Top-down or bottom-up?
• General question in processing XML trees• The tradeoff:
– Usually easier to combine results computed in children to obtain the result at the parent
• So bottom-up traversal is usually easier to design
– On the other hand, some of the computation is redundant since we don’t know if it will become relevant
• So top-down traversal may be more efficient
Linear-time fragment• Core Xpath includes only navigation
– \ and \\
• Core Xpath can be evaluated in O(|D|*|Q|)
• Observtion: no need to consider the entire triple, only current context node
• Top-down or bottom-up evaluation with essentially the same algorithm
• But smaller tables (for every query node, all document nodes and values of evaluation)