Containment and Equivalence for an XPath Fragment Authors:Gerome Miklau Dan Suciu Presented by:...

34
Containment and Containment and Equivalence for an Equivalence for an XPath Fragment XPath Fragment Authors: Authors: Gerome Gerome Miklau Miklau Dan Suciu Dan Suciu Presented by: Presented by: Shnaiderman Lila Shnaiderman Lila
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Containment and Equivalence for an XPath Fragment Authors:Gerome Miklau Dan Suciu Presented by:...

Containment and Containment and Equivalence for an XPath Equivalence for an XPath

FragmentFragment

Authors:Authors: Gerome Miklau Gerome Miklau

Dan Suciu Dan Suciu

Presented by:Presented by: Shnaiderman LilaShnaiderman Lila

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 22

Presentation OutlinePresentation Outline IntroductionIntroduction

Final DestinationFinal Destination

Definitions and backgroundDefinitions and background

Canonical models and Match SetsCanonical models and Match Sets

Exponential time containment algorithm (complete)Exponential time containment algorithm (complete)

HomomorphismHomomorphism

Polynomial time containment algorithm Polynomial time containment algorithm (incomplete)(incomplete)

co-NP hardness of containmentco-NP hardness of containment

Additional topics of interestAdditional topics of interest

ConclusionConclusion

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 33

IntroductionIntroduction XPath is a simple language for navigating XPath is a simple language for navigating

XML documents and selecting a set of XML documents and selecting a set of nodes.nodes.

With XPath we can query XML, describe key With XPath we can query XML, describe key constraints, express transformations and constraints, express transformations and reference elements in remote documents.reference elements in remote documents.

We can find XPath influence in other XML We can find XPath influence in other XML query languages and features such as query languages and features such as XQuery, XSLT, XML schema, XLink, XPointer XQuery, XSLT, XML schema, XLink, XPointer and more...and more...

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 44

Introduction (continue)Introduction (continue)

This article deals with simple This article deals with simple XPath fragments, that consist XPath fragments, that consist of: of: node tests node tests child axes (/)child axes (/) Descendant axes(//)Descendant axes(//) Wildcards (*)Wildcards (*) Predicates ([…])Predicates ([…])

This class of queries is called This class of queries is called XPXP{[] , * , //}{[] , * , //}

a

b

*

c

d

x

Example:Example: a//*[b//d] a//*[b//d][c][c]

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 55

Final DestinationFinal Destination Showing that the containment problem Showing that the containment problem

for for XPXP{[] , * , //} {[] , * , //} is co-NP complete (surprising!)is co-NP complete (surprising!) To present an To present an efficientefficient, sound algorithm , sound algorithm

which is complete in some cases (this which is complete in some cases (this algorithm always runs in PTIME)algorithm always runs in PTIME)

To present a sound and To present a sound and completecomplete algorithm algorithm which is efficient in some cases (the worst which is efficient in some cases (the worst time for that algorithm is exponential)time for that algorithm is exponential)

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 66

Definitions and backgroundDefinitions and background NPNP - stands for “Nondeterministic-Polynomial". - stands for “Nondeterministic-Polynomial". P classP class - A class of mathematical problems for which an - A class of mathematical problems for which an

efficient solution has been found, which is solvable in efficient solution has been found, which is solvable in polynomial time.polynomial time.

NP classNP class - A class of mathematical problems which most likely - A class of mathematical problems which most likely has has Exponential ComplexityExponential Complexity, for which no efficient solution has , for which no efficient solution has been found (yet), which is probably not solvable in polynomial been found (yet), which is probably not solvable in polynomial time. time.

NP hard problemNP hard problem - a problem that each NP problem can be - a problem that each NP problem can be reduced to ( even worse than NP… ).reduced to ( even worse than NP… ).

NP complete problemNP complete problem – a problem which belongs to the NP – a problem which belongs to the NP class of problems and is an NP hard problem by itself.class of problems and is an NP hard problem by itself.

coNPcoNP - is the class of problems whose complement is in NP. - is the class of problems whose complement is in NP.Suppose L is a coNP problem, there exists a polynomial-time Suppose L is a coNP problem, there exists a polynomial-time nondeterministic algorithm M such that:nondeterministic algorithm M such that: If x If x L, then M(x) = “yes” for all computation paths. L, then M(x) = “yes” for all computation paths. If x If x L, then M(x) = “no” for some computation path. L, then M(x) = “no” for some computation path.

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 77

Definitions and background Definitions and background (continued)(continued)

Embedding:Embedding: Given a tree pattern Given a tree pattern pp and a tree and a tree tt, an embedding from , an embedding from pp to to

tt is the function is the function ee: NODES(: NODES(pp) ) NODES ( NODES (tt) with the following conditions:) with the following conditions:

Root-preserving:Root-preserving:ee(ROOT((ROOT(pp)) = ROOT()) = ROOT(tt))

Label-preserving: Label-preserving: For each x For each x NODES( NODES(pp), ),

LABEL(x) = * or LABEL(x) = LABEL(x) = * or LABEL(x) = LABEL(LABEL(ee(x))(x))Child-edge-preserving:Child-edge-preserving:

For each (x,y) For each (x,y) EDGES EDGES/ / ((pp), ), ((ee(x), (x), ee(y)) (y)) EDGES( EDGES(tt))

Descendant-edge-preserving:Descendant-edge-preserving: For each (x,y) For each (x,y) EDGES EDGES////((pp), ),

((ee(x), (x), ee(y)) (y)) EDGES EDGES++((tt)) (EDGES(EDGES++, means that there is at least one edge , means that there is at least one edge

between two nodes) between two nodes)

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 88

Definitions and background Definitions and background (continued)(continued)

exampleexample

a

c b a

ccbcb

dbca bb

Tree instance t

a

c

*

bx

a

Pattern pa[a]//*[b]//c

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 99

Definitions and background Definitions and background (continued)(continued)

From XPath to Tree Patterns:From XPath to Tree Patterns: Every XPath Every XPath expression can be translated into a tree pattern of expression can be translated into a tree pattern of arity 1, and vice-versa, while preserving semantics.arity 1, and vice-versa, while preserving semantics. From now on we shall consider tree patterns only – PFrom now on we shall consider tree patterns only – P{[],*,//} {[],*,//}

and its fragments.and its fragments. Boolean patternsBoolean patterns – patterns with arity 0 – patterns with arity 0

Definition:Definition: If If pp is boolean then: is boolean then: pp((tt) = ) = (false) or (false) orpp((tt) = {()} (true)) = {()} (true)

ContainmentContainment means implication: means implication: pp p’p’ iff iff t t pp((tt) ) p’ p’ ((tt))

Proposition 1:Proposition 1: Let s Let s11,…,s,…,sk k be k labels that are not in ∑. be k labels that are not in ∑. There is a translation of k-ary patterns over the alphabet There is a translation of k-ary patterns over the alphabet ∑, to Boolean patterns over the alphabet ∑∑, to Boolean patterns over the alphabet ∑{s{s11,…,s,…,skk}, such }, such that for any k-ary patterns p, p’, and their translation that for any k-ary patterns p, p’, and their translation ppoo,p,poo’, we have ’, we have p p p’ iff p p’ iff p00 p p00’’

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1010

Definitions and background Definitions and background (continued)(continued)

Example: Example: A tree pattern of arity 3, with the A tree pattern of arity 3, with the distinguished nodes xdistinguished nodes x11,x,x22,x,x33, and its translation to a , and its translation to a Boolean Pattern pBoolean Pattern poo, used in Proposition 1: p, used in Proposition 1: poo has has three extra nodes labeled sthree extra nodes labeled s11, s, s22, s, s3:3:

In the rest of this article, we will assume all tree In the rest of this article, we will assume all tree patterns to be boolean, unless otherwise stated.patterns to be boolean, unless otherwise stated.

a

c

*

bx2

a x1

x3

a

c

*

b

a

s2

s3

s1

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1111

Definitions and background Definitions and background (continued)(continued)

Mutual Reducability of Containment and Mutual Reducability of Containment and Equivalence:Equivalence:The containment and equivalence problems are The containment and equivalence problems are mutually reducible in polynomial time.mutually reducible in polynomial time.Equivalence is simply two-way containment.Equivalence is simply two-way containment. We will only discuss containment in the reminder of this We will only discuss containment in the reminder of this

article.article. Tree pattern evaluation:Tree pattern evaluation:

There is an algorithm that decides for any tree There is an algorithm that decides for any tree pattern pattern pp, and input tree , and input tree tt whether whether p p ((tt) is true and ) is true and runs in time O(|runs in time O(|pp||||tt|). |). ||pp|, ||, |tt| - are the sizes of | - are the sizes of pp, , t, t, meaning the number of nodes meaning the number of nodes

in in pp, , tt..

pp((tt) is true – means that there is an embedding from ) is true – means that there is an embedding from pp to to tt..

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1212

Canonical models and Match SetsCanonical models and Match Sets Model of Boolean pattern P:Model of Boolean pattern P:

A Model of A Model of pp is a tree is a tree t t TT∑ ∑ on which on which pp evaluates to true. evaluates to true. Mod(Mod(pp): a set of models – ): a set of models –

Mod(Mod(pp) = {) = {tt T T∑ ∑ | | pp((tt) is true}) is true} pp p p ’ iff Mod(’ iff Mod(pp) ) Mod( Mod(p p ’)’)

Witness:Witness:a tree a tree tt such that such that pp((tt) is true and ) is true and p p ’(’(tt) is false ) is false pp p p ’’ In order to find a witness we need to check an infinite set so In order to find a witness we need to check an infinite set so

we need to restrict it:we need to restrict it: Canonical Models:Canonical Models:

First step: First step: Eliminate all descendant edges by replacing each Eliminate all descendant edges by replacing each edge // with a sequence of wildcards */*/…/*.edge // with a sequence of wildcards */*/…/*.Second step:Second step: replace each wild card with a symbol replace each wild card with a symbol z.z.

Formally (first step):Formally (first step): pp has has dd descendant edges descendant edges EDGESEDGES////((pp)={r)={r11,…,r,…,rdd}.}.Given Given dd numbers numbers û=(uû=(u11,…,u,…,udd), u), u110,…,u0,…,ud d 0, 0, p p [û] is a pattern [û] is a pattern obtained by replacing each descendant edge with any obtained by replacing each descendant edge with any sequence of *’s. sequence of *’s.

distance:distance: d(x,y) = u d(x,y) = uii + 1 (where x and y are nodes). + 1 (where x and y are nodes).

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1313

Canonical models and Match Sets Canonical models and Match Sets (continued)(continued) ExampleExample

LEMMA:LEMMA: Let e: Let e: pp tt be an embedding from the tree pattern be an embedding from the tree pattern pp to the tree to the tree tt. There exists a unique extension . There exists a unique extension pp[[û] and a û] and a unique embedding e’: unique embedding e’: pp[[û] û] tt such that such that x x NODES( NODES(pp), e(x) ), e(x) = e’(x).= e’(x).

Proof:Proof: For each i=1,...,d, e maps the descendant edge For each i=1,...,d, e maps the descendant edge rrii=(x=(xii,y,yii) ) EDGES EDGES////(p) into a pair of nodes (e(x(p) into a pair of nodes (e(xii),e(y),e(yii)) )) EDGES EDGES++(t). (t). Define uDefine uii=d(e(x=d(e(xii),e(y),e(yii)) - 1 (d is the distance in t), and let )) - 1 (d is the distance in t), and let ûû= (u= (u11,…,u,…,udd). Extend e to e’: p[). Extend e to e’: p[ûû] ] t by mapping the extension t by mapping the extension nodes between xnodes between xii and y and yii to the nodes connecting e(x to the nodes connecting e(xii) to e(y) to e(yii).).

Tree pattern p

a

b

*

c

a

Tree pattern p[0,2]

a

b

*

*

a

*

c

Extension nodes

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1414

Canonical models and Match Sets Canonical models and Match Sets (continued)(continued)

Formally (second step): Formally (second step): replace the *’s with some replace the *’s with some symbol – ssymbol – szz((pp) the tree pattern p obtained by ) the tree pattern p obtained by replacing each * in replacing each * in pp with z. with z.

Set of canonical models:Set of canonical models:modmodzz((pp) = {s) = {szz((pp[[ûû]) | ]) | ûû=(u=(u11,...,u,...,udd), u), u110,..., u0,..., udd0}0}

This set is infinite in case it has at least one descendant This set is infinite in case it has at least one descendant edgeedge

Set of bounded canonical models for n Set of bounded canonical models for n 0:0:modmodzz

nn((pp) = {s) = {szz((pp[[ûû]) | ]) | ûû=(u=(u11,...,u,...,udd), 0), 0uu11n,..., 0n,..., 0uud d n}n} This set is always finite.This set is always finite.

Star length w Star length w in pattern q,in pattern q, is the largest number is the largest number of nodes labeled with *’s and connected by child of nodes labeled with *’s and connected by child edges.edges.

Need to show:Need to show: For searching a witness for For searching a witness for p p pp’ it ’ it is enough to check a finite set modis enough to check a finite set modzz

nn(p) where z (p) where z does not occur in does not occur in p p ’ and n depends only on ’ and n depends only on p p ’.’.

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1515

Canonical models and Match Sets Canonical models and Match Sets (continued)(continued) Proposition:Proposition: Let Let pp and and p’p’ be two Boolean tree patterns, z be two Boolean tree patterns, z∑ be a ∑ be a

symbol that does not appear in symbol that does not appear in p’p’, and w be the star length of , and w be the star length of p’p’. . Then, the following are equivalent: (1) Then, the following are equivalent: (1) pp p’p’ (2) mod (2) modzz((pp) ) Mod( Mod(p’p’ ), ), (3) mod(3) modzz

nn((pp) ) Mod( Mod(p’p’ ), where n = w + 1. ), where n = w + 1.

Proof:Proof: (1) (1)(2)(2)(3) is obvious ((3) is obvious (p p p’p’ is equivalent to mod( is equivalent to mod(pp)) Mod( Mod(p’ p’ )).)).

This leaves (3)This leaves (3)(1):(1): Suppose Suppose pp p’p’, and let t be a witness(, and let t be a witness( pp((tt) is true and ) is true and p’p’ ( (tt) is ) is

false)). false)). pp((tt) is true ) is true there exists an embedding e : there exists an embedding e : pp tt There exists There exists e’ : e’ : pp[û] [û] tt which agrees with e on the nodes of which agrees with e on the nodes of pp (follows from (follows from the Lemma).the Lemma).

tt11 = s = szz((pp[û]) [û]) mod modzz((pp) is still a witness () is still a witness (p’ p’ ((tt11) is false), to show ) is false), to show that: that: suppose suppose p’p’ ( (tt11) were true ) were true there exists an embedding e1 : there exists an embedding e1 : p’p’ tt11,,

let f be a function:let f be a function:f: NODES(f: NODES(pp) -> NODES() -> NODES(tt) by composing e1: ) by composing e1: p’p’ tt11 with with e’: e’: pp[û] [û] tt , (because NODES( , (because NODES(tt11) = NODES() = NODES(pp[û]). [û]).

contradiction (f:contradiction (f:p’p’ tt p’p’ ( (tt) is true) ) is true) p’p’ ( (tt11) is false. This ends ) is false. This ends the proof (the proof (tt11 = s = szz((pp[û]) [û]) mod modzz((pp) is a witness ) is a witness pp((tt11) is true while ) is true while p’ p’ ((tt11) is false) .) is false) .

Let e: Let e: pp tt be an be an embedding from the tree embedding from the tree

pattern pattern pp to the tree to the tree tt. . There exist a unique There exist a unique

extension extension pp[û] and a unique [û] and a unique embedding e’: embedding e’: pp[û] [û] tt such such that that x x NODES( NODES(pp), e(x) = ), e(x) =

e’(x).e’(x).

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1616

Canonical models and Match Sets Canonical models and Match Sets (continued)(continued)

We now construct some canonical model t2 modz

n(p) that is still a witness. This follows directly from the next lemma:

Let p and p’ be two Boolean tree patterns, z ∑ be a symbol that does not appear in p’ , and w’ be the star length of p’. Let t1 = sz(p[û]) be a canonical model such that p’ (t1) is false. Define v = (v1, ... , vd) to be vi = min(ui,n), for i = 1, ... , d, where n = w’ + 1, and t2 = sz(p[v]). Then p’ (t2) is false.

Intuition: if p’ (t2) were true, then we could stretch the chains of extra nodes in t2 to obtain t1, and we would still have p’ (t1) true.

Remark: the n from part (3) depends only on p’ : n = w’ + 1 (w’ is star length).

That concludes the proof: modmodzznn((pp) ) Mod( Mod(p’p’ ) ) pp p’p’

((tt22 is the witness that for is the witness that for p p p’ p’ ).).

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1717

Canonical models and Match Sets Canonical models and Match Sets (continued)(continued)

Match Sets: Match Sets: For a tree t (or a pattern For a tree t (or a pattern pp), each node ), each node and each edge defines a subtree.and each edge defines a subtree. x x NODES( NODES(tt) defines ) defines ttxx that consists of the node x that consists of the node x

and its subtree. (ROOT(and its subtree. (ROOT(ttxx) = x; t) = x; tROOT(ROOT(tt)) = t) = t) (x,y) (x,y) EDGES( EDGES(tt) defines ) defines ttx,yx,y that consists of that consists of ttyy, the , the

node x and the edge (x,y).node x and the edge (x,y). S(S(tt) – a set of all subtrees of nodes and adges.) – a set of all subtrees of nodes and adges.

ax

by

cz *u

p’

by

cz *u

p’ y

by

cz

p’ y,z

by

*u

p’ y,u

cz

p’ z

*u

p’ u

= p’ x = p’

x,y

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1818

Canonical models and Match Sets Canonical models and Match Sets (continued)(continued)

q*q* - the pattern obtained by replacing the root of - the pattern obtained by replacing the root of qq with *. with *. ms(ms(tt) = {) = {ppxx | x | x NODES( NODES(pp), ), ppxx ( (tt) = true} ) = true}

{ {ppx,yx,y | (x,y) | (x,y) EDGES EDGES//((pp), ), ppx,yx,y((tt) = true} ) = true}

{ {ppx,yx,y | (x,y) | (x,y) EDGES EDGES////((pp), (), (ppx,yx,y)*()*(tt) = true}) = true} MS[MS[pp]] = { ms( = { ms(tt) | ) | tt mod modzz((pp) }) }

b

c

t1 = /a/b/c

aby

*u

p’ y,u

*u

p’ u

Ms(t1) = {p’ x , p’ x,y , p’ y,u , p’

u }

b

c

t2 = /a/b/z/c

a

z

ax

by

cz *u

p’ x = p’ x,y

ax

by

cz *u

p’ x = p’ x,y

?

by

*u

p’ y,u

*u

p’ u

Ms(t2) = {p’ y,u , p’

u }

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 1919

Exponential time containment Exponential time containment algorithmalgorithm

Naive algorithm:Naive algorithm: to decide if to decide if pp p’p’ : :iterate over all iterate over all tt mod modzz

w’+1w’+1((pp) and check ) and check p’ p’ ((tt) ) (requires O(|(requires O(|t t ||||p’p’ |) steps). |) steps). The complete time:The complete time: O(| O(|pp||||p’ p’ |(w’+2)|(w’+2)(d+1)(d+1)))

(based on the size of s(based on the size of szz((p p [[ûû]), and the fact that d]), and the fact that d||pp| )| ) Problem:Problem: The naïve algorithm is not practical, since much The naïve algorithm is not practical, since much

of the work in evaluating of the work in evaluating p’ p’ ((tt) is repeated for various ) is repeated for various canonical models canonical models tt..

Main idea of the Match Set algorithm:Main idea of the Match Set algorithm: pp p’p’ iff there exists a canonical tree iff there exists a canonical tree tt mod modzz((pp) ) and and p’ p’ ((tt) is false. So it suffices to compute ms() is false. So it suffices to compute ms(tt) for ) for some some tt and to check if and to check if p’p’ROOT(ROOT(p’ p’ )) ms( ms(tt).). Problem:Problem: we don’t know for what tree we don’t know for what tree t t to compute ms(to compute ms(tt))

……Solution:Solution: To compute the set of all match sets - MS[ To compute the set of all match sets - MS[pp]. ]. And then it suffices to check the condition And then it suffices to check the condition ms ms MS[ MS[pp], ], p’p’ROOT(ROOT(p’ p’ )) ms to determine that ms to determine that pp p’.p’.

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2020

Exponential time containment algorithm Exponential time containment algorithm (continued)(continued)

Remark: Remark: MS(MS(pp) has at most as many elements as canonical trees in ) has at most as many elements as canonical trees in modmodzz

w’+1w’+1((pp) (w’ is the star length of ) (w’ is the star length of p’p’ ). But in many cases it is much ). But in many cases it is much smaller because many canonical trees gives the same match sets smaller because many canonical trees gives the same match sets (like in the example above). (like in the example above). Match Sets algorithm is better than Match Sets algorithm is better than the naïve one.the naïve one.

The full algorithm to check if p The full algorithm to check if p p’ (complete): p’ (complete): Compute MS(Compute MS(pp)) check if check if ms ms MS[ MS[pp], ], p’p’ROOT(ROOT(p’p’ ) ) ms ms

If it exists, return If it exists, return pp p’p’ If it doesn’t, return If it doesn’t, return pp p’p’

MS(MS(pp) = ) = {{{{p’p’ xx , , p’p’ x,yx,y , p’ , p’ y,uy,u , , p’p’

uu } , } , { {p’p’ y,uy,u , , p’p’ uu } } }}

tree patterntree pattern pp

b

c

a ax

by

cz *utree patterntree pattern p’p’

pp p’, because:p’, because: ms ms MS[ MS[pp], ], pp’’ROOT(ROOT(p’ p’ )) ms ms p’p’ xx { {p’p’ y,uy,u , , p’p’ uu } }

Example:Example:

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2121

Exponential time containment algorithm Exponential time containment algorithm (continued)(continued)

The running time:The running time: O(| O(|pp||||p’ p’ |(w’+2)|(w’+2)dd))(based on the size of s(based on the size of szz((p p [[ûû]), and the fact that d]), and the fact that d||pp| )| )

This algorithm is sound and complete, and in some This algorithm is sound and complete, and in some cases runs in exponential time:cases runs in exponential time:

In the following example,In the following example, one ms is: one ms is:{{p’p’xx,,p’p’x,yx,y11,…, ,…, p’p’x,yx,ynn}, and the other ms are subsets of: }, and the other ms are subsets of: {{p’p’x,yx,y11,…, ,…, p’p’x,yx,ynn}, so the answer of the algorithm is false – }, so the answer of the algorithm is false – p p p’, but it takes exponential time to decide it (because p’, but it takes exponential time to decide it (because there are 2there are 2nn ms ms sets to check).sets to check).

tree patterntree pattern pp

b

c1

a

c2cn ........

tree patterntree pattern p’p’

ax

by1

by2

by3

........c1

c1 c1

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2222

HomomorphismHomomorphism A homomorphism A homomorphism h: p’ h: p’ p p between two tree patterns between two tree patterns p,p’p,p’ is a is a

function h:Nodes(function h:Nodes(p’ p’ ) -> Nodes() -> Nodes(pp) that satisfies the regular ) that satisfies the regular embedding with the following strengthening of the child edge embedding with the following strengthening of the child edge preservation condition:preservation condition: ((x,yx,y) ) EDGES EDGES//((p’ p’ ) ) ( (h(x),h(y)h(x),h(y)) ) EDGES EDGES//((pp) (and not ) (and not

EDGESEDGES////((pp) ) ) ) Example:Example:

Root-preserving:Root-preserving:ee(ROOT(p)) = ROOT(t)(ROOT(p)) = ROOT(t)

Label-preserving: Label-preserving: For each x For each x NODES(p), NODES(p),

LABEL(x) = * or LABEL(x) = LABEL(x) = * or LABEL(x) = LABEL(LABEL(ee(x))(x))Child-edge-preserving:Child-edge-preserving:

For each (x,y) For each (x,y) EDGES/( EDGES/(pp), ), ((ee(x), (x), ee(y)) (y)) EDGES( EDGES(tt))

Descendant-edge-preserving:Descendant-edge-preserving: For each (x,y) For each (x,y) EDGES//( EDGES//(pp), ),

((ee(x), (x), ee(y)) (y)) EDGES+( EDGES+(tt))

a

a

c d

a c

b a

P =

a

a

b *

a c

b

P’ =

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2323

Homomorphism (continued)Homomorphism (continued) ProblemProblem – homomorphism fails in the following case for – homomorphism fails in the following case for

PP{//,*}{//,*}::

SolutionSolution – adornment:– adornment: combining // with *: combining // with *: // //0 //m * / //m+1

/ * // n // n+1

// m * // n // m+n+1

Only * nodes with unique children may be eliminated this Only * nodes with unique children may be eliminated this way.way.

In homomorphism with adornment d(h(x),h(y)) In homomorphism with adornment d(h(x),h(y)) d(x,y), d(x,y), where d is the distance function.where d is the distance function.

ExampleExample - - p’p’= a//*/*/b/*/c//d= a//*/*/b/*/c//d p’p’= a//= a//2 b/*/c //b/*/c //0 dd

a

*

b

P = P’ =

a

*

b?

P = P’’ =

a

*

b

a

b

1

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2424

Homomorphism (continued)Homomorphism (continued) Problem: Problem: In the following case there is no In the following case there is no

homomorphism:homomorphism:

Shadowing:Shadowing: for any leaf node in both for any leaf node in both pp and and p’ p’ add a shadow leaf with a label that does not add a shadow leaf with a label that does not exist in exist in pp and and p’ , p’ , connected with the connected with the descendant edge to the original leafdescendant edge to the original leaf..

b

*

c

tree pattern p

b

*

tree pattern p’ Has no

outgoing edge can’t be eliminated

by adornment

b

*

c

tree pattern p

a

b

tree pattern p’

a

* 1

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2525

Polynomial time containment Polynomial time containment algorithmalgorithm The algorithm:The algorithm:

Add shadow leaf symbols to Add shadow leaf symbols to pp and and p’p’ Apply rewriting rules (adornment) to Apply rewriting rules (adornment) to p’ p’ and getand get p’’ p’’ Find a homomorphism from Find a homomorphism from p’’ p’’ to to p p

If found return trueIf found return true Else return falseElse return false

Properties of the algorithm:Properties of the algorithm: This algorithm is sound.This algorithm is sound. The running time: polynomial - The running time: polynomial - is O(|is O(|pp||||p’ p’ |) – depends on |) – depends on

the part which checks homomorphism existence.the part which checks homomorphism existence. This algorithm is not complete…This algorithm is not complete… This algorithm is complete in the following 4 cases:This algorithm is complete in the following 4 cases:

pp P P{[],*}{[],*}

p’p’ P P{[],*}{[],*}

p’p’ P P{[], //}{[], //}

p’p’ P P{*, //}{*, //}

The proof is given in the paper.The proof is given in the paper.

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2626

Polynomial time containment algorithm Polynomial time containment algorithm (continued)(continued) Example of an incomplete case:Example of an incomplete case:

a

b

c

d

b *

b

d

c

dc

Tree pattern p

a

b

c

*

b

d

c

d

=0

0 1 =0 1

Tree pattern p’

no more options…

Algorithm fails though p p’… (can be shown by reasoning by case)

In homomorphism In homomorphism with adornment with adornment

d(h(x),h(y)) d(h(x),h(y)) d(x,y).d(x,y).

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2727

co-NP hardness of containmentco-NP hardness of containment First we will show that the problem: “First we will show that the problem: “pp,,p’p’ P P{ [],*,// }{ [],*,// } decides decides

whether whether pp p’p’ ” is in co-NP: ” is in co-NP: Reminder:Reminder: to show that to show that pp p’p’ we have to find we have to find tt modmodzz

nn((pp) and to ) and to show that there is no embedding from show that there is no embedding from p’p’ to to tt..

To prove that the problem is in co-NP:To prove that the problem is in co-NP: we will present an we will present an algorithm to check that algorithm to check that pp p’p’ : :

guess guess dd numbers numbers uu11,…u,…udd,, each each uuii w’w’+1, where +1, where w’w’ is the star length of is the star length of p’p’, , and construct a canonical model and construct a canonical model tt = s = szz((pp[[uu11,…u,…udd]), then check in ]), then check in polynomial time that polynomial time that p’p’((tt) is false. ) is false. the problem is in co-NP. the problem is in co-NP.

Another definition of containment:Another definition of containment: containment of Boolean containment of Boolean pattern pattern pp in a union of patterns is defined as follows: in a union of patterns is defined as follows: pp pp11……ppkk holds if, for all trees holds if, for all trees tt, , pp((tt) ) pp11((tt) ) pp22((tt) ) … … ppkk((tt).).

Lemma:Lemma: Given the patterns Given the patterns pp, , pp11, , pp22,,…, …, ppk k in P in P{ [],*,// }{ [],*,// }, there , there exist patterns q, q’ in Pexist patterns q, q’ in P{ [],*,// }{ [],*,// } such that such that pp pp11……ppkk iff iff q q q’. q’.

q and q’ are polynomial in the sizes of q and q’ are polynomial in the sizes of pp, , pp11, , pp22,,…, …, ppkk.. q and q’ have no more wildcards than those present in q and q’ have no more wildcards than those present in pp, , pp11, , pp22,,…, …, ppkk..

Suppose L is a coNP problem, there exists Suppose L is a coNP problem, there exists a polynomial-time nondeterministic a polynomial-time nondeterministic algorithm M such that:algorithm M such that:

If x If x L, then M(x) = “yes” for all L, then M(x) = “yes” for all computation paths.computation paths.If x If x L, then M(x) = “no” for some L, then M(x) = “no” for some computation path.computation path.

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2828

co-NP hardness of containmentco-NP hardness of containment Proof:Proof: in order to prove the lemma we will do in order to prove the lemma we will do

the following construction:the following construction:r

c

c

c

c

c

V

V

p

V

V

k-1 nodes

k-1 nodes•V has no * and no

//•V pj fusing the (common) roots in pi subtrees, and replacing * in pi with some letter a and // with /

r

c

cp1

p2

c

pk

k nodesq pattern

q’ pattern

The canonical models of q are completely determined by a choice of canonical model for q’s subtree p : for each t modz(q), tp modz(p) is the subtree corresponding to p

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 2929

co-NP hardness of containment co-NP hardness of containment (continued)(continued)

Returning to lemma, Returning to lemma, pp pp11……ppkk q q q’: q’:

(for every (for every tt mod modzz(q), q’((q), q’(tt) is true):) is true): for for tt mod modzz(q), (q), pp((ttpp) is true ) is true ppii((ttpp) is true for some i ) is true for some i {1,…,k} {1,…,k} q’(q’(tt) is true for the following embedding e: q’ ) is true for the following embedding e: q’ tt : e maps the root of q’ to the root : e maps the root of q’ to the root

of q, e maps the subpattern of q, e maps the subpattern ppii to to ttp p , e maps every other , e maps every other ppjj to a corresponding V to a corresponding V

(there is enough V below and above (there is enough V below and above pp to make it). to make it). q q q’ q’ pp pp11……ppkk: :

(for every (for every ttpp mod modzz((pp)), p, p11((ttpp) ) pp22((ttpp) ) … … ppkk((ttpp) is true ) is true

pp pp11……ppkk):):

ttpp mod modzz((pp),),tt is the extension of is the extension of ttp p toto t t mod modzz(q), by adding the spine and k-1 (q), by adding the spine and k-1

copies of V above and bellow copies of V above and bellow ttpp.. q(q(tt) is true ) is true q’( q’(tt) is true ) is true there exists an embedding e: q’ there exists an embedding e: q’ t.t. This embedding must map the spine in q’ to This embedding must map the spine in q’ to

the spine in t. Let x be the spine node in the spine in t. Let x be the spine node in tt that is right above that is right above ttp p at least one spine node in q’ must be mapped to x (because there are only k-1 at least one spine node in q’ must be mapped to x (because there are only k-1 nodes above or below x, and the spine in q’ has only k nodes and no descendant nodes above or below x, and the spine in q’ has only k nodes and no descendant edges edges

There is some node y in q’ mapped to x There is some node y in q’ mapped to x we found we found ppii such that such that ppii((ttpp) is true) is true pp11((ttpp) ) pp22((ttpp) ) … … ppkk((ttpp) is true.) is true.

r

c

c

c

c

c

V

V

p

V

V

k-1 nodes

k-1 nodes

r

c

cp1

p2

c

pk

k nodesq pattern

q’ pattern

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 3030

co-NP hardness of containment co-NP hardness of containment (continued)(continued)

Now we are ready to prove the co-NP hardness:Now we are ready to prove the co-NP hardness: we will do we will do it by reduction from 3-CNF.it by reduction from 3-CNF. Let Let ψ ψ be a 3-CNF formula with be a 3-CNF formula with n n propositional variables propositional variables

yy11, y, y22, ... , y, ... , yn, and n, and k k clauses clauses cc11, c, c22, ... , c, ... , ckk . We construct patterns . We construct patterns A,CA,C11, ... , C, ... , Ck k , such that , such that ψ ψ is not satisfiable iff is not satisfiable iff A A CC11 … … CCkk . . The tree pattern The tree pattern A A is constructed so that its canonical models, is constructed so that its canonical models, modmodzz((AA), encode truth assignments to the ), encode truth assignments to the n n variables of variables of ψψ. Tree . Tree pattern pattern CCii is constructed so that the following property holds:is constructed so that the following property holds:

(*) For every (*) For every t t modmodzz((AA), ), CCii((tt) is true iff the truth assignment encoded ) is true iff the truth assignment encoded by by t t makes the clause makes the clause ccii false.false.

Property (*) is sufficient to prove co-NP hardness because of the Property (*) is sufficient to prove co-NP hardness because of the following equivalences, and the last Lemma:following equivalences, and the last Lemma:((A A CC11 … … CCkk) ) (for every (for every t t modmodz(z(AA) there exists ) there exists i i such that such that CCi i ((tt) is true) ) is true) (for every truth assignment there exists (for every truth assignment there exists i i such that, such that, cci i is false under that assignment) is false under that assignment) ( (ψ ψ is not satisfiable). is not satisfiable).

lets show how to construct lets show how to construct A,CA,C11, ... ,C, ... ,Ckk so that property (*) is so that property (*) is satisfied.satisfied.

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 3131

co-NP hardness of containment co-NP hardness of containment (continued)(continued)

For For tt modmodzz(Y(Yii), if ), if tt consists only of a consists only of aii followed by b, it followed by b, it corresponds to a truth assignment making ycorresponds to a truth assignment making y ii true. If true. If tt contains contains one or more added nodes between aone or more added nodes between a ii and b, it corresponds to a and b, it corresponds to a truth assignment making ytruth assignment making y ii false. false.

We define a tree pattern CWe define a tree pattern C ii for each clause of ψ by an example: for each clause of ψ by an example: For CFor Cii = ( = (yyj j y ykk y yll ): ):

yi

ai

b

T(yi)

ai

b

F(yi)

ai

b

*

b

y1 y2 yk

Tree pattern A

T(yT(yjj))

rrTree pattern CTree pattern Cii

F(yF(ykk)) T(yT(yll))

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 3232

co-NP hardness of containment co-NP hardness of containment (continued)(continued)

In case of some arbitrary bounds on the In case of some arbitrary bounds on the number of occurrences of //, or *, or []:number of occurrences of //, or *, or []: For //:For //: the containment problem the containment problem pp p’p’ remains in remains in

PTIME if we bound the number of // edges to some PTIME if we bound the number of // edges to some d d 0. 0.

We have shown that at the beginning of the lecture when We have shown that at the beginning of the lecture when we worked on bounded canonical models.we worked on bounded canonical models.

For *: For *: the containment problem the containment problem pp p’p’ remains co- remains co-NP hard even if we allow at most two *.NP hard even if we allow at most two *.

Won’t be proved nowWon’t be proved now For []: For []: the containment problem the containment problem pp p’p’ remains co- remains co-

NP hard even if we allow at most five [] in NP hard even if we allow at most five [] in pp and at and at most three [] at most three [] at p’p’..

Won’t be proved nowWon’t be proved now

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 3333

Additional topics of interestAdditional topics of interest Disjunction:Disjunction:

Containment for PContainment for P{ [], | } { [], | } patterns is already co-NP completepatterns is already co-NP complete Can be shown that Containment for PCan be shown that Containment for P{ //,*,[],| } { //,*,[],| } is also co-NP.is also co-NP.

Given the expresions Given the expresions pp,,p’p’ XP XP { | }{ | }, deciding containment is co-NP , deciding containment is co-NP hard hard and of course in case of XP and of course in case of XP { //,*,[],| } { //,*,[],| } it is also co-NP hard.it is also co-NP hard.

Finite Alphabet:Finite Alphabet: This article’s results do not hold for finite alphabet of size which This article’s results do not hold for finite alphabet of size which

is not two. is not two. In another article (Neven & Schwentick) it is shown that in case In another article (Neven & Schwentick) it is shown that in case

of finite alphabet, containment is in PSPACE for Pof finite alphabet, containment is in PSPACE for P{ //,*,[],| } { //,*,[],| } and and complete for PSPACE for Pcomplete for PSPACE for P{ [], | }{ [], | }..

Evaluation on graphs:Evaluation on graphs: All results in this article apply directly to an extension of Boolean All results in this article apply directly to an extension of Boolean

patterns evaluated on graphs (in our article we deal with trees).patterns evaluated on graphs (in our article we deal with trees). Application to CTL (computation tree logic):Application to CTL (computation tree logic):

All co-NP completeness results in this article apply to a fragment All co-NP completeness results in this article apply to a fragment of CTL (ECTLof CTL (ECTL) as well.) as well.

Presented by Shnaiderman LilaPresented by Shnaiderman Lila 3434

ConclusionConclusion We have studied the complexity of containment and

equivalence for an important core fragment of XPath. Many XML applications benefit from a practical decision procedure for containment of such expressions.

Our results provide intuition into the factors that contribute to its high complexity. Nevertheless, we show that in some significant special cases, containment can be decided efficiently, and we provide an algorithm which does so.

One direction for future work is to expand this fragment of XPath with additional features, although it is clear that it will be even more challenging to prove efficient special cases of the problem. Another direction is to study containment of XPath expressions over sets of documents conforming to constraints or schema restrictions. Preliminary work shows that sufficiently expressive constraints make this problem intractable for XPath fragments that otherwise have efficient containment problems.

THE END !THE END !