Query Containment for Conjunctive Queries With Regular Expressions

42
1 Query Containment for Conjunctive Queries With Regular Expressions Daniela Florescu, Alon Levy, Dan Suciu. PODS 1998 Slides by Gala Yadgar.

description

Query Containment for Conjunctive Queries With Regular Expressions. Daniela Florescu, Alon Levy, Dan Suciu. PODS 1998 Slides by Gala Yadgar. Outline. Semi structured data and conjunctive queries Query containment for different query classes StruQL 0 and the data model for it - PowerPoint PPT Presentation

Transcript of Query Containment for Conjunctive Queries With Regular Expressions

Page 1: Query Containment for Conjunctive Queries With Regular Expressions

1

Query Containment for Conjunctive Queries With Regular Expressions

Daniela Florescu, Alon Levy, Dan Suciu.PODS 1998

Slides by Gala Yadgar.

Page 2: Query Containment for Conjunctive Queries With Regular Expressions

2

Outline

Semi structured data and conjunctive queries Query containment for different query classes StruQL0 and the data model for it Substitutions and canonical databases Semantic criteria for query containment Query mappings Syntactic criteria for query containment Containment for simple StruQL0 queries

Page 3: Query Containment for Conjunctive Queries With Regular Expressions

3

Semi Structured Data

Data is irregular:

Attributes may be missing The type and cardinality of an attribute

may not be known The set of attributes may not be known in advance The schema is unknown in advance

This is an example of a data model where graphs represent databases

אברהם

יצחק

יעקב רחל

ישמעאל

son

son

son

wife

o. brother

y. brother

לאה

wife

רבקהwife

הגר

Page 4: Query Containment for Conjunctive Queries With Regular Expressions

4

Languages Relational calculus: Datalog

Ancestor(X,Y) :- Father(X,Y) Ancestor(X,Y) :- Ancestor(X,Z), Ancestor(Z,Y)

Notice we have union and recursion. Can also have negation

Conjunctive queries: Brother(X,Y) :- Son(Z,X), Son(Z,Y)

No union (one rule only), no recursion, no negation

StruQL: Runs on graphs, the result is a graph Query Q:

where Person{X}, X (“paper”|“publication”) YCollect Page{PersonPage(X),PaperPage(Y)}Link RootPage()“person”PersonPage(X),

PersonPage(X)“paper” PaperPage(Y)

Page 5: Query Containment for Conjunctive Queries With Regular Expressions

5

Query Containment

Find out whether the results of one query are contained in the results of another query

For all databases Formal definition will be given shortly

Good for: Finding redundant subgoals in a query Testing whether two formulations of a query are equivalent Determining independence of database updates Rewriting queries using views

Page 6: Query Containment for Conjunctive Queries With Regular Expressions

6

Known Results Query containment for first order conjunctive queries is

decidable (and NP-Complete) Brother(X,Y) :- Son(Z,X), Son(Z,Y) OlderBrother(X,Y) :- Son(Z,X), Son(Z,Y), Older(X,Y)

Queries in StruQL can be translated into datalogwhere Person{x}, X (“paper”|”publication”) YCollect Page{PersonPage(X),PaperPage(Y)}Link RootPage()”person”PersonPage(X),

PersonPage(X)”paper” PaperPage(Y) PaperPage(Y) :- Person(X),WrotePaper(X,Y) PersonPage(X) :- Person(X),WrotePaper(X,Y)

Containment in datalog programs is undecidable All positive results for containment so far are restricted to

the case when one of the programs is non-recursive

Page 7: Query Containment for Conjunctive Queries With Regular Expressions

7

New Results Define StruQL0 as a subset of StruQL

Leaving out restructuring capabilities

Similar to conjunctive queries for relational calculus Give semantic and syntactic criteria for query

containment StruQL0 identifies a subset of datalog for which containment is

decidable

Show that query containment for a fragment of StruQL0 is NP-complete

Page 8: Query Containment for Conjunctive Queries With Regular Expressions

8

The Data Model

Labeled directed graphs Nodes correspond to objects Labels on the edges

correspond to attributes

Formally: A universe of constants D A universe of object identifiers I

(I ∩ D = Ф) A database DB is a pair (V,E):

In the example: D = {a,b,c,d} V = I = {u1,u2,u3,u4,u5,u6}

E = {(u1,c,u6), (u1,a,u5),…}

,V I E V D V

u1

u6

u5

u2

u3

u4

c

c

aa

b

b

d

Page 9: Query Containment for Conjunctive Queries With Regular Expressions

9

A StruQL0 Query

Queries are allowed to include regular path expressions over the attributes Give the ability to deal with lack of schema R := ε | a | _ | L | (R1.R2) | (R1|R2) | R*

Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z

The relation RQ(X,Y,Z,L) has arity 4 RQ contains 4 tuples:

{(u1,u1,u5,c),(u1,u1,u5,a),(u2,u1,u5,d),(u2,u2,u3,a)}

Q(DB) is the projection of RQ on X and Z: {(u1,u5),(u2,u5),(u2,u3)}

R1 R3R2

u1

u6

u5

u2

u3

u4

c

c

aa

b

b

d

Page 10: Query Containment for Conjunctive Queries With Regular Expressions

10

A StruQL0 Query Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z.

Formally: Regular variables range over the nodes in the graph.

Denoted by capital letters Arc variables range over the labels of edges in the graph.

Denoted by L or Li

A regular path expression is defined by the grammar:

R := ε | a | _ | L | (R1.R2) | (R1|R2) | R*

ε is the empty string a is a label constant _ denotes any label L is a label variable

R1 R3R2

Page 11: Query Containment for Conjunctive Queries With Regular Expressions

11

A StruQL0 Query - Components

Q : q(X) :– Y1R1Z1,…, YnRnZn

nvar(Q) ≡ {Y1,…,Yn,Z1,…,Zn} (node variables) Need not be distinct

Regular path expressions: {R1,…,Rn} avar(Q) ≡ the set of arc variables occurring in R1,…,Rn var(Q) ≡ nvar(Q) U avar(Q) (head variables) Atoms(Q) ≡ the set of constants occurring in R1,…,Rn YiRiZi i=1,…n are conjuncts

nvar(Q)X

Page 12: Query Containment for Conjunctive Queries With Regular Expressions

12

A StruQL0 Query - Semantics

Semantics: a substitution is a function Q : q(X) :– Y1R1Z1,…, YnRnZn

Node variables are mapped to I Arc variables are mapped D Denote φ(YiRiZi) is the path in DB corresponding to the conjunct (YiRiZi)

Each substitution defines a tuple in the relation RQ

The answer to Q is the projection of RQ on the variables in x

The result of applying Q to a database is Q(DB)

: var( )Q I D

:Q DB

Page 13: Query Containment for Conjunctive Queries With Regular Expressions

13

A StruQL0 Query

Notice the advantages for semi-structured data: Regular path expressions Arc variables

For example: Q2 : q2(X,Y) :– XLY,

Query for first degree relatives L can be older brother, younger brother,

son, wife, and maybe more (first wife? X-wife?)

Q3 : q3(X,Y) :– X(“son”|“daughter”)+(ε|L)Y Query for descendants and

their relatives

אברהם

יצחק

יעקב רחל

ישמעאל

son

son

son

wife

o. brother

y. brother

לאה

wife

Page 14: Query Containment for Conjunctive Queries With Regular Expressions

14

Containment A query Q1 is contained in a query Q2 , written

if for all databases DB

The queries Q1 and Q2 are equivalent, written Q1≡Q2 , if

Example: Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z

Q2 : q2(X,Z) :– Xa+Z

Q1(DB)= {(u1,u5),(u2,u5),(u2,u3)}

Q2(DB)= {(u1,u5),(u2,u3)}

1 2( ) ( )Q DB Q DB1 2Q Q

1 2 2 1 and Q Q Q Q

u1

u6

u5

u2

u3

u4

c

c

aa

b

b

d

Page 15: Query Containment for Conjunctive Queries With Regular Expressions

15

Canonical Databases - Intuition A canonical database for Q is a pair (DB,ξ) ξ is a substitution A bifurcation node for each node variable A corresponding internal path for each conjunct

Q: q(X1,X2) :– X1(a.L.(_)*))X2, X2(b.c)*Y, X2(a|L)*Z, Y(c|d)X1

How many canonical databases for a query?

X1

d

X2

Y

Z

b

fe

c

L

a

a a

a

aL L

a

c

b Bifurcation

node

Internal

node

Internal

path

Page 16: Query Containment for Conjunctive Queries With Regular Expressions

16

Canonical Databases – Formal definition Q: q(X1,X2) :– X1(a.L.(_)*))X2, X2(b.c)*Y, X2(a|L)*Z, Y(c|d)X1.

Each internal node belongs to one internalpath, with one outgoingand one incoming edge

The mapping of node variables to bifurcation nodes is surjective

Each arc variable L is mapped to itself

For each conjunct YiRiZi, the path ξ(YiRiZi) is internal and the mappingis one to one

X1

d

X2

Y

Z

b

fe

c

L

a

a a

a

aL L

a

c

b Bifurcation

node

Internal

node

Internal

path

Page 17: Query Containment for Conjunctive Queries With Regular Expressions

17

Semantic Criteria for Query Containment: Query Q has head variables X1,…Xn, and canonical

database (DB, ξ) (ξ(X1),…ξ(Xn)) is the canonical tuple

Proposition 1 Given two queries, Q, Q’:

for any canonical database (DB, ξ) for Q, its canonical tuple is in the answer of Q’

'Q Q

Page 18: Query Containment for Conjunctive Queries With Regular Expressions

18

Proposition 1 Proof If Q is contained in Q’ then for any canonical database

(DB, ξ) for Q, its canonical tuple is in the answer of Q’1. StruQL0 queries are generic:

if Q is contained in Q’ for databases over the universe D, then it is also contained in Q’ for databases over D’, where

D’ ≡ D U avar(Q)

2. (DB, ξ) contains constants in D, with addition of the arc variables of Q D’

3. If Q is contained in Q’ then its canonical tuple for each DB over D is contained in Q’(DB)

4. According to 1, the canonical tuple of Q is contained in Q’(DB’) over D’

Page 19: Query Containment for Conjunctive Queries With Regular Expressions

19

Proposition 1 Proof If Q is contained in Q’ then for any canonical database

(DB, ξ) for Q, its canonical tuple is in the answer of Q’Example: Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z

Q2 : q2(X,Z) :– Xa+Z

Q1(DB)= {(u1,u5),(u2,u5),(u2,u3)}

Q2(DB)= {(u1,u5),(u2,u3)}

D’ ≡ D U avar(Q) = {a,b,c,d,L} Canonical database

for Q2:

u1

u6

u5

u2

u3

u4

c

c

aa

b

b

d

X

a Z

L1( , ) ( )X Z Q DB

Page 20: Query Containment for Conjunctive Queries With Regular Expressions

20

Proposition 1 Proof If for any canonical database (DB, ξ) for Q, its canonical

tuple is in the answer of Q’ then Q is contained in Q’

Assume the contrary – Q is not contained in Q’ There exists some database DB and some tuple of nodes and/or

label constants u=(u1,…uk) in DB, such that u is in Q(DB) but not in Q’(DB)

We will construct a canonical database which will contradict the assumption

Page 21: Query Containment for Conjunctive Queries With Regular Expressions

21

Proposition 1 Proof

There exists a substitution φ : Q DB so that φ(X)=u We construct (DB0,ξ)

The bifurcation nodes are {φ(X)| X is in nvar(Q)} Define ξ(X) = φ(X) for all X in nvar(Q)

So the mapping of node variables is the same in both databases.

For each conjunct YRZ we consider the path φ(YRZ) in DB. This path is not necessarily simple It may contain bifurcation nodes This is because DB is not canonical

Example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3

φ(X1,X2,X3,L)=(A,B,C,a)

Bifurcation nodes in DB0: A,B,C

A B Ca

a

a

b

Page 22: Query Containment for Conjunctive Queries With Regular Expressions

22

Proposition 1 Proof

Introduce a fresh internal node for every occurrence of a node on the path φ(YRZ)

This results in a simple path In the example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3

A B Ca

a

a

bA B Ca

a

a

b

a

Page 23: Query Containment for Conjunctive Queries With Regular Expressions

23

Proposition 1 Proof

Now replace some labels: Let A be some non-deterministic automaton equivalent to R,

where arc variables are viewed as constants By definition, the labels on ξ(YRZ) are accepted by A Replace each label causing a transition in the run of A on ξ(YRZ)

with the corresponding arc variable L

In the example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3

A B Ca

L

L

b

L

A B Ca

a

a

b

a

Page 24: Query Containment for Conjunctive Queries With Regular Expressions

24

Proposition 1 Proof

Example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3

φ(X1,X2,X3,L)=(A,B,C,a) Bifurcation nodes in DB0: A,B,C

DB DB0

A B Ca

L

L

b

L

A B Ca

a

a

b

φ’:Q’DB0

ψ:DB0-->DB

Page 25: Query Containment for Conjunctive Queries With Regular Expressions

25

Proposition 1 Proof

DB0 is a canonical database We have a graph morphism ψ: DB0 DB

Bifurcation nodes are sent to themselves Internal nodes are sent to their originating nodes

We assumed that Q is not contained in Q’ even though the canonical tuple for (DB0,ξ) is in the answer of Q’(DB)

So we must have a substitution φ’ : Q’ DB0 Compose φ’ with ψ and get a substitution

φ’ ○ ψ : Q’ DB0 DB This implies that u is in the answer of Q’ too,

contradicting the assumption □

Page 26: Query Containment for Conjunctive Queries With Regular Expressions

26

Decidability of containment

We still have an infinite number of canonical databases: The internal paths can be of any length

Q: q(X,Y) :- XL*Y The number of substitutions can be infinite

Q: q(X,Y) :- X_Y It is sufficient to examine only databases whose internal

path is no longer than some N which depends only on Q and Q’

Only a set of n x N constants is sufficient, with N from above and n the number of conjuncts in Q. (Only the constants in DQ,Q’ U avar(Q))

The resulting algorithm for containment is of triple exponential space

But it shows decidability

Page 27: Query Containment for Conjunctive Queries With Regular Expressions

27

Path Length is bounded

Remember Ai is the non-deterministic automaton equivalent to Ri, for each conjunct YiRiZi

The path between ξ(Y) and ξ(Z) represents a run of Ai

Its length is bounded by N = |nvar(Q)|x|states(Ai)|+2

If a variable appears in the path ξ(YiRiZi) more than |states(Ai)| times, it can be cut short, and still satisfy Ri

Q: q(X1,X3):- X1aX2, X2bX3, X1L+X3

We must check all the runs ofautomata in Q’ on paths inthe canonical DB of Q. Proof in Appendix A of

the full version of the paper.

A B Ca

L

L

b

L

Page 28: Query Containment for Conjunctive Queries With Regular Expressions

28

Containment by Query mapping A query mapping f:Q’DB sends conjuncts in Q’ to

some path in the canonical database of Q There exist only finitely many mappings They can be encoded in polynomial space

A query mapping f:Q’Q can ‘cover’ a canonical database DB for Q

all query mappings together cover all canonical databases All canonical DBs for a query can

be described in a regular language WQ

For each mapping f, there is a regularexpression for all databases covered by it, Wf

Exponential space

' iff Q fQ Q W W

f

p1p2

p3

Pn-1

pn

Y’ A’ Z’Q’

Q

'Q Q

Page 29: Query Containment for Conjunctive Queries With Regular Expressions

29

Simple StruQL0 queries

The regular expressions Ri in Q are of the form r1. r2... rn, where each ri is either * or a label constant

Examples: a.*.b.* and *.*.a.a.* are simple regular expressions a*.b or _._ are not simple regular expressions

Given two regular expressions, their containment can be checked in polynomial space

The containment problem of two simple queries is NP-complete By reduction to conjunctive queries

First subset including recursion for which containment decision is no harder than for conjunctive queries.

Page 30: Query Containment for Conjunctive Queries With Regular Expressions

30

Summary

StruQL0 – conjunctive queries with regular expressions Canonical databases - semantic criteria for query

containment Containment is decidable But in triple exponential space

Query mappings - syntactic criteria for query containment Exponential space

Simple StruQL0 queries – a subset for which containment is NP-complete

Page 31: Query Containment for Conjunctive Queries With Regular Expressions

31

Backup slides:

Containment by Query Mapping We will show that Q is contained in Q’ iff a certain

condition holds on all query mappings from Q’ to Q

Q is a query with n conjuncts:Q: q(X):- Y1R1Z1,…, YnRnZn

nvar(Q) = {Y1,Z1,…,Yn,Zn} Ai is a fixed non deterministic automaton for each regular

expression Ri

A point in Q is either A node variable (variable-point) A pair (Ai,s) where s is a state in Ai (automaton-point)

points(Q) is the set of points in Q

Page 32: Query Containment for Conjunctive Queries With Regular Expressions

32

Canonical DB and query points

Nodes in a canonical database DB for Q correspond to points in Q

Several internal nodes in DB may correspond to the same automaton-point

Bifurcation nodes in DB correspond both to variable-points and to automaton points (Ai,s) where s is an initial or terminal state

Page 33: Query Containment for Conjunctive Queries With Regular Expressions

33

Path in a Query

Given a query Q, a path of points in Q is a sequence p1,…,pn, n≥2

p2,…,pn-1 are all variable-points (p1,pn can be automaton points) Any two adjacent points are connected in Q:

If pj, pj+1 are variable points there is a conjunct YiRiZi in Q with pj=Yi and pj+1=Zi

If p1 is an automaton-point (p2 is a variable point) there exists a conjunct YiRiZi in Q so that Ai is the automaton

associated with Ri, and p2 = Zi

If pn is an automaton-point (pn-1 is a variable point) there exists a conjunct YiRiZi in Q so that Ai is the automaton

associated with Ri, and pn-1 = Yi

If n=2, and both p1 and P2 are automaton points they refer to the same automaton

Page 34: Query Containment for Conjunctive Queries With Regular Expressions

34

Canonical DB and query path

Let U = u1,u2,…,um be a path in a canonical database DB for Q.

Suppose we drop all internal nodes from u2,…,um-1

Let u1=ui1,ui2,…,uin-1,uin=um be the resulting subsequence

We say that U corresponds to the path of points p1,…,pn iff each uik corresponds to pk, for k=1,…,n

Paths of points rephrase paths in canonical databases

Page 35: Query Containment for Conjunctive Queries With Regular Expressions

35

Query mapping Consider some other query Q’ Ai’ is a nondeterministic automaton for each Ri’ in Q’ Let X, X’ be head variables in Q,Q’ respectively

A query mapping f: Q’ Q consists of:1. Two mappings,

f: nvar(Q’)points(Q) and f: avar(Q’) DQ,Q’ U avar(Q), so that f(X’ )= X

2. A mapping from conjuncts Yi’Ri’Zi’ in Q’ to paths of points in Q, f(Yi’Ri’Zi’) = p1,…,pn so that n≤|nvar(Q)|x|states(Ai)|+2 f(Yi) = p1, f(Zi’) = pn

3. For each conjunct YiRiZi in Q, a total preorder on those variables Z’ in nvar(Q’) for which f(Z’) is an automaton point corresponding to Ai

Whenever X’≤Y’ and Y’≤X’ then f(X’)=f(Y’)

Page 36: Query Containment for Conjunctive Queries With Regular Expressions

36

Query mapping

For some canonical database (DB,ξ) a substitution φ:Q’DB is canonical if φ(X’) is the canonical tuple in DB

Condition 1: A substitution now sends conjuncts in Q’ to some path in

the canonical database, and not variables to nodes and arc variables to arcs

f

p1p2

p3

Pn-1

pn

Y’ A’ Z’Q’

Q

Page 37: Query Containment for Conjunctive Queries With Regular Expressions

37

Path Length is bounded

Condition 2: The path of points p1,…,pn may have cycles

Its length is bounded by |nvar(Q)|x|states(Ai)|+2 If a variable appears in the path f(Y’R’Z’) more than |

states(Ai)| times, it can be cut short, and still satisfy R’

f

p1p2

p3

Pn-1

pn

Y’ A’ Z’Q’

Q

Page 38: Query Containment for Conjunctive Queries With Regular Expressions

38

Preorder

Condition 3: The preorder defines:

Equivalence classes on the variables (X’≤Y’ and Y’≤X’ X’≡Y’) A total order on the equivalence classes

The query mapping imposes such an order on all variables sent by f to points on the same automaton (A,s1), (A,s2), (A,s3)…

Page 39: Query Containment for Conjunctive Queries With Regular Expressions

39

Substitutions and mappings

A canonical substitution φ:Q’DB corresponds to a query mapping f: Q’ Q if:

1. For each conjunct Y’R’Z’ in Q’ the path φ(Y’R’Z’) corresponds to the path of points f(Y’R’Z’)

2. For any internal path in DB corresponding to YRZ, the preorder on all variables mapped by φ onto that path coincides with the preorder given by f

There is always a query mapping between two queries For given Q,Q’, there exist only finitely many mappings Each mapping can be encoded in polynomial space

Page 40: Query Containment for Conjunctive Queries With Regular Expressions

40

Containment

A query mapping f:Q’Q covers a canonical database DB for Q, if there is some canonical substitution φ:Q’DB which corresponds to f Some query mappings don’t cover any canonical database.

Q is contained in Q’ iff all query mappings together cover all canonical databases All canonical databases for a query can be described in a

regular language WQ

For each mapping f, there is a regular expression for all databases covered by it, Wf

This can be computed in exponential space

' iff Q fQ Q W W

Page 41: Query Containment for Conjunctive Queries With Regular Expressions

41

The connection between the syntactic and semantic criteria

all query mappings together cover all canonical databases

If a query mapping covers a canonical database for Q, then the canonical tuple in the database is in the answer of Q’.

This is implied by the definitions of canonical substitution, of correspondence between a mapping an a substitution, and of “covering” a database.

Both criteria (syntactic and semantic) rely on Proposition 1, but present different algorithms to check containments of two queries.

'Q Q

Page 42: Query Containment for Conjunctive Queries With Regular Expressions

42

Known results for regular expressions

Containment of regular expressions is PSPACE complete L.J. Stockmeyer and A.R. Meyer. Word problems requiring

exponential time. In 5th STOC, pages 1-9. ACM, 1973.

Containment of simple regular expressions is in PTIME Tova Milo and Dan Suciu. Index structures for path

expressions. In 7th ICDT, pages 277–295. Springer-Verlag, 1999.