Containment of Conjunctive Queries on Annotated Relations
TJ Green University of Pennsylvania
Symposium on Database ProvenanceUniversity of Edinburgh
May 21, 2008
Provenance and Query Optimization
• Many kinds of semiring-based provenance annotations to choose from:– lineage– why-provenance– minimal witness why-provenance– provenance polynomials– ...
• These seem to keep track of more/less information
• A fundamental question: how does this affect query optimization?
2
Conjunctive Queries on K-Relations• Datalog-style syntax for conjunctive queries (CQs):
Q(x,y) :- R(x,z), R(z,y)• Semantics of applying the CQ to a K-relation R : D£D K:
Q(a,b) = z2D R(a,z)¢R(z,b)• # of repetitions of an atom in the body matters
• For unions of conjunctive quereis (UCQs) (equivalent to positive RA), sum over CQs:
P(x,y) :- R(x,z), R(z,y) P(x,y) :- R(x,w), R(y,w)• Semantics of UCQ applied to R ― a sum over CQs:
P(a,b) = z2D R(a,z)¢R(z,b) + w2D R(a,w)¢R(b,w)3
Choice of K Affects Query Optimization
K = N (bag semantics) differs from K = B (set semantics)e.g., the conjunctive queries
Q1(x) :- R(x,y), R(x,z) Q2(u) :- R(u,v)
are set-equivalent, but not bag-equivalent
4
Conjunctive Queries (CQs)
Unions of Conjunctive Queries (UCQs)
Bag Semantics Containment (vN)
? (¦2p-hard)
[Chaudhuri&Vardi 93]undecidable [Ioannidis&Ramakrishnan 95]
Bag Semantics Equivalence (´N)
isomorphism () [CV 93]
?
Our Contributions
• We make a systematic study of query containment and query equivalence for various provenance models
• We show that K-containment and K-equivalence of CQs and UCQs are decidable for lineage, why-provenace, and the provenance polynomials N[X], as well as a new model, B[X]
• The decision procedures are based on interesting variations of containment mappings
• We analyze the complexity in each case
5
Our Contributions
• As a corollary of the decidability result for N[X]-equivalence of UCQs, we also fill in a gap in the chart for bag semantics:
6
Conjunctive Queries (CQs)
Unions of Conjunctive Queries (UCQs)
Bag Semantics Containment (vN)
? (¦2p-hard)
[Chaudhuri&Vardi 93]undecidable [Ioannidis&Ramakrishnan 95]
Bag Semantics Equivalence (´N)
isomorphism () [CV 93]
isomorphism ()
K-Containment for Queries
• For semiring K, define a ·K b , 9c . a + c = b. If ·K is a partial order, it is called the natural order, and K is said to be naturally-ordered
• B, N, lineage, why-provenance, B[X], and N[X] are all naturally-ordered
• We define K-containment using the natural order:
Q1 vK Q2 , 8I 8t Q1(I)(t) ·K Q2(I)(t)
Q1 ´K Q2 , 8I 8t Q1(I)(t) = Q2(I)(t) 7
A Hierarchy of Semiring Provenance (1)
• Provenance polynomials (N[X], +, ¢, 0, 1) – tracks calculations abstractly; most general
e.g., 2p2r + 3ps + ps3
• Drop coefficients to get (B[X], +, ¢, 0, 1)p2r + ps + ps3
• Drop exponents to get why-prov. (P(P(X)), [, d, ;, {;}){{p,r}, {p,s}}
• Flatten set-of-sets to get lineage (P(X), +, ¢, ?, ;){p,r,s}
• Drop, flatten, etc. correspond to surjective semiring homomorphisms
8
A Hierarchy of Semiring Provenance (2)
• Suppose h : K1 K2 is a semiring homomorphism. Then a ·K1
b implies h(a) ·K2 h(b). If h is also
surjective, then h(a) ·K2 h(b) implies a ·K1
b.
• Definition: K1 ¹ K2 means P vK2 Q implies P vK1
Q
• Proposition: for any positive KB ¹ K ¹ N[X]
(All those we consider are positive.) Moreover:• Proposition (Provenance Hierarchy):
B ¹ lineage ¹ Why-Prov. ¹ B[X] ¹ N[X] 9
Containment Mappings• A containment mapping from CQ Q to CQ P is a
function h : Vars(Q) Vars(P) such that– head of Q is mapped to head of P– every atom in body of Q is mapped to an atom in body
of P
• Theorem [CM77]: For CQs P,Q we have P vB Q iff there is a containment mapping from Q to P– e.g. Q1(x) :- R(x,y), R(x,z) Q2(u) :- R(u,v)– h which sends u x and v y is a containment
mapping• Checking for existence of containment mapping is
NP-complete 10
Canonical Databases
• Take body of CQ, “freeze” into database instance [CM77], and tag each tuple with a “tuple id”
• We’ll denote by canK(Q) the canonical database for Q with abstract tags from K
• e.g., Q(w) :- R(u,v), R(v,w)
u v x1
v w x2
canN[X](Q) = canB[X](Q) = R
u v {x1}
v w {x2}canlin(Q) = R
u v {{x1}}
v w {{x2}}canwhy(Q) = R
11
Lineage-Containment of CQs
• Covering set of containment mappings: for every atom A in the body of P there is a containment mapping h : Q P with A in the image of h
• Theorem: For CQs P, Q the following are equivalent:1. P vlin Q2. P(canlin(P)) µlin Q(canlin(P))3. there is a covering set of containment mappings from Q
to P• Note: covering sets of containment mappings were
identified in [CV 93] as a necessary (but not sufficient) condition for bag-containment of CQs
12
Why-Containment of CQs
• A containment mapping is onto if it induces a surjection on atoms
• Theorem: For CQs P, Q the following are equivalent:1. P vwhy Q2. P(canwhy(P)) µwhy Q(canwhy (P))3. there is an onto containment mapping h : Q P
• Note: onto containment mappings were identified in [CV 93] as a sufficient (but not necessary) condition for bag-containment of CQs
13
B[X], N[X]-containment of CQs
• A containment mapping is exact if it induces a bijection on atoms
• Theorem: For CQs P, Q and for K 2 {B[X], N[X]} the following are equivalent1. P vK Q2. P(canK (P)) µK Q(canK (P))3. there is an exact containment mapping h : Q P
• Another way to think of exact containment mappings: by unifying variables in Q, you get a query isomorphic to P
14
So Far
• K-containment of CQs is decidable for all the provenance models in the hierarchy
• Next, we indicate which steps in the hierarchy are strict, and which collapse:
B Á lineage Á Why-Prov. Á B[X] ¼ N[X]
15
Separating the Models for v of CQs
• B Á lineage:Q1(x,y) :- R(x,y), R(x,z) Q2(x,y) :- R(x,y)Q1 vB Q2 but Q1 vlin Q2
• lineage Á why:Q1(x) :- R(x,y), R(x,z) Q2(x) :- R(x,y)Q1 vlin Q2 but Q1 vwhy Q2
• why Á B[X]:Q1(x,y) :- R(x,y) Q2(x,y) :- R(x,y), R(x,z)Q1 vwhy Q2 but Q1 vB[X] Q2
16
From Containment to Equivalence
• {Onto|exact} containment mappings in both directions implies CQs are isomorphic, so why-provenance, B[X], and N[X] collapse to:
P ´why Q , P ´B[X] Q , P ´N[X] Q , P Q
• In contrast, for lineage, having sets of covering containment mappings in both directions does not imply isomorphism (but still decidable)
17
From CQs to UCQs
• For idempotent semirings (where + is idempotent) this is easy. B, PosBool(B), lineage, why-provenance, and B[X] are idempotent; N[X] is not (omitted)
• Proposition [after SY80]: If K is idempotent, then for UCQs P, Q we have P vK Q iff for every CQ P in P there is a CQ Q in Q such that P vK Q
• Corollary: For idempotent K, the problems of checking K-equivalence of CQs and K-equivalence of UCQs are polynomially equivalent
18
N[X]- and Bag-Equivalence of UCQs
• As with CQs, N[X]-equivalence of UCQs turns out to be the same as isomorphism:Theorem: For UCQs P, Q, P ´N[X] Q iff P Q
• But, it turns out that N[X]-equivalence and N-equivalence of UCQs are intimately related:Theorem: for UCQs P, Q, P ´N[X] Q iff P ´N Q
Thus:Corollary: for UCQs P, Q P ´N Q iff P Q
19
• Theorem: checking for {covering set of|onto|exact} containment mappings is NP-complete
• Checking for query isomorphism: believed >P, <NP
Summary: Complexity Results
20
B PosBool(B) N Lineage Why-Pr. B[X] N[X]
CQs vK NP [CM 77]
NP[PODS 07]
? (¦2p-hard)
[CV 93]NP-ct NP-ct NP-ct NP-ct
´K NP ibid.
NPibid.
ibid.
NP-ct
UCQs vK NP [SY 80]
NPibid.
undec [IR 95]
NP-ct NP-ct NP-ct PSPACE
´K NP ibid.
NPibid.
NP-ct NP-ct NP-ct
Summary: Provenance Hierarchy
21
B PosB.(B) Lineage N Why-Pr. B[X] N[X]
CQs vK ¼ Á Á Á Á ¼
´K ¼ Á Á ¼ ¼ ¼
B PosB.(B) Lineage Why-Pr. B[X] N[X]
UCQs vK ¼ Á Á Á Á
´K ¼ Á Á Á Á
Related Work
• Already mentioned– Set-cont. and equiv. of CQs [Chandra&Merlin 77]– Set-cont. and equiv. of UCQs [Sagiv&Yannakakis 80]– Bag-cont. of UCQs [Ioannidis&Ramakrishnan 95]– Bag-equiv. of CQs [Chaudhuri&Vardi 93]
• Containment of CQs with where-provenance [Tan 03]• Bag-set semantics [CV 93], combined semantics [Cohen 06]– For K-relations: support operator of [Geerts&Poggi 08]
generalizes duplicate elimination• Bag-containment of CQs [Jayram+ 06]
22
Future Work• Loose ends:– Lower bound for N[X]-containment of UCQs (we gave only
a PSPACE upper bound)– Generalize results for specific semirings to semirings with
certain properties?• Beyond UCQs: Datalog– is K-containment of Datalog programs the same as set-
containment when K is a distributive lattice?– is bag-equivalence/N[X]-equivalence undecidable for
Datalog?• Could semiring framework give any insight into bag-
containment of CQs?• Query optimization for annotated XML
23
24
Top Related