Personal Information Management Systems - EDBT/ICDT'15 Tutorial
EDBT 2009 - Provenance for Nested Subqueries
-
Upload
boris-glavic -
Category
Science
-
view
51 -
download
1
description
Transcript of EDBT 2009 - Provenance for Nested Subqueries
Provenance for Nested Subqueries
Boris Glavic
Database Technology Group
Department of Informatics University of Zurich
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Gustavo Alonso
Systems GroupDepartment of Computer
Science ETH Zurich
2
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Overview
1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion
3
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Query
Which input data item(s) influenced which output data item(s)? Granularity
Tuple Attribute Value ...
Contribution semantics Influence (Lineage / Why) Copy (Where) ...
4
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Most application domains that benefit from provenance use complex queries Subqueries
Correlated Nested
Not supported by existing systems Semantics not clear Complex computation
5
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Steps to solve this problem1. Establish sound semantics for
provenance of subqueries2. Algorithms for subquery provenance
computation3. Integrate algorithms into a Provenance
Management system (Perm)
6
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Steps to solve this problem1. Establish sound semantics for
provenance of subqueries2. Algorithms for subquery provenance
computation3. Integrate algorithms into a Provenance
Management system (Perm)
7
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction Definition of contribution semantics
Why/Influence-provenance Introduced in [Cui, Widom ICDE ‘00] Provenance represented as list of
subsets of the input relations Defined for a single algebra operator
and a single result tuple
8
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction Definition 1: For a single algebra
operator op with input relations T1, ... , Tn a list (T1*, ... ,Tn*) of maximal subsets of the input relation is the provenance of a tuple t from the result of op iff:
u op(T1*, ..., Tn*) = t
u For all i and t* with t* in Ti*:op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) !=
€
∅
9
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction Perm
Provenance Extension of the Relational Model
Provenance Management System (PMS) “Pure” Relational representation of
provenance Provenance computation trough
algebraic query rewrite Implemented as extension of
PostgreSQL
10
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Provenance representation
OriginalAttributes
Relation 1 Attributes
Relation n Attributes
Query
1
OriginalResult
2 n
11
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Provenance representation
OriginalAttributes
Relation R Attributes
Relation S Attributes
Query
R
OriginalResult
S
r1
s 1r2
t 1
t 1 r1
t 1 r2
s 1
s 1
12
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction Provenance Computation though
query rewrite: Given query q generate query q+ that
computes the provenance of q Representation as defined before
Rewrites operate on the algebraic representation of a query Rewrite rules for each operator op that
transform op into a algebra statement that propagates the provenance
13
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Rewrite rules example:SELECT agg, GFROM TGROUP BY G
SELECT agg, G, prov(T)FROM
(SELECT agg, G FROM T GROUP BY G) AS agg,LEFT OUTER JOIN(SELECT G AS G’, prov(T) FROM T+) AS provON G = G’
14
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Rewrite rules example:SELECT sum(revenue) AS sum, shopFROM salesGROUP BY shop
shop month revenue
Migros Jan 100
Migros Feb 10
Migros Mar 10
Coop Jan 25
Coop Feb 25
salessum shop
120 Migros
50 Coop
result
15
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
SELECT sum, shop, pShop, pMonth, pRevenueFROM
(SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop) AS aggLEFT OUTER JOIN(SELECT shop AS shop’, pShop, pMonth, pRevenue FROM sales ) AS provON shop = shop’
sum shop pShop pMonth pRevenue
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+
16
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Overview
1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion
17
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Sublinks Subqueries in e.g. SELECT-clause
Correlated References outside attributes
Nested Sublink that contains sublinks
€
σ a IN σ (b=3) (S)(R)
€
σ a IN σ (b=a ) (S)(R)
€
σ a IN σ (b = ANY (T )) (S)(R)
18
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
What is the provenance of a sublink according to Definition 1? Sublinks can be used in different
contexts Selection Projection ...
Sublink either Produces exactly one value Or produces a boolean value
19
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Single uncorrelated ANY-sublinks in selection conditions
For other Types of sublinks Correlated sublinks Nested sublinks
20
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
For other Types of sublinks Correlated sublinks Nested sublinks
READ THE PAPER!
21
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Single uncorrelated ANY-sublinks in selection conditions The result of the sublink query is fixed For a given input tuple t the sublink
condition is either true or false
€
σ a =ANY σ (b=3) (S)(R)
22
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Some terminology The query of a sublink
The conditional expression of a sublink
€
Tsub
€
q =σ a =ANY Πb (S)(R)
€
Πb(S)
€
a = ANY Πb (S)
€
Csub
€
Tsub€
Csub
23
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Sublink condition can play different roles in a condition C of a selection (for one input tuple t): Reqtrue: the selection condition is true, iff is true Reqfalse: the selection condition is true,
iff is false
Ind: the selection condition is true indepedent of the result of €
Csub
€
Csub
€
Csub
24
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Some more terminology All tuples from the sublink query that
fulfill the “unquantified” sublink condition
All tuples from the sublink query that do not fulfill the “unquantified” sublink condition€
Tsubtrue(t)
€
Tsubfalse(t)
€
Csub = (a = ANY σ b=3(S))
€
Csub° = (a = b)
25
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Back to ANY-sublinks in selections Proposition:
€
Tsub*(t) =
Tsubtrue(t) reqtrue
Tsub reqfalse, ind
⎧ ⎨ ⎩
26
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR€
q =σ a =ANY Πb (S)(R)
a
1
2
Result
Compute provenance for
€
t = (1)
Example:
27
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
€
Tsub = Πb (S)
€
Tsubtrue(t) = {(1)}
is reqtrue
€
Csub
€
Tsub* =Tsub
true
€
Csub° = (a = b)
€
q =σ a =ANY Πb (S)(R)
28
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
€
Tsubtrue(t) = {(1)}
€
q =σ a =ANY Πb (S)(R)
b
1
2
4
Tsub
a
1
2
3
R
€
Csub° = (a = b)
Compute provenance for
€
t = (1)
29
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR€
q =σ a =ANY Πb (S)(R)
a
1
b
1
R* Tsub*b
1
2
4
Tsub
a
1
2
Result
Compute provenance for
€
t = (1)
30
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Definition 1 is ambiguous for queries with more than one sublink!
b
1
2
100
c
1
5
SR
€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
Resulta
5
U
31
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Definition 1 is ambiguous for queries with more than one sublink!
b
1
2
100
c
1
5
SR
€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
Resulta
5
U
true
false
32
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5c
1
5
S*R*€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
U*b
1
100
R*b
1
S*a
5
U*Solution 1 Solution 2
33
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5c
1
5
S*R*€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
U*b
1
100
R*b
1
S*a
5
U*Solution 1 Solution 2
true
false
34
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5c
1
5
S*R*€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
U*b
1
100
R*b
1
S*a
5
U*Solution 1 Solution 2
false
true
35
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Reasons for this ambiguity: The definition requires the provenance
to produce the same result But not to produce the same results for
the sublinks
-> Definition 1 produces false positives
36
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Solution: Extend definition 1 Add a third condition: For each sublink:
If computed for one result tuple t one tuple from the provenance of the sublink
Produces same sublink result as in the original query
37
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5c
5
S*R*€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
U*b
1
100
R*b
1
S*a
5
U*Solution 1 Solution 2
38
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
How to compute the provenance according to the extended definition?
Use query rewrite Generic strategy (Gen) Specialized strategies
Use un-nesting Check: does not change the provenance
39
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Gen-strategy For queries we cannot un-nest
1. Join original query with all possible provenance tuples (base relations)
2. Rewrite the sublink query3. Introduce additional correlation to
simulate a join between 1) and 2)
40
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Overview
1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion
41
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
3. Experimental Results TPC-H benchmark (10 MB size)
42
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
3. Experimental Results TPC-H benchmark (1 GB size)
43
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Overview
1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion
44
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
4. Conclusion
Definition 1 fails in the presence of sublinks Can be extended to deal with sublinks
Provenance computation for sublinks By using query rewrites Implemented in the Perm
Future Work Physical provenance-aware operators
45
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Questions
? ? ?