EDBT 2009 - Provenance for Nested Subqueries

45
Provenance for Nested Subqueries Boris Glavic Database Technology Group Department of Informatics University of Zurich [email protected] Zur Anzeige wird der Dekompressor „“ benötigt. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich [email protected]

description

Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use. In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.

Transcript of EDBT 2009 - Provenance for Nested Subqueries

Page 1: EDBT 2009 - Provenance for Nested Subqueries

Provenance for Nested Subqueries

Boris Glavic

Database Technology Group

Department of Informatics University of Zurich

[email protected]

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Gustavo Alonso

Systems GroupDepartment of Computer

Science ETH Zurich

[email protected]

Page 2: EDBT 2009 - Provenance for Nested Subqueries

2

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Overview

1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion

Page 3: EDBT 2009 - Provenance for Nested Subqueries

3

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Query

Which input data item(s) influenced which output data item(s)? Granularity

Tuple Attribute Value ...

Contribution semantics Influence (Lineage / Why) Copy (Where) ...

Page 4: EDBT 2009 - Provenance for Nested Subqueries

4

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Most application domains that benefit from provenance use complex queries Subqueries

Correlated Nested

Not supported by existing systems Semantics not clear Complex computation

Page 5: EDBT 2009 - Provenance for Nested Subqueries

5

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Steps to solve this problem1. Establish sound semantics for

provenance of subqueries2. Algorithms for subquery provenance

computation3. Integrate algorithms into a Provenance

Management system (Perm)

Page 6: EDBT 2009 - Provenance for Nested Subqueries

6

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Steps to solve this problem1. Establish sound semantics for

provenance of subqueries2. Algorithms for subquery provenance

computation3. Integrate algorithms into a Provenance

Management system (Perm)

Page 7: EDBT 2009 - Provenance for Nested Subqueries

7

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction Definition of contribution semantics

Why/Influence-provenance Introduced in [Cui, Widom ICDE ‘00] Provenance represented as list of

subsets of the input relations Defined for a single algebra operator

and a single result tuple

Page 8: EDBT 2009 - Provenance for Nested Subqueries

8

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction Definition 1: For a single algebra

operator op with input relations T1, ... , Tn a list (T1*, ... ,Tn*) of maximal subsets of the input relation is the provenance of a tuple t from the result of op iff:

u op(T1*, ..., Tn*) = t

u For all i and t* with t* in Ti*:op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) !=

Page 9: EDBT 2009 - Provenance for Nested Subqueries

9

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction Perm

Provenance Extension of the Relational Model

Provenance Management System (PMS) “Pure” Relational representation of

provenance Provenance computation trough

algebraic query rewrite Implemented as extension of

PostgreSQL

Page 10: EDBT 2009 - Provenance for Nested Subqueries

10

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Provenance representation

OriginalAttributes

Relation 1 Attributes

Relation n Attributes

Query

1

OriginalResult

2 n

Page 11: EDBT 2009 - Provenance for Nested Subqueries

11

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Provenance representation

OriginalAttributes

Relation R Attributes

Relation S Attributes

Query

R

OriginalResult

S

r1

s 1r2

t 1

t 1 r1

t 1 r2

s 1

s 1

Page 12: EDBT 2009 - Provenance for Nested Subqueries

12

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction Provenance Computation though

query rewrite: Given query q generate query q+ that

computes the provenance of q Representation as defined before

Rewrites operate on the algebraic representation of a query Rewrite rules for each operator op that

transform op into a algebra statement that propagates the provenance

Page 13: EDBT 2009 - Provenance for Nested Subqueries

13

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Rewrite rules example:SELECT agg, GFROM TGROUP BY G

SELECT agg, G, prov(T)FROM

(SELECT agg, G FROM T GROUP BY G) AS agg,LEFT OUTER JOIN(SELECT G AS G’, prov(T) FROM T+) AS provON G = G’

Page 14: EDBT 2009 - Provenance for Nested Subqueries

14

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Rewrite rules example:SELECT sum(revenue) AS sum, shopFROM salesGROUP BY shop

shop month revenue

Migros Jan 100

Migros Feb 10

Migros Mar 10

Coop Jan 25

Coop Feb 25

salessum shop

120 Migros

50 Coop

result

Page 15: EDBT 2009 - Provenance for Nested Subqueries

15

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

SELECT sum, shop, pShop, pMonth, pRevenueFROM

(SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop) AS aggLEFT OUTER JOIN(SELECT shop AS shop’, pShop, pMonth, pRevenue FROM sales ) AS provON shop = shop’

sum shop pShop pMonth pRevenue

120 Migros Migros Jan 100

120 Migros Migros Feb 10

120 Migros Migros Mar 10

50 Coop Coop Jan 25

50 Coop Coop Feb 25

+

Page 16: EDBT 2009 - Provenance for Nested Subqueries

16

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Overview

1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion

Page 17: EDBT 2009 - Provenance for Nested Subqueries

17

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Sublinks Subqueries in e.g. SELECT-clause

Correlated References outside attributes

Nested Sublink that contains sublinks

σ a IN σ (b=3) (S)(R)

σ a IN σ (b=a ) (S)(R)

σ a IN σ (b = ANY (T )) (S)(R)

Page 18: EDBT 2009 - Provenance for Nested Subqueries

18

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

What is the provenance of a sublink according to Definition 1? Sublinks can be used in different

contexts Selection Projection ...

Sublink either Produces exactly one value Or produces a boolean value

Page 19: EDBT 2009 - Provenance for Nested Subqueries

19

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Single uncorrelated ANY-sublinks in selection conditions

For other Types of sublinks Correlated sublinks Nested sublinks

Page 20: EDBT 2009 - Provenance for Nested Subqueries

20

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

For other Types of sublinks Correlated sublinks Nested sublinks

READ THE PAPER!

Page 21: EDBT 2009 - Provenance for Nested Subqueries

21

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Single uncorrelated ANY-sublinks in selection conditions The result of the sublink query is fixed For a given input tuple t the sublink

condition is either true or false

σ a =ANY σ (b=3) (S)(R)

Page 22: EDBT 2009 - Provenance for Nested Subqueries

22

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Some terminology The query of a sublink

The conditional expression of a sublink

Tsub

q =σ a =ANY Πb (S)(R)

Πb(S)

a = ANY Πb (S)

Csub

Tsub€

Csub

Page 23: EDBT 2009 - Provenance for Nested Subqueries

23

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Sublink condition can play different roles in a condition C of a selection (for one input tuple t): Reqtrue: the selection condition is true, iff is true Reqfalse: the selection condition is true,

iff is false

Ind: the selection condition is true indepedent of the result of €

Csub

Csub

Csub

Page 24: EDBT 2009 - Provenance for Nested Subqueries

24

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Some more terminology All tuples from the sublink query that

fulfill the “unquantified” sublink condition

All tuples from the sublink query that do not fulfill the “unquantified” sublink condition€

Tsubtrue(t)

Tsubfalse(t)

Csub = (a = ANY σ b=3(S))

Csub° = (a = b)

Page 25: EDBT 2009 - Provenance for Nested Subqueries

25

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Back to ANY-sublinks in selections Proposition:

Tsub*(t) =

Tsubtrue(t) reqtrue

Tsub reqfalse, ind

⎧ ⎨ ⎩

Page 26: EDBT 2009 - Provenance for Nested Subqueries

26

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

a

1

2

3

b c

1 100

2 10

4 24

SR€

q =σ a =ANY Πb (S)(R)

a

1

2

Result

Compute provenance for

t = (1)

Example:

Page 27: EDBT 2009 - Provenance for Nested Subqueries

27

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Tsub = Πb (S)

Tsubtrue(t) = {(1)}

is reqtrue

Csub

Tsub* =Tsub

true

Csub° = (a = b)

q =σ a =ANY Πb (S)(R)

Page 28: EDBT 2009 - Provenance for Nested Subqueries

28

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Tsubtrue(t) = {(1)}

q =σ a =ANY Πb (S)(R)

b

1

2

4

Tsub

a

1

2

3

R

Csub° = (a = b)

Compute provenance for

t = (1)

Page 29: EDBT 2009 - Provenance for Nested Subqueries

29

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

a

1

2

3

b c

1 100

2 10

4 24

SR€

q =σ a =ANY Πb (S)(R)

a

1

b

1

R* Tsub*b

1

2

4

Tsub

a

1

2

Result

Compute provenance for

t = (1)

Page 30: EDBT 2009 - Provenance for Nested Subqueries

30

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Definition 1 is ambiguous for queries with more than one sublink!

b

1

2

100

c

1

5

SR

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

Resulta

5

U

Page 31: EDBT 2009 - Provenance for Nested Subqueries

31

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Definition 1 is ambiguous for queries with more than one sublink!

b

1

2

100

c

1

5

SR

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

Resulta

5

U

true

false

Page 32: EDBT 2009 - Provenance for Nested Subqueries

32

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

b

5c

1

5

S*R*€

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

U*b

1

100

R*b

1

S*a

5

U*Solution 1 Solution 2

Page 33: EDBT 2009 - Provenance for Nested Subqueries

33

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

b

5c

1

5

S*R*€

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

U*b

1

100

R*b

1

S*a

5

U*Solution 1 Solution 2

true

false

Page 34: EDBT 2009 - Provenance for Nested Subqueries

34

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

b

5c

1

5

S*R*€

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

U*b

1

100

R*b

1

S*a

5

U*Solution 1 Solution 2

false

true

Page 35: EDBT 2009 - Provenance for Nested Subqueries

35

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Reasons for this ambiguity: The definition requires the provenance

to produce the same result But not to produce the same results for

the sublinks

-> Definition 1 produces false positives

Page 36: EDBT 2009 - Provenance for Nested Subqueries

36

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Solution: Extend definition 1 Add a third condition: For each sublink:

If computed for one result tuple t one tuple from the provenance of the sublink

Produces same sublink result as in the original query

Page 37: EDBT 2009 - Provenance for Nested Subqueries

37

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

b

5c

5

S*R*€

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

U*b

1

100

R*b

1

S*a

5

U*Solution 1 Solution 2

Page 38: EDBT 2009 - Provenance for Nested Subqueries

38

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

How to compute the provenance according to the extended definition?

Use query rewrite Generic strategy (Gen) Specialized strategies

Use un-nesting Check: does not change the provenance

Page 39: EDBT 2009 - Provenance for Nested Subqueries

39

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Gen-strategy For queries we cannot un-nest

1. Join original query with all possible provenance tuples (base relations)

2. Rewrite the sublink query3. Introduce additional correlation to

simulate a join between 1) and 2)

Page 40: EDBT 2009 - Provenance for Nested Subqueries

40

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Overview

1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion

Page 41: EDBT 2009 - Provenance for Nested Subqueries

41

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

3. Experimental Results TPC-H benchmark (10 MB size)

Page 42: EDBT 2009 - Provenance for Nested Subqueries

42

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

3. Experimental Results TPC-H benchmark (1 GB size)

Page 43: EDBT 2009 - Provenance for Nested Subqueries

43

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Overview

1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion

Page 44: EDBT 2009 - Provenance for Nested Subqueries

44

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

4. Conclusion

Definition 1 fails in the presence of sublinks Can be extended to deal with sublinks

Provenance computation for sublinks By using query rewrites Implemented in the Perm

Future Work Physical provenance-aware operators

Page 45: EDBT 2009 - Provenance for Nested Subqueries

45

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Questions

? ? ?