Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney...

12
Database Queries that Explain their Work James Cheney University of Edinburgh [email protected] Amal Ahmed Northeastern University [email protected] Umut A. Acar Carnegie Mellon University & INRIA-Rocquencourt [email protected] Abstract Provenance for database queries or scientific workflows is often motivated as providing explanation, increasing understanding of the underlying data sources and processes used to compute the query, and reproducibility, the capability to recompute the results on different inputs, possibly specialized to a part of the output. Many provenance systems claim to provide such capabilities; how- ever, most lack formal definitions or guarantees of these proper- ties, while others provide formal guarantees only for relatively lim- ited classes of changes. Building on recent work on provenance traces and slicing for functional programming languages, we in- troduce a detailed tracing model of provenance for multiset-valued Nested Relational Calculus, define trace slicing algorithms that ex- tract subtraces needed to explain or recompute specific parts of the output, and define query slicing and differencing techniques that support explanation. We state and prove correctness properties for these techniques and present a proof-of-concept implementation in Haskell. Categories and Subject Descriptors D.3.0 [Programming Lan- guages]: General; D.3.3 [Programming Languages]: Language Constructs and Features Keywords provenance, database queries, slicing 1. Introduction Over the past decade, the use of complex computer systems in sci- ence has increased dramatically: databases, scientific workflow sys- tems, clusters, and cloud computing based on frameworks such as MapReduce [17] or PigLatin [27] are now routinely used for sci- entific data analysis. With this shift to computational science based on (often) unreliable components and noisy data comes decreased transparency, and an increased need to understand the results of complex computations by auditing the underlying processes. This need has motivated work on provenance in databases, sci- entific workflow systems, and many other settings [7, 26]. There is now a great deal of research on extending such systems with rich provenance-tracking features. Generally, these systems aim to pro- vide high-level explanations intended to aid the user in understand- ing how a computation was performed, by recording and presenting additional “trace” information. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PPDP ’14, September 08–10, 2014, Canterbury, United Kingdom. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2947-7/14/09. . . $15.00. http://dx.doi.org/10.1145/2643135.2643143 Over time, two distinct approaches to provenance have emerged: (1) the use of annotations propagated through database queries to illustrate where-provenance linking results to source data [7], lin- eage or why-provenance linking result records to sets of witness- ing input records [16], or how-provenance describing how results were produced via algebraic expressions [21], and (2) the use of graphical provenance traces to illustrate how workflow computa- tions construct final results from inputs and configuration parame- ters [5, 22, 30]. However, to date few systems formally specify the semantics of provenance or give formal guarantees characterizing how provenance “explains” results. For example, scientists often conduct parameter sweeps to search for interesting results. The provenance trace of such a com- putation may be large and difficult to navigate. Once the most promising results have been identified, a scientist may want to extract just that information that is needed to explain the result, without showing all of the intermediate search steps or uninter- esting results. Conversely, if the results are counterintuitive, the scientist may want to identify the underlying data that contributed to the anomalous result. Missier et al. [25] introduced the idea of a “golden trail”, or a subset of the provenance trace that explains, or allows reproduction of, a high-value part of the output. They proposed techniques for extracting “golden trails” using recursive Datalog queries over provenance graphs; however, they did not propose definitions of correctness or reproducibility. It is a natural question to ask how we know when a proposed solution, such as Missier et al.’s “golden trail” queries, correctly explains or can be used to correctly reproduce the behavior of the original computation. It seems to have been taken for granted that simple graph traversals suffice to at least overapproximate the de- sired subset of the graph. As far as we know, it is still an open question how to define and prove such correctness properties for most provenance techniques. In fact, these properties might be de- fined and formalized in a number of ways, reflecting different mod- eling choices or requirements. In any case, in the absence of clear statements and proofs of correctness, claims that different forms of provenance “explain” or allow “reproducibility” are difficult to evaluate objectively. The main contribution of this paper is to formalize and prove the correctness of an approach to fine-grained provenance for database queries. We build on our approach developed in prior work, which we briefly recapitulate. Our approach is based on analogies be- tween the goals of provenance tracking for databases and work- flows, and those of classical techniques for program comprehen- sion and analysis, particularly program slicing [32] and informa- tion flow [29]. Both program slicing and information flow rely crit- ically on notions of dependence, such as the familiar control-flow and data-flow dependences in programming languages. We previously introduced a provenance model for NRC (in- cluding difference and aggregation operations) called dependency provenance [12], and showed how it can be used to compute data

Transcript of Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney...

Page 1: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

Database Queries that Explain their Work

James CheneyUniversity of [email protected]

Amal AhmedNortheastern [email protected]

Umut A. AcarCarnegie Mellon University &

[email protected]

AbstractProvenance for database queries or scientific workflows is oftenmotivated as providing explanation, increasing understanding ofthe underlying data sources and processes used to compute thequery, and reproducibility, the capability to recompute the resultson different inputs, possibly specialized to a part of the output.Many provenance systems claim to provide such capabilities; how-ever, most lack formal definitions or guarantees of these proper-ties, while others provide formal guarantees only for relatively lim-ited classes of changes. Building on recent work on provenancetraces and slicing for functional programming languages, we in-troduce a detailed tracing model of provenance for multiset-valuedNested Relational Calculus, define trace slicing algorithms that ex-tract subtraces needed to explain or recompute specific parts of theoutput, and define query slicing and differencing techniques thatsupport explanation. We state and prove correctness properties forthese techniques and present a proof-of-concept implementation inHaskell.

Categories and Subject Descriptors D.3.0 [Programming Lan-guages]: General; D.3.3 [Programming Languages]: LanguageConstructs and Features

Keywords provenance, database queries, slicing

1. IntroductionOver the past decade, the use of complex computer systems in sci-ence has increased dramatically: databases, scientific workflow sys-tems, clusters, and cloud computing based on frameworks such asMapReduce [17] or PigLatin [27] are now routinely used for sci-entific data analysis. With this shift to computational science basedon (often) unreliable components and noisy data comes decreasedtransparency, and an increased need to understand the results ofcomplex computations by auditing the underlying processes.

This need has motivated work on provenance in databases, sci-entific workflow systems, and many other settings [7, 26]. There isnow a great deal of research on extending such systems with richprovenance-tracking features. Generally, these systems aim to pro-vide high-level explanations intended to aid the user in understand-ing how a computation was performed, by recording and presentingadditional “trace” information.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’14, September 08–10, 2014, Canterbury, United Kingdom.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2947-7/14/09. . . $15.00.http://dx.doi.org/10.1145/2643135.2643143

Over time, two distinct approaches to provenance have emerged:(1) the use of annotations propagated through database queries toillustrate where-provenance linking results to source data [7], lin-eage or why-provenance linking result records to sets of witness-ing input records [16], or how-provenance describing how resultswere produced via algebraic expressions [21], and (2) the use ofgraphical provenance traces to illustrate how workflow computa-tions construct final results from inputs and configuration parame-ters [5, 22, 30]. However, to date few systems formally specify thesemantics of provenance or give formal guarantees characterizinghow provenance “explains” results.

For example, scientists often conduct parameter sweeps tosearch for interesting results. The provenance trace of such a com-putation may be large and difficult to navigate. Once the mostpromising results have been identified, a scientist may want toextract just that information that is needed to explain the result,without showing all of the intermediate search steps or uninter-esting results. Conversely, if the results are counterintuitive, thescientist may want to identify the underlying data that contributedto the anomalous result. Missier et al. [25] introduced the idea ofa “golden trail”, or a subset of the provenance trace that explains,or allows reproduction of, a high-value part of the output. Theyproposed techniques for extracting “golden trails” using recursiveDatalog queries over provenance graphs; however, they did notpropose definitions of correctness or reproducibility.

It is a natural question to ask how we know when a proposedsolution, such as Missier et al.’s “golden trail” queries, correctlyexplains or can be used to correctly reproduce the behavior of theoriginal computation. It seems to have been taken for granted thatsimple graph traversals suffice to at least overapproximate the de-sired subset of the graph. As far as we know, it is still an openquestion how to define and prove such correctness properties formost provenance techniques. In fact, these properties might be de-fined and formalized in a number of ways, reflecting different mod-eling choices or requirements. In any case, in the absence of clearstatements and proofs of correctness, claims that different formsof provenance “explain” or allow “reproducibility” are difficult toevaluate objectively.

The main contribution of this paper is to formalize and prove thecorrectness of an approach to fine-grained provenance for databasequeries. We build on our approach developed in prior work, whichwe briefly recapitulate. Our approach is based on analogies be-tween the goals of provenance tracking for databases and work-flows, and those of classical techniques for program comprehen-sion and analysis, particularly program slicing [32] and informa-tion flow [29]. Both program slicing and information flow rely crit-ically on notions of dependence, such as the familiar control-flowand data-flow dependences in programming languages.

We previously introduced a provenance model for NRC (in-cluding difference and aggregation operations) called dependencyprovenance [12], and showed how it can be used to compute data

Page 2: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

slices, that is, subsets of the input to the query that include all ofthe information relevant to a selected part of the output. Some otherforms of provenance for database query languages, such as how-provenance [21], satisfy similar formal guarantees that can be usedto predict how the output would change under certain classes of in-put changes, specifically those expressible by semiring homomor-phisms. For example, Amsterdamer et al.’s system [3] is based onthe semiring provenance model, so the effects of deletions on partsof the output can be predicted by inspecting their provenance, butother kinds of changes are not supported.

More recently, we proposed an approach to provenance calledself-explaining computation [11] and explored it in the contextof a general-purpose functional programming language [1, 28].In this approach, detailed execution traces are used as a form ofprovenance. Traces explain results in the sense that they can bereplayed to recompute the results, and they can be sliced to obtainsmaller traces that provide more concise explanations of parts ofthe output. Trace slicing also produces a slice of the input showingwhat was needed by the trace to compute the output. Moreover,other forms of provenance can be extracted from traces (or slices),and we also showed that traces can be used to compute programslices efficiently through lazy evaluation. Finally, we showed howtraces support differential slicing techniques that can highlight thedifferences between program runs in order to explain and preciselylocalize bugs in the program or errors in the input data.

Our long-term vision is to develop self-explaining computationtechniques covering all components used in day-to-day scientificpractice. Databases are probably the single most important suchcomponent. Since our previous work already applies to a general-purpose programming language, one way to proceed would be tosimply implement an interpreter for NRC in this language, and in-herit the slicing behavior from that. However, without some furtherinlining or optimization, this naive strategy would yield traces thatrecord both the behavior of the NRC query and its interpreter, alongwith internal data structures and representation choices whose de-tails are (intuitively) irrelevant to understanding the high-level be-havior of the query.

1.1 Technical overviewIn this paper, we develop a tracing semantics and trace slicing tech-niques tailored to NRC (over a multiset semantics). This seman-tics evaluates a query Q over an input database (i.e. environment �mapping relation names to table values), yielding the usual resultvalue v as well as a trace T . Traces are typically large and difficultto decipher, so we consider a scenario where a user has run Q, in-spected the results v, and requests an explanation for a part of theresult, such as a field value of a single record. As in our previouswork for functional programs, we use partial values with “holes” 2to describe parts of the output that are to be explained. For exam-ple, if the result of a program is just a pair (1, 2) then the pattern(1,2) can be used to request an explanation for just the first com-ponent. Given a partial value p matching the output, our approachcomputes a “slice” consisting of a partial trace and a partial inputenvironment, where components not necessary for recomputing theexplained output part p have been deleted.

The main technical contribution of this paper over our previouswork [1, 28] is its treatment of tracing and slicing for collections.There are two underlying technical challenges; we illustrate both(and our solutions) via a simple example query Q = �

A<B

(R) [⇢

A 7!B,B 7!A

(�

A�B

(R)) over a table R with attributes A,B. Here,�

is relational selection of all tuples satisfying a predicate �

and ⇢

A 7!B,B 7!A

is renaming. Thus, Q simply swaps the fields ofrecords where A � B, and leaves other records alone.

The first challenge is how to address elements of multisetsreliably across different executions and support propagation of

addresses in the output backwards towards the input. Our solutionis to use a mildly enriched semantics in which multiset elementscarry explicit labels; that is, we view multisets of elements from X

as functions I ! X from some index set to X . For example, if Ris labeled as follows and we use this enriched semantics to evaluatethe above query Q on R, we get a result:

R =

id A B C

[r

1

] 1 2 7

[r

2

] 2 3 8

[r

3

] 4 3 9

Q(R) =

id A B C

[1, r

1

] 1 2 7

[1, r

2

] 2 3 8

[2, r

3

] 3 4 9

where in each case the id column contains a distinct index r

i

. InQ(R), the first ‘1’ or ‘2’ in each index indicates whether the rowwas generated by the left or right subexpression in the union (’[’).In general, we use sequences of natural numbers [i

1

, . . . , i

n

] 2 N⇤

as indices, and we maintain a stronger invariant: the set of indexesused in a multiset must form a prefix code. We define a semanticsfor NRC expressions over such collections that is fully determin-istic and does not resort to generation of fresh intermediate labels;in particular, the union operation adjusts the labels to maintain dis-tinctness.

The second technical challenge involves extending patterns forpartial collections. In our previous work, any subexpression of avalue can be replaced by a hole. This works well in a conven-tional functional language, where typical values (such as lists andtrees) are essentially initial algebras built up by structural induction.However, when we consider unordered collections such as bags,our previous approach becomes awkward.

For example, if we want to use a pattern to focus on only theB field of the second record in query result Q(R), we can onlydo this by deleting the other record values, and the smallest suchpattern is {[1, r

1

].2, [2, r

2

].hA:2, B:3, C:2i, [2, r3

].2}. This istolerable if there are only a few elements, but if there are hundredsor millions of elements and we are only interested in one, this isa significant overhead. Therefore, we introduce enriched patternsthat allow us to replace entire subsets with holes, for example,p

0= {[2, r

2

].hB:3;2i} ˙[ 2. Enriched patterns are not just aconvenience; we show experimentally that they allow traces andslices to be smaller (and computed faster) by an order of magnitudeor more.

1.2 OutlineThe rest of this paper is structured as follows. Section 2 reviews the(multiset-valued) Nested Relational Calculus and presents our trac-ing semantics and deterministic labeling scheme. Section 3 presentstrace slicing, including a simple form of patterns. Section 4 showshow to enrich our pattern language to allow for partial record andpartial set patterns, which considerably increase the expressivenessof the pattern language, leading to smaller slices. Section 5 presentsthe query slicing algorithm and shows how to compute differentialslices. Section 6 presents additional examples and discussion. Sec-tion 7 presents our implementation demonstrating the benefits oflaziness and enriched patterns for trace slicing. Section 8 discussesrelated and future work and Section 9 concludes.

Due to space limitations, and in order to make room for exam-ples and high-level discussion, some (mostly routine) formal detailsand proofs are relegated to a companion technical report [13].

2. Traced Evaluation for NRCThe nested relational calculus [9] is a simply-typed core languagewith collection types that can express queries on nested data similarto those of SQL on flat relations, but has simpler syntax and cleanersemantics. In this section, we show how to extend the ideas andmachinery developed in our previous work on traces and slicing forfunctional languages [28] to NRC. Developing formal foundations

Page 3: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

Operations f ::= + | � | ⇤ | / | = | < | | · · ·Expressions e ::= c | f(e

1

, . . . , e

n

) | x | let e = x in e

0

| hA1

: e

1

, . . . , A

n

: e

n

i | e.A | if(e, e0, e00)| ; | {e} | e

1

[ e

2

|S{e0 | x 2 e}

| empty e | sum e | · · ·Labels ` ::= `.`

0 | ✏ | nValues v ::= c | hA

1

: v

1

, . . . A

n

: v

n

i| ; | {`

1

.v

1

, . . . , `

n

.v

n

}Environments � ::= [x

1

7! v

1

, . . . , x

n

7! v

n

]

Traces T ::= · · · | if(T, e0, e00) .true

T | if(T, e0, e00) .false

T

|S{e | x 2 T} .⇥

Trace Sets ⇥ ::= {`1

.T

1

, . . . , `

n

.T

n

}Types ⌧ ::= int | bool | hA

1

: ⌧

1

, . . . , A

n

: ⌧

n

i | {⌧}Type Contexts � ::= x

1

: ⌧

1

, . . . , x

n

: ⌧

n

Figure 1. NRC expressions, values, traces, and types.

for tracking provenance in the presence of unordered collectionspresents a number of challenges not encountered in the functionalprogramming setting, as we will explain.

2.1 Syntax and Dynamic SemanticsFigure 1 presents the abstract syntax of NRC expressions, values,and traces. The expression ; denotes the empty collection, {e} con-structs a singleton collection, and e

1

[e

2

takes the (multiset) unionof two collections. The operation sum e computes the sum of a col-lection of integers, while the predicate empty e tests whether thecollection denoted by e is empty. Additional aggregation operationssuch as count, maximum and average can easily be accommodated.Finally, the comprehension operation

S{e0 | x 2 e} iterates over

the collection obtained by evaluating e, evaluating e

0(x) with x

bound to each element of the collection in turn, and returning acollection containing the union of all of the results. We sometimesconsider pairs (e

1

, e

2

), a special case of records h#1

: e

1

,#

2

: e

2

iusing two designated field names #

1

and #

2

. Many trace forms aresimilar to those for expressions; only the differences are shown.

Labels are sequences ` = [i

i

, . . . , i

n

] 2 N⇤, possibly empty.The empty sequence is written ✏, and labels can be concatenated` · `0; concatenation is associative. Record field names are writtenA,B,A

1

, A

2

, . . ..Values in NRC include constants c, which we assume in-

clude at least booleans and integers. Record values are essen-tially partial functions from field names to values, written hA

1

:

v

1

, . . . , A

n

: v

n

i. Collection values are essentially partial, finite-domain functions from labels in N⇤ to values, which we write{`

1

.v

1

, . . . , `

n

.v

n

}. Since they denote functions, collections andrecords are identified up to reordering of their elements, and theirfield names or labels are always distinct. We write ` · v for theoperation that prepends ` to each of the labels in a set v, that is,

` · {`1

.v

1

, . . . , `

n

.v

n

} = {` · `1

.v

1

, . . . , ` · `n

.v

n

} .

Other operations on labels and labeled collections will be intro-duced in due course.

The labels on the elements of a collection provide us with a per-sistent address for a particular element of the collection. This capa-bility is essential when asking and answering provenance queriesabout parts of the source or output data, and when tracking fine-grained dependencies.

Both expressions and traces are subject to a type system. NRCtypes include collection types {⌧} which are often taken to be sets,bags (multisets), or lists, though in this paper, we consider multisetcollections only. However, types do not play a significant role in

this paper so the typing rules are omitted. For expressions, thetyping judgment � ` e : ⌧ is standard and the typing rules fortrace well-formedness � ` T : ⌧ are presented in the companiontechnical report [13].

Traced evaluation NRC traces include a trace form correspond-ing to each of the expressions described above. The structure ofthe traces is best understood by inspecting the traced evaluationrules (Figure 2), which define a judgment �, e + v, T indicatingthat evaluating an expression e in environment � yields a value v

and a trace T . We assume an environment ⌃ associating constantsand function symbols with their types, and write ˆf or ˆ

+ for thesemantic operations corresponding to f or +, and so on. In mostcases, the trace form is similar to the expression form; for exam-ple the trace of a constant or variable is a constant trace c, thetrace of a primitive operation f(e

1

, . . . , e

n

) is a primitive opera-tion trace f(T

1

, . . . , T

n

) applied to the traces Ti

of the argumentse

i

, the trace of a record expression is a trace record constructorhA

1

: T

1

, . . . , A

n

: T

n

i, and the trace of a field projection e.A

is a trace T.A. Also, the trace of a let-binding is a let-bindingtrace let x = T

1

in T

2

, where x is bound in T

2

. In these cases,the traces mimic the expression structure.

The traced evaluation rules for conditionals illustrate that tracesdiffer from expressions in recording control flow decisions. Thetrace of a conditional is a conditional trace if(T, e

1

, e

2

) .

b

T

0

where T is the trace of the conditional test, b is the Booleanvalue of the test e

1

, and T

0 is the trace of the taken branch. Theexpressions e

1

and e

2

are not strictly necessary but retained topreserve structural similarity to the original expression.

The trace of ; is a constant trace ;. To evaluate a singleton-collection constructor {e}, we evaluate e to obtain a value v andreturn the singleton {✏.v} with empty label ✏. We return the sin-gleton trace {T} recording the trace for the evaluation of the ele-ment. To evaluate the union of two expressions, we evaluate eachone and take the semantic union (written ]) of the resulting col-lections, with a ‘1’ or ‘2’ concatenated onto the beginning of eachlabel to reflect whether each element came from the first or secondpart of the union; the union trace T

1

[ T

2

records the traces for theevaluation of the two subexpressions. For sum e, evaluating e yieldsa collection of numbers whose sum we return, together with a sumtrace sum T recording the trace for evaluation of e. Evaluation ofemptiness tests empty e is analogous, yielding a trace empty T .

To evaluate a comprehensionS{e0 | x 2 e}, we first evaluate e,

which yields a collection v and trace T , and then (using auxiliaryjudgment �, x 2 v, e

0 +⇤v

0,⇥) evaluate e

0 repeatedly withx bound to each element v

i

of the collection v to get resultingvalues v0

i

and corresponding traces T 0i

. We return a new collectionv

0= {`

1

· v01

, . . . , `

n

· v0n

}; similarly we return a labeled set oftraces ⇥ = {`

1

.T

1

, . . . , `

n

.T

n

}. (Analogously to values, trace setsare essentially finite partial functions from labels to traces). Foreach of these collections, we prepend the appropriate label `

i

of thecorresponding input element.

A technical point of note is that the resulting trace T

i

may con-tain free occurrences of x. As in our trace semantics for functionalprograms, these variables serve as markers in T

i

that will be criticalfor the trace replay semantics. The comprehension trace records,using the notation

S{e0 | x 2 T} . {`

1

.T

1

, . . . , `

n

.T

n

}, that thetrace T was used to compute a multiset v, and x was bound toeach element `

i

.v

i

in v in turn, with trace T

i

showing how the cor-responding subset of the result was computed. The comprehensiontrace also records the expression e

0 and bound variable x, which areagain not strictly necessary but preserve the structural similarity tothe original expression.

At this point it is useful to provide some informal motivationfor the labeling semantics, compared for example to other seman-tics that use annotations or labels as a form of provenance. We do

Page 4: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

�, e + v, T

�, c + c, c

�, e

1

+ c

1

, T

1

· · · �, e

n

+ c

n

, T

n

�, f(e

1

, . . . , e

n

) + ˆ

f(c

1

, . . . , c

n

), f(T

1

, . . . , T

n

) �, x + �(x), x

�, e

1

+ v

1

, T

1

�[x 7! v

1

], e

2

+ v

2

, T

2

�, let x = e

1

in e

2

+ v

2

, let x = T

1

in T

2

�, e

1

+ v

1

, T

1

· · · �, e

n

+ v

n

, T

n

�, hA1

:e

1

, . . . , A

n

:e

n

i + hA1

:v

1

, . . . , A

n

:v

n

i, hA1

:T

1

, . . . , A

n

:T

n

i�, e + hA

1

:v

1

, . . . , A

n

:v

n

i, T�, e.A

i

+ v

i

, T.A

i

�, e + true, T �, e

1

+ v

1

, T

1

�, if(e, e

1

, e

2

) + v

1

, if(T, e

1

, e

2

) .

true

T

1

�, e + false, T �, e

2

+ v

2

, T

2

�, if(e, e

1

, e

2

) + v

2

, if(T, e

1

, e

2

) .

false

T

2

�, ; + ;, ;�, e + v, T

�, {e} + {✏.v}, {T}

�, e

1

+ v

1

, T

1

�, e

2

+ v

2

, T

2

�, e

1

[ e

2

+ 1 · v1

] 2 · v2

, T

1

[ T

2

�, e + v, T �, x 2 v, e

0 +⇤v

0,⇥

�,

[{e0 | x 2 e} + v

0,

[{e0 | x 2 T} .⇥

�, e + {`1

.v

1

, . . . , `

n

.v

n

}, T�, sum e + v

1

ˆ

+ . . .

ˆ

+v

n

, sum T

�, e + v, T v = ;�, empty e + true, empty T

�, e + v, T v 6= ;�, empty e + false, empty T

�, x 2 v, e +⇤v,⇥

�, x 2 ;, e +⇤ ;, ;�, x 2 v

1

, e +⇤v

01

,⇥

1

�, x 2 v

2

, e +⇤v

02

,⇥

2

�, x 2 v

1

] v

2

, e +⇤v

01

] v

02

,⇥

1

]⇥

2

�[x 7! v], e + v

0, T

�, x 2 {`.v}, e +⇤` · v0, {`.T}

Figure 2. Traced evaluation.

not view the labels themselves as provenance; instead, they pro-vide a useful infrastructure for traces, which do capture a form ofprovenance. Moreover, by calculating the label of each part of anintermediate or final result deterministically (given the labels onthe input), we provide a way to reliably refer to parts of the output,which otherwise may be unaddressable in a multiset-valued seman-tics. This is essential for supporting compositional slicing for op-erations such as let-binding or comprehension, where the output ofone subexpression becomes a part of the input for another.

A central point of our semantics is that evaluation preserves theproperty that labels uniquely identify the elements of each multiset.This naturally assumes that the labels on the input collectionsare distinct. In fact, a stronger property is required: evaluationpreserves the property that set labels form a prefix code. In thefollowing, we write x y to indicate that sequence x is a prefix ofsequence y.

Definition 2.1. A prefix code over ⌃ is a set of sequences L ✓ ⌃

such that for every x, y 2 L, if x y then x = y. A sub-prefixcode of a prefix code L is a prefix code L

0 such that for all x 2 L

there exists y 2 L

0 such that y x. We write L

0 L to indicatethat L0 is a sub-prefix code of L. We say that L and L

0 are prefix-disjoint when no element of L is a prefix of an element of L0 andvice versa.

Let v be a collection v = {`1

.v

1

, . . . , `

n

.v

n

}. We define thedomain of v to be dom(v) = {`

1

, . . . , `

n

}. . We say that a valueor value environment is prefix-labeled if for every collection v

occurring in it, the labels `

1

, . . . , `

n

are distinct and dom(v) is aprefix code. Similarly, we say that a trace is prefix-labeled if everylabeled trace set ⇥ = {`

1

.T

1

, . . . , `

n

.T

n

} is prefix-labeled.

Theorem 2.2. If � is prefix-labeled and �, e + v, T then v andT are both prefix-labeled. Moreover, if � and v are prefix-labeledand �, x 2 v, e +⇤

v

0,⇥ then v

0 and ⇥ are prefix-labeled, and inaddition dom(⇥) = dom(v) dom(v

0).

Proof. By induction on derivations. The key cases are those forunion, which is straightforward, and for comprehensions. For thelatter case we need the second part, to show that whenever � andv are prefix-labeled, if �, x 2 v, e +⇤

v

0,⇥ then v

0 and ⇥ areprefix-labeled, and in addition dom(⇥) = dom(v) and dom(v) isa sub-prefix code of dom(v

0).

The prefix code property is needed later in the slicing algo-rithms, when we will need it to match elements of collections pro-duced by comprehensions with corresponding elements of the traceset ⇥. From now on, we assume that all values, environments, andtraces are prefix-labeled, so any labeled set is assumed to have theprefix code property.

We will use the following query as a running example.

Q =

[{if(x.B = 3, {hA:x.A,B:x.Ci}, {}) | x 2 R}

This is a simple selection query; it identifies records in R that haveB-value of 3, and returns record hA:x.A,B:x.Ci containing x’s Avalue and its C value renamed to B. The result of Q on the input Rin the introduction is Q(R) = {[r

2

].hA:2, B:8i, [r3

].hA:4, B:9i},and the trace is:

T =

S{ | x 2 R} . { [r

1

].if(x.B = 3, , ) .

false

{},[r

2

].if(x.B = 3, , ) .

true

{ },[r

3

].if(x.B = 3, , ) .

true

{ } }where indicates omitted (easily inferrable) subexpressions.

Replay We introduce a judgment �, T y v for replaying atrace on a (possibly different) environment �. The rules for replay-ing NRC traces are presented in Figure 3. Many of the rules arestraightforward or analogous to the corresponding evaluation rules.Here we only discuss the replay rules for conditional and compre-hension traces.

For the conditional rules, the basic idea is as follows. If re-playing the trace T of the test yields the same boolean value b asrecorded in the trace, we replay the trace of the taken branch. If thetest yields a different value, then replay fails.

To replay a comprehension traceS{e | x 2 T} . ⇥, rule

RCOMP first replays trace T to update the set of elements v

over which we will iterate. We define a separate judgment �, x 2v,⇥ y⇤

v

0 to iterate over set v and replay traces on the corre-sponding elements. For elements `

i

2 dom(⇥), we replay thecorresponding trace ⇥(`

i

). Replay fails if v contains any labels notpresent in ⇥.

Replaying a trace can fail if either the branch taken in a condi-tional test differs from that recorded in the trace, or the intermediateset obtained from rerunning a comprehension trace includes valueswhose labels are not present in the trace. This means, in particular,that changes to the input can be replayed if they only change base

Page 5: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

�, T y v

�, c y c

�, T

1

y c

1

· · · �, T

n

y c

n

�, f(T

1

, . . . , T

n

) y ˆ

f(c

1

, . . . , c

n

)

�, x y �(x)

�, T

1

y v

1

�[x 7! v

1

], T

2

y v

2

�, let x = T

1

in T

2

y v

2

�, T

1

y v

1

· · · �, T

n

y v

n

�, hA1

:T

1

, . . . , A

n

:T

n

i y hA1

:v

1

, . . . , A

n

:v

n

i

�, T y hA1

:v

1

, . . . , A

n

:v

n

i�, T.A

i

y v

i

�, T y true �, T

1

y v

1

�, if(T, e

1

, e

2

) .

true

T

1

y v

1

�, T y false �, T

2

y v

2

�, if(T, e

1

, e

2

) .

false

T

2

y v

2

�, ; y ;�, T y v

�, {T} y {✏.v}

�, T

1

y v

1

�, T

2

y v

2

�, T

1

[ T

2

y 1 · v1

] 2 · v2

�, T y {`1

.v

1

, . . . , `

n

.v

n

}�, sum T y v

1

ˆ

+ . . .

ˆ

+ v

n

�, T y v v = ;�, empty T y true

�, T y v v 6= ;�, empty T y false

�, T y v �, x 2 v,⇥ y⇤v

0

�,

[{e | x 2 T} .⇥ y v

0

�, x 2 v,⇥ y⇤v

0

�, x 2 ;,⇥ y⇤ ;�, x 2 v

1

,⇥ y⇤v

01

�, x 2 v

2

,⇥ y⇤v

02

�, x 2 v

1

] v

2

,⇥ y⇤v

01

] v

02

`

i

2 dom(⇥) �[x 7! v

i

],⇥(`

i

) y v

0i

�, x 2 {`i

.v

i

},⇥ y⇤`

i

· v0i

Figure 3. Trace replay.

values or delete set elements, but changes leading to additions ofnew labels to sets involved in comprehensions typically cannot bereplayed.

Returning to the running example, suppose we change field B

of row r

1

of R from 2 to 5. This change has no effect on thecontrol flow choice taken in Q, and replaying the trace T succeeds.Likewise, changing the A or C field of any column of R has noeffect, since these values do not affect the control flow choices inT . However, changing the B field of a row to 3 (or changing it from3 to something else) means that replay will fail.

2.2 Key propertiesBefore moving on to consider trace slicing, we identify some prop-erties that formalize the intuition that traces are consistent with andfaithfully record execution.

Evaluation and traced evaluation are deterministic:

Proposition 2.3 (Determinacy). If �, e + v, T and �, e + v

0, T

0

then v = v

0 and T = T

0. If �, T y v and �, T y v

0 then v = v

0.

When the trace is irrelevant, we write �, e + v to indicate that�, e + v, T for some T .

Traces can be represented using pointers to share commonsubexpressions; using this DAG representation, traces can be storedin space polynomial in the input. (This sharing happens automati-cally in our implementation in Haskell.)

Proposition 2.4. For a fixed e, if �, e + v, T then the sizes of vand of the DAG representation of T are at most polynomial in |�|.

Proof. Most cases are straightforward. The only non-trivial case isfor comprehensions, where we need a stronger induction hypoth-esis: if �, x 2 v, e +⇤

v

0,⇥ then the sizes of v0 and of the DAG

representation of ⇥ are at most polynomial in |�|.

Furthermore, traced evaluation produces a trace that replaysto the same value as the original expression run on the originalenvironment. We call this property consistency.

Proposition 2.5 (Consistency). If �, e + v, T then �, T y v.

Finally, trace replay is faithful to ordinary evaluation in the fol-lowing sense: if T is generated by running e in � and we success-fully replay T on �

0 then we obtain the same value (and same trace)as if we had rerun e from scratch in �

0, and vice versa:

Proposition 2.6 (Fidelity). If �, e + v, T , then for any �

0, v0 wehave �

0, e + v

0, T if and only if �0

, T y v

0.

Observe that consistency is a special case of fidelity (with �

0=

�, v

0= v. Moreover, the “if” direction of fidelity holds even though

replay can fail, because we require that �0, e + v

0, T holds for the

same trace T . If �0, e + v

0, T

0 is derivable but only for a differenttrace T

0, then replay fails.

3. Trace SlicingThe goal of the trace slicing algorithm we consider is to removeinformation from a trace and input that is not needed to recom-pute a part of the output. To accommodate these requirements, weintroduce traces and values with holes and more generally, we con-sider patterns that represent relations on values capturing possiblechanges. In this section, we limit attention to pairs, and considerslicing for a class of simple patterns. We consider records and moreexpressive enriched patterns in the next section.

We extend traces with holes 2 and define a subtrace relationv that is essentially a syntactic precongruence on traces and tracesets such that 2 v T holds and ⇥ ✓ ⇥

0 implies ⇥ v ⇥

0. (Thedefinition of v is shown in full in the companion technical report.)Intuitively, holes denote parts of traces we do not care about, andwe can think of a trace T with holes as standing for a set of possiblecomplete traces {T 0 | T v T

0} representing different ways offilling in the holes.

The syntax of simple patterns p, set patterns sp, and patternenvironments ⇢ is:

p ::= 2 | 3 | c | (p1

, p

2

) | spsp ::= ; | {`

1

.p

1

, . . . , `

n

.p

n

}⇢ ::= [x

1

7! p

1

, . . . , x

n

7! p

n

]

Essentially, a pattern is a value with holes in some positions. Themeaning of each pattern is defined through a relation h

p

that sayswhen two values are equivalent with respect to a pattern. This re-lation is defined in Figure 5. A hole 2 indicates that the part ofthe value is unimportant, that is, for slicing purposes we don’t careabout that part of the result. Its associated relation h2 relates anytwo values. An identity pattern 3 is similar to a hole: it says thatthe value is important but its exact value is not specified, and itsassociated relation h3 is the identity relation on values. Completeset patterns {`

1

.p

1

, . . . , `

n

.p

n

} specify the labels and patterns forall of the elements of a set; that is, such a pattern relates only setsthat have exactly the labeled elements specified and whose corre-sponding values match according to the corresponding patterns.

We define the union of two set patterns as {`i

.p

i

} ] {`0i

.p

0i

} =

{`i

.p

i

, `

0i

.p

0i

} provided their domains ~

`

i

and ~

`

0i

are prefix-disjoint.We define a (partial) least upper bound operation on patterns pt p

0

such that for any v, v

0 we have v hptp

0v

0 if and only if v hp

v

0

and v hp

0v

0; the full definition is shown in Figure 4. We define

Page 6: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

2 t p = p t 2 = p

3 t p = p t3 = p[3/2]

c t c = c(p

1

, p

2

) t (p

01

, p

02

) = (p

1

t p

01

, p

2

t p

02

)

{`i

.p

i

} t {`i

.p

0i

} = {`i

.p

i

t p

0i

}where:

3[3/2] = 2[3/2] = 3c[3/2] = c

(p

1

, p

2

)[3/2] = (p

1

[3/2], p

2

[3/2])

{`i

.p

i

}[3/2] = {`i

.p

i

[3/2]}

Figure 4. Least upper bound for simple patterns

v hp

v

0

v h2 v

0v h3 v c h

c

c

v

1

hp1 v

01

v

1

hp2 v

02

(v

1

, v

2

) h(p1,p2)

(v

01

, v

02

)

v

i

hpi v

0i

(i 2 {1, . . . , n}){`

1

.v

1

, . . . , `

n

.v

n

} h{`1.p1,...,`n.pn} {`1

.v

01

, . . . , `

n

.v

0n

}

� h⇢

0 () 8x 2 dom(⇢). �(x) h⇢(x)

0(x)

Figure 5. Simple pattern equivalence.

the partial ordering p v p

0 as p t p

0= p

0. We say that a value v

matches pattern p if p v v. Observe that this implies v hp

v. Weextend the t and v operations to pattern environments ⇢ pointwise,that is, (⇢ t ⇢

0)(x) = ⇢(x) t ⇢(x

0).

We define several additional operations on patterns that areneeded for the slicing algorithm. Consider the following singletonextraction operation p.✏ and label projection operation p[`]:

({✏.p}).✏ = p 2.✏ = 2 3.✏ = 3sp[`] = {`0.v | `.`0.v 2 sp} 2[`] = 2 3[`] = 3

These operations are only used for the above cases; they have noeffect on constant or pair patterns. For sets, p[`] extracts the subsetof p whose labels start with `, truncating the initial prefix `, while ifp is 2 or 3 then again p[`] returns the same kind of hole. Moreover,dom(p[`]) is a prefix code if dom(p) is.

Suppose L is a prefix code. We define restriction of a set patternp to L as follows:

sp|L

= {`.`0.p 2 sp | ` 2 L} 2|L

= 2 3|L

= 3

It is easy to see that dom(p|L

) ✓ dom(p) so dom(p|L

) is a prefixcode if dom(p) is, so this operation is well-defined on collections:

Lemma 3.1. If sp is prefix-labeled and L is a prefix code thensp[`] and sp|

L

are prefix-labeled.

Backward Slicing The rules for backward slicing are given inFigure 6. The judgment p, T & ⇢, T

0 slices trace T with respectto a pattern p to yield the slice T

0 and sliced input environment⇢. The sliced input environment records what parts of the inputare needed to produce p; this is needed for slicing operations suchas let-binding or comprehensions. The main new ideas are in therules for collection operations, particularly comprehensions. Theslicing rules SCONST, SPRIM, SVAR, SLET, SPAIR, SPROJ

i

, andSIF follow essentially the same idea as in our previous work [1, 28].We focus discussion on the new cases, but we review the key ideasfor these operations here in order to make the presentation self-contained.

The rules for collections use labels and set pattern operations toeffectively undo the evaluation of the set pattern. The rule SEMPTY

is essentially the same as the constant rule. The rule SSNG usesthe singleton extraction operation p.✏ to obtain a pattern describ-ing the single element of a singleton set value matching p. Therule SUNION uses the two projections p[1] and p[2] to obtain thepatterns describing the subsets obtained from the first and secondsubtraces in the union pattern, respectively.

The rule SCOMP uses a similar idea to let-binding. We slice thetrace set ⇥ using an auxiliary judgment p, x.⇥ &⇤

⇢,⇥

0

, p

0

, ob-taining a sliced input environment, sliced trace set ⇥

0

, and patternp

0

describing the set of values to which x was bound. We then usep

0

to slice backwards through the subtrace T that constructed theset. The auxiliary slicing judgment for trace sets has three rules:a trivial rule SEMPTY⇤ when the set is empty, rule SSNG⇤ thatuses label projection p[`] to handle a singleton trace set, and a ruleSUNION⇤ handling larger trace sets by decomposing them into sub-sets. This is essentially a structural recursion over the trace set, andis deterministic even though the rules can be used to decompose thetrace sets in many different ways, because ] is associative. RuleSUNION⇤ also requires that we restrict the set pattern to match thedomains of the corresponding trace patterns.

The slicing rules SSUM and SEMPTYP follow the same idea asfor primitive operations at base type: we require that the whole setvalue be preserved exactly. As discussed by Perera et al. [28] withprimitive operations, this is a potential source of overapproxima-tion, since (for example) for an emptiness test, all we really need isto preserve the number of elements in the set, not their values. Thelast rule shows how to slice pair patterns when the pattern is 3: weslice both of the subtraces by 3 and combine the results.

It is straightforward to show that the slicing algorithm is well-defined for consistent traces. That is, if �, T y v and p v v thenthere exists ⇢, S such that p, T & ⇢, S holds, where ⇢ v � andS v T . We defer the correctness theorem for slicing using simplepatterns to the end of the next section, since it is a special case ofcorrectness for slicing using enriched patterns.

Continuing our running example, consider the pattern p =

{[r2

].hA:2, B:8i, [r3

].2}. The slice of T with respect to this pat-tern is of the form:

T

0=

S{ | x 2 R} . { [r

1

].2,

[r

2

].if(x.B = 3, , ) .

true

{ },[r

3

].2 }

and the slice of R is R0= {[r

1

].2, [r

2

].hA:2, B:3, C:8i, [r3

].2}.(That is, R0 is the value of ⇢(R), where ⇢ is the pattern environ-ment produced by slicing T with respect to p.) Observe that thevalue of A is not needed but the value of B must remain 3 in or-der to preserve the control flow behavior of the trace on r

2

. Theholes indicate that changes to r

1

and r

3

in the input cannot af-fect r

2

. However, a trace matching T

0 cannot be replayed if anyof r

1

, r

2

, r

3

are deleted from the input or if the A field is removedfrom an input record, because the replay rules require all of the la-bels mentioned in collections or records in T

0 to be present. Wenow turn our attention to enriched patterns, which mitigate thesedrawbacks.

4. Enriched PatternsSo far we have considered only complete set patterns of the form{`

1

.p

1

, . . . , `

n

.p

n

}. These patterns relate pairs of values that haveexactly n elements labeled `

1

, . . . , `

n

, each of which matchesp

1

, . . . , p

n

respectively.This is awkward, as we already can observe in our running ex-

ample above: to obtain a slice describing how one record was com-puted, we need to use a set pattern that lists all of the indexes in theoutput. Moreover, the slice with respect to such a pattern may alsoinclude labeled subtraces explaining why the other elements existin the output (and no others). This information seems intuitively

Page 7: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

p, T & ⇢, S

2, T & [],2SHOLE

p, c & [], c

SCONST3, T

1

& ⇢

1

, S

1

· · · 3, T

n

& ⇢

n

, S

n

p, f(T

1

, . . . , T

n

) & ⇢

1

t · · · t ⇢

n

, f(S

1

, . . . , S

n

)

SPRIMp, x & [x 7! p], x

SVAR

p

2

, T

2

& ⇢

2

[x 7! p

1

], S

2

p

1

, T

1

& ⇢

1

, S

1

p

2

, let x = T

1

in T

2

& ⇢

1

t ⇢

2

, let x = S

1

in S

2

SLETp

1

, T

1

& ⇢

1

, S

1

p

2

, T

2

& ⇢

2

, S

2

(p

1

, p

2

), (T

1

, T

2

) & ⇢

1

t ⇢

2

, (S

1

, S

2

)

SPAIR(p,2), T & ⇢, S

p, T.#

1

& ⇢, S.#

1

SPROJ1

(2, p), T & ⇢, S

p, T.#

2

& ⇢, S.#

2

SPROJ2

p, T

0 & ⇢

0, S

0b, T & ⇢, S

p, if(T, e

1

, e

2

) .

b

T

0 & ⇢

0 t ⇢, if(S, e

1

, e

2

) .

b

S

0 SIFp, ; & [], ;

SEMPTYp.✏, T & ⇢, S

p, {T} & ⇢, {S}SSNG

p[1], T

1

& ⇢

1

, S

1

p[2], T

2

& ⇢

2

, S

2

p, T

1

[ T

2

& ⇢

1

t ⇢

2

, S

1

[ S

2

SUNIONp, x.⇥ &⇤

0,⇥

0, p

0p

0, T & ⇢, S

p,

[{e | x 2 T} .⇥ & ⇢ t ⇢

0,

[{e | x 2 S} .⇥

0SCOMP

3, T & ⇢, S

p, sum T & ⇢, sum S

SEMPTYP3, T & ⇢, S

p, empty T & ⇢, empty S

SSUM3, T

1

& ⇢

1

, S

1

3, T

2

& ⇢

2

, S

2

3, (T

1

, T

2

) & ⇢

1

t ⇢

2

, (S

1

, S

2

)

SDIAMOND

p, x.⇥ &⇤⇢,⇥

0

, p

0

;, x.; &⇤[], ;, ;

SEMPTY⇤ p[`], T & ⇢[x 7! p

0

], S

p, x.{`.T} &⇤⇢, {`.S}, {`.p

0

}SSNG⇤

p|dom(⇥1)

, x.⇥

1

&⇤⇢

1

,⇥

01

, p

1

p|dom(⇥2)

, x.⇥

2

&⇤⇢

2

,⇥

02

, p

2

p, x.⇥

1

]⇥

2

&⇤⇢

1

t ⇢

2

,⇥

01

]⇥

02

, p

1

] p

2

SUNION⇤

Figure 6. Backward trace slicing.

irrelevant to the value at `1

, and can be a major overhead if the col-lection is large. We also have considered only binary pairs; it wouldbe more convenient to support record patterns directly.

In this section we sketch how to enrich the language of patternsto allow for partial set and partial record patterns, as follows:

p ::= 2 | 3 | c | sp | rprp ::= hi | hA

i

: p

i

i | hAi

: p

i

;2i | hAi

: p

i

;3isp ::= ; | {`

i

.p

i

} | {`i

.p

i

} ˙[ 2 | {`i

.p

i

} ˙[3

Record patterns are of the form hAi

: p

i

i, listing the fields and thepatterns they must match, possibly followed by 2 or 3, which theremainder of the record must match. The pattern {`

i

.p

i

} ˙[2 standsfor a set with labeled elements matching patterns p

1

, . . . , p

n

, plussome additional elements whose values we don’t care about. Forexample, we can use the pattern {`

1

.p

1

} ˙[ 2 to express interestin why element `

1

matches p

1

, when we don’t care about the restof the set. The second partial pattern, {`

i

.p

i

} ˙[ 3, has similarbehavior, but it says that the rest of the sets being considered mustbe equal. For example, {`.2} ˙[ 3 says that the two sets are equalexcept possibly at label `. This pattern is needed mainly in order toensure that we can define a least upper bound on enriched patterns,since we cannot express ({`

1

.v

1

, . . . , `

n

.p

n

} ˙[2) t3 otherwise.We define dom({`

i

.p

i

} ˙[ 2) = dom({`i

.p

i

} ˙[ 3) =

{`1

, . . . , `

n

}. Disjoint union of enriched patterns sp ] sp0 is de-fined only if the domains are prefix-disjoint, so that the labels of theresult still form a prefix code; this operation is defined in Figure 7.

We now extend the definitions of p[3/2] and t to account forextended patterns. We extend the [3/2] substitution operation asfollows:

hAi

: p

i

i[3/2] = hAi

: p

i

[3/2]ihA

i

: p

i

;2i[3/2] = hAi

: p

i

[3/2];3ihA

i

: p

i

;3i[3/2] = hAi

: p

i

[3/2];3i({`

i

.p

i

} ˙[ 2)[3/2] = {`i

.p

i

[3/2]} ˙[3

({`i

.p

i

} ˙[3)[3/2] = {`i

.p

i

[3/2]} ˙[3

We handle the additional cases of the t operation in Figure 8,and extend the h

p

relation as shown in Figure 9. Note that takingthe least upper bound of a partial pattern with a complete patternyields a complete pattern, while taking the least upper bound of

3 ] 2 = 2 ]3 = 2 ] 2 = 23 ]3 = 3

sp ] ; = ; ] sp = sp{`

i

.p

i

} ] 2 = 2 ] {`i

.p

i

} = {`i

.p

i

} ˙[ 2

{`i

.p

i

} ]3 = 3 ] {`i

.p

i

} = {`i

.p

i

} ˙[3

{`i

.p

i

} ] ({`0j

.p

0j

} ˙[ 2) = ({`i

.p

i

} ] {`0j

.p

0j

}) ˙[ 2

{`i

.p

i

} ] ({`0j

.p

0j

} ˙[3) = ({`i

.p

i

} ] {`0j

.p

0j

}) ˙[3

({`i

.p

i

} ˙[ 2) ] ({`0j

.p

0j

} ˙[ 2) = ({`i

.p

i

} ] {`0j

.p

0j

}) ˙[ 2

({`i

.p

i

} ˙[3) ] ({`0j

.p

0j

} ˙[ 2) = ({`i

.p

i

} ] {`0j

.p

0j

}) ˙[ 2

({`i

.p

i

} ˙[ 2) ] ({`0j

.p

0j

} ˙[3) = ({`i

.p

i

} ] {`0j

.p

0j

}) ˙[ 2

({`i

.p

i

} ˙[3) ] ({`0j

.p

0j

} ˙[3) = ({`i

.p

i

} ] {`0j

.p

0j

}) ˙[3

Figure 7. Enriched pattern union

partial patterns involving 3 again relies on the [3/2] substitutionoperation.

We extend the singleton extraction p.✏, label projection p[`], andrestriction p|

L

operations on set patterns as follows:({✏.p} ˙[3).✏ = ({✏.p} ˙[ 2).✏ = p

` · ({`i

.p

i

} ˙[ 2) = (` · {`i

.p

i

}) ] 2

` · ({`i

.p

i

} ˙[3) = (` · {`i

.p

i

}) ]3

({`i

.p

i

} ˙[ 2)[`] = ({`i

.p

i

}[`]) ] 2

({`i

.p

i

} ˙[3)[`] = ({`i

.p

i

}[`]) ]3

({`i

.p

i

} ˙[ 2)|L

= ({`i

.p

i

}|L

) ] 2

({`i

.p

i

} ˙[3)|L

= ({`i

.p

i

}|L

) ]3

Note that in many cases, we use the disjoint union operation ]on the right-hand side; this ensures, for example, that we neverproduce results of the form ; ˙[2 or ; ˙[3; these are normalized to2 and 3 respectively, and this normalization reduces the numberof corner cases in the slicing algorithm.

We define a record pattern projection operation p.A as follows:hA

1

:p

1

, . . . , A

n

:p

n

i.Ai

= p

i

2.A = 2hA

1

:p

1

, . . . , A

n

:p

n

; i.Ai

= p

i

3.A = 3hA

1

:p

1

, . . . , A

n

:p

n

;2i.B = 2 (B /2 {A1

, . . . , A

n

})hA

1

:p

1

, . . . , A

n

:p

n

;3i.B = 3 (B /2 {A1

, . . . , A

n

})We extend the slicing judgment to accommodate these new

patterns in Figure 10. The rules SREC and SPROJA

are similar to

Page 8: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

{`i

.p

i

, `

0j

.q

j

} t ({`i

.p

0i

} ˙[ 2) = {`i

.p

i

t p

0i

, `

0j

.q

j

}

{`i

.p

i

, `

0j

.q

j

} t ({`i

.p

0i

} ˙[3) = {`i

.p

i

t p

0i

, `

0j

.q

j

[3/2]}

({`i

.p

i

, `

0j

.q

j

} ˙[ 2) t ({`i

.p

0i

, `

00k

.r

k

} ˙[ 2) = {`i

.p

i

t p

0i

, `

0j

.q

j

, `

00k

.r

k

} ˙[ 2

({`i

.p

i

, `

0j

.q

j

} ˙[ 2) t ({`i

.p

0i

, `

00k

.r

k

} ˙[3) = {`i

.p

i

t p

0i

, `

0j

.q

j

[3/2], `

00k

.r

k

} ˙[3

({`i

.p

i

, `

0j

.q

j

} ˙[3) t ({`i

.p

0i

, `

00k

.r

k

} ˙[3) = {`i

.p

i

t p

0i

, `

0j

.q

j

[3/2], `

00k

.r

k

[3/2]} ˙[3

hAi

: p

i

, B

j

: q

j

i t (hAi

: p

0i

;2i) = hAi

: p

i

t p

0i

, B

j

: q

j

i

hAi

: p

i

, B

j

: q

j

i t (hAi

: p

0i

;3i) = hAi

: p

i

t p

0i

, B

j

: q

j

[3/2]i

(hAi

: p

i

, B

j

: q

j

;2i) t (hAi

: p

0i

, C

k

: r

k

;2i) = hAi

: p

i

t p

0i

, B

j

: q

j

, C

k

: r

k

;2i

(hAi

: p

i

, B

j

: q

j

;2i) t (hAi

: p

0i

, C

k

: r

k

;3i) = hAi

: p

i

t p

0i

, B

j

: q

j

[3/2], C

k

: r

k

;3i

(hAi

: p

i

, B

j

: q

j

;3i) t (hAi

: p

0i

, C

k

: r

k

;3i) = hAi

: p

i

t p

0i

, B

j

: q

j

[3/2], C

k

: r

k

[3/2];3i

Figure 8. Least upper bound for enriched patterns (excluding some symmetric cases).

v

1

hp1 v

01

· · · v

n

hpn v

0n

hAi

: v

i

i hhAi:pii hAi

: v

0i

i

v

1

hp1 v

01

· · · v

n

hpn v

0n

hAi

: v

i

, B

j

:w

j

i hhAi:pi;2i hAi

: v

0i

, C

k

:w

0k

i

v

1

hp1 v

01

· · · v

n

hpn v

0n

hAi

: v

i

, B

j

:w

j

i hhAi:pi;3i hAi

: v

0i

, B

j

:w

j

i

v

1

hp1 v

01

· · · v

n

hpn v

0n

{`i

.v

i

} ] v h{`i.pi} ˙[2 {`i

.v

0i

} ] v

0

v

1

hp1 v

01

· · · v

n

hpn v

0n

{`i

.v

i

} ] v h{`i.pi} ˙[3 {`i

.v

0i

} ] v

Figure 9. Enriched pattern equivalence

p, T & ⇢, S hA:p;2i, T & ⇢, S

p, T.A & ⇢, S.A

SREC

p.A

1

, T

1

& ⇢

1

, S

1

· · · p.A

n

, T

n

& ⇢

n

, S

n

p, hA1

:T

1

, . . . , A

n

:T

n

i & ⇢

1

t · · · t ⇢

n

, hA1

:S

1

, . . . , A

n

:S

n

iSPROJ

A

p, x.⇥ &⇤⇢,⇥

0, p

0

2, x.⇥ &⇤[], ;,2

SHOLE⇤3, x.; &⇤

[], ;, ;SDIAMOND⇤

Figure 10. Backward trace slicing over enriched patterns.

those for pairs, except that we use the field projection operationin the case for a record trace, and we use partial record patternshA : p;2i in the case for a field projection trace. The addedrules SHOLE⇤ and SDIAMOND⇤ handle the possibility that a partialpattern reduces to 2 or 3 through projection; we did not need tohandle this case earlier because a simple set pattern is either a hole(which could be handled by the rule 2, T & [],2) or a completeset pattern showing all of the labels of the result.

Both simple and extended patterns satisfy a number of lemmasthat are required to prove the correctness of trace slicing.

Lemma 4.1 (Properties of union and restriction).

1. If p1

v v

1

and p

2

v v

2

and v

1

hp1 v

01

and v

2

hp2 v

02

thenv

1

] v

2

hp1]p2 v

01

] v

02

, provided all of these disjoint unionsare defined.

2. If p v v

1

] v

2

and L

1

dom(v

1

) and L

2

dom(v

2

) andL

1

, L

2

are prefix-disjoint, then p|L1 v v

1

and p|L2 v v

2

.

Lemma 4.2 (Projection and v).

1. If p v {✏.v} then p.✏ v v.2. If p v 1 · v

1

] 2 · v2

then p[1] v v

1

and p[2] v v

2

.3. If p v ` · v then p[`] v v.4. If p v hA

i

: v

i

i then p.A

i

v v

i

.

Lemma 4.3 (Projection and hp

).

1. If p v {✏.v} and v hp.✏

v

0 then {✏.v} hp

{✏.v0}.2. If p v 1 · v

1

] 2 · v2

and v

1

hp[1]

v

01

and v

2

hp[2]

v

02

then1 · v

1

] 2 · v2

hp

1 · v01

] 2 · v02

.3. If p v ` · v and v h

p[`]

v

0 then ` · v hp

` · v0.4. If p v hA

i

: v

i

i and v

1

hp.A1 v

01

, . . . , v

n

hp.An v

0n

thenhA

i

: v

i

i hp

hAi

: v

0i

i.

We now state the key correctness property for slicing. Intu-itively, it says that if we slice T with respect to output pattern p,obtaining a slice ⇢ and S, then p will be reproduced on recomputa-tion under any change to the input and trace that is consistent withthe slice — formally, that means that the changed trace T

0 mustmatch the sliced trace S, and the changed input �0 must match �

modulo ⇢.

Theorem 4.4 (Correctness of Slicing). Suppose �, T y v andp v v and p, T & ⇢, S. Then for all �0 h

� and T

0 w S suchthat �0

, T

0 y v

0 we have v

0 hp

v.

Returning to our running example, we can use the enrichedpattern p

0= {[r

2

].hB:8;2i} ˙[2 to indicate interest in the B fieldof r

2

, without naming the other fields of the row or the other rowindexes. Slicing with respect to this pattern yields the followingslice:

T

00=

[{ | x 2 R} . {[r

2

].if(x.B = 3, , ) .

true

{ }}

and the slice of R is R00= {[r

2

].hB:3, C:8;2i} ˙[2. This indicatesthat (as before) the values of r

1

and r

3

and of the A field of r

2

are irrelevant to r

2

in the result; unlike T

0, however, we can alsopotentially replay if r

1

and r

3

have been deleted from R. Likewise,we can replay if the A field has been removed from a record, or ifsome other field such as D is added. This illustrates that enriched

Page 9: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

p, T && ⇢, e

2, T && [],2

p

1

, T

1

&& ⇢

1

, e

01

true, T && ⇢, e

0

p

1

, if(T, e

1

, e

2

) .

true

T

1

&& ⇢

1

t ⇢, if(e

0, e

01

,2)

p

2

, T

2

&& ⇢

2

, e

02

false, T && ⇢, e

0

p

2

, if(T, e

1

, e

2

) .

false

T

2

&& ⇢

2

t ⇢, if(e

0,2, e

02

)

p, x.⇥ &&⇤⇢

0, e

0, p

0

p

0

, T && ⇢, e

00

p,

[{e | x 2 T} .⇥ && ⇢ t ⇢

0,

[{e0 | x 2 e

00

}

p, x.⇥ && ⇢, e, p

0

2, x.⇥ &&⇤[],2,2 ;, x.; &&⇤

[],2, ; 3, x.; &&⇤[],2, ;

p[`], T && ⇢[x 7! p

0

], e

p, x.{`.T} &&⇤⇢, e, {`.p

0

}

p|dom(⇥1)

, x.⇥

1

&&⇤⇢

1

, e

1

, p

1

p|dom(⇥2)

, x.⇥

2

&&⇤⇢

2

, e

2

, p

2

p, x.⇥

1

]⇥

2

&&⇤⇢

1

t ⇢

2

, e

1

t e

2

, p

1

] p

2

Figure 11. Unevaluation (selected rules).

patterns allow for smaller slices than simple patterns, with greaterflexibility concerning possible updates to the input.

A natural question is whether slicing computes the (or a) small-est possible answer. Because our definition of correct slices is basedon recomputation, minimal slices are not computable, by a straight-forward reduction from the undecidability of minimizing depen-dency provenance [1, 12].

5. Query and Differential SlicingWe now adapt slicing techniques to provide explanations in termsof query expressions, and show how to use differences betweenslices to provide precise explanations for parts of the output.

5.1 Query slicingOur previous work [28] gave an algorithm for extracting a programslice from a trace. We now adapt this idea to queries. A trace sliceshows the parts of the trace that need to be replayed in order to com-pute the desired part of the output; similarly, a query slice shows thepart of the query expression that is needed in order to ensure that thedesired part of the output is recomputed. As with traces, we allowholes 2 in programs to allow deleting subexpressions, and definev as a syntactic precongruence such that 2 v e. We also define aleast upper bound operation e t e

0 on partial query expressions inthe obvious way, so that 2 t e = e.

We define a judgment p, T && ⇢, e that traverses T and “un-evaluates” p, yielding a partial input environment ⇢ and partial pro-gram e. The rules are illustrated in Figure 11. Many of the rules aresimilar to those for trace slicing; the main differences arise in thecases for conditionals and comprehensions, where we collapse thesliced expressions back into expressions, possibly inserting holes ormerging sliced expressions obtained by running the same code indifferent ways (as in a comprehension that contains a conditional).

Again, it is straightforward to show that if �, e + v, T andp v v then there exist ⇢, e0 such that p, T && ⇢, e

0, where ⇢ v �

and e

0 v e. The essential correctness property for query slices issimilar to that for trace slices: again, we require that rerunning anysufficiently similar query on a sufficiently similar input produces aresult that matches p.

Theorem 5.1 (Correctness of Query Slicing). Suppose �, T y v

and p v v and p, T && ⇢, e. Then for all �0 h⇢

� and e

0 w e suchthat �0

, e

0 + v

0 we have v

0 hp

v.

Combining with the consistency property (Proposition 2.5), wehave:

Corollary 5.2. Suppose �, e + v, T and p v v and p, T && ⇢, e

0.Then for all �0 h

� and e

00 w e

0 such that �0, e

00 + v

0, we havev h

p

v

0.

Continuing our running example, the query slice for the patternp

0 considered above is

Q

0=

[{if(x.B = 3, {hA:2, B:x.Ci}, {}) | x 2 R}

since only the computation of the A field in the output is irrelevantto p

0.

5.2 Differential slicingWe consider a pattern difference to be a pair of patterns (p, p

0)

where p v p

0. Intuitively, a pattern difference selects the part of avalue present in the outer component p0 and not in the inner com-ponent p. For example, the pattern difference (hB:2;2i, hB:8;2i)selects the value 8 located in the B component of a record. We canalso write this difference as hB: 8 ;2i, using ? to highlight theboundary between the inner and outer pattern. Trace and query pat-tern differences are defined analogously.

It is straightforward to show by induction that slicing is mono-tonic in both arguments:

Lemma 5.3 (Monotonicity). If p v p

0 and T v T

0 and p

0, T

0 &⇢

0, S

0 then there exist ⇢, S such that p, T & ⇢, S and ⇢ v ⇢

0 andS v S

0. In addition, p, S0 & ⇢, S.

This implies that given a pattern difference and a trace, we cancompute a trace difference using the following rule:

p

2

, T & ⇢

2

, S

2

p

1

, S

2

& ⇢

1

, S

1

(p

1

, p

2

), T & (⇢

1

, ⇢

2

), (S

1

, S

2

)

It follows from monotonicity that ⇢1

v ⇢

2

and S

1

v S

2

, thus,pattern differences yield trace differences. Furthermore, the secondpart of monotonicity implies that we can compute the smaller sliceS

1

from the larger slice S

2

, rather than re-traverse T . It is alsopossible to define a simultaneous differential slicing judgment, asan optimization to ensure we only traverse the trace once.

Query slicing is also monotone, so differential query slices canbe obtained in exactly the same way. Revisiting our running exam-ple one last time, consider the differential pattern {[r

2

].hB: 8 ;2i} ˙[2. The differential query slice for the pattern p

0 considered aboveis

Q

00=

[{if(x.B = 3, {hA:2, B: x.C i}, {}) | x 2 R}

6. Examples and DiscussionIn this section we present some more complex examples to illus-trate key points.

Renaming Recall the swapping query from the introduction,written in NRC as

Q

1

=

[{{if(x.A > x.B, hA:x.B,B:x.Ai, hxi)}| x 2 R}

This query illustrates a key difference between our approach andthe how-provenance model of Green et al. [21]. As discussed in[14], renaming operations are ignored by how-provenance, so thehow-provenance annotations of the results of Q

1

are the same as fora query that simply returns R. In other words, the choice to swapthe fields when A > B is not reflected in the how-provenance,

Page 10: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

which shows that it is impossible to extract where-provenance (ortraces) from how-provenance. Extracting where-provenance fromtraces appears straightforward, extending our previous work [1].

This example also illustrates how traces and slices can be usedfor partial recomputation. The slice for output pattern {[1, r

1

].hB:2i} ˙[2, for example, will show that this record was produced becausethe A component of hA:1, B:2, C:7i at index [r

1

] in the input wasless than or equal to the B component. Thus, we can replay afterany change that preserves this ordering information.

Union Consider query

Q

2

=

[{{hB:x.Bi} | x 2 R} [ {hB:3i}

that projects the B fields of elements of R and adds another copyof hB:3i to the result. This yields

Q

2

(R) = {[1, r1

].hB:2i, [1, r2

].hB:3i, [1, r3

].hB:3i, [2].hB:3i}

This illustrates that the indexes may not all have the same length,but still form a prefix code. If we slice with respect to {[1, r

2

].hB:3i} ˙[2 then the query slice is:

Q

02

=

[{{hB:x.Bi} | x 2 R} [ 2

and R

0= {[r

2

].hB:3;2i} ˙[ 2 whereas if we slice with respect to{[2].hB:3i} ˙[ 2 then the query slice is Q

002

= 2 [ {hB:3i} andR

00= 2, indicating that this part of the result has no dependence

on the input.A related point: one may wonder whether it makes sense to se-

lect a particular copy of hB:3i in the output, since in a conven-tional multiset, multiple copies of the same value are indistinguish-able. We believe it is important to be able to distinguish differentcopies of a value, which may have different explanations. Treatingn copies of a value as a single value with multiplicity n would ob-scure this distinction and force us to compute the slices of all of thecopies even if only a single explanation is required. This is why wehave chosen to work with indexed sets, rather than pure multisets.

Joins So far all examples have involved a single table R. Con-sider a simple join query

Q

3

= {hA:x.A,B:y.Ci | x 2 R, y 2 S, x.B = y.B}

and consider the following table S, and the result Q3

(R,S).

S =

id B C

[s

1

] 2 4

[s

2

] 3 4

[s

3

] 4 5

Q

3

(R,S) =

id A B

[r

1

, s

1

] 1 4

[r

2

, s

2

] 2 4

[r

3

, s

2

] 4 5

The full trace of this query execution is as follows:T

3

=

S{ | x 2 R} . {

[r

1

].

S{ | y 2 S} . { [s

1

].if(x.B = y.B, , ) .

true

{ },[s

2

].if(x.B = y.B, , ) .

false

{},[s

3

].if(x.B = y.B, , ) .

false

{}},[r

2

].

S{ | y 2 S} . { [s

1

].if(x.B = y.B, , ) .

false

{},[s

2

].if(x.B = y.B, , ) .

true

{ },[s

3

].if(x.B = y.B, , ) .

false

{}},[r

3

].

S{ | y 2 S} . { [s

1

].if(x.B = y.B, , ) .

false

{},[s

2

].if(x.B = y.B, , ) .

true

{ },[s

3

].if(x.B = y.B, , ) .

false

{}}}

Slicing with respect to {[r1

, s

1

].hA:1;2i, [r2

, s

2

].hB:4;2i} ˙[ 2yields trace slice

T

03

=

S{ | x 2 R} . {

[r

1

].

S{ | y 2 S} . { [s

1

].if(x.B = y.B, , ) .

true

{ },[r

2

].

S{ | y 2 S} . { [s

2

].if(x.B = y.B, , ) .

true

{ }}

and input slice R

03

= {[r1

].hA:1, B:2;2i, [r2

].hB:3;2i} andS

03

= {[s1

].hB:2;2i, [s2

].hB:3;C:4i}.

Workflows NRC expressions can be used to represent work-flows, if primitive operations are added representing the workflowsteps [2, 22]. To illustrate query slicing and differential slicing for aworkflow-style query, consider the following more complex query:

Q

4

= {f(x, y) | x 2 T, y 2 T, z 2 U, p(x, y), q(x, y, z)}where f computes some function of x and y and p and q are selec-tion criteria. Here, we assume T and U are collections of data filesand f, p, q are additional primitive operations on them. This queryexercises most of the distinctive features of our approach; we canof course translate it to the NRC core calculus used in the rest of thepaper. If dom(T ) = {t

1

, . . . , t

10

} and dom(U) = {u1

, . . . , u

10

}then we might obtain result {[t

3

, t

4

, u

5

].v

1

, [t

6

, t

8

, u

10

].v

2

}. If wefocus on the value v

1

using the pattern {[t3

, t

4

, u

5

].v

1

} ˙[ 2, thenthe program slice we obtain is Q

4

itself, while the data slice mightbe T

0= {[t

3

].3, [t

4

].3} ˙[2, U

0= {[u

5

].3} ˙[2, indicating thatif the values at t

3

, t

4

, u

5

are held fixed then the end result will stillbe v

1

. The trace slice is similar, and shows that v1

was computedby applying f with x bound to the value at t

3

in T , y bound to t

4

,and z bound to u

5

, and that p(x, y) and q(x, y, z) succeeded forthese values.

If we consider a differential slice using pattern difference{[t

3

, t

4

, u

5

]. v

1

} ˙[ 2 then we obtain the following program dif-ference:

Q

4

=

[{ f(x, y) | x 2 T, y 2 T, z 2 U, p(x, y), q(x, y, z)}

This shows that most of the query is needed to ensure that theresult at [t

3

, t

4

, u

5

] is produced, but the subterm f(x, y) is onlyneeded to compute the value v

1

. This can be viewed as a query-based explanation for this part of the result.

7. ImplementationTo validate our design and experiment with larger examples, weextended our Haskell implementation Slicer of program slicing forfunctional programs [28] with the traces and slicing techniquespresented in this paper. We call the resulting system NRCSlicer;it supports a free combination of NRC and general-purpose func-tional programming features. NRCSlicer interprets expressions in-memory without optimization. As reported previously for Slicer,we have experimented with several alternative tracing and slicingstrategies, which use Haskell’s lazy evaluation strategy in differentways. The alternatives we consider here are:• eager: the trace is fully computed during evaluation.• lazy: the value is computed eagerly, but the trace is computed

lazily using Haskell’s default lazy evaluation strategy.

To evaluate the effectiveness of enriched patterns, we measured thetime needed for the eager and lazy techniques to trace and slicethe workflow example Q

4

in the previous section. We considered ainstantiation of the workflow where the data values are simply in-tegers and with input tables T, U = {1, . . . , 50}, and defined theoperations f(x, y) as x ⇤ y, p(x, y) as x < y, and q(x, y, z) asx

2

+ y

2

= z

2. This is not a realistic workflow, and we expect thatthe time to evaluate the basic operations of a realistic workflow fol-lowing this pattern would be much larger. However, the overheadsof tracing and slicing do not depend on the execution time of prim-itive operations, so we can still draw some conclusions from thissimplistic example.

The comprehension iterates over 503 = 125,000 triples, produc-ing 20 results. We considered simple and enriched patterns select-ing a single element of the result. We measured evaluation time,and the overhead of tracing, trace slicing, and query slicing. Theexperiments were conducted on a MacBook Pro with 2GB RAMand a 2.8GHz Intel Core Duo, using GHC version 7.4.

Page 11: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

eval trace slice qsliceeager-simple 0.5 1.5 2.5 1.6eager-enriched 0.5 1.5 <0.1 <0.1lazy-simple 0.5 0.7 1.3 1.7lazy-enriched 0.5 0.7 <0.1 <0.1

The times are in seconds. The “eval” column shows the time neededto compute the result without tracing. The “trace”, “slice”, and “qs-lice” columns show the added time needed to trace and computeslices. The full traces in each of these runs have over 2.1 millionnodes; the simple pattern slices are almost as large, while the en-riched pattern slices are only 95 nodes. For this example, slicing isover an order of magnitude faster using enriched patterns. The lazytracing approach required less total time both for tracing and slic-ing (particularly for simple patterns). Thus, Haskell’s built-in lazyevaluation strategy offers advantages by avoiding explicitly con-structing the full trace in memory when it is not needed; however,there is still room for improvement. Again, however, for an actualworkflow involving images or large data files, the evaluation timewould be much larger, dwarfing the time for tracing or slicing.

Our implementation is a proof-of-concept that evaluates queriesin-memory via interpretation, rather than compilation; further workwould be needed to adapt our approach to support fine-grainedprovenance for conventional database systems. Nevertheless, ourexperimental results do suggest that the lazy tracing strategy anduse of enriched patterns can effectively decrease the overhead oftracing, making it feasible for in-memory execution of workflowsrepresented in NRC.

8. Related and future workProgram slicing has been studied extensively [18, 31, 32], as has theuse of execution traces, for example in dynamic slicing. Our workcontrasts with much of this work in that we regard the trace andunderlying data as being of interest, not just the program. Someof our previous work [12] identified analogies between programslicing and provenance, but to our knowledge, there is no otherprior work on slicing in databases.

Lineage and why-provenance were motivated semantically interms of identifying witnesses, or parts of the input needed to en-sure that a given part of the output is produced by a query. Earlywork on lineage in relational algebra [16] associates each outputrecord with a witness. Buneman et al. studied a more general no-tion called why-provenance that maps an output part to a collec-tion of witnesses [7, 8]. This idea was generalized further to thehow-provenance or semiring model [19, 21], based on using alge-braic expressions as annotations; this approach has been extendedto handle some forms of negation and aggregation [4, 20]. Semiringhomomorphisms commute with query evaluation; thus, homomor-phic changes to the input can be performed directly on the outputwithout re-running the query. However, this approach only appliesto changes describable as semiring homomorphisms, such as dele-tion.

Where-provenance was also introduced by Buneman et al. [7,8]. Although the idea of tracking where input data was copiedfrom is natural, it is nontrivial to characterize semantically, becausewhere-provenance does not always respect semantic equivalence.In later work, Buneman et al. [6] studied where-provenance forthe pure NRC and characterized its expressiveness for queries andupdates. It would be interesting to see whether their notion ofexpressive completeness for where-provenance could be extendedto richer provenance models, such as traces, possibly leading to animplementation strategy via translation to plain NRC.

Provenance has been studied extensively for scientific workflowsystems [5, 30], but there has been little formal work on the seman-tics of workflow provenance. The closest work to ours is that of

Hidders et al. [22], who model workflows by extending the NRCwith nondeterministic, external function calls. They sketch an op-erational semantics that records runs that contain essentially all ofthe information in a derivation tree, represented as a set of triples.They also suggest ways of extracting subruns from runs, but theirtreatment is partial and lacks strong formal guarantees analogousto our results.

There have been some attempts to reconcile the database andworkflow views of provenance; Hidders et al. [22] argued for theuse of Nested Relational Calculus (NRC) as a unifying formal-ism for both workflow and database operations, and subsequentlyKwasnikowska and Van den Bussche [23] showed how to map thismodel to the Open Provenance Model. Acar et al. [2] later formal-ized a graph model of provenance for NRC. The most advancedwork in this direction appears to be that of Amsterdamer et al. [3],who combined workflow and database styles of provenance in thecontext of the PigLatin system (a MapReduce variant based onnested relational queries). Lipstick allows analyzing the impact ofrestricted hypothetical changes (such as deletion) on parts of theoutput, but to our knowledge no previous work provides a formalguarantee about the impact of changes other than deletion.

In our previous work [12], we introduced dependency prove-nance, which conservatively over-approximates the changes thatcan take place in the output if the input is changed. We devel-oped definitions and techniques for dependency provenance in fullNRC including nonmonotone operations (empty, sum) and primi-tive functions. Dependency provenance cannot predict exactly howthe output will be affected by a general modification to the source,but it can guarantee that some parts of the output will not change ifcertain parts of the input are fixed. Our notion of equivalence mod-ulo a pattern is a generalization of the equal-except-at relation usedin that work. Motivated by dependency provenance, an earlier tech-nical report [10] presented a model of traced evaluation for NRCand proved elementary properties such as fidelity. However, it didnot investigate slicing techniques, and used nondeterministic labelgeneration instead of our deterministic scheme; our deterministicapproach greatly simplifies several aspects of the system, particu-larly for slicing.

There are several intriguing directions for future work, includ-ing developing more efficient techniques for traced evaluation andslicing that build upon existing database query optimization capa-bilities. It appears possible to translate multiset queries so as tomake the labels explicit, since a fixed given query increases thelabel depth by at most a constant. Thus, it may be possible to eval-uate queries with label information but without tracing first, thengradually build the trace by slicing backwards through the query,re-evaluating subexpressions as necessary. Other interesting direc-tions include the use of slicing techniques for security, to hide confi-dential input information while disclosing enough about the trace topermit recomputation, and the possibility of extracting other formsof provenance from traces, as explored in the context of functionalprograms in prior work [1].

9. ConclusionThe importance of provenance for transparency and reproducibilityis widely recognized, yet there has been little explicit discussionof correctness properties formalizing intuitions about how prove-nance is to provide reproducibility. In self-explaining computation,traces are considered to be explanations of a computation in thesense that the trace can be used to recompute (parts of) the outputunder hypothetical changes to the input. This paper develops thefoundations of self-explaining computation for database queries,by defining a tracing semantics for NRC, proposing a formal defi-nition of correctness for tracing (fidelity) and slicing, and defininga correct (though potentially overapproximate) algorithm for trace

Page 12: Database Queries that Explain their WorkDatabase Queries that Explain their Work James Cheney University of Edinburgh jcheney@inf.ed.ac.uk Amal Ahmed Northeastern University amal@ccs.neu.edu

slicing. Trace slicing can be used to obtain smaller “golden trail”traces that explain only a part of the input or output, and explore theimpact of changes in hypothetical scenarios similar to the originalrun. At a technical level, the main contributions are the careful useof prefix codes to label multiset elements, and the development ofenriched patterns that allow more precise slices. Our design is val-idated by a proof-of-concept implementation that shows that lazi-ness and enriched patterns can significantly improve performancefor small (in-memory) examples.

In the near term, we plan to combine our work on self-explaining functional programs [28] and database queries (thispaper) to obtain slicing and provenance models for programminglanguages with query primitives, such as F# [15] or Links [24].Ultimately, our aim is to extend self-explaining computation toprograms that combine several execution models, including work-flows, databases, conventional programming languages, Web inter-action, or cloud computing.

Acknowledgments We are grateful to Peter Buneman, Jan Vanden Bussche, and Roly Perera for comments on this work and tothe anonymous reviewers for detailed suggestions. Effort sponsoredby the Air Force Office of Scientific Research, Air Force MaterialCommand, USAF, under grant number FA8655-13-1-3006. TheU.S. Government and University of Edinburgh are authorized to re-produce and distribute reprints for their purposes notwithstandingany copyright notation thereon. Cheney is supported by a Royal So-ciety University Research Fellowship, by the EU FP7 DIACHRONproject, and EPSRC grant EP/K020218/1. Acar is partially sup-ported by an EU ERC grant (2012-StG 308246—DeepSea) and anNSF grant (CCF-1320563).

References[1] U. A. Acar, A. Ahmed, J. Cheney, and R. Perera. A core calculus for

provenance. Journal of Computer Security, 21:919–969, 2013. Fullversion of a POST 2012 paper.

[2] U. A. Acar, P. Buneman, J. Cheney, N. Kwasnikowska,J. Van den Bussche, and S. Vansummeren. A graph model ofdata and workflow provenance. In TAPP, 2010.

[3] Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich,and V. Tannen. Putting lipstick on pig: Enabling database-style work-flow provenance. PVLDB, 5(4):346–357, 2011.

[4] Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance for aggregatequeries. In PODS, pages 153–164. ACM, 2011.

[5] R. Bose and J. Frew. Lineage retrieval for scientific data processing: asurvey. ACM Comput. Surv., 37(1):1–28, 2005.

[6] P. Buneman, J. Cheney, and S. Vansummeren. On the expressivenessof implicit provenance in query and update languages. ACM Transac-tions on Database Systems, 33(4):28, November 2008.

[7] P. Buneman, S. Khanna, and W. Tan. Why and where: A character-ization of data provenance. In ICDT, number 1973 in LNCS, pages316–330. Springer, 2001.

[8] P. Buneman, S. Khanna, and W. Tan. On propagation of deletions andannotations through views. In PODS, pages 150–158, 2002.

[9] P. Buneman, S. A. Naqvi, V. Tannen, and L. Wong. Principles of pro-gramming with complex objects and collection types. Theor. Comp.Sci., 149(1):3–48, 1995.

[10] J. Cheney, U. A. Acar, and A. Ahmed. Provenance traces. CoRR,arXiv.org/abs/0812.0564, 2008.

[11] J. Cheney, U. A. Acar, and R. Perera. Toward a theory of self-explaining computation. In In search of elegance in the theory andpractice of computation: a Festschrift in honour of Peter Buneman,number 8000 in LNCS, pages 193–216. Springer, 2013.

[12] J. Cheney, A. Ahmed, and U. A. Acar. Provenance as dependencyanalysis. Mathematical Structures in Computer Science, 21(6):1301–1337, 2011.

[13] J. Cheney, A. Ahmed, and U. A. Acar. Database queries that explaintheir work (extended version). Technical report, arXiv.org, 2014.http://arxiv.org/abs/1408.1675.

[14] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases:Why, how, and where. Foundations and Trends in Databases,1(4):379–474, 2009.

[15] J. Cheney, S. Lindley, and P. Wadler. A practical theory of language-integrated query. In ICFP, pages 403–416, New York, NY, USA, 2013.ACM.

[16] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data ina warehousing environment. ACM Trans. Database Syst., 25(2):179–227, 2000.

[17] J. Dean and S. Ghemawat. MapReduce: simplified data processing onlarge clusters. Commun. ACM, 51(1):107–113, 2008.

[18] J. Field and F. Tip. Dynamic dependence in term rewriting systemsand its application to program slicing. Information and SoftwareTechnology, 40(11–12):609–636, 1998.

[19] J. N. Foster, T. J. Green, and V. Tannen. Annotated XML: queries andprovenance. In PODS, pages 271–280, 2008.

[20] F. Geerts and A. Poggi. On database query languages for K-relations.J. Applied Logic, 8(2):173–185, 2010.

[21] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings.In PODS, pages 31–40, 2007.

[22] J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, and J. Van denBussche. A formal model of dataflow repositories. In DILS, 2007.

[23] N. Kwasnikowska and J. Van den Bussche. Mapping the NRCdataflow model to the open provenance model. In IPAW, pages 3–16,2008.

[24] S. Lindley and J. Cheney. Row-based effect types for database inte-gration. In TLDI, pages 91–102. ACM Press, 2012.

[25] P. Missier, B. Ludascher, S. Dey, M. Wang, T. McPhillips, S. Bowers,M. Agun, and I. Altintas. Golden trail: Retrieving the data history thatmatters from a comprehensive provenance repository. InternationalJournal of Digital Curation, 7(1):139–150, 2011.

[26] L. Moreau. The foundations for provenance on the web. Foundationsand Trends in Web Science, 2(2–3), 2010.

[27] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Piglatin: a not-so-foreign language for data processing. In SIGMOD,pages 1099–1110. ACM, 2008.

[28] R. Perera, U. A. Acar, J. Cheney, and P. B. Levy. Functional programsthat explain their work. In ICFP, pages 365–376. ACM, 2012.

[29] A. Sabelfeld and A. Myers. Language-based information-flow secu-rity. IEEE Journal on Selected Areas in Communications, 21(1):5–19,2003.

[30] Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenancein e-science. SIGMOD Record, 34(3):31–36, 2005.

[31] F. Tip. A survey of program slicing techniques. J. Prog. Lang., 3(3),1995.

[32] M. Weiser. Program slicing. In ICSE, pages 439–449. IEEE Press,1981.