Research Principles Revealed Jennifer Widom Stanford University.
PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous...
Transcript of PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous...
![Page 1: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/1.jpg)
PANDAA System for Provenance and Data
Jennifer Widom — Stanford University
![Page 2: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/2.jpg)
Jennifer Widom
Example: Sales Prediction Workflow
CustListn
CustListn-1
CustList2
CustList1
Europe
USA
Dedup Union Predict
ItemSales
. . . ItemAgg
CatalogItems
BuyingPatterns
Split
2
![Page 3: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/3.jpg)
Jennifer Widom
?
Example: Sales Prediction Workflow
CustListn
CustListn-1
CustList2
CustList1
Europe
USA
Dedup Union Predict
ItemSales
ItemAgg
CatalogItems
BuyingPatterns
Split
Item Demand
Cowboy Hat high
Name Item Prob
Amelie Cowboy Hat .98
Pierre Cowboy Hat .98
Isabelle Cowboy Hat .98
Backward Tracing
??
3
Name Address
Amelie … Paris, Texas
Pierre … Paris, Texas
Isabelle … Paris, Texas
. . .
![Page 4: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/4.jpg)
Jennifer Widom
USAUSA
Name Address
Amelie … Paris, Texas
Pierre … Paris, Texas
Isabelle … Paris, Texas
Example: Sales Prediction Workflow
CustListn
CustListn-1
CustList2
CustList1
Europe
Dedup Union Predict
ItemSales
ItemAgg
CatalogItems
BuyingPatterns
Split
Name Address
Amelie 65, quai d'Orsay, Paris
Pierre 39, rue de Bretagne, Paris
Isabelle 20, rue d„Orsel, Paris
?
Backward Tracing
4
CustListn
. . .
![Page 5: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/5.jpg)
Jennifer Widom
CustListnCustListn
SplitSplit
EuropeEurope
USA
Dedup Union Predict ItemAgg
Name Address
Amelie … Paris, France
Pierre … Paris, France
Isabelle … Paris, France
Example: Sales Prediction Workflow
CustListn-1
CustList2
CustList1
ItemSalesCatalog
ItemsBuying
Patterns
Name Address
Amelie 65, quai d'Orsay, Paris
Pierre 39, rue de Bretagne, Paris
Isabelle 20, rue d„Orsel, Paris
Item Demand
Beret high
Backward TracingForward Propagation
5
. . .
![Page 6: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/6.jpg)
Jennifer Widom
Provenance
6
Where data came from
How it was derived, manipulated,
combined, processed, …
How it has evolved over time
![Page 7: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/7.jpg)
Jennifer Widom
Uses for Provenance
Sources and evolution of data; deeper understanding
Buggy or stale source data? Buggy processing?
Error propagation paths
Auditing
Propagate changes to affected “downstream” data
7
Explanation
Debugging and Verification
Recomputation
![Page 8: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/8.jpg)
Jennifer Widom
Some Application Domains
Sales prediction workflows
Scientific-data workflows
Including human-curated data
Including evolving versions of data
Any analytic pipeline
“Extract-transform-load” (ETL) processes
Information-extraction pipelines
8
![Page 9: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/9.jpg)
Jennifer Widom
Third Time’s a Charm
1. Data Warehousing project (long ago)
Lineage of relational views in warehouse:
formal foundations, system/caching issues
Lineage in ETL pipelines: foundations & algorithms
2. “Trio” project (recently)
Data + Uncertainty + Lineage
Lineage primarily in support of uncertainty
Isn’t provenance the same thing as lineage?
Haven’t you worked on it before?
Pretty much
Yes
9
![Page 10: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/10.jpg)
Jennifer Widom
Panda’s Ambitions
Previous provenance work tends to be…
Either data-based or process-based
Either fine-grained or coarse-grained
Focused on modeling and capturing provenance
Geared to specific functions or domains
10
Panda will…
Capture both: “data-oriented workflows”
Cover the spectrum in a unified fashion
Also support provenance operators and queries
End with a general-purpose open-source system
![Page 11: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/11.jpg)
Jennifer Widom
Remainder of Talk
Fundamentals
Capturing provenance
Exploiting provenance
Concrete progress and results
What‟s next
11
![Page 12: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/12.jpg)
Jennifer Widom
Remainder of Talk
Fundamentals Data-oriented workflows
Provenance model
Capturing provenance
Exploiting provenance
Concrete progress and results
What‟s next
12
![Page 13: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/13.jpg)
Jennifer Widom
Remainder of Talk
Fundamentals Data-oriented workflows
Provenance model
Capturing provenance
Exploiting provenance Backward tracing & forward tracing
Forward propagation & refresh
Ad-hoc queries
Concrete progress and results
What‟s next
13
![Page 14: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/14.jpg)
Jennifer Widom
Data-Oriented Workflows
Graph of processing nodes; data sets on edges
Assume (for now):
Statically-defined; batch execution; acyclic
Don‟t assume (for now):
Specific types of data sets or processing nodes
14
![Page 15: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/15.jpg)
Jennifer Widom
Processing Nodes
Goal
Exploit knowledge and properties when present
Provide fallback when processing is opaque
Sample properties Known relational operator or query Monotonic One-many or many-one Map function or Reduce function
15
![Page 16: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/16.jpg)
Jennifer Widom
Processing Nodes
General principle
Stronger properties finer-grained input-output data relationships; more useful and efficient provenance
Sample properties Known relational operator or query Monotonic One-many or many-one Map function or Reduce function
16
![Page 17: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/17.jpg)
Jennifer Widom
Union
Europe
USA
Dedup Predict ItemAggSplit Union ItemAggDedup Split
USA
Europe
Predict
Processing Nodes: Example
Known relational operators
Many-one, nonmonotonic
One-one, monotonic
Opaque
17
CustListn
CustListn-1
CustList2
CustList1
ItemSalesCatalog
ItemsBuying
Patterns
. . .
![Page 18: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/18.jpg)
Jennifer Widom
Provenance Model
Ultimate goals:
Support provenance at spectrum of granularities
Mesh data-oriented and process-oriented provenance
Composability/transitivity
For now, simple underlying model:
Mappings between input and output data elements
18
Understandability
![Page 19: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/19.jpg)
Jennifer Widom
Provenance Capture
Processing nodes provide provenance
information along with output
Eager — generated at data-processing time
versus
Lazy — “tracing procedure”
19
![Page 20: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/20.jpg)
Jennifer Widom
Relational operators ― automatic,previous work, eager or lazy
Dedup ― eager easy, lazy hard
One-ones ― eager or lazy easy
Predict ― it depends
Worst Case:
No access to fine-grained provenance
Union
Europe
USA
Dedup Predict ItemAggSplit Union ItemAggDedup Split
USA
Europe
Predict
Processing Capture: Example
20
CustListn
CustListn-1
CustList2
CustList1
ItemSalesCatalog
ItemsBuying
Patterns
. . .
![Page 21: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/21.jpg)
Jennifer Widom
Provenance Operations — Basic
CustListn
CustListn-1
CustList2
CustList1
Europe
USA
Dedup Union Predict
ItemSales
ItemAgg
CatalogItems
BuyingPatterns
Split
21
Backward tracing
Where did the Cowboy Hat record come from?
Forward tracing
Which sales predictions did Amelie contribute to?
. . .
![Page 22: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/22.jpg)
Jennifer Widom
Additional Functionality
CustListn
CustListn-1
CustList2
CustList1
Europe
USA
Dedup Union Predict
ItemSales
ItemAgg
CatalogItems
BuyingPatterns
Split
22
Forward propagation
Update all affected predictions after customers
move from Texas to France
. . .
![Page 23: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/23.jpg)
Jennifer Widom
Additional Functionality
CustListn
CustListn-1
CustList2
CustList1
Europe
USA
Dedup Union Predict
ItemSales
ItemAgg
CatalogItems
BuyingPatterns
Split
23
Refresh
Get latest prediction for Cowboy Hat sales (only)
based on modified buying patterns
≈ Backward tracing + Forward propagation
. . .
![Page 24: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/24.jpg)
Jennifer Widom
Provenance Queries
CustListn
CustListn-1
CustList2
CustList1
Europe
USA
Dedup Union Predict
ItemSales
ItemAgg
CatalogItems
BuyingPatterns
Split
24
How many people from each country contributed to
the Cowboy Hat prediction?
Which customer list contributed the most to the
top 100 predicted items?
. . .
![Page 25: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/25.jpg)
Jennifer Widom
Provenance Queries
CustListn
CustListn-1
CustList2
CustList1
Europe
USA
Dedup Union Predict
ItemSales
ItemAgg
CatalogItems
BuyingPatterns
Split
25
For a specific customer list, which items have
higher demand than for the entire customer set?
Which customers have more duplication — those
processed by USA or by Europe?
. . .
![Page 26: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/26.jpg)
Jennifer Widom
Provenance Queries
26
For a specific customer list, which items have
higher demand than for the entire customer set?
Which customers have more duplication — those
processed by USA or by Europe?
Query language goals
Declarative ad-hoc queries à la database systems
Seamlessly combine provenance and data
Amenable to optimization
![Page 27: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/27.jpg)
Jennifer Widom
Concrete Progress and Results
1. Provenance predicates Motivated by making refresh problem concrete
Drove initial Panda prototype
2. Attribute mappings
3. Generalized map and reduce workflows
Provenance capture, backward tracing,forward tracing, forward-propagation,refresh
Ad-hoc queries, optimizations
27
![Page 28: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/28.jpg)
Jennifer Widom
Concrete Progress and Results
1. Provenance predicates Motivated by making refresh problem concrete
Drove initial Panda prototype
2. Attribute mappings
3. Generalized map and reduce workflows
28
Predicates Attribute Mappings GMRWs
![Page 29: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/29.jpg)
Jennifer Widom
Provenance Predicates
29
I o O
Provenance of output o is σp(I)
![Page 30: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/30.jpg)
Jennifer Widom
Provenance Predicates
Provenance of output oi is σpi(I)
Worst case: pi = TRUE
Think: formalism to be instantiated
• Predicates can have compact representations
• Predicates can sometimes be generated automatically
Natural recursive definition
Extends to multiple inputs/outputs
30
{[o1 , p1], [o2 , p2], …, [om , pm]}I
Captures most
existing provenance
definitions
![Page 31: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/31.jpg)
Jennifer Widom
Selective Refresh Problem
31
Exploit provenance to efficiently compute the
up-to-date value of selected output elements
after the input (or processing nodes) may have changed
![Page 32: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/32.jpg)
Jennifer Widom
Selective Refresh Problem
32
Refreshing oi through one processing node P
1) Backward trace
2) Forward propagate
I* = σpi(Inew)
onew = P(I*)
{[o1 , p1], [o2 , p2], …, [om , pm]}IP
Inew
![Page 33: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/33.jpg)
Jennifer Widom
…
[ (Beret,high), item=‘Beret’ ]
…
Selective Refresh Problem
33
Refreshing oi through one processing node P
ItemAggI
σitem=‘Beret’
{[o1 , p1], [o2 , p2], …, [om , pm]}P
Inew
Inew Refresh…
[ (Beret,medium), item=‘Beret’ ]
…
![Page 34: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/34.jpg)
Jennifer Widom
Refreshing oi through one processing node P
1)
2)
Selective Refresh Problem
34
I* = σpi(Inew)
onew = P(I*)
Does this always “work”?
Does it make sense?
Properties of
processing nodes
and their provenance
{[o1 , p1], [o2 , p2], …, [om , pm]}P
Inew
![Page 35: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/35.jpg)
Jennifer Widom
Selective Refresh Problem
35
Refreshing oi through entire workflow
1) Backward tracerecursively
2) Forward propagatethrough workflow
Does this always “work”?
Does it make sense?
Is it efficient?
+ Properties of
workflow
{ … [oi , pi] … }I1
I2
![Page 36: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/36.jpg)
Jennifer Widom
Selective Refresh Example
36
Euros2$ CitySum
Person City SalesE
Amelie Paris 10
Pierre Paris 10 Person City SalesD
Amelie Paris 13
Pierre Paris 13
City Total
Paris 26
Person=„Amelie‟
Person=„Pierre‟
City=„Paris‟
Person City SalesE
Amelie Paris 20
Pierre Paris 10 Person City SalesD
Amelie Paris 26
Pierre Paris 13
Person=„Amelie‟
Person=„Pierre‟
City Total
Paris 39 City=„Paris‟
Person City SalesE
Amelie Paris 20
Pierre Paris 10
Marie Paris 30Person City SalesD
Amelie Paris 26
Pierre Paris 13
Person=„Amelie‟
Person=„Pierre‟
City Total
Paris 39 City=„Paris‟
Marie Paris 39
78
Person=„Marie‟
![Page 37: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/37.jpg)
Jennifer Widom
Required Properties
37
![Page 38: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/38.jpg)
Jennifer Widom
Panda System (version 0.1)
38
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
CreateTable
Graphical Interface
![Page 39: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/39.jpg)
Jennifer Widom
Panda System (version 0.1)
39
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
CreateSQL
Transformation
Graphical Interface
![Page 40: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/40.jpg)
Jennifer Widom
Panda System (version 0.1)
40
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
CreatePython
Transformation
Graphical Interface
![Page 41: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/41.jpg)
Jennifer Widom
Panda System (version 0.1)
41
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
BackwardTrace
Graphical Interface
![Page 42: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/42.jpg)
Jennifer Widom
Panda System (version 0.1)
42
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
ForwardTrace
Graphical Interface
![Page 43: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/43.jpg)
Jennifer Widom
Panda System (version 0.1)
43
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
Refresh
Graphical Interface
![Page 44: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/44.jpg)
Jennifer Widom
Attribute Mappings
Attribute mapping: I.A O.B
Provenance of output oO is: σI.A=o.B(I)
More generally: x: σB=x(O) = P(σA=x(I))
44
I (A, …) O (B, …)
ItemAggI (cust,item,prob) O (item,sales) I.Item O.item
P
![Page 45: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/45.jpg)
Jennifer Widom
Attribute Mappings
Attribute mapping: I.A O.B
Provenance of output oO is: σI.A=o.B(I)
More generally: x: σB=x(O) = P(σA=x(I))
45
Can generate automatically in many cases (e.g., SQL)
Worst case: { } { }
Generalize to Datalog-like rules
I(__, item, __) :- O(item, __)
I(__, item, prob) :- O1(item, __) prob > .95I(__, item, prob) :- O2(item, __) prob ≤ .95
Allow functions
I.name ToCaps(O.name)
![Page 46: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/46.jpg)
Jennifer Widom
Attribute Mappings
Attribute mapping: I.A O.B
Provenance of output oO is: σI.A=o.B(I)
More generally: x: σB=x(O) = T(σA=x(I))
46
Rules for AMs: combining, splitting, transitivity
soundness and completeness
“Strongest possible mapping”
![Page 47: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/47.jpg)
Jennifer Widom
Provenance Operations
Backward and forward tracing
Forward propagation and refresh
Key challenge: “broken chains”
Proofs of correctness and minimality
47
![Page 48: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/48.jpg)
Jennifer Widom
Generalized Map and Reduce Workflows
What if every transformationwas a Map or Reduce function?
Very specific properties
Provenance easier to define, capture, and exploit
Automatic wrapping, doesn‟t interfere with parallelism
48
M
M
R
MR
![Page 49: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/49.jpg)
Jennifer Widom
Map and Reduce Provenance
Map functions
M(I) = UiI (M({i}))
Provenance of oO is iI such that oM({i})
Reduce functions
R(I) = U1≤ k ≤ n(R(Ik)) I1,…,In partition I on reduce-key
Provenance of oO is Ik I such that oR(Ik)
49
![Page 50: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/50.jpg)
Jennifer Widom
Recursive MR Provenance
Intuitive recursive definition
Workflow W with inputs I1,…,In; output element o
PW(o) = (I*1,…, I*
n) I*1 I1, …, I*
n In
Desirable property
o W(I*1,…, I*
n)
50
M
M
R
MR
Usually holds, but not always
![Page 51: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/51.jpg)
Jennifer Widom
Counterexample
51
TweetScan Summarize CountTwitter
PostsInferred
Movie Ratings
RatingMedians
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight 0
Twilight 2
Avatar 7
Twilight 9
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
![Page 52: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/52.jpg)
Jennifer Widom
Counterexample
52
TweetScan Summarize CountTwitter
PostsInferred
Movie Ratings
RatingMedians
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight 0
Twilight 2
Avatar 7
Twilight 9
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
![Page 53: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/53.jpg)
Jennifer Widom
Counterexample
53
TweetScan Summarize CountTwitter
PostsInferred
Movie Ratings
RatingMedians
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight 0
Twilight 2
Avatar 7
Twilight 7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar
And Twilight too”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
![Page 54: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/54.jpg)
Jennifer Widom
Counterexample
54
TweetScan Summarize CountTwitter
PostsInferred
Movie Ratings
RatingMedians
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight 0
Twilight 2
Avatar 7
Twilight 7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar
And Twilight too”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 17 2
One-ManyFunction
NonmonotonicReduce
NonmonotonicReduce
![Page 55: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/55.jpg)
Jennifer Widom
RAMP System
Built on top of Hadoop, experiments on EC2
Proof of concept — very preliminary!
Wrap Map and Reduce functions (automatically)to capture provenance Add (file,offset) as IDs to output sets
Backward-tracing bias
Alternative schemes, indexing
Overhead on example: 111% time, 45% space
Straightforward backward-tracing
Seconds response time on 1.2GB workflow
55
![Page 56: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/56.jpg)
Jennifer Widom
What’s Next
Unify what we have so far
Predicates Attribute Mappings GMRWs
Enhance system(s)
Extend provenance model
Fine-grained to coarse-grained
Data-based and process-based
Time/versioning
Extensions to capture, tracing, propagation
56
![Page 57: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/57.jpg)
Jennifer Widom
What’s Next
Ad-hoc queries
Language
Execution
Optimization
Query-driven provenance capture
57
![Page 58: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/58.jpg)
Jennifer Widom
Dedup ItemAgg
Computation and storage optimizations
Eager vs. lazy provenance capture
• Space-time & query-update tradeoffs
• Processing-node dependent
Ex:
Retain intermediate data sets?
Extreme case
• Workflow run once, never updated
• Provenance traced frequently
Compute transitive provenance eagerly,discard intermediate data
What’s Next
58
![Page 59: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/59.jpg)
Jennifer Widom
Provenance optimizations
Fine-grained vs. coarse-grained
Approximate provenance
What’s Next
59
![Page 60: PANDA - Stanford University Computer Science › people › widom › panda-talk.pdf · Previous provenance work tends to be… Either data-based or process-based Either fine-grained](https://reader030.fdocuments.in/reader030/viewer/2022041104/5f03e03f7e708231d40b338f/html5/thumbnails/60.jpg)
PANDAA System for Provenance and Data
“stanford panda”