Provinance in scientific workflows in e science
Transcript of Provinance in scientific workflows in e science
1
A Survey of Data Provenance in
e-Science [et al]Simmhan, Plale & Gannon
UC San DiegoCSE 294
February 5, 2010Barry Demchak
Pompeii – AD 62
2
[et al] [Simmhan2005] Y.L. Simmhan, B. Plale, D. Gannon. A Survey of Data Provenance in e-
Science. SIGMOD Record, Vol 34, No. 3, Sept 2005. [Buneman2000] P. Buneman, S. Khanna, W.C. Tan. Data Provenance: Some Basic
Issues. Lecture Notes in Computer Science. Vol 1974, pp 87-, 2000. [Tan2004] W.C. Tan. Research Problems in Data Provenance. IEEE Data Eng. Bull.
27(4):45-52, 2004. [Tan2007] W.C. Tan. Provenance in Databases: Past, Current, and Future. IEEE Data
Eng. Bull. 30(4):3-12, 2007. [Buneman2007] P. Buneman, W.C. Tan. Provenance in Databases (Tutorial Outline).
SIGMOD ’07, Beijing, China, 2007. [Rajbhandari2004] S. Rajbhandari & D. Walker. Support for Provenance in Service-
based Computing Grid. Cardiff University, 2004. http://www.wesc.ac.uk/resources/presentations/AHM04/194.pdf
[IBM2003] IBM Corporation. Assured Data Provenance. 2003. http://priorartdatabase.com/IPCOM/000010757/
[Komatsoulis2004] G. Komatsoulis. Toward a Functional Model of Data Provenance. cancer Biomedical Informatics Grid. 2004. https://cabig.nci.nih.gov/workspaces/Architecture/Meetings/Architecture_Workspace/f2f-meetings/ARCH-VCDE-F2F/Day 1 F2F Presentations 10_25/Data Provenance
[Plale2005] B. Plale & Y. Simmhan. Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management. Indiana University. Spring 2005. http://vw.indiana.edu/talks-spring05/plale.ppt
[WLIA1994] Wisconsin Land Informatics Association. WLIA Standard. August 1994. http://www.wlia.org/resources/standard4.pdf
3
Agenda
What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?
4
Data Provenance
Information that helps determine the derivation history of a data product, starting from its original sources Includes process of derivation [Buneman2000]
Materials and transformations [Lanter1991]
Workflows, annotations, notes [Greenwood 2003]
5
Data Provenance (Others) [Tan2007]
Workflow interconnection of computation steps and human-
machine interaction steps Workflow Provenance
record of entire history of the derivation of the final output of the workflow
[Komatsoulis2004] 6 W’s Plus
Who, What, When, Where, Why, How Chain of Custody
6
Agenda
What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?
7
[Near] Historical Context
Embedded in earliest GIS standards (as lineage) {suitability for use}
Materials engineering (as pedigree) {system failure and audit}
Life sciences (as transformation history) {attribution, suitability for use}
Intellectual property {for patents} Databases {veracity, quality}
Lots of purposes and uses → taxonomy
8
Spacial Data Transfer Standard (SDTS) Description of how to apply parameters in a
calculation process (e.g., photo rectification) Data quality
Positional accuracy Attribute accuracy Logical consistency Completeness
Wisconsin Land Informatics Association (WLIA) Standard [WLIA1994]
http://www.fgdc.gov/standards/projects/FGDC-standards-projects/SDTS/sdts_pt5/srpe0299.pdf
9
Bioinformatics Data Flow
Curation (Lit): classification, annotation, error correction (by humans)
Provenance recognizes value of curation, collaboration, uniqueness, etc
[Buneman2000]
In 2004, molecular biology had over 500 databases. Most contained data derived from other databases. [Tan2004]
10
Use Cases [Tan2004]
Gauging trustworthiness of data How many generations of derivation? What’s original vs curated?
Sharing knowledge about data via annotation Verifying data that is itself subject to update
11
Immediate Rationale NSF Grant General Conditions (GC-1) January 5, 2009
§38a. Sharing of Findings, Data, and Other Research Products
NSF expects significant findings from research and education activities it supports to be promptly submitted for publication, with authorship that accurately reflects the contributions of those involved. It expects investigators to share with other researchers, at no more than incremental cost and within a reasonable time, the data, samples, physical collections and other supporting materials created or gathered in the course of the work. It also encourages grantees to share software and inventions or otherwise act to make the innovations they embody widely useful and usable.
Adjustments and, where essential, exceptions may be allowed to safeguard the rights of individuals and subjects, the validity of results, or the integrity of collections or to accommodate legitimate interests of investigators.
12
Agenda
What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?
13
Taxonomy Applications Subjects Representations Storage Dissemination
Survey of 5 systems4 systems: workflows for scientific experiments
and simulations1 system: transformations via database queries
14
Application of Provenance Data Quality
Proof statements on derivations
Estimations of quality and reliability
Audit Trail Resource usage Error detection
Replication Recipes Maintain data currency Repeat experiments Cross-site transfers Replication cost
estimates
Attribution Copyright Data ownership Citations Liability
Informational Interpretation context
Commercial Accountability (credit
reports)
15
Subjects of Provenance
Data-oriented model (explicit) Process-oriented model (indirect)
Granularity = resolution of objects tracked
Note: [Buneman2007] defines coarse granularity as workflow tracking, and fine granularity as tracking individual tuples
C(provenance) ~ 1/granularity
16
Representation of Provenance
Annotations Notes and descriptions of source data and
processes (incl. parameters, versions) Eager forms tag data and propagates w/data
Inversions Identifies data on which transformation
depends Query or notes Less accurate
17
Query Inversion Example [Buneman2000]
Objective: {Query → tuple | tuple → result} per some minimal derivation
Invariant under query rewriting, composition Prefer strong inverses:
select salary from x where salary < 1000
Avoid weak inverses:select salary from x
where salary < select avg(salary) from x
18
Annotation Example [Tan2004]
DBNotes embeds attributes and automatically propagates them with query results
Q1: select distinct Desc from SWISSPROT propagate default union select distinct Desc from GENBANK propagate default
19
Provenance Storage
Scalability(Annotations) < Scalability(Inversions)
Annotations: Embedded vs separate Embedded easier to maintain Separate easier to search and publish Immutability adds to trust
Automatic collection More complete Less insightful
20
Provenance Dissemination
Derivation graphs Metadata search Provenance retrieval APIs
21
Taxonomy of Provenance
22
Agenda
What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?
23
System Survey
24
System Survey - Chimera
Workflows specified via Virtual Data Language (DAG) that relates datasets and transformations
VDL maps to SQL using schemas VDL repository can be searched
25
System Survey - myGrid
Services: resource discovery, workflow enactment, provenance management
XScufl language/Taverna engine Provenance: services invoked,
params, start/end time, data used, inversions (automatically recorded)
26
System Survey - CMCS
Semantically oriented manual annotation language
Data updates triggered on provenance information (like a Make file)
27
System Survey - Earth System Science Workbench
Detects errors in derived data products and assesses quality of datasets
Script writer fills/stores lineage templates
ESSW executes script, builds DAG DAG viewable through browser
28
System Survey - Trio
Stores source tuples and invert query
29
Implications for PALMS/CitiCORE
Provenance tracking reflects requirements of stakeholders – discussions are appropriate
End-to-end use cases should be explicitly generated
Provenance implementations may be responsive to a policy approach
Careful attention must be paid to lifecycles of data, metadata, calculations, and other resources
30
Agenda
What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?
31
Provenance Servers
[Rajbhandari2004]
[IBM2003] proposes provenance server that signs provenance and returns it to database
32
Digital Data Provenance [Plale2005]
Karma Toolkit for collecting provenance uniformly
from grid and web service workflows Phala
Case-based reasoning recommender system
33
Agenda
What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?
34
Future Research
Create provenance metadata standards Seamlessly represent provenance from
workflow and database models Store provenance about missing or deleted
data (phantom lineage) Federate provenance information across
organizations
35
Future Research [Buneman2000]
Archiving/restoring provenance-referenced data Release successor versions of database as
separate documents Maintain delta chains and timestamps
Package all tuples as “self-aware” … containing own provenance
How to articulate inversions … XPath? ID attribute keys?
36
Future Research [Tan2007]
Extend provenance to workflows represented by web services
Generate provenance of data generated by “black box”
Efficiently archive versions of databases whose schemas change over time
Recover a version of a database that undergoes updates over time
37
Future Research [Komatsoulis2004]
Measurement of Data Reliability via provenance Assertions have unique identifiers Tracked attributes
Generating source Immediate source Number of transformations Transformation type Reference Evidence code
38
Future Research [Plale2005]
Quality metrics from Attributes of data set Attributes of
generating process Ancestral datasets
Compare datasets Search for datasets Community feedback