Provinance in scientific workflows in e science

38
1 A Survey of Data Provenance in e-Science [et al] Simmhan, Plale & Gannon UC San Diego CSE 294 February 5, 2010 Barry Demchak Pompeii – AD 62

Transcript of Provinance in scientific workflows in e science

Page 1: Provinance in scientific workflows in e science

1

A Survey of Data Provenance in

e-Science [et al]Simmhan, Plale & Gannon

UC San DiegoCSE 294

February 5, 2010Barry Demchak

Pompeii – AD 62

Page 2: Provinance in scientific workflows in e science

2

[et al] [Simmhan2005] Y.L. Simmhan, B. Plale, D. Gannon. A Survey of Data Provenance in e-

Science. SIGMOD Record, Vol 34, No. 3, Sept 2005. [Buneman2000] P. Buneman, S. Khanna, W.C. Tan. Data Provenance: Some Basic

Issues. Lecture Notes in Computer Science. Vol 1974, pp 87-, 2000. [Tan2004] W.C. Tan. Research Problems in Data Provenance. IEEE Data Eng. Bull.

27(4):45-52, 2004. [Tan2007] W.C. Tan. Provenance in Databases: Past, Current, and Future. IEEE Data

Eng. Bull. 30(4):3-12, 2007. [Buneman2007] P. Buneman, W.C. Tan. Provenance in Databases (Tutorial Outline).

SIGMOD ’07, Beijing, China, 2007. [Rajbhandari2004] S. Rajbhandari & D. Walker. Support for Provenance in Service-

based Computing Grid. Cardiff University, 2004. http://www.wesc.ac.uk/resources/presentations/AHM04/194.pdf

[IBM2003] IBM Corporation. Assured Data Provenance. 2003. http://priorartdatabase.com/IPCOM/000010757/

[Komatsoulis2004] G. Komatsoulis. Toward a Functional Model of Data Provenance. cancer Biomedical Informatics Grid. 2004. https://cabig.nci.nih.gov/workspaces/Architecture/Meetings/Architecture_Workspace/f2f-meetings/ARCH-VCDE-F2F/Day 1 F2F Presentations 10_25/Data Provenance

[Plale2005] B. Plale & Y. Simmhan. Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management. Indiana University. Spring 2005. http://vw.indiana.edu/talks-spring05/plale.ppt

[WLIA1994] Wisconsin Land Informatics Association. WLIA Standard. August 1994. http://www.wlia.org/resources/standard4.pdf

Page 3: Provinance in scientific workflows in e science

3

Agenda

What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?

Page 4: Provinance in scientific workflows in e science

4

Data Provenance

Information that helps determine the derivation history of a data product, starting from its original sources Includes process of derivation [Buneman2000]

Materials and transformations [Lanter1991]

Workflows, annotations, notes [Greenwood 2003]

Page 5: Provinance in scientific workflows in e science

5

Data Provenance (Others) [Tan2007]

Workflow interconnection of computation steps and human-

machine interaction steps Workflow Provenance

record of entire history of the derivation of the final output of the workflow

[Komatsoulis2004] 6 W’s Plus

Who, What, When, Where, Why, How Chain of Custody

Page 6: Provinance in scientific workflows in e science

6

Agenda

What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?

Page 7: Provinance in scientific workflows in e science

7

[Near] Historical Context

Embedded in earliest GIS standards (as lineage) {suitability for use}

Materials engineering (as pedigree) {system failure and audit}

Life sciences (as transformation history) {attribution, suitability for use}

Intellectual property {for patents} Databases {veracity, quality}

Lots of purposes and uses → taxonomy

Page 8: Provinance in scientific workflows in e science

8

Spacial Data Transfer Standard (SDTS) Description of how to apply parameters in a

calculation process (e.g., photo rectification) Data quality

Positional accuracy Attribute accuracy Logical consistency Completeness

Wisconsin Land Informatics Association (WLIA) Standard [WLIA1994]

http://www.fgdc.gov/standards/projects/FGDC-standards-projects/SDTS/sdts_pt5/srpe0299.pdf

Page 9: Provinance in scientific workflows in e science

9

Bioinformatics Data Flow

Curation (Lit): classification, annotation, error correction (by humans)

Provenance recognizes value of curation, collaboration, uniqueness, etc

[Buneman2000]

In 2004, molecular biology had over 500 databases. Most contained data derived from other databases. [Tan2004]

Page 10: Provinance in scientific workflows in e science

10

Use Cases [Tan2004]

Gauging trustworthiness of data How many generations of derivation? What’s original vs curated?

Sharing knowledge about data via annotation Verifying data that is itself subject to update

Page 11: Provinance in scientific workflows in e science

11

Immediate Rationale NSF Grant General Conditions (GC-1) January 5, 2009

§38a. Sharing of Findings, Data, and Other Research Products

NSF expects significant findings from research and education activities it supports to be promptly submitted for publication, with authorship that accurately reflects the contributions of those involved. It expects investigators to share with other researchers, at no more than incremental cost and within a reasonable time, the data, samples, physical collections and other supporting materials created or gathered in the course of the work. It also encourages grantees to share software and inventions or otherwise act to make the innovations they embody widely useful and usable.

Adjustments and, where essential, exceptions may be allowed to safeguard the rights of individuals and subjects, the validity of results, or the integrity of collections or to accommodate legitimate interests of investigators.

Page 12: Provinance in scientific workflows in e science

12

Agenda

What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?

Page 13: Provinance in scientific workflows in e science

13

Taxonomy Applications Subjects Representations Storage Dissemination

Survey of 5 systems4 systems: workflows for scientific experiments

and simulations1 system: transformations via database queries

Page 14: Provinance in scientific workflows in e science

14

Application of Provenance Data Quality

Proof statements on derivations

Estimations of quality and reliability

Audit Trail Resource usage Error detection

Replication Recipes Maintain data currency Repeat experiments Cross-site transfers Replication cost

estimates

Attribution Copyright Data ownership Citations Liability

Informational Interpretation context

Commercial Accountability (credit

reports)

Page 15: Provinance in scientific workflows in e science

15

Subjects of Provenance

Data-oriented model (explicit) Process-oriented model (indirect)

Granularity = resolution of objects tracked

Note: [Buneman2007] defines coarse granularity as workflow tracking, and fine granularity as tracking individual tuples

C(provenance) ~ 1/granularity

Page 16: Provinance in scientific workflows in e science

16

Representation of Provenance

Annotations Notes and descriptions of source data and

processes (incl. parameters, versions) Eager forms tag data and propagates w/data

Inversions Identifies data on which transformation

depends Query or notes Less accurate

Page 17: Provinance in scientific workflows in e science

17

Query Inversion Example [Buneman2000]

Objective: {Query → tuple | tuple → result} per some minimal derivation

Invariant under query rewriting, composition Prefer strong inverses:

select salary from x where salary < 1000

Avoid weak inverses:select salary from x

where salary < select avg(salary) from x

Page 18: Provinance in scientific workflows in e science

18

Annotation Example [Tan2004]

DBNotes embeds attributes and automatically propagates them with query results

Q1: select distinct Desc from SWISSPROT propagate default union select distinct Desc from GENBANK propagate default

Page 19: Provinance in scientific workflows in e science

19

Provenance Storage

Scalability(Annotations) < Scalability(Inversions)

Annotations: Embedded vs separate Embedded easier to maintain Separate easier to search and publish Immutability adds to trust

Automatic collection More complete Less insightful

Page 20: Provinance in scientific workflows in e science

20

Provenance Dissemination

Derivation graphs Metadata search Provenance retrieval APIs

Page 21: Provinance in scientific workflows in e science

21

Taxonomy of Provenance

Page 22: Provinance in scientific workflows in e science

22

Agenda

What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?

Page 23: Provinance in scientific workflows in e science

23

System Survey

Page 24: Provinance in scientific workflows in e science

24

System Survey - Chimera

Workflows specified via Virtual Data Language (DAG) that relates datasets and transformations

VDL maps to SQL using schemas VDL repository can be searched

Page 25: Provinance in scientific workflows in e science

25

System Survey - myGrid

Services: resource discovery, workflow enactment, provenance management

XScufl language/Taverna engine Provenance: services invoked,

params, start/end time, data used, inversions (automatically recorded)

Page 26: Provinance in scientific workflows in e science

26

System Survey - CMCS

Semantically oriented manual annotation language

Data updates triggered on provenance information (like a Make file)

Page 27: Provinance in scientific workflows in e science

27

System Survey - Earth System Science Workbench

Detects errors in derived data products and assesses quality of datasets

Script writer fills/stores lineage templates

ESSW executes script, builds DAG DAG viewable through browser

Page 28: Provinance in scientific workflows in e science

28

System Survey - Trio

Stores source tuples and invert query

Page 29: Provinance in scientific workflows in e science

29

Implications for PALMS/CitiCORE

Provenance tracking reflects requirements of stakeholders – discussions are appropriate

End-to-end use cases should be explicitly generated

Provenance implementations may be responsive to a policy approach

Careful attention must be paid to lifecycles of data, metadata, calculations, and other resources

Page 30: Provinance in scientific workflows in e science

30

Agenda

What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?

Page 31: Provinance in scientific workflows in e science

31

Provenance Servers

[Rajbhandari2004]

[IBM2003] proposes provenance server that signs provenance and returns it to database

Page 32: Provinance in scientific workflows in e science

32

Digital Data Provenance [Plale2005]

Karma Toolkit for collecting provenance uniformly

from grid and web service workflows Phala

Case-based reasoning recommender system

Page 33: Provinance in scientific workflows in e science

33

Agenda

What is Data Provenance? Why is it important? What are its dimensions? Examples of data provenance techniques What tools exist? What are the research frontiers?

Page 34: Provinance in scientific workflows in e science

34

Future Research

Create provenance metadata standards Seamlessly represent provenance from

workflow and database models Store provenance about missing or deleted

data (phantom lineage) Federate provenance information across

organizations

Page 35: Provinance in scientific workflows in e science

35

Future Research [Buneman2000]

Archiving/restoring provenance-referenced data Release successor versions of database as

separate documents Maintain delta chains and timestamps

Package all tuples as “self-aware” … containing own provenance

How to articulate inversions … XPath? ID attribute keys?

Page 36: Provinance in scientific workflows in e science

36

Future Research [Tan2007]

Extend provenance to workflows represented by web services

Generate provenance of data generated by “black box”

Efficiently archive versions of databases whose schemas change over time

Recover a version of a database that undergoes updates over time

Page 37: Provinance in scientific workflows in e science

37

Future Research [Komatsoulis2004]

Measurement of Data Reliability via provenance Assertions have unique identifiers Tracked attributes

Generating source Immediate source Number of transformations Transformation type Reference Evidence code

Page 38: Provinance in scientific workflows in e science

38

Future Research [Plale2005]

Quality metrics from Attributes of data set Attributes of

generating process Ancestral datasets

Compare datasets Search for datasets Community feedback