CIS 602: Provenance & Scientific Data Management ...

26
CIS 602, Fall 2014 CIS 602: Provenance & Scientific Data Management Provenance Standards Dr. David Koop

Transcript of CIS 602: Provenance & Scientific Data Management ...

Page 1: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

CIS 602: Provenance & Scientific Data Management

Provenance Standards

Dr. David Koop

Page 2: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Reminders• Project Emails:

- Almost everyone has now received my feedback on their projects- If you do not email me back, I will assume that you concur with

my understanding of your project- That means, what I wrote is what I expect to see in the final project!

• Project Progress Reports Due November 13- Can serve as the start to your final project report

2

Page 3: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Related Work and Citations• Related work is meant to provide context and background as well

as give credit for past work• Citations point to other papers that a reader can look at for more

information- Citations should immediately the use of the concept, summary,

argument, or position in the paper- Citations that reference examples can be as simple as

(e.g. [21, 41]) or (e.g. (Smith, 1994), (Jones, 2001a))• Direct quotes are rare in computer science research papers

- Reuse an exact definition- Point to a position that you are concerned about misrepresenting

3

Page 4: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Reading Quiz

4

Page 5: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

History of Provenance (in Computer Science)• Data Derivation & Provenance Workshop, 2002• International Provenance & Annotation Workshop (IPAW)

- Meeting for community to discuss provenance and provenance-specific issues

- Published proceedings, papers, and posters- Held every other year, 2006-present

• Workshop on the Theory and Practice of Provenance (TaPP)- Developed to focus on new ideas, short papers on new ideas,

vision statements- Held every year, 2009-present

• This year IPAW and TaPP were co-located and integrated into a “Provenance Week” event

5

Page 6: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Provenance Challenges (2006-2009)• Developed to “understand the different representations used for

provenance, its common aspects, and the reasons for its differences” [http://twiki.ipaw.info/bin/view/Challenge/]

• Led to development of Open Provenance Model (OPM) and eventually PROV.

6

Page 7: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

First Provenance Challenge (2006)• Focus on representations, capabilities, and scope of provenance-

aware systems• Developed a simple example workflow from Functional MRI• 17 Teams contributed:

- Representations of the workflow in their system- Representations of provenance for the example workflow- Representations of the result of the core (and other) queries- For each query, whether (1) the query can be answered by the

system, (2) the system cannot answer the query now but considers it relevant, (3) the query is not relevant to the project.

7

Page 8: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

First Provenance Challenge

8

1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is.

2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.

3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.

4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model that ran on a Monday.

5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095.

6. Find all output averaged images of softmean procedures where the warped images taken as input were align_warped using a 12th order nonlinear 1365 parameter model.

7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs.

8. A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.

9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

Page 9: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

TH

EFIR

ST

PR

OV

EN

AN

CE

CH

ALLE

NG

E7

Figure

2.Sum

mary

ofC

ontributions

Copyrig

ht

c�2000

John

Wiley

&Sons,

Ltd

.Concu

rrency

Com

puta

t.:Pra

ct.Exper.

2000;00:1

–7

Prepa

redusin

gcp

eauth

.cls

First Provenance Challenge Results

9

Page 10: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Second Provenance Challenge (2007)

10

• First challenge helped identify common characteristics as well as differences

• Want to better understand how the systems relate• Focus on interoperability• Teams share provenance of different components of workflow, run

queries over compositions of provenance data (cross-model provenance queries)

• Same workflow and queries as in first challenge• Goals:

- Understand where data in one model is translatable to or has no parallel in another model.

- Understand how the provenance of data can be traced across multiple systems, so adding value to all those systems.

Page 11: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Second Provenance Challenge Results• Interoperability was fairly straightforward: teams could understand

each other’s provenance without much interaction• Issues:

- Naming: identifying inputs/outputs across different systems’ provenance can be challenging without any agreed-upon identification scheme• Artificial problem?• Could compare data by checksums/hashes• Defer to actual names used in executions

- Extraneous Information: other systems capture provenance information that isn’t used by another system’s query system

• Different levels of abstraction for provenance• General agreement on annotated causality graph representation

11

Page 12: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Open Provenance Model (2007)• http://twiki.ipaw.info/bin/view/Challenge/OPM• Requirements:

- To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model.

- To allow developers to build and share tools that operate on such provenance model.

- To define the model in a precise, technology-agnostic manner. - To support a digital representation of provenance for any “thing",

whether produced by computer systems or not. - To define a core set of rules that identify the valid inferences that

can be made on provenance graphs.

12

[L. Moreau et al.]

Page 13: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Open Provenance Model• Non-requirements

- It is not the purpose of this document to specify the internal representations that systems have to adopt to store and manipulate provenance internally; systems remain free to adopt internal representations that are fit for their purpose.

- It is not the purpose of this document to define a computer-parseable syntax for this model; model implementations in XML, RDF or others will be specified in separate documents, in the future.

- We do not specify protocols to store such provenance information in provenance repositories.

- We do not specify protocols to query provenance repositories.

13

[L. Moreau et al.]

Page 14: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

OPM Components• Artifact: Immutable piece of state, which may have a physical

embodiment in a physical object, or a digital representation in a computer system

• Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts

• Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution

14

Artifact Process Agent[L. Moreau et al.]

Page 15: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

A1 A2

P1 P2wasTriggeredBy

wasDerivedFrom

A Pused(R)

APwasGeneratedBy(R)

Ag PwasControlledBy(R)

OPM Edges

15

[L. Moreau et al.]

Page 16: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

100gFlour

100gSugar

2 eggs

Bake Cake

John

100gButter

wasGeneratedBy(cake)

used(flo

ur)

used(sugar)

used(egg)

used(butter)wasControlledBy(cook)

OPM Cake Example

16

[L. Moreau et al.]

Page 17: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

A1 A2P1 P2used(R3)

wasGene-ratedBy(R2)used(R1)

wasTriggeredBy

wasDerivedFrom(asserted)

Acc1 Acc3Acc2

mayHaveBeenDerivedFrom(inferred)

OPM Inferences

17

[L. Moreau et al.]

Page 18: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Other OPM Notes

18

• Model is timeless but may have time annotations• Has overlapping and hierarchical descriptions (Accounts)• Roles are mandatory in certain edges• Has support for collections

Page 19: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Third Provenance Challenge (2009)• Main Goal: test the efficacy of OPM in representing causality of real

world applications and, importantly, its usefulness for interoperating across systems

• Other goals:- Build momentum around the proposed standard, integrate

support for OPM within existing provenance systems- Spawn novel OPM-based tools for the community

19

[Y. Simmhan et al.]

Page 20: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

in the specification. There were two sets of queries: core queries that were required to be attempted by all teams and optional queries that could be suggested by the participants to highlight specific features or pitfalls.

Participants were asked to share their workflows, exported OPM graphs, and results of running the queries on provenance that they collect and OPM documents they import on the PC3 wiki site [12] and present their results at the challenge workshop.

3.2 The Pan-STARRS Load Workflow Panoramic Sky Survey and Rapid Response System (Pan-STARRS) [13] is a next generation sky survey project that continuously scans the visible sky once a week and builds a time series of data that catalogs the solar system and can detect Earth impacting objects. Pan-STARRS uses several workflows for its data loading pipeline [14] that convert raw image data from the digital telescope into queryable observations and metadata that are stored in relational databases. One of these data loading workflows was selected and adapted for the provenance challenge. Figure 1 shows the data and control flows in the Pan-STAR load workflow used in the challenge.

The load workflow imports observations of space objects arriving in the form of Comma Separated Value (CSV) files into tables in a relational database. There are two inputs to the workflow: (1) the directory location of a set of related CSV files from a single telescope image that need to be loaded into a new database, and (2) a JobID that indicates the name of the database to be created to hold the data. The input directory contains a “CSV  Ready”  manifest file, listing the CSV files to be loaded, and three CSV files – detection, frame meta, and image meta – that each correspond to a table in the database. The workflow then performs several tasks that verify the existence of the input CSV files, create a database to load the files, iterate through each input CSV file to validate and load it into a distinct table in the, update computed columns in the table, perform post load validations, and finally compact the database once all files are successfully loaded. Tasks that fail to validate input data cause the workflow to halt. The load workflow tasks are listed in Table 1.

In Figure 1, solid lines denote the data passed from one task to another. These may be string paths and URIs that reference files, tables, and databases, or simple parameters used by the task. The C# and Java implementations use class objects and collections to represent some of these parameters, though all of them can be formatted as strings, basic types and their

Figure 1. Pan-STARRS load workflow used in the provenance challenge

Third Provenance Challenge Workflow & Queries• Pan-STARRS sky survey project• Loads CSV files and converts raw

image data into query-able observations that are stored in a relational database

• Queries:- For a given detection, which CSV

files contributed to it?- The user considers a table to contain

unexpected values. Was the range check performed for this table?

- Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

20

Page 21: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Table 2. Details of Teams Participating in the Challenge.

Team Workflow/

Provenance

System

Native

Provenance

Format

OPM Binding

Format

Native

Provenance

Storage

PC3 Query

Format

Teams Interoperated With (Query Success Status)

NCSA Java/Tupelo RDF XML, RDF Java API NCSA (Success)

Univ of Chicago Swift Relational XML DBMS UChicago (Partial)

Microsoft

Research

Trident Relational XML DBMS/MS SQL Server

SQL MSR (Success), Soton (Fail), UoM (Fail), UCD (Partial), IU (Success), UUtah (Fail)

UC-Davis COMAD-Kepler Relational XML DBMS/ MySQL SQL UCD (Success), Soton, RPI, KCL, Harvard

Univ of Soton, USC-ISI

Java XML XML, RDF File Java API Soton (Success)

Univ of

Manchester

Taverna RDF XML, RDF SPARQL UoM, UCD, Soton, NCSA

RPI/Tetherless Java/ProtoProv RDF XML Jena SPARQL RPI (Success)

UvA/VL-e WS-VLAM/ PLIER

Relational XML DBMS SQL UvA (Success)

SDSC Kepler XML XQuery SDSC (Success)

Univ of Utah VisTrails, Python

XML XML XQuery UUtah (Partial), SDSC (Partial)

Kings College,

London

Java XML KCL (Success)

Harvard Univ Bash, Java/ PASS

Path Query Language Graphs

XML PQL DB Store PQL Harvard (Success)

Indiana Univ ODE/Karma Relational XML DBMS/ MySQL SQL IU (Success)

UTEP WDO-It! Proof Markup Language

File

There were several interesting approaches taken by the teams to address challenges posed by the interoperability problem. One was in dealing with control-flows in the workflow, which was not seen in the previous challenges. Some teams modeled the control-flow activities as processes that trigger subsequent processes, while others modeled them as explicit data-flows that pass the boolean value. Teams that used pure dataflow-based systems did not even support tracking of control-flows. Many teams also unrolled the iterations in the workflow, optionally keeping track of the iteration step count. Validation errors (e.g. IsMatchtableCount) were also handled differently, with certain teams able to halt workflow execution as expected while others let them proceed but use an error flag to suppress the application logic for subsequent workflow activities from executing. All of these techniques have a corresponding impact on the provenance collected and its interoperability.

Teams also encountered difficulties due to different approaches taken for recording process and artifact identifiers in OPM since the model does not prescribe any naming scheme. Also, OPM makes a distinction between the artifact instance at the moment it is used or generated, and the logical or physical identifier for the actual artifact, which caused some challenge queries to require additional examination. Naming was particularly a problem for database artifacts like tables and tuples since several teams did not support tracking of database artifacts. Different teams were able to track database and file entities at various levels of granularities.

Challenge queries that combined provenance information with additional metadata or data values also posed difficulties for teams. Additional ad hoc annotations were used by certain teams to capture part of the metadata within the OPM model itself, with particular teams even encoding the entire artifact value as part of the provenance annotation.

Given that this was the first time the OPM specification was put to use, the initial interoperability challenge for teams was in mapping from their native provenance model to OPM. Most teams were successful in achieving this to

Third Provenance Challenge Results

21

Page 22: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

PROV

22

• At Third Provenance Challenge Workshop, connections made with W3C and PROV incubator was created

• Some similarities to OPM, some of the same people involved, but also tried to involve a wider group

Page 23: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

PROV Model Primer

Y. Gil, S. Miles (eds.). K. Belhajjame, H. Deus, D. Garijo, G. Klyne, P. Missier, S. Soiland-Reyes, S. Zednik (contributers)

Presented by: Srinidhi Pesara

Page 24: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

PROV Tutorial

L. Moreau, P. Groth, T. D. Huynh

Page 25: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

PROV Discussion• What are the benefits of PROV?• What are potential problems with PROV?• PROV to promote “inter-operable interchange”, not necessarily the

primary storage of provenance for each system?• Extensions of the core model for specific systems and/or uses

- ProvONE: http://vcvcomputing.com/provone/provone.html- Wf4ever provenance: http://wf4ever.github.io/ro/

• Other examples:- Extracting PROV from Wikipedia: [P. Missier & Z. Chen, 2013]- Git2Prov: [T. de Nies et al., 2013]

25

Page 26: CIS 602: Provenance & Scientific Data Management ...

CIS 602, Fall 2014

Reminders• Project Emails:

- Almost everyone has now received my feedback on their projects- If you do not email me back, I will assume that you concur with

my understanding of your project- That means, what I wrote is what I expect to see in the final project!

• Project Progress Reports Due November 13- Can serve as the start to your final project report

• Next Class: Visualization & Provenance

26