CIS 602: Provenance & Scientific Data Management ...
Transcript of CIS 602: Provenance & Scientific Data Management ...
CIS 602, Fall 2014
CIS 602: Provenance & Scientific Data Management
Provenance Standards
Dr. David Koop
CIS 602, Fall 2014
Reminders• Project Emails:
- Almost everyone has now received my feedback on their projects- If you do not email me back, I will assume that you concur with
my understanding of your project- That means, what I wrote is what I expect to see in the final project!
• Project Progress Reports Due November 13- Can serve as the start to your final project report
2
CIS 602, Fall 2014
Related Work and Citations• Related work is meant to provide context and background as well
as give credit for past work• Citations point to other papers that a reader can look at for more
information- Citations should immediately the use of the concept, summary,
argument, or position in the paper- Citations that reference examples can be as simple as
(e.g. [21, 41]) or (e.g. (Smith, 1994), (Jones, 2001a))• Direct quotes are rare in computer science research papers
- Reuse an exact definition- Point to a position that you are concerned about misrepresenting
3
CIS 602, Fall 2014
Reading Quiz
4
CIS 602, Fall 2014
History of Provenance (in Computer Science)• Data Derivation & Provenance Workshop, 2002• International Provenance & Annotation Workshop (IPAW)
- Meeting for community to discuss provenance and provenance-specific issues
- Published proceedings, papers, and posters- Held every other year, 2006-present
• Workshop on the Theory and Practice of Provenance (TaPP)- Developed to focus on new ideas, short papers on new ideas,
vision statements- Held every year, 2009-present
• This year IPAW and TaPP were co-located and integrated into a “Provenance Week” event
5
CIS 602, Fall 2014
Provenance Challenges (2006-2009)• Developed to “understand the different representations used for
provenance, its common aspects, and the reasons for its differences” [http://twiki.ipaw.info/bin/view/Challenge/]
• Led to development of Open Provenance Model (OPM) and eventually PROV.
6
CIS 602, Fall 2014
First Provenance Challenge (2006)• Focus on representations, capabilities, and scope of provenance-
aware systems• Developed a simple example workflow from Functional MRI• 17 Teams contributed:
- Representations of the workflow in their system- Representations of provenance for the example workflow- Representations of the result of the core (and other) queries- For each query, whether (1) the query can be answered by the
system, (2) the system cannot answer the query now but considers it relevant, (3) the query is not relevant to the project.
7
CIS 602, Fall 2014
First Provenance Challenge
8
1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is.
2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model that ran on a Monday.
5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095.
6. Find all output averaged images of softmean procedures where the warped images taken as input were align_warped using a 12th order nonlinear 1365 parameter model.
7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs.
8. A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.
9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.
CIS 602, Fall 2014
TH
EFIR
ST
PR
OV
EN
AN
CE
CH
ALLE
NG
E7
Figure
2.Sum
mary
ofC
ontributions
Copyrig
ht
c�2000
John
Wiley
&Sons,
Ltd
.Concu
rrency
Com
puta
t.:Pra
ct.Exper.
2000;00:1
–7
Prepa
redusin
gcp
eauth
.cls
First Provenance Challenge Results
9
CIS 602, Fall 2014
Second Provenance Challenge (2007)
10
• First challenge helped identify common characteristics as well as differences
• Want to better understand how the systems relate• Focus on interoperability• Teams share provenance of different components of workflow, run
queries over compositions of provenance data (cross-model provenance queries)
• Same workflow and queries as in first challenge• Goals:
- Understand where data in one model is translatable to or has no parallel in another model.
- Understand how the provenance of data can be traced across multiple systems, so adding value to all those systems.
CIS 602, Fall 2014
Second Provenance Challenge Results• Interoperability was fairly straightforward: teams could understand
each other’s provenance without much interaction• Issues:
- Naming: identifying inputs/outputs across different systems’ provenance can be challenging without any agreed-upon identification scheme• Artificial problem?• Could compare data by checksums/hashes• Defer to actual names used in executions
- Extraneous Information: other systems capture provenance information that isn’t used by another system’s query system
• Different levels of abstraction for provenance• General agreement on annotated causality graph representation
11
CIS 602, Fall 2014
Open Provenance Model (2007)• http://twiki.ipaw.info/bin/view/Challenge/OPM• Requirements:
- To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model.
- To allow developers to build and share tools that operate on such provenance model.
- To define the model in a precise, technology-agnostic manner. - To support a digital representation of provenance for any “thing",
whether produced by computer systems or not. - To define a core set of rules that identify the valid inferences that
can be made on provenance graphs.
12
[L. Moreau et al.]
CIS 602, Fall 2014
Open Provenance Model• Non-requirements
- It is not the purpose of this document to specify the internal representations that systems have to adopt to store and manipulate provenance internally; systems remain free to adopt internal representations that are fit for their purpose.
- It is not the purpose of this document to define a computer-parseable syntax for this model; model implementations in XML, RDF or others will be specified in separate documents, in the future.
- We do not specify protocols to store such provenance information in provenance repositories.
- We do not specify protocols to query provenance repositories.
13
[L. Moreau et al.]
CIS 602, Fall 2014
OPM Components• Artifact: Immutable piece of state, which may have a physical
embodiment in a physical object, or a digital representation in a computer system
• Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts
• Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution
14
Artifact Process Agent[L. Moreau et al.]
CIS 602, Fall 2014
A1 A2
P1 P2wasTriggeredBy
wasDerivedFrom
A Pused(R)
APwasGeneratedBy(R)
Ag PwasControlledBy(R)
OPM Edges
15
[L. Moreau et al.]
CIS 602, Fall 2014
100gFlour
100gSugar
2 eggs
Bake Cake
John
100gButter
wasGeneratedBy(cake)
used(flo
ur)
used(sugar)
used(egg)
used(butter)wasControlledBy(cook)
OPM Cake Example
16
[L. Moreau et al.]
CIS 602, Fall 2014
A1 A2P1 P2used(R3)
wasGene-ratedBy(R2)used(R1)
wasTriggeredBy
wasDerivedFrom(asserted)
Acc1 Acc3Acc2
mayHaveBeenDerivedFrom(inferred)
OPM Inferences
17
[L. Moreau et al.]
CIS 602, Fall 2014
Other OPM Notes
18
• Model is timeless but may have time annotations• Has overlapping and hierarchical descriptions (Accounts)• Roles are mandatory in certain edges• Has support for collections
CIS 602, Fall 2014
Third Provenance Challenge (2009)• Main Goal: test the efficacy of OPM in representing causality of real
world applications and, importantly, its usefulness for interoperating across systems
• Other goals:- Build momentum around the proposed standard, integrate
support for OPM within existing provenance systems- Spawn novel OPM-based tools for the community
19
[Y. Simmhan et al.]
CIS 602, Fall 2014
in the specification. There were two sets of queries: core queries that were required to be attempted by all teams and optional queries that could be suggested by the participants to highlight specific features or pitfalls.
Participants were asked to share their workflows, exported OPM graphs, and results of running the queries on provenance that they collect and OPM documents they import on the PC3 wiki site [12] and present their results at the challenge workshop.
3.2 The Pan-STARRS Load Workflow Panoramic Sky Survey and Rapid Response System (Pan-STARRS) [13] is a next generation sky survey project that continuously scans the visible sky once a week and builds a time series of data that catalogs the solar system and can detect Earth impacting objects. Pan-STARRS uses several workflows for its data loading pipeline [14] that convert raw image data from the digital telescope into queryable observations and metadata that are stored in relational databases. One of these data loading workflows was selected and adapted for the provenance challenge. Figure 1 shows the data and control flows in the Pan-STAR load workflow used in the challenge.
The load workflow imports observations of space objects arriving in the form of Comma Separated Value (CSV) files into tables in a relational database. There are two inputs to the workflow: (1) the directory location of a set of related CSV files from a single telescope image that need to be loaded into a new database, and (2) a JobID that indicates the name of the database to be created to hold the data. The input directory contains a “CSV Ready” manifest file, listing the CSV files to be loaded, and three CSV files – detection, frame meta, and image meta – that each correspond to a table in the database. The workflow then performs several tasks that verify the existence of the input CSV files, create a database to load the files, iterate through each input CSV file to validate and load it into a distinct table in the, update computed columns in the table, perform post load validations, and finally compact the database once all files are successfully loaded. Tasks that fail to validate input data cause the workflow to halt. The load workflow tasks are listed in Table 1.
In Figure 1, solid lines denote the data passed from one task to another. These may be string paths and URIs that reference files, tables, and databases, or simple parameters used by the task. The C# and Java implementations use class objects and collections to represent some of these parameters, though all of them can be formatted as strings, basic types and their
Figure 1. Pan-STARRS load workflow used in the provenance challenge
Third Provenance Challenge Workflow & Queries• Pan-STARRS sky survey project• Loads CSV files and converts raw
image data into query-able observations that are stored in a relational database
• Queries:- For a given detection, which CSV
files contributed to it?- The user considers a table to contain
unexpected values. Was the range check performed for this table?
- Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
20
CIS 602, Fall 2014
Table 2. Details of Teams Participating in the Challenge.
Team Workflow/
Provenance
System
Native
Provenance
Format
OPM Binding
Format
Native
Provenance
Storage
PC3 Query
Format
Teams Interoperated With (Query Success Status)
NCSA Java/Tupelo RDF XML, RDF Java API NCSA (Success)
Univ of Chicago Swift Relational XML DBMS UChicago (Partial)
Microsoft
Research
Trident Relational XML DBMS/MS SQL Server
SQL MSR (Success), Soton (Fail), UoM (Fail), UCD (Partial), IU (Success), UUtah (Fail)
UC-Davis COMAD-Kepler Relational XML DBMS/ MySQL SQL UCD (Success), Soton, RPI, KCL, Harvard
Univ of Soton, USC-ISI
Java XML XML, RDF File Java API Soton (Success)
Univ of
Manchester
Taverna RDF XML, RDF SPARQL UoM, UCD, Soton, NCSA
RPI/Tetherless Java/ProtoProv RDF XML Jena SPARQL RPI (Success)
UvA/VL-e WS-VLAM/ PLIER
Relational XML DBMS SQL UvA (Success)
SDSC Kepler XML XQuery SDSC (Success)
Univ of Utah VisTrails, Python
XML XML XQuery UUtah (Partial), SDSC (Partial)
Kings College,
London
Java XML KCL (Success)
Harvard Univ Bash, Java/ PASS
Path Query Language Graphs
XML PQL DB Store PQL Harvard (Success)
Indiana Univ ODE/Karma Relational XML DBMS/ MySQL SQL IU (Success)
UTEP WDO-It! Proof Markup Language
File
There were several interesting approaches taken by the teams to address challenges posed by the interoperability problem. One was in dealing with control-flows in the workflow, which was not seen in the previous challenges. Some teams modeled the control-flow activities as processes that trigger subsequent processes, while others modeled them as explicit data-flows that pass the boolean value. Teams that used pure dataflow-based systems did not even support tracking of control-flows. Many teams also unrolled the iterations in the workflow, optionally keeping track of the iteration step count. Validation errors (e.g. IsMatchtableCount) were also handled differently, with certain teams able to halt workflow execution as expected while others let them proceed but use an error flag to suppress the application logic for subsequent workflow activities from executing. All of these techniques have a corresponding impact on the provenance collected and its interoperability.
Teams also encountered difficulties due to different approaches taken for recording process and artifact identifiers in OPM since the model does not prescribe any naming scheme. Also, OPM makes a distinction between the artifact instance at the moment it is used or generated, and the logical or physical identifier for the actual artifact, which caused some challenge queries to require additional examination. Naming was particularly a problem for database artifacts like tables and tuples since several teams did not support tracking of database artifacts. Different teams were able to track database and file entities at various levels of granularities.
Challenge queries that combined provenance information with additional metadata or data values also posed difficulties for teams. Additional ad hoc annotations were used by certain teams to capture part of the metadata within the OPM model itself, with particular teams even encoding the entire artifact value as part of the provenance annotation.
Given that this was the first time the OPM specification was put to use, the initial interoperability challenge for teams was in mapping from their native provenance model to OPM. Most teams were successful in achieving this to
Third Provenance Challenge Results
21
CIS 602, Fall 2014
PROV
22
• At Third Provenance Challenge Workshop, connections made with W3C and PROV incubator was created
• Some similarities to OPM, some of the same people involved, but also tried to involve a wider group
CIS 602, Fall 2014
PROV Model Primer
Y. Gil, S. Miles (eds.). K. Belhajjame, H. Deus, D. Garijo, G. Klyne, P. Missier, S. Soiland-Reyes, S. Zednik (contributers)
Presented by: Srinidhi Pesara
CIS 602, Fall 2014
PROV Tutorial
L. Moreau, P. Groth, T. D. Huynh
CIS 602, Fall 2014
PROV Discussion• What are the benefits of PROV?• What are potential problems with PROV?• PROV to promote “inter-operable interchange”, not necessarily the
primary storage of provenance for each system?• Extensions of the core model for specific systems and/or uses
- ProvONE: http://vcvcomputing.com/provone/provone.html- Wf4ever provenance: http://wf4ever.github.io/ro/
• Other examples:- Extracting PROV from Wikipedia: [P. Missier & Z. Chen, 2013]- Git2Prov: [T. de Nies et al., 2013]
25
CIS 602, Fall 2014
Reminders• Project Emails:
- Almost everyone has now received my feedback on their projects- If you do not email me back, I will assume that you concur with
my understanding of your project- That means, what I wrote is what I expect to see in the final project!
• Project Progress Reports Due November 13- Can serve as the start to your final project report
• Next Class: Visualization & Provenance
26