Post on 12-Aug-2015
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance and the W3C PROV model(in the Big Data context)§
Paolo Missier
School of Computing Science
Newcastle University, UK
Tutorial
First Keystone Summer School,
Malta, July 2015
Some of the slides courtesy of Luc Moreau – thanks!
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Topical research dissemination events
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Lecture goals and outline
• What is provenance, and why does it matter?• Definitions and case studies
• The W3C PROV standard in a nutshell• PROV-O: the Provenance Ontology and examples of its usage
• Provenance and Big Data: what’s the connection?• Opportunities and challenges
• Provenance tools [from Southampton]
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
One recent book
http://www.morganclaypool.com/doi/abs/10.2200/S00528ED1V01Y201308WBE007
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
1- Reproducibility and dissemination in Science
Independent validation of scientific claims is a cornerstone of experimental science
• Scientific claims are supported by experiments
• How do express my “material and methods” so that you can independently verify my results?
• How do I document my results to promote their understanding / reuse
Provenance is the equivalent of a logbook• Capture all steps involved in the derivation of a
result• Replay, validate the execution, compare it with
others
To what extent these can be formalised and automated in data-intensive science?
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
2- Explaining the outcome of a complex decision process
• Which process was used to derive a diagnosis?
• How did the process use the input data?
• How were the steps configured?
• Which decisions were made by human experts (clinicians)?
Clinical diagnosis of genetic diseases
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
3- Understanding the results of a computation
• Why has my [very complicated algorithm] produced this particular result?
• Why is my predictive analytics model suggesting that it will rain tomorrow?
• Why is this record part of the result of my database query?• Database provenance
• Why is this record included in the result of my keyword search?
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
4- Content reuse on the Social Web
Open Data, Data Journalism
• A consume-select-curate-share workflow, not only professional
• Ethos: to expose the data and methods used to produce news items
• But: Data wrangling can introduce errors• Is the data I am using valid? What is its primary source? What are the
transformation steps?
NowNews publishes an article based on the latest employment data published by GovStat
PolicyOrg compiles a report including NowNews article
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
What is provenance?
Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation• the history or pedigree of a work of art, manuscript, rare book, etc.;• a record of the passage of an item through its various owners
Magna Carta (‘the Great Charter’) was agreed between King John and his barons on 15 June 1215.
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
What is provenance?
Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)
Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance on the Web
Tim Berners-Lee’s “Oh Yeah” button:
• A browser button by which the user can express their uncertainty about a document being displayed “so how do I know I can trust this information?”.
• Upon activation of the button, the software then retrieves metadata about the document, listing assumptions on which trust can be based.
http://users.ugent.be/~tdenies/OhYeah/Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC, the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
http://dx.doi.org/10.1109/COMPSACW.2013.29
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance in the Semantic Web Stack
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Use cases on the Social Web
Open Data, Data Journalism
NowNews publishes an article based on the latest employment data published by GovStat
PolicyOrg compiles a report including NowNews article
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Derivation - Timeliness
Derivation:• Charts, graphs and visualizations are all based on multiple data sets• Eg Bob’s article on employment that appeared in NowNews• Which data was a figure based upon?
Is the report based on the most up-to-date data?
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Derivation - Trusted sources
Derivation:• Is this content derived from data coming from a reliable source?
• The chart within Bob’s article is based on GovStat data• However that information is hidden:
• the chart was produced by a complex process performed by Alice
Policy rule:
“data supplied by the government is reliable”
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Tracing the source of errors
Derivation, attribution:• When did this error occur?• Who was responsible for the chart?
Nick discovers an error in the chart included in Bob’s article
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Ensuring policy compliance
Process inspection:• Which process steps led to publication?• Was editorial check part of it?
Policy rule:
“posts are to be checked by an editor prior to publication”
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Ensuring credit and acknowledgement
NowNews relies on multiple contributors
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
Attribution and responsibility:• How do we ensure that all relevant
contributors are acknowledged?
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Reproducibility
Documenting the data generation process:• How do we ensure that
the figures can be reproduced using the new versions of the data?
NowNews must ensure that the article figures reflect the most recent data
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
So, why does provenance matter?
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To enable process analysis for debugging, improvement, evolution
• To enable reproducibility of processes (eg in science, data journalism…)
See also:
ACM Journal of Data and Information Quality (JDIQ) - Special Issue on Provenance, Data and Information Quality, Paolo Missier, Paolo Papotti, Eds. Volume 5 Issue 3, February 2015DOI: 10.1145/2692312http://dl.acm.org/citation.cfm?id=2700413http://jdiq.acm.org/archive.cfm?id=2698232
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erThe W3C Working Group on Provenance
W3CIncubator groupon provenance
Chair: Yolanda Gil, ISI, USC
W3Cworking groupapproved
Chairs: Luc Moreau,Paul Groth
2009-2010
Main output:“Provenance XG Final Report”http://www.w3.org/2005/Incubator/prov/XGR-prov/- provides an overview of the various existing approaches, vocabularies- proposes the creation of a dedicated W3C Working
Group
April, 2011 April, 2013
ProposedRecommendationsfinalised
prov-dm: Data Modelprov-o: OWL ontology, RDF encodingprov-n: prov notationprov-constraints
...plus a number of non-prescriptive Notes
http://www.w3.org/2011/prov/wiki/
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erPROV: scope and structure
23
source: http://www.w3.org/TR/prov-overview/
Recommendationtrack
See also:
Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129. doi:10.2200/S00528ED1V01Y201308WBE007.
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erPROV Core Elements (graph depiction)
24
An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.
An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erGeneration, Usage
25
Generation is the completion of production of a new entity by an activity. This entity did not exist before generation and becomes available for usage after this generation.
Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had not begun to utilize this entity
PROV is based on a notion of instantaneous events, that mark transitions in the world
- generation, usage (and others)
Ordering constraints amongst events:
“generation of e must precede each of usages”
“a can only use / generate e after it has started and before it has ended”
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erConcepts and relations
26
Generation of “draft v1” expressed as relation:
wasGeneratedBy(“draft v1”, ...)
Usage of “draft v1” by “commenting” expressed as relation:
used(“commenting, “draft v1”,...)
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erPROV notation
27
document
prefix prov <http://www.w3.org/ns/prov#>prefix ex <http://www.example.com/>
entity(ex:draftComments)entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"])entity(ex:paper1)entity(ex:paper2)
activity(ex:commenting)activity(ex:drafting)wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00)used(ex:commenting, ex:draftV1, -)wasGeneratedBy(ex:draftV1, ex:drafting, -)used(ex:drafting, ex:paper1, -)used(ex:drafting, ex:paper2, -)
endDocument
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erSame example — PROV-O notation (RDF/N3)
28
:draftComments a prov:Entity ; :distr "internal"^^xsd:string ; prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ; prov:used :draftV1 .
:draftV1 a prov:Entity ; :distr "internal"^^xsd:string ; :status "draft"^^xsd:string ; :version "0.1"^^xsd:string ; prov:wasGeneratedBy :drafting .
:drafting a prov:Activity ; prov:used :paper1, :paper2 .
:paper1 a prov:Entity, "reference"^^xsd:string .
:paper2 a prov:Entity, "reference"^^xsd:string .
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erAssociation, Attribution, Delegation: who did what?
29
An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity.
Attribution is the ascribing of an entity to an agent.
entity(ex:draftComments, [ ex:distr='internal' ])activity(ex:commenting)agent(ex:Bob, [prov:type = "mainEditor"] )agent(ex:Alice, [prov:type = "srEditor"])
wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])actedOnBehalfOf(Bob, Alice)wasAttributedTo(ex:draftComments, ex:Bob)
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erSame example — PROV-O notation (RDF/N3)
30
:Alice a prov:Agent, "ex:chiefEditor"; :firstName "Alice"; :lastName "Cooper".
:Bob a prov:Agent, "ex:seniorEditor"; :firstName "Robert"; :lastName "Thompson"^; prov:actedOnBehalfOf :Alice .
:draftComments prov:wasAttributedTo :Bob .:drafting a prov:Activity ; prov:wasAssociatedWith :Bob .
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erAssociation and Attribution
31
Q.: what is the relationship between attribution and association?
This is defined as an inference rule in the PROV-CONSTR document
entity(e)agent(Ag)activity(a)
wasAttributedTo(e, Ag)wasGeneratedBy(e, a,-) wasAssociatedWith(a, Ag,-)
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erCommunication amongst activities
32
Communication is the exchange of some unspecified entity by two activities, one activity using some entity generated by the other.
activity(ex:commenting)activity(ex:drafting)
wasInformedBy(ex:commenting, ex:drafting)
:drafting a prov:Activity .
:commenting a prov:Activity ; prov:wasInformedBy :drafting .
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erCommunication, generation, usage
33
activity(ex:commenting)activity(ex:drafting)entity(e)wasInformedBy(ex:commenting, ex:drafting)wasGeneratedBy(e,ex:drafting, -)used(ex:commenting, e, -)
Q.: what is the relationship between communication, generation, and usage?
This are inference rules 5 and 6 in the PROV-CONSTR document
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erSummary of the PROV Core model
35
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erDerivation amongst entities
36
A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity.
entity(ex:draftV1)entity(ex:draftComments)wasDerivedFrom(ex:draftComments, ex:draftV1)
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ; prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance and Big Data: what’s the connection?
opportunities and challenges
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance {as,of} Big Data
1. BigProv: Provenance as big data• High volume provenance
• What kind of analytics are interesting on big provenance?
2. Provenance of analytics processes
• “Prediction provenance”• Train a model provenance of the model as a record of the training
process and data involved
• Use the model to make predictions provenance of the prediction
3. Provenance of a search
• What is the provenance of a keyword search?
• Why would it be interesting? What can we learn from it?
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Recent research on Provenance as Big Data
Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7 May 2015 doi: 10.1109/CCGrid.2015.85
Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.525,534, 4-7 May 2015doi: 10.1109/CCGrid.2015.86
Provenance Map Orbiter: Interactive Exploration of Large Provenance GraphsPeter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece
Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013
Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop, Edinburgh, 2015 http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
ProvGen
• A Provenance Generator tool for experimenting with provenance at scale
• Why generate synthetic provenance?• Synthetic PROV graphs can be a valuable complement to emerging natural
provenance collections
• … provided their structural properties reflect specific provenance patterns
• control over their repetition and variability
• varying scales
• Useful for benchmarking emerging provenance management systems
• Useful to test analytics algorithms that operate on large provenance collections
Firth, Hugo, and Paolo Missier. “ProvGen: Generating Synthetic PROV Graphs with Predictable Structure.” In Procs. IPAW 2014 (Provenance and Annotations). Koln, Germany: Springer, 2014.http://arxiv.org/pdf/1406.2495
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
What does ProvGen do?
• Accept a seed PROV graph
• Grow the graph• Add nodes and relationships following the seed graph
structure
• … with constraints on how to grow
document entity(e1, [type="Document", version="original"]) entity(e2, [type="Document"]) entity(e3, [type="Document"]) activity(a1, [type="create"]) activity(a2, [type="edit"]) activity(a3, [type="edit"]) agent(ag, [type="Person"]) used(a2, e1) used(a3, e2) wasGeneratedBy(e2, a2, [fct="save"]) wasGeneratedBy(e1, a1, [fct="publish"]) wasGeneratedBy(e3, a3, [fct="save"]) wasAssociatedWith(a3, ag, [role="contributor"]) wasAssociatedWith(a2, ag, [role="contributor"]) wasAssociatedWith(a1, ag, [role="creator"]) wasDerivedFrom(e2, e1) wasDerivedFrom(e3, e2)endDocument
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
ProvGen constraints
an Entity must have relationship "WasDerivedFrom" exactly 2 times unless it has property("version"="original");
the Entity(e1) must not have relationship "WasDerivedFrom" with the Entity(e2) unless e1 has relationship "Used" with the Activity(a) and e2 has the relationship "WasGeneratedBy" with the Activity(a);
an Entity must have relationship "WasGeneratedBy" exactly 1 times;
an Entity must have property("version"="original") with probability 0.05;
an Entity must have out degree at most 2;
an Activity must have relationship "Used" at most 1 times;
an Activity must have property("type"="create") with probability 0.01;
an Activity must have relationship "WasAssociatedWith" exactly 1 times;
an Activity must have relationship "Used" exactly 1 times unless it has property("type"="create");
an Activity must have relationship "WasGeneratedBy" exactly 1 times;
an Agent must have relationship "WasAssociatedWith" with probability 0.1;
an Agent must have relationship "WasAssociatedWith" between 1, 120 times with distribution gamma(1.3, 2.4);
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Some test queries
Generated graph loaded to Neo4J GDBMSQueries expressed using the Cypher graph query language
Transitive closure over Derivation: Return all the derivation chains, along with their length
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) RETURN a,b, length(r)
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) WHERE length(r) > 10 RETURN a,b, length(r) ORDER BY length(r) desc limit 50
Return the top 50 length derivation chains
MATCH (a)-[:`WASASSOCIATEDWITH`]->(b)RETURN a as Agent, b as Activity
All agents and their associated activities
All agents who created new documents
MATCH (a{type:'create'})-[:`WASASSOCIATEDWITH`]->(b)RETURN a,b LIMIT 25
All agents who edited a document that was derived from an original
MATCH (doc1{version:'original'}) <- [:WASDERIVEDFROM] - (doc2) -[:`WASGENERATEDBY`] -> act -[:WASASSOCIATEDWITH] -> agent RETURN agent LIMIT 25
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance of Big Data
Provenance of analytics processes:
“Prediction provenance”• Train a model provenance of the model as a record of the training
process and data involved
• Use the model to make predictions provenance of the prediction
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erRelations may be given identifiers
45
entity(ex:draftComments)entity(ex:draftV1)activity(ex:commenting)wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -)used(use1; ex:commenting, ex:draftV1, -)
gen1 denotes a generation event
use1 denotes a usage event
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
General derivation relation:
Relation IDs make it possible to refer to relations in other relations
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erRendering N-ary relations in PROV-O
46
RDF is for binary relations —- N-ary relations require reification
entity(ex:draftComments)entity(ex:draftV1)activity(ex:commenting)wasGeneratedBy(gen1; ex:draftComments, ex:commenting, 2013-03-18T10:00:01)used(use1; ex:commenting, ex:draftV1, -)
:draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ; prov:qualifiedUsage :use1 .
:use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er“Qualified relation” RDF pattern
47
:draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ; prov:qualifiedUsage :use1 .
:use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erPlans — why was something done?
48
Most relation types have two arguments which are { Entity, Activity, Agent}
Derivation is one exception:
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
Two other notable exceptions: - Associations with a plan- Delegation with an activity scope
wasAssociatedWith(id; a, ag, pl, attrs)
A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goal
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erAssociation with a plan
49
A plan plays a role in an association
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erPlans are typed entities
50
activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])agent(ex:_aJVM, [prov:type = 'JVM-6.0'])entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label='Program 1'])
wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role='defaultRuntime', ex:accessPath="webapp" ])
A plan is an entity having prov:type = “prov:plan”
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erPlan pattern as PROV-O
51
:_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] .
:_aJVM a prov:Agent, “Java-6.0".
:myCleverProgram a prov:Entity, prov:Plan.
activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])agent(ex:_aJVM, [prov:type = 'JVM-6.0'])entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label='Program 1'])
wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role='defaultRuntime', ex:accessPath='webapp' ])
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erPlan pattern as PROV-O
52
:_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] .
:_aJVM a prov:Agent, “Java-6.0".
:myCleverProgram a prov:Entity, prov:Plan.
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erDelegation within an activity scope
53
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erReal-world artifacts vs provenance entities
54
ref: http://www.w3.org/2001/sw/wiki/PROV-FAQ#Examples_of_Provenance
“What do I know about the car I see in this Cambridge street today?”
•It was bought by Joe in 2011
•Joe drove it to Boston on March 16th, 2013. The car has now got 10,000 miles on it
•Joe drove it to Cambridge on March 18th, 2013.
“Same” car, but different provenance at each stage of its evolution
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erAlternate-specialization pattern
55
Two alternate entities present aspects of the same thing. These aspects may be the same or different, and the alternate entities may or may not overlap in time.
An entity that is a specialization of another shares all aspects of the latter, and additionally presents more specific aspects of the same thing as the latter.
...But, this is still that car!
Semantic notes:1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2).2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1)
3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).
differing in their location
same owner, added location
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erReserved attributes and types
56
A small set of reserved attributes, with some usage restrictions
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erBundles, provenance of provenance
57
A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed.
bundle pm:bundle1
entity(ex:draftComments)entity(ex:draftV1)
activity(ex:commenting)wasGeneratedBy(ex:draftComments, ex:commenting,-) used(ex:commenting, ex:draftV1, -)endBundle...entity(pm:bundle1, [ prov:type='prov:Bundle' ])agent(ex:Bob)wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00)wasAttributedTo(pm:bundle1, ex:Bob)
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erBundles in PROV-O
58
Bundle definition (an RDF named graph):
ex:bundle1 { :draftComments a prov:Entity ; :status “blah"; prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ; prov:used :draftV1 .
:draftV1 a prov:Entity .}
Bundle usage:
ex:bundle1 a prov:Entity, "prov:Bundle"; prov:qualifiedGeneration [ a prov:Generation ; prov:atTime “2013-03-20T10:30:00+09:00" ]; prov:wasAttributedTo :Bob .
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Component Structure for PROV
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Core vs Extended
Responsibility View
Data Flow View
ProcessViewCore Extended
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erTime, Events
62
wasStartedBy(id; a2, e, a1, t, attrs)
wasEndedBy(id; a2, e, a1, t, attrs)
Instead, the PROV data model is implicitly based on a notion of instantaneous events, that mark transitions in the world (*)
(*) PROV-CONSTR http://www.w3.org/TR/prov-constraints/#events (non-normative)
Events:
- activity start, activity end,
- entity generation , entity usage, entity invalidation
- Provenance statements are combined by different systems
- An application may not be able to align the times involved to a single global timeline
Therefore, PROV minimizes assumptions about time
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erFrom “scruffy” provenance to “valid” provenance
63
- Are all possible temporal partial ordering of events equally acceptable?- How can we specify the set of all valid orderings?
More generally, how do we formally define what it means for a set of provenance statements to be valid?
PROV defines a set of temporal constraints that ensure consistency of a provenance graph
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Summary
• Motivation for collecting provenance of data and information• In Science
• In the Social Web
• The W3C PROV Recommendation (2013)• PROV-DM: The PROV data model
• PROV-O: the Provenance Ontology
• (PROV-CONSTRAINTS)
• Provenance as Big Data• High volume provenance
• Storage, analytics, visualisation
• Provenance of analytics• How can I explain my predictions?
• The ProvGen tool
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Selected bibliography
Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012. http://www.w3.org/TR/prov-dm/
Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012. http://www.w3.org/TR/prov-constraints/
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web Semantics: Science, Services and Agents on the World Wide Web (April 2015). doi:10.1016/j.websem.2015.04.001.http://www.sciencedirect.com/science/article/pii/S1570826815000177
Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530. http://dx.doi.org/10.1002/cpe.1870.
ProvGen: generating synthetic PROV graphs with predictable structure.Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.2495
ProvAbs: model, policy, and tooling for abstracting PROV graphs.Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998
De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh, Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop-program/presentation/de-oliveira.