Keystone summer school 2015 paolo-missier-provenance

70
First Keystone Summer School – Malta July 2015 – P. Missier Provenance and the W3C PROV model (in the Big Data context)§ Paolo Missier School of Computing Science Newcastle University, UK Tutorial First Keystone Summer School, Malta, July 2015 Some of the slides courtesy of Luc Moreau – thanks!

Transcript of Keystone summer school 2015 paolo-missier-provenance

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Provenance and the W3C PROV model(in the Big Data context)§

Paolo Missier

School of Computing Science

Newcastle University, UK

Tutorial

First Keystone Summer School,

Malta, July 2015

Some of the slides courtesy of Luc Moreau – thanks!

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Topical research dissemination events

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Lecture goals and outline

• What is provenance, and why does it matter?• Definitions and case studies

• The W3C PROV standard in a nutshell• PROV-O: the Provenance Ontology and examples of its usage

• Provenance and Big Data: what’s the connection?• Opportunities and challenges

• Provenance tools [from Southampton]

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

1- Reproducibility and dissemination in Science

Independent validation of scientific claims is a cornerstone of experimental science

• Scientific claims are supported by experiments

• How do express my “material and methods” so that you can independently verify my results?

• How do I document my results to promote their understanding / reuse

Provenance is the equivalent of a logbook• Capture all steps involved in the derivation of a

result• Replay, validate the execution, compare it with

others

To what extent these can be formalised and automated in data-intensive science?

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

2- Explaining the outcome of a complex decision process

• Which process was used to derive a diagnosis?

• How did the process use the input data?

• How were the steps configured?

• Which decisions were made by human experts (clinicians)?

Clinical diagnosis of genetic diseases

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

3- Understanding the results of a computation

• Why has my [very complicated algorithm] produced this particular result?

• Why is my predictive analytics model suggesting that it will rain tomorrow?

• Why is this record part of the result of my database query?• Database provenance

• Why is this record included in the result of my keyword search?

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

4- Content reuse on the Social Web

Open Data, Data Journalism

• A consume-select-curate-share workflow, not only professional

• Ethos: to expose the data and methods used to produce news items

• But: Data wrangling can introduce errors• Is the data I am using valid? What is its primary source? What are the

transformation steps?

NowNews publishes an article based on the latest employment data published by GovStat

PolicyOrg compiles a report including NowNews article

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

What is provenance?

Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation• the history or pedigree of a work of art, manuscript, rare book, etc.;• a record of the passage of an item through its various owners

Magna Carta (‘the Great Charter’) was agreed between King John and his barons on 15 June 1215.

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

What is provenance?

Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)

Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)

Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Provenance on the Web

Tim Berners-Lee’s “Oh Yeah” button:

• A browser button by which the user can express their uncertainty about a document being displayed “so how do I know I can trust this information?”.

• Upon activation of the button, the software then retrieves metadata about the document, listing assumptions on which trust can be based.

http://users.ugent.be/~tdenies/OhYeah/Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC, the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan

http://dx.doi.org/10.1109/COMPSACW.2013.29

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Provenance in the Semantic Web Stack

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Use cases on the Social Web

Open Data, Data Journalism

NowNews publishes an article based on the latest employment data published by GovStat

PolicyOrg compiles a report including NowNews article

Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Derivation - Timeliness

Derivation:• Charts, graphs and visualizations are all based on multiple data sets• Eg Bob’s article on employment that appeared in NowNews• Which data was a figure based upon?

Is the report based on the most up-to-date data?

Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Derivation - Trusted sources

Derivation:• Is this content derived from data coming from a reliable source?

• The chart within Bob’s article is based on GovStat data• However that information is hidden:

• the chart was produced by a complex process performed by Alice

Policy rule:

“data supplied by the government is reliable”

Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Tracing the source of errors

Derivation, attribution:• When did this error occur?• Who was responsible for the chart?

Nick discovers an error in the chart included in Bob’s article

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Ensuring policy compliance

Process inspection:• Which process steps led to publication?• Was editorial check part of it?

Policy rule:

“posts are to be checked by an editor prior to publication”

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Ensuring credit and acknowledgement

NowNews relies on multiple contributors

Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master

Attribution and responsibility:• How do we ensure that all relevant

contributors are acknowledged?

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Reproducibility

Documenting the data generation process:• How do we ensure that

the figures can be reproduced using the new versions of the data?

NowNews must ensure that the article figures reflect the most recent data

Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master

:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

So, why does provenance matter?

• To establish quality, relevance, trust

• To track information attribution through complex transformations

• To enable process analysis for debugging, improvement, evolution

• To enable reproducibility of processes (eg in science, data journalism…)

See also:

ACM Journal of Data and Information Quality (JDIQ) - Special Issue on Provenance, Data and Information Quality, Paolo Missier, Paolo Papotti, Eds. Volume 5 Issue 3, February 2015DOI: 10.1145/2692312http://dl.acm.org/citation.cfm?id=2700413http://jdiq.acm.org/archive.cfm?id=2698232

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erThe W3C Working Group on Provenance

W3CIncubator groupon provenance

Chair: Yolanda Gil, ISI, USC

W3Cworking groupapproved

Chairs: Luc Moreau,Paul Groth

2009-2010

Main output:“Provenance XG Final Report”http://www.w3.org/2005/Incubator/prov/XGR-prov/- provides an overview of the various existing approaches, vocabularies- proposes the creation of a dedicated W3C Working

Group

April, 2011 April, 2013

ProposedRecommendationsfinalised

prov-dm: Data Modelprov-o: OWL ontology, RDF encodingprov-n: prov notationprov-constraints

...plus a number of non-prescriptive Notes

http://www.w3.org/2011/prov/wiki/

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erPROV: scope and structure

23

source: http://www.w3.org/TR/prov-overview/

Recommendationtrack

See also:

Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129. doi:10.2200/S00528ED1V01Y201308WBE007.

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erPROV Core Elements (graph depiction)

24

An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.

An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities.

An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erGeneration, Usage

25

Generation is the completion of production of a new entity by an activity. This entity did not exist before generation and becomes available for usage after this generation.

Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had not begun to utilize this entity

PROV is based on a notion of instantaneous events, that mark transitions in the world

- generation, usage (and others)

Ordering constraints amongst events:

“generation of e must precede each of usages”

“a can only use / generate e after it has started and before it has ended”

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erConcepts and relations

26

Generation of “draft v1” expressed as relation:

wasGeneratedBy(“draft v1”, ...)

Usage of “draft v1” by “commenting” expressed as relation:

used(“commenting, “draft v1”,...)

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erPROV notation

27

document

prefix prov <http://www.w3.org/ns/prov#>prefix ex <http://www.example.com/>

entity(ex:draftComments)entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"])entity(ex:paper1)entity(ex:paper2)

activity(ex:commenting)activity(ex:drafting)wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00)used(ex:commenting, ex:draftV1, -)wasGeneratedBy(ex:draftV1, ex:drafting, -)used(ex:drafting, ex:paper1, -)used(ex:drafting, ex:paper2, -)

endDocument

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erSame example — PROV-O notation (RDF/N3)

28

:draftComments a prov:Entity ; :distr "internal"^^xsd:string ; prov:wasGeneratedBy :commenting .

:commenting a prov:Activity ; prov:used :draftV1 .

:draftV1 a prov:Entity ; :distr "internal"^^xsd:string ; :status "draft"^^xsd:string ; :version "0.1"^^xsd:string ; prov:wasGeneratedBy :drafting .

:drafting a prov:Activity ; prov:used :paper1, :paper2 .

:paper1 a prov:Entity, "reference"^^xsd:string .

:paper2 a prov:Entity, "reference"^^xsd:string .

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erAssociation, Attribution, Delegation: who did what?

29

An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity.

Attribution is the ascribing of an entity to an agent.

entity(ex:draftComments, [ ex:distr='internal' ])activity(ex:commenting)agent(ex:Bob, [prov:type = "mainEditor"] )agent(ex:Alice, [prov:type = "srEditor"])

wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])actedOnBehalfOf(Bob, Alice)wasAttributedTo(ex:draftComments, ex:Bob)

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erSame example — PROV-O notation (RDF/N3)

30

:Alice a prov:Agent, "ex:chiefEditor"; :firstName "Alice"; :lastName "Cooper".

:Bob a prov:Agent, "ex:seniorEditor"; :firstName "Robert"; :lastName "Thompson"^; prov:actedOnBehalfOf :Alice .

:draftComments prov:wasAttributedTo :Bob .:drafting a prov:Activity ; prov:wasAssociatedWith :Bob .

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erAssociation and Attribution

31

Q.: what is the relationship between attribution and association?

This is defined as an inference rule in the PROV-CONSTR document

entity(e)agent(Ag)activity(a)

wasAttributedTo(e, Ag)wasGeneratedBy(e, a,-) wasAssociatedWith(a, Ag,-)

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erCommunication amongst activities

32

Communication is the exchange of some unspecified entity by two activities, one activity using some entity generated by the other.

activity(ex:commenting)activity(ex:drafting)

wasInformedBy(ex:commenting, ex:drafting)

:drafting a prov:Activity .

:commenting a prov:Activity ; prov:wasInformedBy :drafting .

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erCommunication, generation, usage

33

activity(ex:commenting)activity(ex:drafting)entity(e)wasInformedBy(ex:commenting, ex:drafting)wasGeneratedBy(e,ex:drafting, -)used(ex:commenting, e, -)

Q.: what is the relationship between communication, generation, and usage?

This are inference rules 5 and 6 in the PROV-CONSTR document

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Three Views of Provenance

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erSummary of the PROV Core model

35

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erDerivation amongst entities

36

A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity.

entity(ex:draftV1)entity(ex:draftComments)wasDerivedFrom(ex:draftComments, ex:draftV1)

Q.: what is the relationship between derivation, generation, and usage?

:draftComments a prov:Entity ; prov:wasDerivedFrom :draftV1 .

:draftV1 a prov:Entity .

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Provenance and Big Data: what’s the connection?

opportunities and challenges

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Provenance {as,of} Big Data

1. BigProv: Provenance as big data• High volume provenance

• What kind of analytics are interesting on big provenance?

2. Provenance of analytics processes

• “Prediction provenance”• Train a model provenance of the model as a record of the training

process and data involved

• Use the model to make predictions provenance of the prediction

3. Provenance of a search

• What is the provenance of a keyword search?

• Why would it be interesting? What can we learn from it?

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Recent research on Provenance as Big Data

Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7 May 2015 doi: 10.1109/CCGrid.2015.85

Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.525,534, 4-7 May 2015doi: 10.1109/CCGrid.2015.86

Provenance Map Orbiter: Interactive Exploration of Large Provenance GraphsPeter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece

Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013

Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop, Edinburgh, 2015 http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

ProvGen

• A Provenance Generator tool for experimenting with provenance at scale

• Why generate synthetic provenance?• Synthetic PROV graphs can be a valuable complement to emerging natural

provenance collections

• … provided their structural properties reflect specific provenance patterns

• control over their repetition and variability

• varying scales

• Useful for benchmarking emerging provenance management systems

• Useful to test analytics algorithms that operate on large provenance collections

Firth, Hugo, and Paolo Missier. “ProvGen: Generating Synthetic PROV Graphs with Predictable Structure.” In Procs. IPAW 2014 (Provenance and Annotations). Koln, Germany: Springer, 2014.http://arxiv.org/pdf/1406.2495

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

What does ProvGen do?

• Accept a seed PROV graph

• Grow the graph• Add nodes and relationships following the seed graph

structure

• … with constraints on how to grow

document entity(e1, [type="Document", version="original"]) entity(e2, [type="Document"]) entity(e3, [type="Document"]) activity(a1, [type="create"]) activity(a2, [type="edit"]) activity(a3, [type="edit"]) agent(ag, [type="Person"]) used(a2, e1) used(a3, e2) wasGeneratedBy(e2, a2, [fct="save"]) wasGeneratedBy(e1, a1, [fct="publish"]) wasGeneratedBy(e3, a3, [fct="save"]) wasAssociatedWith(a3, ag, [role="contributor"]) wasAssociatedWith(a2, ag, [role="contributor"]) wasAssociatedWith(a1, ag, [role="creator"]) wasDerivedFrom(e2, e1) wasDerivedFrom(e3, e2)endDocument

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

ProvGen constraints

an Entity must have relationship "WasDerivedFrom" exactly 2 times unless it has property("version"="original");

the Entity(e1) must not have relationship "WasDerivedFrom" with the Entity(e2) unless e1 has relationship "Used" with the Activity(a) and e2 has the relationship "WasGeneratedBy" with the Activity(a);

an Entity must have relationship "WasGeneratedBy" exactly 1 times;

an Entity must have property("version"="original") with probability 0.05;

an Entity must have out degree at most 2;

an Activity must have relationship "Used" at most 1 times;

an Activity must have property("type"="create") with probability 0.01;

an Activity must have relationship "WasAssociatedWith" exactly 1 times;

an Activity must have relationship "Used" exactly 1 times unless it has property("type"="create");

an Activity must have relationship "WasGeneratedBy" exactly 1 times;

an Agent must have relationship "WasAssociatedWith" with probability 0.1;

an Agent must have relationship "WasAssociatedWith" between 1, 120 times with distribution gamma(1.3, 2.4);

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Some test queries

Generated graph loaded to Neo4J GDBMSQueries expressed using the Cypher graph query language

Transitive closure over Derivation: Return all the derivation chains, along with their length

MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) RETURN a,b, length(r)

MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) WHERE length(r) > 10 RETURN a,b, length(r) ORDER BY length(r) desc limit 50

Return the top 50 length derivation chains

MATCH (a)-[:`WASASSOCIATEDWITH`]->(b)RETURN a as Agent, b as Activity

All agents and their associated activities

All agents who created new documents

MATCH (a{type:'create'})-[:`WASASSOCIATEDWITH`]->(b)RETURN a,b LIMIT 25

All agents who edited a document that was derived from an original

MATCH (doc1{version:'original'}) <- [:WASDERIVEDFROM] - (doc2) -[:`WASGENERATEDBY`] -> act -[:WASASSOCIATEDWITH] -> agent RETURN agent LIMIT 25

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Provenance of Big Data

Provenance of analytics processes:

“Prediction provenance”• Train a model provenance of the model as a record of the training

process and data involved

• Use the model to make predictions provenance of the prediction

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erRelations may be given identifiers

45

entity(ex:draftComments)entity(ex:draftV1)activity(ex:commenting)wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -)used(use1; ex:commenting, ex:draftV1, -)

gen1 denotes a generation event

use1 denotes a usage event

wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)

General derivation relation:

Relation IDs make it possible to refer to relations in other relations

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erRendering N-ary relations in PROV-O

46

RDF is for binary relations —- N-ary relations require reification

entity(ex:draftComments)entity(ex:draftV1)activity(ex:commenting)wasGeneratedBy(gen1; ex:draftComments, ex:commenting, 2013-03-18T10:00:01)used(use1; ex:commenting, ex:draftV1, -)

:draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 .

:gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00".

:commenting a prov:Activity ; prov:qualifiedUsage :use1 .

:use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er“Qualified relation” RDF pattern

47

:draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 .

:gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00".

:commenting a prov:Activity ; prov:qualifiedUsage :use1 .

:use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erPlans — why was something done?

48

Most relation types have two arguments which are { Entity, Activity, Agent}

Derivation is one exception:

wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)

Two other notable exceptions: - Associations with a plan- Delegation with an activity scope

wasAssociatedWith(id; a, ag, pl, attrs)

A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goal

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erAssociation with a plan

49

A plan plays a role in an association

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erPlans are typed entities

50

activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])agent(ex:_aJVM, [prov:type = 'JVM-6.0'])entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label='Program 1'])

wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role='defaultRuntime', ex:accessPath="webapp" ])

A plan is an entity having prov:type = “prov:plan”

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erPlan pattern as PROV-O

51

:_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] .

:_aJVM a prov:Agent, “Java-6.0".

:myCleverProgram a prov:Entity, prov:Plan.

activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])agent(ex:_aJVM, [prov:type = 'JVM-6.0'])entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label='Program 1'])

wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role='defaultRuntime', ex:accessPath='webapp' ])

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erPlan pattern as PROV-O

52

:_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] .

:_aJVM a prov:Agent, “Java-6.0".

:myCleverProgram a prov:Entity, prov:Plan.

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erDelegation within an activity scope

53

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erReal-world artifacts vs provenance entities

54

ref: http://www.w3.org/2001/sw/wiki/PROV-FAQ#Examples_of_Provenance

“What do I know about the car I see in this Cambridge street today?”

•It was bought by Joe in 2011

•Joe drove it to Boston on March 16th, 2013. The car has now got 10,000 miles on it

•Joe drove it to Cambridge on March 18th, 2013.

“Same” car, but different provenance at each stage of its evolution

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erAlternate-specialization pattern

55

Two alternate entities present aspects of the same thing. These aspects may be the same or different, and the alternate entities may or may not overlap in time.

An entity that is a specialization of another shares all aspects of the latter, and additionally presents more specific aspects of the same thing as the latter.

...But, this is still that car!

Semantic notes:1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2).2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1)

3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).

differing in their location

same owner, added location

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erReserved attributes and types

56

A small set of reserved attributes, with some usage restrictions

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erBundles, provenance of provenance

57

A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed.

bundle pm:bundle1

entity(ex:draftComments)entity(ex:draftV1)

activity(ex:commenting)wasGeneratedBy(ex:draftComments, ex:commenting,-) used(ex:commenting, ex:draftV1, -)endBundle...entity(pm:bundle1, [ prov:type='prov:Bundle' ])agent(ex:Bob)wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00)wasAttributedTo(pm:bundle1, ex:Bob)

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erBundles in PROV-O

58

Bundle definition (an RDF named graph):

ex:bundle1 { :draftComments a prov:Entity ; :status “blah"; prov:wasGeneratedBy :commenting .

:commenting a prov:Activity ; prov:used :draftV1 .

:draftV1 a prov:Entity .}

Bundle usage:

ex:bundle1 a prov:Entity, "prov:Bundle"; prov:qualifiedGeneration [ a prov:Generation ; prov:atTime “2013-03-20T10:30:00+09:00" ]; prov:wasAttributedTo :Bob .

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Component Structure for PROV

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Core vs Extended

Responsibility View

Data Flow View

ProcessViewCore Extended

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erTime, Events

62

wasStartedBy(id; a2, e, a1, t, attrs)

wasEndedBy(id; a2, e, a1, t, attrs)

Instead, the PROV data model is implicitly based on a notion of instantaneous events, that mark transitions in the world (*)

(*) PROV-CONSTR http://www.w3.org/TR/prov-constraints/#events (non-normative)

Events:

- activity start, activity end,

- entity generation , entity usage, entity invalidation

- Provenance statements are combined by different systems

- An application may not be able to align the times involved to a single global timeline

Therefore, PROV minimizes assumptions about time

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

erFrom “scruffy” provenance to “valid” provenance

63

- Are all possible temporal partial ordering of events equally acceptable?- How can we specify the set of all valid orderings?

More generally, how do we formally define what it means for a set of provenance statements to be valid?

PROV defines a set of temporal constraints that ensure consistency of a provenance graph

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Summary

• Motivation for collecting provenance of data and information• In Science

• In the Social Web

• The W3C PROV Recommendation (2013)• PROV-DM: The PROV data model

• PROV-O: the Provenance Ontology

• (PROV-CONSTRAINTS)

• Provenance as Big Data• High volume provenance

• Storage, analytics, visualisation

• Provenance of analytics• How can I explain my predictions?

• The ProvGen tool

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

Selected bibliography

Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012. http://www.w3.org/TR/prov-dm/

Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012. http://www.w3.org/TR/prov-constraints/

Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web Semantics: Science, Services and Agents on the World Wide Web (April 2015). doi:10.1016/j.websem.2015.04.001.http://www.sciencedirect.com/science/article/pii/S1570826815000177

Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530. http://dx.doi.org/10.1002/cpe.1870.

ProvGen: generating synthetic PROV graphs with predictable structure.Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.2495

ProvAbs: model, policy, and tooling for abstracting PROV graphs.Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998

De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh, Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop-program/presentation/de-oliveira.

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

provenance.ecs.soton.ac.uk

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

ProvValidator

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

ProvTranslator

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

ProvStore

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

ProvToolbox

Firs

t Key

ston

e S

umm

er S

choo

l – M

alta

Jul

y 20

15 –

P. M

issi

er

ProvPy