Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows...

13
Sarah Cohen - Boulakia Université Paris Sud, LRI CNRS UMR 8623 On leave at INRIA Virtual Plants & Zenith, Inst. of Comput. Biology, Montpellier

Transcript of Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows...

Page 1: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-BoulakiaUniversité Paris Sud, LRI CNRS UMR 8623On leave at INRIA Virtual Plants & Zenith, Inst. of Comput. Biology, Montpellier

Page 2: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud2

Repositories queried (IR-style) with workflow docum.

Open question: Query languages for repositories◦ Given a high-level description of a (integration) task – a sketch◦ Given a input and/or and output format/type◦ Given a workflow – find similar workflows◦ Search across workflow models (Galaxy, Taverna…) …◦ Querying runs [KSB10, MPB10,…15]

Core of the problem: Workflow similarity◦ Clear view of the state-of-the-art [SCB+14]◦ Need to design hybrid and efficient solutions

Becomes a practical topic only now◦ Large repositories are available + Smaller provenance repositories

Relationships with Business workflows BPQL, BPMN-Q [AS10], BP-QL [BEKM08], … start considering logs

Page 3: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud3

Reuse can be improved by providing citation

Discovering reused workflows in existing repositories◦ Detecting Graph patterns

Various techniques exist, again graph-based problems

Subgraph isomorphism [Ull76] or graph simulation [FLM+10]

Constructing workflow citations◦ Techniques to track copy-paste operations when designing

workflow

◦ Workflow as citeable objects

◦ Storing/indexing workflows (graphs)

Illustration

Page 4: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud4

workflow

workflow interconnection

by ≥ 3 mutual processors

( avg 11.4 proc / swf )

Page 5: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud5

workflow

workflow interconnection

by mutual processors

( ≥ 3 )

Page 6: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud6

workflow

workflow interconnection

by mutual processors

( ≥ 3 )

Page 7: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud7

1

2

34

Page 8: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud8

1

2

34

Page 9: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud9

1

2

34

Page 10: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud10

Workflows and provenance help in reproducibility

But tasks may require certain software to be pre-installed

Open question: Make SWFS infrastructure-aware◦ Problem is well studied in operating systems / middleware

◦ SWFS need to communicate with operating system

New approaches are emerging possibly combined with workflows (virtual environments): Docker, Reprozip, ….

Reproducible papers◦ Web-based interactive computational environment ◦ Combination of code execution, text, mathematics, plots

and rich media into a single document◦ Some systems export workflows as executable IPython

papers To be formalized

Page 11: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud11

A lot of bioinformatics analysis are performed using scripts (instead of workflows)

Provenance of a script execution?◦ noWorkflow [MBC+14], yesWorkflow [MSK+15]

Equivalence between scripts and workflows?◦ Provenance-equivalence [CBC+14]? Other kind of

equivalence?

Aim ◦ Optimization of workflows (using ZOOM*userviews,

DistillFlow…) Optimization of scripts (refactoring, …)

Page 12: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud12

On-the-fly solutions have to be designed◦ Data is too volatile to be updated as in data warehouses

One size cannot fit all ◦ combining ranking criteria or consensus ranking?

Exploiting alternative paths ◦ Tuning page-rank… ?

Organizing challenges & providing gold standards to evaluate solutions

Many research opportunities …

…. with big impact on large communities of users!

Page 13: Managing Provenance in Scientific workflows: a …cohen/BIGDATA/lecture3-2.pdfScientific workflows play a crucial role by their ability to combine analysis and integration and enhance

Sarah Cohen-Boulakia, Université Paris Sud13

Data Integration in the Life Science (DILS) is more important than ever

Faced with the increasing number of data, sources, and analytic tools and the increasing complexity of analysis pipelines, challenges are numerous

Scientific workflows play a crucial role by their ability to combine analysis and integration and enhance reproducibility

Ranking is necessary to help priorize research

New developments in Databases and Graphs (algorithmics) will have major impact in DILS…

… and new algorithms from the DILS community may be reused by other communities!