2017-11-03 Scientific Workflow systems

27
Partners Funding bioexcel.eu Scientific Workflow Systems 1 Stian Soiland-Reyes eScience Lab, The University of Manchester 2017-11-03, Aix-en-Provence CESAB workshop: Reproducible Workflows orcid.org/0000-0001-9842-9718 @ soilandreyes This work is licensed under a Creative Commons Attribution 4.0 International License .

Transcript of 2017-11-03 Scientific Workflow systems

Page 1: 2017-11-03 Scientific Workflow systems

Partners Funding

bioexcel.eu

Scientific Workflow Systems

1

Stian Soiland-Reyes

eScience Lab, The University of Manchester

2017-11-03, Aix-en-Provence

CESAB workshop: Reproducible Workflows

orcid.org/0000-0001-9842-9718 @soilandreyes

This work is licensed under aCreative Commons Attribution 4.0 International License.

Page 2: 2017-11-03 Scientific Workflow systems

bioexcel.eu

What is a Workflow?

Orchestrating computational tasks

Managing the control and data flow

Homogeneous or heterogeneous tasks:– Local / remote

– Own / third party

– White, grey or black boxes

– Reliable / fragile

– Reserved / dynamic

– Various underpinning infrastructure

– Various access controls

BioExcel: Biomolecular recognition

Page 3: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Not on the agenda: Business workflows

Control flow of who has responsibility for what

BPM

Business workflows + computational workflows

IBISBA

3

Page 4: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Why use workflows?Automation– Automate computational aspects

– Repetitive pipelines, sweep campaigns

Scaling – compute cycles– Make use of computational infrastructure &

handle large data

Abstraction – people cycles– Shield complexity and incompatibilities

– Report, re-use, evolve, share, compare

– Repeat –Tweak - Repeat

– First class commodities

Provenance - reporting– Capture, report and utilize log and data lineage

auto-documentation

– Traceable evolution, audit, transparency

– Compare

Findable

Accessible

Interoperable

Reusable

(Reproducible)

4 Adapted from Bertram Ludäscher at WORKS2015 https://www.slideshare.net/ludaesch/works-2015provenancemileage

Page 5: 2017-11-03 Scientific Workflow systems

bioexcel.eu

The humble Makefile

5

https://github.com/vak/makefile2dot

Page 6: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Laser Interferometer Gravitational-Wave ObservatoryFirst detection of gravitational waves from colliding black holes

https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/

https://pegasus.isi.edu/

Page 7: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Workflow Environment Ecosystem

7

Page 8: 2017-11-03 Scientific Workflow systems

bioexcel.euhttps://s.apache.org/existing-workflow-systems

Page 9: 2017-11-03 Scientific Workflow systems

bioexcel.eu

https://taverna.incubator.apache.org/

Page 10: 2017-11-03 Scientific Workflow systems

bioexcel.eu

https://www.knime.org/

https://www.openphacts.org/

Pharmacological queriestarget, compound and pathway data

https://doi.org/10.1371/journal.pone.0115460

http://www.myexperiment.org/workflows/4292

Page 11: 2017-11-03 Scientific Workflow systems

bioexcel.eu

https://usegalaxy.org/

Page 12: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Stop Press!GUIs not essential!

GUI: Canvas, drag-drop blocks, arrows,

run button, data visualization

Script: Textual, command line, view data

externally. Script easily run from other apps.

Scripts can be workflows!

Workflow systems ⇆ Scripts

Scripts on ASAP meter:

Automation: ★ ★ ★ ★ ★

Scaling: ★ ★

Abstraction: ★

Provenance: ★ ★

Page 13: 2017-11-03 Scientific Workflow systems

bioexcel.eu

https://www.nextflow.io/

Script-like, define flow as channels

Streaming

Automatic Parallelism

Checkpoints

Virtualization and packaging

Portable

Reproducibility

Page 14: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Snakemake

MakeFile + Python ⇝SnakeMake

Filename patterns

Shell commands

Inline Python, R

Scalable to grid/cloud

14

https://snakemake.readthedocs.io/

Page 15: 2017-11-03 Scientific Workflow systems

bioexcel.eu

YesWorkflow

Declare workflow steps as

#annotations in existing scripts

Graphical visualization of workflow

15

http://yesworkflow.org/

Page 16: 2017-11-03 Scientific Workflow systems

bioexcel.eu

https://github.com/chapmanb/bcbio-

nextgen

Distributed workflows for

Next-Gen Sequencing

analysis

Domain-specific language

Focus on parameters,

algorithms

Workflow fixed –

no command lines!

https://bcbio-nextgen.readthedocs.org

Page 17: 2017-11-03 Scientific Workflow systems

bioexcel.eu

http://commonwl.org/

Workflow interoperability

Common workflow format

Community based standards effort

Designed for clusters & clouds

Use containers (e.g. Docker)

Textual YAML files

(GUIs available)

Workflow: Steps with data dependencies

Step: command line or inline scripts

Scatter/gather on steps

Rich annotations

Page 18: 2017-11-03 Scientific Workflow systems

bioexcel.eu

http://www.commonwl.org/

Page 19: 2017-11-03 Scientific Workflow systems

bioexcel.eu

ContainersLinux Container technology

..light-weight "virtual" virtual machine

A container is started from a image

Images downloaded from Docker Hub

Dockerfile: Layer-based recipe

Philosophy: One service, one

image → microservices

Cloud's best friend: scalable, reproducible,

customizable

19

Page 20: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Publish your own

container images

20

https://hub.docker.com/r/openphacts/

Dockerfile

Page 21: 2017-11-03 Scientific Workflow systems

bioexcel.eu

http://www.myexperiment.org Find and Share

Page 22: 2017-11-03 Scientific Workflow systems

bioexcel.eu

https://view.commonwl.org/

http://doi.org/10.7490/f1000research.1114375.1

Page 23: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Running workflows,tracking provenance

Page 24: 2017-11-03 Scientific Workflow systems

bioexcel.eu

Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.

http://www.w3.org/TR/prov-overview/

ProvenanceW3C standard: PROV

But multiple formats

Multiple styles

Multiple extensions

Best practice for Workflow Provenance?

wfprov (Research Object, Taverna)OPMW/P-Plan (WINGS)ProvONE (DataOne)

https://w3id.org/ro/2016-01-28/wfprov/http://www.opmw.orghttp://vcvcomputing.com/provone/provone.html

Page 25: 2017-11-03 Scientific Workflow systems

bioexcel.eu

https://twitter.com/ianholmes/status/288689712636493824

Page 26: 2017-11-03 Scientific Workflow systems

bioexcel.euhttps://doi.org/10.1016/j.websem.2015.01.003

application/vnd.wf4ever.robundle+zip

Research Object Bundlehttp://www.researchobject.org/

Page 27: 2017-11-03 Scientific Workflow systems

Partners Funding

bioexcel.eu

Acknowledgements

27

Carole Goble

Michael R. Crusoe

Apache Taverna

BioExcel

Common Workflow Language

Research Object