2016-10-20 BioExcel: Advances in Scientific Workflow Environments

29
2016-10-20 BioExcel Workflow Training, BSC, Barcelona Advances in Scientific Workflow Environments Carole Goble, Stian Soiland-Reyes The University of Manchester http://esciencelab.org.uk/ @soilandreyes http://orcid.org/0000-0001-9842-9718

Transcript of 2016-10-20 BioExcel: Advances in Scientific Workflow Environments

2016-10-20 BioExcel Workflow Training, BSC, Barcelona

Advances in Scientific Workflow Environments

Carole Goble, Stian Soiland-ReyesThe University of Manchester

http://esciencelab.org.uk/

@soilandreyeshttp://orcid.org/0000-0001-9842-9718

What is a Workflow? • Orchestrating multiple

computational tasks• Managing the control

and data flow between them

• In a world that is homogeneous or heterogeneous

• Tasks– Local / remote– Own / third party– White, grey or black boxes– Reliable / fragile– Reserved / dynamic– Various underpinning

infrastructure– Various access controls

BioExcel: Biomolecular recognition

What is a Workflow? Automation

– Automate computational aspects– Repetitive pipelines, sweep campaigns

Scaling – compute cycles– Make use of computational infrastructure

& handle large data

Abstraction – people cycles– Shield complexity and incompatibilities– Report, re-use, evolve, share, compare– Repeat – Tweak - Repeat– First class commodities

Provenance - reporting– Capture, report and utilize log and data

lineage auto-documentation– Traceable evolution, audit, transparency– Compare

With thanks to Bertram Ludascher: WORKS 2015

FindableAccessibleInteroperableReusable(Reproducible)

The humble Makefiledefault

clean all

thesis.pdf

thesis.aux

thesis.bbl

thesis.tex

thesis.bib

https://github.com/vak/makefile2dot

https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/

Laser Interferometer Gravitational-Wave Observatory – first detection of gravitational waves from colliding black holes

Workflow Environment Ecosystem

https://s.apache.org/existing-workflow-systems

Morphological, hemodynamic and structural analyses linked to aneurysm genesis, growth and rupture.

[Susheel Varma] http://www.vph-share.eu/

https://taverna.incubator.apache.org/

Apache Taverna

https://www.knime.org/ https://www.openphacts.org/

Pharmacological queriestarget, compound and pathway data

doi:10.1371/journal.pone.0115460

http://www.myexperiment.org/workflows/4292

Galaxy https://usegalaxy.org/

doi:10.1186/s13742-016-0115-8

https://usegalaxy.org/

Science Workflows

Data wrangling& analytics

Simulations

Instrumentpipelines

+

+http://tpeterka.github.io/maui-project/

The Future of Scientific Workflows, Report of DOE Workshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pdf

What happens inside workflows?

Garijo et al: Common Motifs in Scientific Workflows: An Empirical Analysisdoi:10.1016/j.future.2013.09.018

Stop Press!GUIs not essential!• Canvas, drag-drop blocks,

arrows, run button• Command-line & embedding in

developer or user applications

Scripts can be workflows!• WMS<->Scripts• Script vs Workflows/ASAP:

– Automation: *****– Scaling: **– Abstraction: *– Provenance: **

Copernicus workflow engine for parallel adaptive molecular dynamics

• Peer-to-peer distributed computing platform– high-level parallelization of statistical

sampling problems

• Consolidation of heterogeneous compute resources

• Automatic resource matching of jobs against compute resources

• Automatic fault tolerance of distributed work

• Workflow execution engine to define a problem (reporting) and trace its results live (provenance)

• Flexible plugin facilities – programs to be integrated to the

workflow execution engine

Free Energy Workflow using GROMACS

http://copernicus-computing.org/

COMPs/PyCOMPs: Programmer Productivity framework

• Sequential programming– Parallelisation and

distribution heavy-lifting– Dependency detection

• Infrastructure unaware– Abstract application from

underlying infrastructure– Portability

• Standard Programming Languages– Java, Python, C/C++

• No (or few!) APIs– Standard Java

http://compss.bsc.es/

https://www.nextflow.io/

StreamingParallelismCheckpointsPluggable executorsReproducibility

https://github.com/chapmanb/bcbio-nextgen

Distributed workflows for NGSDomain-specific language

https://bcbio-nextgen.readthedocs.org

GUIs -->increase user uptake

http://www.myexperiment.org Find and Share

Running workflows,tracking provenance

ASAP• common,

interoperable provenance recording– W3C PROV

ASAP• YesWorkflow.org

– Annotations in script yield workflow view

ASAP• Library profilers

– noWorkflow

• runtime provenance recorders– Sumatra, RDataTracker

Provenance

W3C PROV Standard

Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. 

http://www.w3.org/TR/prov-overview/

Activities

What happened? When? Who?

What was used and generated?

Why was this workflow started?

Which workflow ran? Where?

Why do I need this?

i. To see which analysis was performed

ii. To find out who did what

iii. What was the metagenome used for?

iv. To understand the whole process“make me a Methods section”

v. To track down inconsistencies

used

wasGeneratedBy

wasStartedAt

"2012-06-21"

MetagenomeMetagenome

SampleSample

wasAssociatedWith

Workflow server

Workflow server

wasInformedBy

wasStartedBy

Workflow run

Workflow run

wasGeneratedBy

ResultsResults

SequencingSequencing

wasAssociatedWith

AliceAlice

hadPlan

Workflow definition

hadRole

Lab technician

ResultsResults

Dependency Management

Codes Behaviours & Reliability

https://twitter.com/ianholmes/status/288689712636493824

Research Object Bundlehttp://researchobject.org/

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects doi:10.1016/j.websem.2015.01.003

application/vnd.wf4ever.robundle+zip

Workflow Interoperability. Common format for bioinformatics tool & workflow executionCommunity based standards effortDesigned for clusters & cloudsSupports the use of containers (e.g. Docker)Specify data dependencies between stepsScatter/gather on stepsNest workflows in steps

Develop your pipeline on your local computer (optionally with Docker)Execute on your research cluster or in the cloudDeliver to users via workbenches

EDAM ontology (ELIXIR-DK) to specify file formats and reason about them: “FASTQ Sanger” encoding is a type of FASTQ file

http://commonwl.org/

http://www.commonwl.org/

● Task-specific “mini-workflow” fragments– e.g. using Gromacs, CPMD,

HADDOCK

● Packaged– EGI VM images and Docker

containers

● Backed by existing registries– ELIXIR’s bio.tools and EGI App DB

● Instantiated as cloud instances– private (Open Nebula, Open Stack)

– public (e.g. Amazon AWS )

Application Building BlocksBioExcel Virtualised Software Library“transversal workflow units”, higher level operations

But which workflow system should I use..?