2016-10-20 BioExcel: Advances in Scientific Workflow Environments
-
Upload
stian-soiland-reyes -
Category
Science
-
view
169 -
download
0
Transcript of 2016-10-20 BioExcel: Advances in Scientific Workflow Environments
2016-10-20 BioExcel Workflow Training, BSC, Barcelona
Advances in Scientific Workflow Environments
Carole Goble, Stian Soiland-ReyesThe University of Manchester
http://esciencelab.org.uk/
@soilandreyeshttp://orcid.org/0000-0001-9842-9718
What is a Workflow? • Orchestrating multiple
computational tasks• Managing the control
and data flow between them
• In a world that is homogeneous or heterogeneous
• Tasks– Local / remote– Own / third party– White, grey or black boxes– Reliable / fragile– Reserved / dynamic– Various underpinning
infrastructure– Various access controls
BioExcel: Biomolecular recognition
What is a Workflow? Automation
– Automate computational aspects– Repetitive pipelines, sweep campaigns
Scaling – compute cycles– Make use of computational infrastructure
& handle large data
Abstraction – people cycles– Shield complexity and incompatibilities– Report, re-use, evolve, share, compare– Repeat – Tweak - Repeat– First class commodities
Provenance - reporting– Capture, report and utilize log and data
lineage auto-documentation– Traceable evolution, audit, transparency– Compare
With thanks to Bertram Ludascher: WORKS 2015
FindableAccessibleInteroperableReusable(Reproducible)
The humble Makefiledefault
clean all
thesis.pdf
thesis.aux
thesis.bbl
thesis.tex
thesis.bib
https://github.com/vak/makefile2dot
https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/
Laser Interferometer Gravitational-Wave Observatory – first detection of gravitational waves from colliding black holes
Morphological, hemodynamic and structural analyses linked to aneurysm genesis, growth and rupture.
[Susheel Varma] http://www.vph-share.eu/
https://taverna.incubator.apache.org/
Apache Taverna
https://www.knime.org/ https://www.openphacts.org/
Pharmacological queriestarget, compound and pathway data
doi:10.1371/journal.pone.0115460
http://www.myexperiment.org/workflows/4292
Galaxy https://usegalaxy.org/
doi:10.1186/s13742-016-0115-8
Science Workflows
Data wrangling& analytics
Simulations
Instrumentpipelines
+
+http://tpeterka.github.io/maui-project/
The Future of Scientific Workflows, Report of DOE Workshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pdf
What happens inside workflows?
Garijo et al: Common Motifs in Scientific Workflows: An Empirical Analysisdoi:10.1016/j.future.2013.09.018
Stop Press!GUIs not essential!• Canvas, drag-drop blocks,
arrows, run button• Command-line & embedding in
developer or user applications
Scripts can be workflows!• WMS<->Scripts• Script vs Workflows/ASAP:
– Automation: *****– Scaling: **– Abstraction: *– Provenance: **
Copernicus workflow engine for parallel adaptive molecular dynamics
• Peer-to-peer distributed computing platform– high-level parallelization of statistical
sampling problems
• Consolidation of heterogeneous compute resources
• Automatic resource matching of jobs against compute resources
• Automatic fault tolerance of distributed work
• Workflow execution engine to define a problem (reporting) and trace its results live (provenance)
• Flexible plugin facilities – programs to be integrated to the
workflow execution engine
Free Energy Workflow using GROMACS
http://copernicus-computing.org/
COMPs/PyCOMPs: Programmer Productivity framework
• Sequential programming– Parallelisation and
distribution heavy-lifting– Dependency detection
• Infrastructure unaware– Abstract application from
underlying infrastructure– Portability
• Standard Programming Languages– Java, Python, C/C++
• No (or few!) APIs– Standard Java
http://compss.bsc.es/
https://www.nextflow.io/
StreamingParallelismCheckpointsPluggable executorsReproducibility
https://github.com/chapmanb/bcbio-nextgen
Distributed workflows for NGSDomain-specific language
https://bcbio-nextgen.readthedocs.org
Running workflows,tracking provenance
ASAP• common,
interoperable provenance recording– W3C PROV
ASAP• YesWorkflow.org
– Annotations in script yield workflow view
ASAP• Library profilers
– noWorkflow
• runtime provenance recorders– Sumatra, RDataTracker
Provenance
W3C PROV Standard
Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.
http://www.w3.org/TR/prov-overview/
Activities
What happened? When? Who?
What was used and generated?
Why was this workflow started?
Which workflow ran? Where?
Why do I need this?
i. To see which analysis was performed
ii. To find out who did what
iii. What was the metagenome used for?
iv. To understand the whole process“make me a Methods section”
v. To track down inconsistencies
used
wasGeneratedBy
wasStartedAt
"2012-06-21"
MetagenomeMetagenome
SampleSample
wasAssociatedWith
Workflow server
Workflow server
wasInformedBy
wasStartedBy
Workflow run
Workflow run
wasGeneratedBy
ResultsResults
SequencingSequencing
wasAssociatedWith
AliceAlice
hadPlan
Workflow definition
hadRole
Lab technician
ResultsResults
Dependency Management
Codes Behaviours & Reliability
https://twitter.com/ianholmes/status/288689712636493824
Research Object Bundlehttp://researchobject.org/
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects doi:10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip
Workflow Interoperability. Common format for bioinformatics tool & workflow executionCommunity based standards effortDesigned for clusters & cloudsSupports the use of containers (e.g. Docker)Specify data dependencies between stepsScatter/gather on stepsNest workflows in steps
Develop your pipeline on your local computer (optionally with Docker)Execute on your research cluster or in the cloudDeliver to users via workbenches
EDAM ontology (ELIXIR-DK) to specify file formats and reason about them: “FASTQ Sanger” encoding is a type of FASTQ file
http://commonwl.org/
● Task-specific “mini-workflow” fragments– e.g. using Gromacs, CPMD,
HADDOCK
● Packaged– EGI VM images and Docker
containers
● Backed by existing registries– ELIXIR’s bio.tools and EGI App DB
● Instantiated as cloud instances– private (Open Nebula, Open Stack)
– public (e.g. Amazon AWS )
Application Building BlocksBioExcel Virtualised Software Library“transversal workflow units”, higher level operations