ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH...
Transcript of ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH...
![Page 1: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/1.jpg)
ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES
WITH NEXTFLOW Paolo Di Tommaso, CRG
Wellcome Trust Sanger Institute, 1 May 2018, Cambridge
![Page 2: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/2.jpg)
WHO IS THIS CHAP? @PaoloDiTommasoResearch software engineerComparative Bioinformatics, Notredame LabCenter for Genomic Regulation (CRG)Author of Nextflow project
![Page 3: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/3.jpg)
AGENDA• The challenges with computational workflows
• Nextflow main principles
• Handling parallelisation and portability
• Deployments scenarios
• Comparison with other tools
• Future plans
![Page 4: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/4.jpg)
GENOMIC WORKFLOWS• Data analysis applications to extract information from
(large) genomic datasets
• Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster
• Mash-up of many different tools and scripts
• Complex dependency trees and configuration → very fragile ecosystem
![Page 5: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/5.jpg)
Steinbiss et al., Companion parassite genome annotation pipeline, DOI: 10.1093/nar/gkw292
![Page 6: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/6.jpg)
To reproduce the result of a typical computational biology paper
requires 280 hours. ≈1.7 months!
![Page 7: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/7.jpg)
![Page 8: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/8.jpg)
THE SAME APPLICATION DEPLOYED IN
DIFFERENT ENVIRONMENTSPRODUCES
DIFFERENT RESULTS (!)
![Page 9: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/9.jpg)
Platform Amazon Linux Debian Linux Mac OSX
Number of chromosomes 36 36 36
Overall length (bp) 32,032,223 32,032,223 32,032,223
Number of genes 7,781 7,783 7,771
Gene density 236.64 236.64 236.32
Number of coding genes 7,580 7,580 7570
Average coding length (bp) 1,764 1,764 1,762
Number of genes with multiple CDS 113 113 111
Number of genes with known function 4,147 4,147 4,142
Number of t-RNAs 88 90 88
Comparison of the Companion pipeline annotation of Leishmania infantum genome executed across different platforms *
* Di Tommaso P, et al., Nextflow enables computational reproducibility, Nature Biotech, 2017
![Page 10: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/10.jpg)
CHALLENGES• Reproducibility, replicate results over time
• Portability, run across different platforms
• Scalability ie. deploy big distributed workloads
• Usability, streamline execution and deployment of complex workloads ie. remove complexity instead of adding new one
• Consistency ie. track changes and revisions consistently for code, config files and binary dependencies
![Page 11: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/11.jpg)
PUSH-THE-BUTTON PIPELINES
![Page 12: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/12.jpg)
HOW?
• Fast prototyping ⇒ custom DSL that enables tasks composition, simplifies
most use cases + general purpose programming lang. for corner cases
• Easy parallelisation ⇒ declarative reactive programming model based on
dataflow paradigm, implicit portable parallelism
• Self-contained ⇒ functional approach, a task execution is idempotent ie.
cannot modify the state of other tasks + isolate dependencies with containers
• Portable deployments ⇒ executor abstraction layer + deployment
configuration from implementation logic
![Page 13: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/13.jpg)
Orchestration& Parallelisation
Scalability& Portability
Deployment &Reproducibility
containers
Git GitHub
![Page 14: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/14.jpg)
TASK EXAMPLE
bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam
![Page 15: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/15.jpg)
process align_sample {
input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch
output: file 'sample.bam' into bam_ch
script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """
}
TASK EXAMPLE
bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam
![Page 16: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/16.jpg)
TASKS COMPOSITION
process index_sample {
input: file 'sample.bam' from bam_ch
output: file 'sample.bai' into bai_ch
script: """ samtools index sample.bam """
}
process align_sample {
input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch
output: file 'sample.bam' into bam_ch
script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """
}
![Page 17: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/17.jpg)
DATAFLOW • Declarative computational model for parallel
process executions
• Processes wait for data, when an input set is ready the process is executed
• They communicate by using dataflow variables i.e. async FIFO queues called channels
• Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
![Page 18: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/18.jpg)
HOW PARALLELISATION WORKSsamples_ch = Channel.fromPath('data/sample.fastq')
process FASTQC {
input: file reads from samples_ch
output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }
![Page 19: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/19.jpg)
samples_ch = Channel.fromPath('data/*.fastq')
process FASTQC {
input: file reads from samples_ch
output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }
HOW PARALLELISATION WORKS
![Page 20: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/20.jpg)
IMPLICIT PARALLELISM
clustalo
Channel.fromPath("data/*.fastq")
clustaloFASTQC
![Page 21: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/21.jpg)
SUPPORTED PLATFORMS
![Page 22: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/22.jpg)
DEPLOYMENT SCENARIOS
![Page 23: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/23.jpg)
LOCAL EXECUTION
• Common development scenario
• Dependencies can be managed using a container runtime
• Parallelisations is managed spawning posix processes
• Can scale vertically using fat server / shared mem. machine
nextflow
OS
local storage
docker/singularity
laptop / workstation
![Page 24: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/24.jpg)
CENTRALISED ORCHESTRATION
computer cluster• Nextflow orchestrates
workflow execution submitting jobs to a compute cluster eg. SLURM
• It can run in the head node or a compute node
• Requires a shared storage to exchange data between tasks
• Ideal for corse-grained parallelisms
NFS/Lustre
cluster node
cluster node
cluster node
cluster node
submit jobs
cluster node
nextflow
![Page 25: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/25.jpg)
DISTRIBUTED ORCHESTRATION
login node
NFS/Lustre
job request
cluster node
cluster node
launcher wrapper
nextflow cluster
nextflow driver
nextflow worker
nextflow worker
nextflow worker
HPC cluster
• A single job request allocates the desired computes nodes
• Nextflow deploys its own embedded compute cluster
• The main instance orchestrate the workflow execution
• The worker instances execute workflow jobs (work stealing approach)
![Page 26: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/26.jpg)
KUBERNETES
• Next generation native cloud clustering for containerised workloads
• There's the need of workflow orchestration
• Latest NF version includes a new command that streamline the workflow deployment to K8s
![Page 27: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/27.jpg)
K8S DEPLOYMENT
![Page 28: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/28.jpg)
PORTABILITY
![Page 29: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/29.jpg)
PORTABILITY
process { executor = 'slurm' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }
![Page 30: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/30.jpg)
PORTABILITY
process { executor = 'awsbatch' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }
![Page 31: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/31.jpg)
CONFIGURATION DECOUPLING IS THE KEY TO
PORTABLE DEPLOYMENTS
![Page 32: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/32.jpg)
DEMO!
![Page 33: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/33.jpg)
A QUICK COMPARISON
![Page 34: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/34.jpg)
GALAXY vs. NEXTFLOW• Command line oriented tool
• Can incorporate any tool w/o any extra adapter
• Fine control over tasks parallelisation
• Scalability 100⇒1M jobs
• One liner installer
• Suited for production workflows + experienced bioinformaticians
• Web based platform
• Built-in integration with many tools and dataset
• Little control over tasks parallelisation
• Scalability 10⇒1K jobs
• Complex installation and maintenance
• Suited for training + not experienced bioinformaticians
![Page 35: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/35.jpg)
SNAKEMAKE vs. NEXTFLOW• Command line oriented tool
• Push model
• Can manage any data structure
• Compute DAG at runtime
• All major container runtimes
• Built-in support for clusters and cloud
• No (yet) support for sub-workflows
• Built-in support for Git/GitHub, etc., manage pipeline revisions
• Groovy/JVM based
• Command line oriented tool
• Pull model
• Rules defined using file name patterns
• Compute DAG ahead
• Built-in support for Singularity
• Custom scripts for cluster deployments
• Support for sub-workflows
• No support for source code management system
• Python based
![Page 36: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/36.jpg)
CWL vs. NEXTFLOW
• Language + app. runtime
• DSL on top of a general purpose programming lang.
• Concise, fluent (at least try to be!)
• Community driven
• Single implementation, quick iterations
• Language specification
• Declarative meta-language (YAML/JSON)
• Verbose
• Committee driven
• Many vendors/implementations (and specification version)
![Page 37: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/37.jpg)
CONTAINERISATION
![Page 38: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/38.jpg)
CONTAINERISATION• Nextflow envisioned the use
of software containers to fix computational reproducibility
• Mar 2014 (ver 0.7), support for Docker
• Dec 2016 (ver 0.23), support for Singularity
Nextflow
job job job
![Page 39: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/39.jpg)
SINGULARITY FEATURES
Kurtzer et al. Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459
![Page 40: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/40.jpg)
BENCHMARK*
* Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 https://dx.doi.org/10.7717/peerj.1273
container execution can have an impact on short running tasks ie. < 1min
![Page 41: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/41.jpg)
SINGULARITY BENCHMARK
https://github.com/wresch/python_import_problem
Singularity image format speeds up Python execution having many imports from a shared file system !
![Page 42: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/42.jpg)
WHEN USE CONTAINERS?
ALWAYS!
![Page 43: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/43.jpg)
BEST PRACTICES• Helps to isolate dependencies from dev or local deployment
environment
• Provides a reproducibles sandbox for third party users
• Binary images preserve against software decay
• Make it transparent ie. always include the Dockefile
• Docker image format is de-facto standard, it can be executed by different runtime eg. Singularity, Shifter, uDocker, etc.
![Page 44: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/44.jpg)
ERROR RECOVERY• Each task outputs are saved in a
separate directory
• This allows to safely record interrupted executions discarding
• Dramatically simplify debugging !
• Computing resources can be defined in a *dynamic* manner, so that a failing task can be automatically re-execute with more memory, longer timeout, etc.
![Page 45: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/45.jpg)
EXECUTION REPORT
![Page 46: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/46.jpg)
EXECUTION REPORT
![Page 47: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/47.jpg)
EXECUTION TIMELINE
![Page 48: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/48.jpg)
DAG VISUALISATION
![Page 49: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/49.jpg)
EDITORS !
![Page 50: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/50.jpg)
WHAT'S NEXT
![Page 51: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/51.jpg)
IMPROVEMENTS
• Built-in support for Bioconda recipies
• Better meta-data and provenance handling
• Workflow composition aka sub-workflows
• More clouds support ie. Azure and GCP
![Page 52: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/52.jpg)
APACHE SPARK
• Native support for Apache Spark clusters and execution model
• Allow hybrid Nextflow and Spark applications
• Mix the best of the two worlds, Nextflow for legacy tools/corse grain parallelisation and Spark for fine grain/distributed execution eg. GATK4
![Page 53: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/53.jpg)
• Partecipate in Cloud Work Stream working group
• TES: Task Execution API (working prototype)
• WES: Workflow Execution API
• Enable interoperability with GA4GH complaint platforms eg. Cancer Genomics Cloud and Broad FireCloud
![Page 54: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/54.jpg)
WHO IS USING NEXTFLOW?
![Page 55: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/55.jpg)
• Community effort to collect production ready analysis pipelines built with Nextflow
• Initially supported by SciLifeLab, QBiC and A*Star Genome Institute Singapore
• https://nf-core.github.io Alexander
PeltzerPhil
EwelsAndreas
Wilm
![Page 56: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/56.jpg)
CONCLUSION• Data analysis reproducibility is hard and it's often underestimated.
• Nextflow does not provide a magic solution but enables best-practices and provide support for community and industry standards.
• It strictly separates the application logic from the configuration and deployment logic, enabling self-contained workflows.
• Applications can be easily deployed across different environment in a reproducible manner with a single command.
• The functional/reactive model allows applications to scale to millions of jobs with ease.
![Page 57: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number](https://reader035.fdocuments.in/reader035/viewer/2022071105/5fdf22df9ed16c20750b8181/html5/thumbnails/57.jpg)
ACKNOWLEDGMENT
Evan Floden
Emilio Palumbo
Cedric Notredame
Notredame Lab, CRG
http://nextflow.io