May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
-
Upload
paolo-missier -
Category
Technology
-
view
306 -
download
0
Transcript of Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Paolo Missier, Jacek Cała, Yaobo Xu, Eldarina Wijaya School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
FGCS ForumRoma, April 24, 2016
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
The challenge
• Port an existing WES/WGS pipeline• From HPC to a (public) cloud• While achieving more flexibility and better abstraction• With better performance than the equivalent HPC deployment
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Scripted NGS data processing pipeline
RecalibrationCorrects for system bias on quality scores assigned by sequencerGATK
Computes coverage of each read.
VCF Subsetting by filtering, eg non-exomic variants
Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations
Aligns sample sequence to HG19 reference genomeusing BWA aligner
Cleaning, duplicate elimination
Picard tools
Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels
Variant recalibration attempts to reduce false positive rate from caller
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
The original implementationecho Preparing directories $PICARD_OUTDIR and $PICARD_TEMPmkdir -p $PICARD_OUTDIRmkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = \$SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID \RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...samtools index $SORTED_BAM_FILE_NODUPS
• Pros• simplicity – 50-100 lines of bash code• flexibility of the bash language
• Cons• embedded dependencies between steps• low-level configuration
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Problem scale
Data stats per sample:4 files per sample (2-lane, pair-end,
reads)≈15 GB of compressed text data (gz)≈40 GB uncompressed text data
(FASTQ)
Usually 30-40 input samples0.45-0.6 TB of compressed data1.2-1.6 TB uncompressed
Most steps use 8-10 GB of reference data
Small 6-sample run takesabout 30h on the IGM HPC
machine (Stage1+2)
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Scripts to workflow - Design
Design Cloud Deployment Execution Analysis
• Better abstraction• Easier to understand, share,
maintain• Better exploit data parallelism• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Workflow Designecho Preparing directories $PICARD_OUTDIR and $PICARD_TEMPmkdir -p $PICARD_OUTDIRmkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = \$SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID \RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”blocksUtility
blocks
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Workflow design
Conceptual:
Actual:
11 workflows101 blocks28 tool blocks
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Anatomy of a complex parallel dataflow
eScience Central: simple dataflow model…
Sample-split:Parallel processing of samples in a batch
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Anatomy of a complex parallel dataflow… with hierarchical structure
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Cloud Deployment
Design Cloud Deployment Execution Analysis
Scalability• Exploiting data parallelism• Fewer installation/deployment requirements, staff hours
required• Automated dependency management, packaging• Configurable to make most efficient use of a cluster
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Parallelism in the pipeline
Chr1 Chr2 ChrM
Chr1 Chr2 ChrM
Chr1 Chr2 ChrM
align, clean, recalibrate
call variants
annotate
align, clean, recalibrate
align, clean, recalibrate
Stage 1 Stage 2 Stage 3
annotate
annotate
call variants
call variants
Chr1Chr1
Chr1
Chr2Chr2
Chr2
ChrMChrM
ChrM
chro
mos
ome
split
sam
ple
split
chro
mos
ome
split
sam
ple
split
Sample 1
Sample 2
Sample N
Annotated variants
Annotated variants
Annotated variants
align-clean-recalibrate-coverage
…
align-clean-recalibrate-coverage
Sample1
Samplen
Variant callingrecalibration
Variant callingrecalibration
Variant filtering annotation
Variant filtering annotation
……
Chromosomesplit
Per-sample Parallelprocessing
Per-chromosomeParallelprocessing
Stage I Stage II Stage III
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Workflow on Azure Cloud – modular configuration
<<Azure VM>>Azure Blob
store
e-SC db backend
<<Azure VM>>
e-Science Central
main server JMS queue
REST APIWeb UI
web browser
rich client app
workflow invocations
e-SC control data
workflow data
<<worker role>>Workflow
engine
<<worker role>>Workflow
engine
e-SC blob store
<<worker role>>Workflow
engine
Workflow engines Module configuration:3 nodes, 24 cores
Modular architecture indefinitely scalable!
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Workflow and sub-workflows execution
To e-SC queue To e-SC queueExecutable Block To e-SC queue
<<Azure VM>>Azure Blob
store
e-SC db backend
<<Azure VM>>
e-Science Central
main server JMS queue
REST APIWeb UI
web browser
rich client app
workflow invocations
e-SC control data
workflow data
<<worker role>>Workflow
engine
<<worker role>>Workflow
engine
e-SC blob store
<<worker role>>Workflow
engine
Workflow invocation executing on one engine (fragment)
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Scripts to workflow
Design Cloud Deployment Execution Analysis
3. Execution
• Runtime monitoring• provenance collection
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Performance
Configurations for 3VMs experiments:
HPC cluster (dedicated nodes): 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160
GB scratch space
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
0 6 12 18 2400:00
12:00
24:00
36:00
48:00
60:00
72:00
3 eng (24 cores)6 eng (48 cores)12 eng (96 cores)
Number of samples
Resp
onse
tim
e [h
h:m
m]
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Comparison with HPC
0 6 12 18 240
12
24
36
48
60
72
HPC (3 compute nodes) Azure (3xD13 – SSD) – syncAzure (3xD13 – SSD) – chained
Number of input samples
Resp
onse
tim
e [h
ours
]
50 100 150 200 250 300 350 4000
1
2
3
4
5
6
HPC (3 compute nodes) Azure (3xD13 – SSD) – syncAzure (3xD13 – SSD) – chained
Size of the sample cohort [GiB]
Syst
em th
roug
hput
[GiB
/hr]
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Scalability
There is little incentive to grow the VM pool beyond 6 engines
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Cost
Again, a 6 engine configuration achieves near-optimal cost/sample
0 6 12 18 24
0 50 100 150 200 250 300 350
02468
1012141618
0
0.2
0.4
0.6
0.8
1
1.2 3 eng (24 cores)
Number of samples
Cost
per
sam
ple
[£]
Size of the input data [GiB]
Cost
per
GiB
[£]
FGC
S F
orum
Rom
a, A
pril
24, 2
016
P.. M
isis
er
Lessons learnt
Design Cloud Deployment Execution Analysis
Better abstraction• Easier to understand, share,
maintainBetter exploit data parallelismExtensible by wrapping new tools
• Scalability Fewer installation/deployment
requirements, staff hours required Automated dependency management,
packaging Configurable to make most efficient
use of a cluster
Runtime monitoring Provenance collection
Reproducibility Accountability