Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

Paolo Missier, Jacek Cała, Yaobo Xu, Eldarina Wijaya School of Computing Science and Institute of Genetic Medicine

Newcastle University, Newcastle upon Tyne, UK

FGCS ForumRoma, April 24, 2016

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

The challenge

• Port an existing WES/WGS pipeline• From HPC to a (public) cloud• While achieving more flexibility and better abstraction• With better performance than the equivalent HPC deployment

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Scripted NGS data processing pipeline

RecalibrationCorrects for system bias on quality scores assigned by sequencerGATK

Computes coverage of each read.

VCF Subsetting by filtering, eg non-exomic variants

Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations

Aligns sample sequence to HG19 reference genomeusing BWA aligner

Cleaning, duplicate elimination

Picard tools

Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels

Variant recalibration attempts to reduce false positive rate from caller

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

The original implementationecho Preparing directories $PICARD_OUTDIR and $PICARD_TEMPmkdir -p $PICARD_OUTDIRmkdir -p $PICARD_TEMP

echo Starting PICARD to clean BAM files...$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED

echo Starting PICARD to remove duplicates...$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = \$SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true

echo Adding read group information to bam file...$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID \RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”

echo Indexing bam files...samtools index $SORTED_BAM_FILE_NODUPS

• Pros• simplicity – 50-100 lines of bash code• flexibility of the bash language

• Cons• embedded dependencies between steps• low-level configuration

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Problem scale

Data stats per sample:4 files per sample (2-lane, pair-end,

reads)≈15 GB of compressed text data (gz)≈40 GB uncompressed text data

(FASTQ)

Usually 30-40 input samples0.45-0.6 TB of compressed data1.2-1.6 TB uncompressed

Most steps use 8-10 GB of reference data

Small 6-sample run takesabout 30h on the IGM HPC

machine (Stage1+2)

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Scripts to workflow - Design

Design Cloud Deployment Execution Analysis

• Better abstraction• Easier to understand, share,

maintain• Better exploit data parallelism• Extensible by wrapping new tools

Theoretical advantages of using a workflow programming model

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Workflow Designecho Preparing directories $PICARD_OUTDIR and $PICARD_TEMPmkdir -p $PICARD_OUTDIRmkdir -p $PICARD_TEMP

echo Starting PICARD to clean BAM files...$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED

echo Starting PICARD to remove duplicates...$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = \$SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true

echo Adding read group information to bam file...$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID \RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”

echo Indexing bam files...samtools index $SORTED_BAM_FILE_NODUPS

“Wrapper”blocksUtility

blocks

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Workflow design

Conceptual:

Actual:

11 workflows101 blocks28 tool blocks

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Anatomy of a complex parallel dataflow

eScience Central: simple dataflow model…

Sample-split:Parallel processing of samples in a batch

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Anatomy of a complex parallel dataflow… with hierarchical structure

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Cloud Deployment


Scalability• Exploiting data parallelism• Fewer installation/deployment requirements, staff hours

required• Automated dependency management, packaging• Configurable to make most efficient use of a cluster

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Parallelism in the pipeline

Chr1 Chr2 ChrM

Chr1 Chr2 ChrM

Chr1 Chr2 ChrM

align, clean, recalibrate

call variants

annotate



Stage 1 Stage 2 Stage 3

annotate

annotate

call variants

call variants

Chr1Chr1

Chr1

Chr2Chr2

Chr2

ChrMChrM

ChrM

chro

mos

ome

split

sam

ple

split

chro

mos

ome

split

sam

ple

split

Sample 1

Sample 2

Sample N

Annotated variants

Annotated variants

Annotated variants

align-clean-recalibrate-coverage

…

align-clean-recalibrate-coverage

Sample1

Samplen

Variant callingrecalibration

Variant callingrecalibration

Variant filtering annotation

Variant filtering annotation

……

Chromosomesplit

Per-sample Parallelprocessing

Per-chromosomeParallelprocessing

Stage I Stage II Stage III

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Workflow on Azure Cloud – modular configuration

<<Azure VM>>Azure Blob

store

e-SC db backend

<<Azure VM>>

e-Science Central

main server JMS queue

REST APIWeb UI

web browser

rich client app

workflow invocations

e-SC control data

workflow data

<<worker role>>Workflow

engine


engine

e-SC blob store


engine

Workflow engines Module configuration:3 nodes, 24 cores

Modular architecture indefinitely scalable!

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Workflow and sub-workflows execution

To e-SC queue To e-SC queueExecutable Block To e-SC queue

<<Azure VM>>Azure Blob

store

e-SC db backend

<<Azure VM>>

e-Science Central

main server JMS queue

REST APIWeb UI

web browser

rich client app

workflow invocations

e-SC control data

workflow data


engine


engine

e-SC blob store


engine

Workflow invocation executing on one engine (fragment)

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Scripts to workflow


3. Execution

• Runtime monitoring• provenance collection

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Performance

Configurations for 3VMs experiments:

HPC cluster (dedicated nodes): 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160

GB scratch space

Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.

0 6 12 18 2400:00

12:00

24:00

36:00

48:00

60:00

72:00

3 eng (24 cores)6 eng (48 cores)12 eng (96 cores)

Number of samples

Resp

onse

tim

e [h

h:m

m]

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Comparison with HPC

0 6 12 18 240

12

24

36

48

60

72

HPC (3 compute nodes) Azure (3xD13 – SSD) – syncAzure (3xD13 – SSD) – chained

Number of input samples

Resp

onse

tim

e [h

ours

]

50 100 150 200 250 300 350 4000

1

2

3

4

5

6

HPC (3 compute nodes) Azure (3xD13 – SSD) – syncAzure (3xD13 – SSD) – chained

Size of the sample cohort [GiB]

Syst

em th

roug

hput

[GiB

/hr]

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Scalability

There is little incentive to grow the VM pool beyond 6 engines

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Cost

Again, a 6 engine configuration achieves near-optimal cost/sample

0 6 12 18 24

0 50 100 150 200 250 300 350

02468

1012141618

0

0.2

0.4

0.6

0.8

1

1.2 3 eng (24 cores)

Number of samples

Cost

per

sam

ple

[£]

Size of the input data [GiB]

Cost

per

GiB

[£]

FGC

S F

orum

Rom

a, A

pril

24, 2

016

P.. M

isis

er

Lessons learnt


Better abstraction• Easier to understand, share,

maintainBetter exploit data parallelismExtensible by wrapping new tools

• Scalability Fewer installation/deployment

requirements, staff hours required Automated dependency management,

packaging Configurable to make most efficient

use of a cluster

Runtime monitoring Provenance collection

Reproducibility Accountability

Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

Technology

Transcript of Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud