Initial steps towards a production platform for DNA sequence analysis on the grid

26
Initial steps towards a production platform for DNA sequence analysis on the grid ISMB/ECCB conference 18 July 2011 Barbera van Schaik, Angela Luyf, Michel de Vries, Frank Baas, Antoine van Kampen and Silvia Olabarriaga [email protected]

description

Presented at the ISMB/ECCB 2011 conference. https://www.iscb.org/cms_addon/conferences/ismbeccb2011/highlights.php#HL13

Transcript of Initial steps towards a production platform for DNA sequence analysis on the grid

Page 1: Initial steps towards a production platform for DNA sequence analysis on the grid

Initial steps towards a production platformfor DNA sequence analysis on the grid

ISMB/ECCB conference – 18 July 2011

Barbera van Schaik, Angela Luyf, Michel de Vries,

Frank Baas, Antoine van Kampen and Silvia Olabarriaga

[email protected]

Page 2: Initial steps towards a production platform for DNA sequence analysis on the grid

Overview

Grid computing and workflow technology

Example: Virus discovery

Analysis of larger data sets

Example: Genome of the Netherlands

Challenges and summary

Page 4: Initial steps towards a production platform for DNA sequence analysis on the grid

What are the options?

Local cluster

Desktop grid

Super computer

Hadoop cluster

GPU cluster

Cloud computing

(Inter) national Grid

DNA computing

National computing facilities

Each system has its own interfaceNeed to learn how they all work

Page 5: Initial steps towards a production platform for DNA sequence analysis on the grid

Grids

Distributed resources

ComputingData storage

Open protocols

It's all about sharing

ResourcesMethodsCollaborations

Page 6: Initial steps towards a production platform for DNA sequence analysis on the grid

Dutch grid (resources)

grid

http://www.biggrid.nl/

Page 7: Initial steps towards a production platform for DNA sequence analysis on the grid

People, resources and data flow

My role

grid

Sequencefacility

Researchlaboratories

BioinformaticsNGS team

e-BioScienceteam

Page 8: Initial steps towards a production platform for DNA sequence analysis on the grid

Example: Virus discovery

Virus discovery unit

VIDISCAmethod

GenBank - NR

exp1exp1

exp1exp1

exp1exp1

exp6exp1

exp1exp3

exp2exp1

Goal: Identify known and discover new viruses in samples

Michel de Vries et al (2011) PloS one

Page 9: Initial steps towards a production platform for DNA sequence analysis on the grid

BLAST analysis workflow

Input: sequence reads

Conversion step (sff to fasta)

BLAST

Output: BLAST results

Page 10: Initial steps towards a production platform for DNA sequence analysis on the grid

Workflow description (XML)

Component 1 (XML) Component 2 (XML)

Implementation of workflow components

Executable/script:

BLAST

Executable/script:

sff2fasta.pl

In: sequences(fasta)

In: database(fasta)

Out: blast result

(txt)

In: sequences(sff)

Out: sequences(fasta)

X

Tristan Glatard (2008) Future generation computer systems

http://gwendia.i3s.unice.fr/doku.php?id=gwendia

Page 11: Initial steps towards a production platform for DNA sequence analysis on the grid

Run workflow on the grid

Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In BiomedicineTristan Glatard (2008) International Journal of High Performance Computing Applications

Page 12: Initial steps towards a production platform for DNA sequence analysis on the grid

Graphical user interface: VBrowser

htt

p:/

/ww

w.v

l-e.

nl/

vbro

wse

r

Page 13: Initial steps towards a production platform for DNA sequence analysis on the grid

Workflow monitoring

Page 14: Initial steps towards a production platform for DNA sequence analysis on the grid

Speed upexp1

exp1exp1

exp1exp1

exp1exp6

exp1exp1

exp3exp2

exp1

Blast

15 experiments722 samples

2 databases:Human ribosomal

Viruses

Total CPU time: 413 hrs (~17 days)Elapsed time workflow: 13.7 hrs= 30x speed up

Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics

Page 15: Initial steps towards a production platform for DNA sequence analysis on the grid

Benefits workflow technology

Agile development

Re-use of components

Iteration strategy

Knowledge about analysis

steps captured in workflow

Page 16: Initial steps towards a production platform for DNA sequence analysis on the grid

Analysis of larger data setsGenome of the Netherlands (GoNL)

Whole genome

sequencing of

250 trios

Enrich biobanks

Reference set for

disease studies http://www.bbmri.nl/http://www.nlgenome.nl/

770 samples45 TB raw data

Many partners(data sharing)

Analysis ondistributed sites

Page 17: Initial steps towards a production platform for DNA sequence analysis on the grid

GoNL alignment pipeline

BWA aln, sampe, sam-to-bam, sort bam, index

Picard mark duplicates

GATK realignment

GATK recalibration

Picard fix mates

Pair1.fastq

Pair2.fastq

Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)

Referencegenome

Result.bam

160 samples (478 lanes) are

currently analyzed on the Dutch grid

Development and small tests:

Nov 22, 2010 - now

Analysis:

Mar 25, 2011 - Jul 15, 2011

Jobs: 13,981

Total CPU time: 5.5 years

Disk space used: 315 TB

Page 18: Initial steps towards a production platform for DNA sequence analysis on the grid

Challenges

• Error handling

• Data management

• Data protection

• Provenance tracking

• Transparent addition of other resources

Page 19: Initial steps towards a production platform for DNA sequence analysis on the grid

Summary

More research and development needed in e-bioscience

Latest IT infrastructures needed for scaling up NGS data analysis (grids, clouds, big clusters)

Workflow technology assists agile implementation of bioinformatics software

Separate workflow development from IT infrastructure for easier migration and expansion (middleware)

Page 20: Initial steps towards a production platform for DNA sequence analysis on the grid

AcknowledgementsGenome of the

Netherlands, NL

Cisca Wijmenga

Morris Swertz

All project partners

Virus discovery unit, AMC

Lia van der Hoek

Michel de Vries

Department of

genome analysis, AMC

Frank Baas

Ted Bradley

Marja Jakobs

Bioinformatics Laboratory, AMC

Antoine van Kampen

NGS bioinformatics team

Aldo Jongejan

Marcel Willemsen

e-Bioscience team

Silvia Olabarriaga

Angela Luyf

Mark Santcroos

Shayan Shahand

University of Amsterdam

Piter de Boer

BiG Grid

Jan Just Keijser

Tom Visser

Grid support

Modalis, France

Johan Montagnat

Creatis, France

Tristan Glatard

http://www.bioinformaticslaboratory.nl/

Page 21: Initial steps towards a production platform for DNA sequence analysis on the grid
Page 22: Initial steps towards a production platform for DNA sequence analysis on the grid

22

BWA on grid – component description

Page 23: Initial steps towards a production platform for DNA sequence analysis on the grid

23

BWA on grid – component description

Page 24: Initial steps towards a production platform for DNA sequence analysis on the grid

24

BWA on grid – workflow description

Page 25: Initial steps towards a production platform for DNA sequence analysis on the grid

e-BioInfra gateway

No grid certificate neededData upload via sFTP (intranet)Synced with grid storageWorkflows are started from web page

htt

p:/

/ora

nge

.eb

iosc

ien

ce.a

mc.

nl/

ebio

infr

agat

eway

/

Page 26: Initial steps towards a production platform for DNA sequence analysis on the grid

Implemented workflow componentsfor next generation sequencing

Existing software

• BLAST

• BLAT

• BWA

• Annovar

• Varscan

• Newbler

• FastQC

In-house software

• Data format converters

• Quality trimming

• Alternative splice product detection

• CDR3 detection (T- and B-cell variation)

• Genome comparison (small genomes)

• Roche software

• GATK

• Picard

• Samtools