Initial steps towards a production platform for DNA sequence analysis on the grid

Post on 10-May-2015

1.022 views 2 download

Tags:

description

Presented at the ISMB/ECCB 2011 conference. https://www.iscb.org/cms_addon/conferences/ismbeccb2011/highlights.php#HL13

Transcript of Initial steps towards a production platform for DNA sequence analysis on the grid

Initial steps towards a production platformfor DNA sequence analysis on the grid

ISMB/ECCB conference – 18 July 2011

Barbera van Schaik, Angela Luyf, Michel de Vries,

Frank Baas, Antoine van Kampen and Silvia Olabarriaga

b.d.vanschaik@amc.uva.nl

Overview

Grid computing and workflow technology

Example: Virus discovery

Analysis of larger data sets

Example: Genome of the Netherlands

Challenges and summary

What are the options?

Local cluster

Desktop grid

Super computer

Hadoop cluster

GPU cluster

Cloud computing

(Inter) national Grid

DNA computing

National computing facilities

Each system has its own interfaceNeed to learn how they all work

Grids

Distributed resources

ComputingData storage

Open protocols

It's all about sharing

ResourcesMethodsCollaborations

Dutch grid (resources)

grid

http://www.biggrid.nl/

People, resources and data flow

My role

grid

Sequencefacility

Researchlaboratories

BioinformaticsNGS team

e-BioScienceteam

Example: Virus discovery

Virus discovery unit

VIDISCAmethod

GenBank - NR

exp1exp1

exp1exp1

exp1exp1

exp6exp1

exp1exp3

exp2exp1

Goal: Identify known and discover new viruses in samples

Michel de Vries et al (2011) PloS one

BLAST analysis workflow

Input: sequence reads

Conversion step (sff to fasta)

BLAST

Output: BLAST results

Workflow description (XML)

Component 1 (XML) Component 2 (XML)

Implementation of workflow components

Executable/script:

BLAST

Executable/script:

sff2fasta.pl

In: sequences(fasta)

In: database(fasta)

Out: blast result

(txt)

In: sequences(sff)

Out: sequences(fasta)

X

Tristan Glatard (2008) Future generation computer systems

http://gwendia.i3s.unice.fr/doku.php?id=gwendia

Run workflow on the grid

Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In BiomedicineTristan Glatard (2008) International Journal of High Performance Computing Applications

Graphical user interface: VBrowser

htt

p:/

/ww

w.v

l-e.

nl/

vbro

wse

r

Workflow monitoring

Speed upexp1

exp1exp1

exp1exp1

exp1exp6

exp1exp1

exp3exp2

exp1

Blast

15 experiments722 samples

2 databases:Human ribosomal

Viruses

Total CPU time: 413 hrs (~17 days)Elapsed time workflow: 13.7 hrs= 30x speed up

Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics

Benefits workflow technology

Agile development

Re-use of components

Iteration strategy

Knowledge about analysis

steps captured in workflow

Analysis of larger data setsGenome of the Netherlands (GoNL)

Whole genome

sequencing of

250 trios

Enrich biobanks

Reference set for

disease studies http://www.bbmri.nl/http://www.nlgenome.nl/

770 samples45 TB raw data

Many partners(data sharing)

Analysis ondistributed sites

GoNL alignment pipeline

BWA aln, sampe, sam-to-bam, sort bam, index

Picard mark duplicates

GATK realignment

GATK recalibration

Picard fix mates

Pair1.fastq

Pair2.fastq

Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)

Referencegenome

Result.bam

160 samples (478 lanes) are

currently analyzed on the Dutch grid

Development and small tests:

Nov 22, 2010 - now

Analysis:

Mar 25, 2011 - Jul 15, 2011

Jobs: 13,981

Total CPU time: 5.5 years

Disk space used: 315 TB

Challenges

• Error handling

• Data management

• Data protection

• Provenance tracking

• Transparent addition of other resources

Summary

More research and development needed in e-bioscience

Latest IT infrastructures needed for scaling up NGS data analysis (grids, clouds, big clusters)

Workflow technology assists agile implementation of bioinformatics software

Separate workflow development from IT infrastructure for easier migration and expansion (middleware)

AcknowledgementsGenome of the

Netherlands, NL

Cisca Wijmenga

Morris Swertz

All project partners

Virus discovery unit, AMC

Lia van der Hoek

Michel de Vries

Department of

genome analysis, AMC

Frank Baas

Ted Bradley

Marja Jakobs

Bioinformatics Laboratory, AMC

Antoine van Kampen

NGS bioinformatics team

Aldo Jongejan

Marcel Willemsen

e-Bioscience team

Silvia Olabarriaga

Angela Luyf

Mark Santcroos

Shayan Shahand

University of Amsterdam

Piter de Boer

BiG Grid

Jan Just Keijser

Tom Visser

Grid support

Modalis, France

Johan Montagnat

Creatis, France

Tristan Glatard

http://www.bioinformaticslaboratory.nl/

22

BWA on grid – component description

23

BWA on grid – component description

24

BWA on grid – workflow description

e-BioInfra gateway

No grid certificate neededData upload via sFTP (intranet)Synced with grid storageWorkflows are started from web page

htt

p:/

/ora

nge

.eb

iosc

ien

ce.a

mc.

nl/

ebio

infr

agat

eway

/

Implemented workflow componentsfor next generation sequencing

Existing software

• BLAST

• BLAT

• BWA

• Annovar

• Varscan

• Newbler

• FastQC

In-house software

• Data format converters

• Quality trimming

• Alternative splice product detection

• CDR3 detection (T- and B-cell variation)

• Genome comparison (small genomes)

• Roche software

• GATK

• Picard

• Samtools