Initial steps towards a production platform for DNA sequence analysis on the grid

Initial steps towards a production platformfor DNA sequence analysis on the grid

ISMB/ECCB conference – 18 July 2011

Barbera van Schaik, Angela Luyf, Michel de Vries,

Frank Baas, Antoine van Kampen and Silvia Olabarriaga

b.d.vanschaik@amc.uva.nl

Overview

Grid computing and workflow technology

Example: Virus discovery

Analysis of larger data sets

Example: Genome of the Netherlands

Challenges and summary

Sequencing, Moore’s law and personnel

http://www.politigenomics.com/2009/02/the-scale-up.html

Accele

ration Note:

Only slope is

meaningful in

this graph

What are the options?

Local cluster

Desktop grid

Super computer

Hadoop cluster

GPU cluster

Cloud computing

(Inter) national Grid

DNA computing

National computing facilities

Each system has its own interfaceNeed to learn how they all work

Distributed resources

ComputingData storage

Open protocols

It's all about sharing

ResourcesMethodsCollaborations

Dutch grid (resources)

http://www.biggrid.nl/

People, resources and data flow

My role

Sequencefacility

Researchlaboratories

BioinformaticsNGS team

e-BioScienceteam

Example: Virus discovery

Virus discovery unit

VIDISCAmethod

GenBank - NR

exp1exp1

exp6exp1

exp1exp3

exp2exp1

Goal: Identify known and discover new viruses in samples

Michel de Vries et al (2011) PloS one

BLAST analysis workflow

Input: sequence reads

Conversion step (sff to fasta)

Output: BLAST results

Workflow description (XML)

Component 1 (XML) Component 2 (XML)

Implementation of workflow components

Executable/script:

sff2fasta.pl

In: sequences(fasta)

In: database(fasta)

Out: blast result

In: sequences(sff)

Out: sequences(fasta)

Tristan Glatard (2008) Future generation computer systems

http://gwendia.i3s.unice.fr/doku.php?id=gwendia

Run workflow on the grid

Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In BiomedicineTristan Glatard (2008) International Journal of High Performance Computing Applications

Graphical user interface: VBrowser

Workflow monitoring

Speed upexp1

exp1exp1

exp1exp6

exp1exp1

exp3exp2

15 experiments722 samples

2 databases:Human ribosomal

Viruses

Total CPU time: 413 hrs (~17 days)Elapsed time workflow: 13.7 hrs= 30x speed up

Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics

Benefits workflow technology

Agile development

Re-use of components

Iteration strategy

Knowledge about analysis

steps captured in workflow

Analysis of larger data setsGenome of the Netherlands (GoNL)

Whole genome

sequencing of

250 trios

Enrich biobanks

Reference set for

disease studies http://www.bbmri.nl/http://www.nlgenome.nl/

770 samples45 TB raw data

Many partners(data sharing)

Analysis ondistributed sites

GoNL alignment pipeline

BWA aln, sampe, sam-to-bam, sort bam, index

Picard mark duplicates

GATK realignment

GATK recalibration

Picard fix mates

Pair1.fastq

Pair2.fastq

Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)

Referencegenome

Result.bam

160 samples (478 lanes) are

currently analyzed on the Dutch grid

Development and small tests:

Nov 22, 2010 - now

Analysis:

Mar 25, 2011 - Jul 15, 2011

Jobs: 13,981

Total CPU time: 5.5 years

Disk space used: 315 TB

Challenges

• Error handling

• Data management

• Data protection

• Provenance tracking

• Transparent addition of other resources

Summary

More research and development needed in e-bioscience

Latest IT infrastructures needed for scaling up NGS data analysis (grids, clouds, big clusters)

Workflow technology assists agile implementation of bioinformatics software

Separate workflow development from IT infrastructure for easier migration and expansion (middleware)

AcknowledgementsGenome of the

Netherlands, NL

Cisca Wijmenga

Morris Swertz

All project partners

Virus discovery unit, AMC

Lia van der Hoek

Michel de Vries

Department of

genome analysis, AMC

Frank Baas

Ted Bradley

Marja Jakobs

Bioinformatics Laboratory, AMC

Antoine van Kampen

NGS bioinformatics team

Aldo Jongejan

Marcel Willemsen

e-Bioscience team

Silvia Olabarriaga

Angela Luyf

Mark Santcroos

Shayan Shahand

University of Amsterdam

Piter de Boer

BiG Grid

Jan Just Keijser

Tom Visser

Grid support

Modalis, France

Johan Montagnat

Creatis, France

Tristan Glatard

http://www.bioinformaticslaboratory.nl/

BWA on grid – component description

BWA on grid – workflow description

e-BioInfra gateway

No grid certificate neededData upload via sFTP (intranet)Synced with grid storageWorkflows are started from web page

Implemented workflow componentsfor next generation sequencing

Existing software

• BLAST

• BLAT

• BWA

• Annovar

• Varscan

• Newbler

• FastQC

In-house software

• Data format converters

• Quality trimming

• Alternative splice product detection

• CDR3 detection (T- and B-cell variation)

• Genome comparison (small genomes)

• Roche software

• GATK

• Picard

• Samtools

Initial steps towards a production platform for DNA sequence analysis on the grid

Technology

Transcript of Initial steps towards a production platform for DNA sequence analysis on the grid

Transcription transcription Gene sequence (DNA) recopied or transcribed to RNA sequence Gene sequence (DNA) recopied or transcribed to RNA sequence.

Binding polyamidein DNA: Sequence-specific enthalpic

08.13.08: DNA Sequence Variation

Gene - Sequence of Bases in DNA

DNA Sequence Analysis

DNA Sequence - Sinica

Raw nuclear DNA Sequence

Energy process’ sequence control system using Metso DNA ...Energy process’ sequence control system using . Metso DNA FbCAD & Sequence CAD . Teshome Garedew . Bachelor’s thesis

Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

DNA Sequencing. DNA sequencing Determination of nucleotide sequence the determination of the precise sequence of nucleotides in a sample of DNA Two.

Fractal analysis of DNA sequence data;

Scalable Solutions for DNA Sequence Analysis

Genome Sequencing DNA Sequence Analysis

Translocation DNA sequence ampicillin Molecular

SPECIFIC DNA SEQUENCE RESPONISIVE DNA CROSSLINKED …

Segmenting dna sequence into words

DNA SEQUENCE DATA

DNA uptake signal sequence

Organellar DNA-Like Sequence...observe the inheritance of DNA methylation in two organellar DNA-like sequence regions in the nuclear genome. Because organellar DNA integration to the

DNA sequence incongruence and inconsistent morphology ...