Commodity Compung in Genomics Research -...

CommodityCompu+nginGenomicsResearch

MichaelSchatzandMihaiPop

CenterforBioinforma-csandComputa-onalBiology,UniversityofMaryland,CollegePark,MD,USA

h;p://www.cbcb.umd.edu/research/cloud/

RecentadvancesinDNAsequencingtechnologyfromIllumina,454LifeSciences,ABI,andHelicos,are enabling next genera-on sequencing instruments to sequence the equivalent of the humangenome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the humangenomeproject of the late 90's andearly '00s required years ofworkonhundredsofmachineswith sequencing costs measured in hundreds of millions of dollars. This drama-c increase inefficiencyhasspurredtremendousgrowthinapplica-onsforDNAsequencing.

Forexample,whereasthehumangenomeprojectsoughttosequencethegenomeofasmallgroupofindividuals,the1000genomesprojectaimstocatalogthegenomesof1000individualsfromallregionsoftheglobeinjustthreeyears.Relatedprojectsaimtocatalogallofthebiologicallyac-vetranscribed regionsof the genomeover awide varietyof environmental anddisease condi-ons.Similarstudiesarealsounderwayformodelorganismssuchasmouse,rat,chicken,rice,andyeast,andotherorganismsofinterest.

Cheapandfastsequencingtechnologiesarealsoprovidingscien-stswiththetoolstoanalyzethelargelyunknownmicrobialbiosphere.Themajorityofmicrobesinhabi-ngourworldandourbodiesareunknownandcannotbeeasilymanipulated inthe laboratory. Inrecentyearsanewscien-ficfield has emerged ‐ metagenomics ‐ that aims to characterize en-re microbial communi-es bysequencingtheDNAdirectlyextractedfromanenvironment.Severalstudieshavealreadytargeteda rangeofnaturalenvironments (ocean, soil,minedrainage)aswellas thecommensalmicrobesinhabi-ngthebodiesofhumansandotheranimalsandinsects.Thela[erarethetargetofanewNIHini-a-ve–theHumanMicrobiomeProject‐anefforttocharacterizethediversityofhuman‐associatedmicrobialcommuni-esandtounderstandtheircontribu-onstohumanhealth.

The raw data generated by the new sequencing instruments o^en exceed 1 terabyte and arestraining the computa-onal infrastructure typically available in an average research lab.Furthermore,biologicaldatasetsareonlyincreasinginsize,asdataformoreindividualsandmoreenvironments are collected, further complica-ng computa-onal analyses. Even seemingly simpletasks,suchasmappingacollec-onofsequencingreadstooneof thehumanreferencegenome,can require days of computa-on, andde novo assembly of an en-re human genomeusing newgenera-on sequence data has only been possible with specialized compute resources. The onlylong‐termsolu-ontothechallengesposedbythemassivebiologicaldata‐setsbeinggeneratedistocombine computa-onalbiology researchwithadvances fromhighperformance compu-ng.Hereweexplore theuseofcommoditycompu-ngwithinacloudcompu-ngparadigmto tackle theseproblems.

Abstract

CloudBurst:HighlySensi+veShortReadMapping

CloudBurstisanewopen‐sourceread‐mappingalgorithm,foruseinavarietyofbiologicalanalysesincludingSNPdiscovery,genotyping,andpersonal genomics. It is modeled a^er the short read mappingprogramRMAP,butusesHadooptocomputealignments inparallel.CloudBurst's running -me scales linearly with the number of readsmapped,andwithnear linearspeedupasthenumberofprocessorsincreases.Inalargeremotecomputecloudwith96cores,CloudBurstachieves over a 100‐fold speedup over a serial execu-on of RMAP,reducing the running -me from hours to mere minutes to mapmillionsofshortreadstothehumangenome.

Schatz,MC(2009)CloudBurst:HighlySensi+veReadMappingwithMapReduce.Bioinforma-cs.25(11):1363‐1369.h[p://www.cbcb.umd.edu/so^ware/cloudburst

GenomeAssemblywithMapReduceCrossbow:SearchingforSNPswithCloudCompu+ng

Mapstepisshortreadalignment.ManyinstancesofBow+eruninparallelacrossthe cluster. Input tuples are reads andoutputtuplesarealignments.

A clustermay consist of any number ofnodes. Hadoop handles the details ofrou-ng data, distribu-ng and invokingprograms,providingfaulttolerance,etc.

The user first uploads reads to afilesystemvisible to theHadoop cluster.If the Hadoop cluster is in EC2, thefilesystemmightbeanS3bucket. IftheHadoop cluster is local, the filesystemmightbeanNFSshare.

Sort step bins alignments according toprimarykey (genomechromosome)andsorts according to a secondary key(offset into chromosome). This ishandledefficientlybyHadoop.

Reduce step calls SNPs for eachreference par--on. Many instances ofSOAPsnp run in parallel across thecluster. Input tuples are sortedalignments for a par--on and outputtuplesareSNPcalls.

Crossbowisacloud‐compu-ngso^waresystemthatcombinesthespeedoftheshortreadalignerBow-e and the accuracy of the SOAPsnp consensus and SNP caller. Execu-ng these tools inparallelwiththeHadoop implementa-onofMapReduce,CrossbowalignsreadsandmakesSNPcallswith>99%accuracyfromadatasetcomprising38‐foldcoverageofthehumangenomeinlessthanonedayona clusterof40 computer cores, and in less than threehoursusinga320‐coreclusterrentedfromacommercialcloudcompu-ngservice.Crossbow’sabilitytoruninthecloudsmeansthatusersneednotownoroperateanexpensivecomputerclustertorunCrossbow.

Langmead, B, Schatz, MC, Lin, J, Pop, M, Salzberg, SL (2009) Searching for SNPs with CloudCompu+ng.SubmiCedforpublica-on.h[p://bow-e‐bio.sf.net/crossbow

Denovo assemblyofhuman‐sizedgenomes fromshort readsonasinglemachine isnot feasiblewith a current assembler such as Velvet, EULER‐USR, or ALLPATHS, as those algorithms wouldrequire >10TB of RAM to execute. However,MapReduce enables the de Bruijn graph assemblyalgorithmstoscaletotheselargedatasetsasoutlinedbelow.

1.DeBruijnGraphConstruc+onandCompressionConstruc-on of the de Bruijn graph is naturally implemented inMapReduce. Themap func-onemitskeyvaluepairs(ki,ki+1)forconsecu-vek‐mersinthereads,whicharethengloballyshuffledandreducedtobuildanadjacencylistforallk‐mersinthereads.Regionsofthegenomebetweenrepeat boundaries formnon‐branching simple paths, and are efficiently compressed inO(log(S))MapReduceroundsusingaparallellistrankingalgorithm.

ACTG ATCT CTGA CTGG CTGC GACT GCTG GGCT TCTG TGAC TGCA TGGC

DeBruijnGraph

ATC TCT CTG TGC GCA

CompressedGraph

2.ErrorCorrec+onErrorsinthereadsdistortthegraphstructurecrea-ngdead‐ends(le^)orbubbles(middle).Thesegraph structures are recognized and resolved in a single MapReduce cycle crea-ng addi-onalsimplepaths(right).

B’A C B*A C

3.Graphcontrac+onAddi-onal simplifica-on techniques suchas x‐cut (le^) and cycle tree compac-on (right) furthersimplifythegraphstructure,andcreatemoreopportuni-esforsimplepathcompression.

4.ScaffoldingFinallymate‐pairs, ifavailable,areanalyzedtofurtherresolveambigui-es.MapReduce isusedtoiden-fy the connected components of the assembly graph, which are then separately butconcurrentlyanalyzedusingthescaffoldingcomponentofanassemblersuchasVelvetortheCeleraAssembler.Thefinal sequences resolves larger regionsof thegenome, revealingnewbiologynotaccessiblethroughpurelycompara-vetechniques.

A C DR B R R

Humanchromosome1

map shuffle

reduce

Read 1, Chromosome 1, 12345-12365

Read 2, Chromosome 1, 12350-12370

Simulated DataCrossbowsensi+vity

Crossbowprecision

Nodes Time/Cost

HumanChr.2240Mreads1.8GB

99.01% 99.14% 1 master + 2 worder

0h30m$2.40

HumanChr.X172Mreads

5.6GB98.97% 99.64% 1 master +

8 workers 0h26m$7.20

Genuine DataAutosomalagreement

Chr.Xagreement

Nodes Time/Cost

Wholehuman,versusIllumina1MBeadChip

2.7Breads103GB

99.5% 99.6%1master+40workers

2h53m$98.40

AcknowledgementsWewouldliketothankUMDstudentsandfacultyBenLangmead,JimmyLin,StevenSalzberg,andDanSommerfortheircontribu-ons.

ThisworkissupportedbytheNSFClusterExploratory(CluE)programgrantIIS‐0844494.

Commodity Compung in Genomics Research -...

Documents

Transcript of Commodity Compung in Genomics Research -...

Assembly of Large Genomes using Cloud Computing Michael …schatzlab.cshl.edu/presentations/2010-07-23.Illumina.pdf · 7/23/2010 · Parallel Algorithms Spectrum Loosely Coupled

Program Schedule Online - karunya.edu Program Schedule.pdfDr. Paul D. Manuel Kuwait University Key note talk on: Conive Compung in Key note talk on: Big Data Analcs in Fog Compung

Functional genomics approaches to disease genomics

Google Genomics Documentation - Read the Docsmedia.readthedocs.org/pdf/google-genomics/latest/google-genomics… · Google Genomics Documentation, Release v1 The 1000 Genomes Project

Assembly concepts and methods - schatzlab.cshl.edu

Cellular’Networks’and’Mobile’ Compung’ COMS…lierranli/coms6998-11Fall2012/lectures/lec5... · Cellular’Networks’and’Mobile’ Compung ... FACH: ’Low’Power’State’

Cellular’Networks’and’Mobile’ Compung’ …coms6998-8/lectures/lec9-cell...PreviousStudies • Cellular’IP’dynamics’ – Measured’cellular’IP’dynamics’attwo’locaons’

Genomics in Medicine: An Introduction to Genomics: Breast Cancer

Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

Compung Distance Histograms Efficiently in Scienfic Databasestuy/pub/talks/ICDE09-poster.pdf · Compung Distance Histograms Efficiently in Scienfic Databases ... Running me of DM‐SDH

Management,)Security)and) Sustainabilityfor Cloud)Compung )westphal/WorldComp2013-USA-tutorial.pdf · Management,)Security)and) Sustainabilityfor Cloud)Compung )!!!! ... JASIG! CAS!

SMART Genomics API - Massachusetts Institute of … Genomics API Yishen Chen Mentor - Dr. Gil Alterovitz PRIMES Conference May 18, 2014 Status of Genomics Status of Genomics Influx

· Web viewGenetics and Genomics. Genetics and Genomics. Genetics and Genomics. Genetics and Genomics. Cell cycle, divisions, gametogenesis. Mutations and polymorphisms. Cytogenetics.

Genomics in Society: Genomics, Preventive Medicine, and Society

Error Correction and Assembly of Oxford Nanopore …schatzlab.cshl.edu/presentations/2015/2015.02.28.AGBT Nanocorr... · Post Correction Identity . ONT vs Illumina Assembly Oxford

Genome and transcriptome of the regeneration …schatzlab.cshl.edu/publications/posters/2015/2015.GI.Macro.pdfGenome and transcriptome of the regeneration-competent flatworm, Macrostomum

Varia%on(&(RNA-seq(Services( - Schatzlab - …schatzlab.cshl.edu/presentations/2014-01-14.PAG.Jnomics...Scalpel:'Haplotype'microassembly' DNA sequence micro-assembly pipeline for accurate

23andMe, Pathway Genomics, Complete Genomics, Counsyl | Company Showdown

Genomics of Major Crops and Model Plant Speciesdownloads.hindawi.com/journals/ijpg/2008/171928.pdf · genomics, functional genomics, proteomics, metabolomics, and comparative genomics.

Distributed Systems Mobile & IoT Compung · Distributed Systems Mobile & IoT Compung Rik Sarkar University of Edinburgh Fall 2016 Distributed Systems, Edinburgh, 2016 . Mobile and