Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly...

ParalleldenovoAssemblyofComplex(Meta)GenomesviaHipMer

AydınBuluçComputationalResearchDivision,LBNL

May23,2016InvitedTalkatHiCOMB 2016

Jointwork(alphabetical)withChaitanyaAluru,JarrodChapman,RobEgan,Evangelos Georganas,StevenHofmeyr,LennyOliker,DanRokhsar,KathyYelick

1. ParallelDeBruijn GraphConstructionandTraversalfordenovoGenomeAssembly,SC’14

2. Awhole-genomeshotgunapproachforassemblingandanchoringthehexaploid breadwheatgenome.GenomeBiology,16(26),2015.

3. meraligner:Afullyparallelsequencealigner,IPDPS’154. HipMer:AnExtreme-ScaleDeNovoGenomeAssembler,SC’15

OutlineandAcknowledgments

Workisfundedby

DenovoGenomeAssembly

1.Threecopiesofthesamenovel.

For all men tragically great are made so through a certain morbidness… all mortal greatness is but disease.

2.Sometextfromthenovel.Allpageswillberandomlycutintostripsofcharacters.Randomtypos(errors) throughouteachnovel.

all men tragically g

ally great

great are made so

3.Afewstripsofcharactersfromonepage.

4.Allofthestripsofcharactersfromthe3novels.

5.Everystripmustbeassembledasshownheretocreateasinglecopyofthenovel.

For all men tragically great are made so

For a ally great

great are made so all men tragically g

1.ThreecopiesofthesameDNA.

2.SomepartoftheDNAsequence.Itwillbereadintostrips.Therearerandomerrors throughoutthesequence.

ACCGTAGCAAAACCGGGTAGTCATACTACTACGTACTCATCT

3.Thesequenceisreadintosmallerpieces(reads).CannotreadwholeDNAsequenceinonego.

ACCGTAGCAA

AAACCGGGTA TAGTCATACT

AAACCGGGTA

ACTACGTACT

4.Allreads

5.ReconstructoriginalDNAsequencefromthereadset.

ACCGTAGCAAAACCGGGTAGTCATACTACTACGTACTCATCT

ACCGTAGCAA

AAACCGGGTA

GTAGTCATACT

CTACTACGTAC

CGTACTCATCT

• There is no genome reference! – In principle we want to reconstruct unknown genome sequence.

• Reads are significantly shorter than whole genome.– Reads consist of 20 to 30K bases– Genomes vary in length and complexity – up to 30G bases

• Reads include errors.

• Genomes have repetitive regions.– Repetitive regions increase genome complexity.

DenovoGenomeAssemblyishard

• Switchgrass:1.4Giga-base pairs(Gbp)• Maize:2.4Gbp• Miscanthus:2.5Gbp• Humangenome:3Gbp• Barleygenome:7Gbp• Wheatgenome:17Gbp• Pinegenome:20Gbp• Salamander:20-30Gbp

Genomesvaryinsize

contigs

scaffolds

k-mers

Input:Readsthatmaycontainerrors

Chopreadsintok-mers,processk-mers toexcludeerrors

Construct&traversedeBruijn graphofk-mers,generatecontigs

Leveragereadinformationtolinkcontigs andgeneratescaffolds.

GenomeAssemblyalaMeraculous

I/O,bandwidth,andmemoryintensive

Latencybound(irregularaccesses)

ComputeandI/Ointensive

MeraculousparallelizationwithUPC

• UPC is a PGAS (Partitioned Global Address Space) parallel language with one-sided communication.

• Core graph algorithms implemented in UPC (graph ó hash table).

• We need the notion of a huge global distributed hash table.• Irregular access pattern in the the distributed hash table

– One-sided communication is handy!• Portable implementation: can run any machine without change!• Result of this work: Complete assembly of human genome in

8.4 minutes using 15K cores

• Original code required 2 days and a large memory machine.

Parallelk-mer analysis:pass1

Parsetok-mers

Hyperloglog

k-mer cardinalityestimation

G=Globalcardinalityestimation

ParallelReduction

BloomFilterofsize(G/n)

CreatelocalBloomfilters

Alsoidentifyhigh-frequencyk-mers

atthisstep

Parsetok-mers

Hashk-mers

Hashk-mers &findowners

Receivedk-mers

All-to-allcommunication

ofk-mers

BloomFilter

Localset

Storeak-mer inthelocalsetonlyifitwas

seenbefore

Thelocalsetdoestheactualcounting

ofthek-mers

Bloomfilterisaprobabilistic datastructureusedformembershipqueries• Givenabloomfilter,wecanask:“Haveweseenthisk-mer before?”• Nofalsenegatives.• Mayhavefalsepositives(inpractice5%falsepositiverate)

k-mers withfrequency=1areuseless(eithererrororcannotbedistinguishedfromerror),andcansafelybeeliminated.

WhyuseaBloomfilter?

Parsetok-mers

Hashk-mers

Findextensionsofk-mers,hashthem&findtheirowners

Receivedk-mers &extensions

All-to-allcommunicationofk-mers andextensions

Keeptrackofthenumberofoccurrencesofeachextensionfor

eachk-mer

Localset

ACCCACTCTTAGCFAACCTTGCGCATXA

AGGCAATGGTAGFFAAAATTGCCCATXX

TTCCAGTTTTGCCAAACTTGGCTTTTCA

Useathreshold&findthehigh

qualityextensionsofk-mers

High-frequencyk-mers

Long-taileddistributionforgenomeswithrepetitivecontent:- Themaximumcountforanyk-mer inthewheatdatasetis451million- Ouroriginalscheme(SC’14)was“ownercounts”,afteranall-to-all- Countinganitemw/451millionoccurrencesaloneisloadimbalanced

Solution:Quicklyidentifyhigh-frequencyk-mers usingminimalcommunicationduringthe“cardinalityestimation”stepandtreatthemspeciallybyusinglocalcounters.

• InMeraculous,thedeBruijn graphisrepresentedasahashtable

• K-mers arebothnodesinthegraph&keysinthehashtable

• Anedgeinthegraphconnectstwonodesthatoverlapink-1bases

• Theedgesinthegraphareputinthehashtablebystoringtheextensionsofthek-mers astheircorrespondingvalues

GAT ATC TCT CTG TGA

ParallelDeBruijn GraphConstruction

• Challenge 1: The hash table that represents the de Bruijngraph is large (100s of GBs up to 10s of TBs !)– Solution: Distribute the graph over multiple processors.

The global address space of UPC is handy!

• Challenge 2: Parallel hash tables construction introduces communication and synchronization costs– Solution: Aggregate messages to reduce number of

messages and synchronization à 10x-20x performance improvement.

LocalbufferforP0

LocalbufferforP1

LocalbufferforPn

BufferlocaltoP0

DistributedHashtable

LocaltoP0

Aggregatingstoresoptimization

P0 storesthek-mers &extensionsinitslocalbucketsinalock-free&communication-freefashion

Goal:• TraversethedeBruijn graphandfindUUcontigs (chainsofUU

nodes),oralternatively• findtheconnectedcomponentswhichconsistoftheUUcontigs.

• Mainidea:– Pickaseed– Iterativelyextenditbyconsecutivelookupsinthedistributedhash

GAT ATC TCT CTG TGA

Contig 1:GATCTGAContig 2:AACCG

Contig 3:AATGC

ParallelDeBruijn GraphTraversal

CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT

Assumeone oftheUUcontigs tobeassembledis:

ParallelDeBruijn GraphTraversal

ProcessorPi picksarandomk-mer fromthedistributedhashtableasseed:

Pi knowsthatforwardextensionisA

Pi usesthelastk-1basesandtheforwardextensionandforms:CAACGTATCA

Pi doesalookupinthedistributedhashtableforCAACGTATCA

Pi iteratesthisprocessuntilitreachesthe“right”endpointoftheUUcontig

Pi alsoiteratesthisprocessbackwardsuntilitreachesthe“left”endpointoftheUUcontig

MultipleprocessorsonthesameUUcontig

However,processorsPi,Pj andPt mighthavepickedinitialseedsfromthesameUUcontig

Pi Pj Pt

• ProcessorsPi,Pj andPt havetocollaborateandconcatenatesubcontigsinordertoavoidredundantwork.

• Solution:lightweightsynchronizationschemebasedonastatemachine

contigs

scaffolds

k-mers

3PARALLELIZINGSCAFFOLDING

AlignmentforDenovο GenomeAssembly

Tobereusedformetagenomicsw/modifications

Contig 101 Contig 61

Read ID start-pos end-pos Contig ID start-pos end-pos--------------------------------------------------------------------------------------------------Read 42 1 4 Contig 101 152 155Read 42 130 150 Contig 61 1 21Read 90 1 150 Contig 500 101 250

Contig 500

Queries(Reads)

Targets(Contigs)

• Aqueryandatargetshouldmatchinatleastkbasesinordertobealigned• Wecallseed asubstringofasequence(queryortarget)withlengthequaltok

Aligningqueriestotargetsequences

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Seed: ACT

Seed Index

Buildingseedindex

Seed: ACT

�Seed: CTG

Seed Index

Buildingseedindex

Seed: ACT

�Seed: CTG

�Seed: TGG

Seed Index

Buildingseedindex

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

Seed Index

Buildingseedindex

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

Seed Index

Buildingseedindex

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

Seed Index

query sequence : G C T G

query’s seed

Lookupseedindex

SeedGCTisnotfoundinseedindex

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

Seed Index

query sequence : G C T G

query’s seed

Lookupseedindex

• Indenovoassembly,billionsofreadsmustbealignedtocontigs• Firstalignertoparallelizetheseedindexconstruction(“fully”parallel)

Evangelos Georganas, Aydın Buluç,JarrodChapman,LeonidOliker,DanielRokhsar,andKatherineYelick.meraligner:Afullyparallelsequencealigner.InProceedingsoftheIPDPS,2015.

480 960 1920 3840 7680 15360

Number of Cores

merAligner-wheatideal-wheatmerAligner-humanideal-humanBWAmem-humanBowtie2-human

ParallelGenomeAlignmentforDenovoAssembly

Scaffoldingbeyondalignment

• High-level idea: Leverage read-to-contig information to generate links among contigs.• Distributed hash tables to index the link information.

• Formacontig graphbyusingthelinksandtraverseittoformscaffolds.

contig i contig j

LINK : < contig i ! contig j , link info >

contig 1 contig 2 contig 3 contig 4

tie 1!2 tie 2!3 tie 3!4

Scaffold Link 1!2 Link 2!3 Link 3!4

Scaffoldingbeyondalignment

• Computingcontig depthsandterminationstates• Contig bubbleidentification(fordiploidgenomes)• Orderingandorientationofcontigs(inherentlyserialasimplemented)• Insertsizeestimation• Locatingsplints andspans• Contig linkgeneration• GapclosingConceptually:Mini/localassembly– embarrassinglyparallelizable-

Toolsemployed:- Distributedhashtables- Speculativeexecution

read contig k

contig m

SPLINT

read 1 read 2

contig i

contig j

SPAN (a) (b)

Gap 1 Gap 2 Gap 3

Task 1 Task 2 Task 3

Gap closing

Assembled genome

Strongscaling(wheatgenome)onCrayXC30

960 1920 3840 7680 15360

Number of Cores

overall timekmer analysis

contig generationscaffolding

ideal overall time

• Completeassemblyofwheatgenomein39minutes(15Kcores).• OriginalMeraculous wouldrequire(projectedtime)aweek(~300x

slower) andasharedmemorymachinewith1TBmemory.

Strongscaling(humangenome)onCrayXC30@SC’15

• Completeassemblyofhumangenomein8.4minutes(15Kcores).• 350xspeedupoveroriginalMeraculous(took2,880minutesanda

largesharedmemorymachine).

480 960 1920 3840 7680 15360

Number of Cores

overall timekmer analysis

ideal overall time

Strongscaling(humangenome)onCrayXC30@Now

• Completeassemblyofhumangenomein4 minutesusing23Kcores.• 700xspeedupoveroriginalMeraculous (took2,880minutesandalarge

sharedmemorymachine).

1920 3840 7680 15360 23040

Number of Cores

ideal IO cached overall No IO cached

overall IO cachedkmer analysis

• Insteadofmultiplecopiesofthesamebook,howyouhavethewholelibrarytoassemble!

Howaboutmetagenomes?

• 1984isthe“dominantspecies”inthissample.• Tooeasy: therearealsopreviouslyunknownbooksinthemix.

Option1:Binthereadstogenomes;runsinglegenomeassemblyL Readsaretooshorttocontainenoughinformationforbinning.Option2:Generatecontigs andbinthecontigsL Howdoyoueliminateerrorsforlow-coverageorganisms?

Metagenomechallenges

- Whydoesmetagenomeassemblybenefitsfromexpliciterrorcorrectionwhensinglegenomeassemblydoesnot?+Unevensequencingdepth:Errorsinhigh-depthregionsmightbemorefrequentthantruek-mers inlow-depthregions.- Whocaresaboutlow-depthregions?+Infact,that’swhatmattersmost.

Singlegenome Metagenome

AdaptingHipMertometagenomes

persistent

contiggeneration

alignment scaffolding

Datasizes

gapclosing

k-meranalysis Increasek

anditerate

Inputs:(A) Shortreadsand(F)longinsert(matepaired)librariesIntermediate:(B)Error-freek-mers w/extensions(a.k.a.UFX),(C)contigs…Output:(G)finalgenomescaffolds

Secretmetagenomesauce#1

DEGASBerkeleymeeting

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

Iterativecontig generation

DEGASBerkeleymeeting

k-mer analysis

Local Assembly

bubble

k-mer analysis

Local Assembly

bubble

k-mer analysis

Local Assembly

bubble

k-mer analysis

Local Assembly

bubble

k-mer analysis

Local Assembly

bubble

k-mer analysis

Local Assembly

bubble

HipMeronaMOCKcommunity

– Mixof25bacteriawithdifferentabundancies (JGIdataset)– Datasetsize~100GBytes

Statistics metaHipMer metaSPAdes

#contigs 2,670 5,100

Totallength> 10kbp 92,245,198 92,152,728

Totallength>50kbp 69,493,125 77,292,121

Misassemblies 58 134

Mismatchesper100kbp 3.48 77.05

Genomesfraction(%) 92.10 91.10

Summary

• HipMer’s corealgorithmsscaletotensofthousandsofcoresandyieldperformanceimprovementsfromdays/weeksdowntominutes.

• HipMerbreaksthehardwarelimitationsbyenablingdistributed-memoryscaling

• Useofdenovoassemblyintimesensitiveapplicationslikeprecisionmedicineisnomoreformidable!

• SourcereleaseofHipMer:https://sourceforge.net/projects/hipmer/

• Ongoingwork:AdaptHipMerformetagenomicanalysis(somehintsandpreliminaryresultsgiveninthistalk).

Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly...

Documents

Transcript of Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly...

FINAL!PROGRAM 28th!0!29 !September,!2017! · De novo assembly of Mycobacterium tuberculosis genomes from long read sequencing technologies Miguel Blanco Fuertes IBV-CSIC P. 8 Functional

High-Quality De Novo Insect Genome Assemblies …i5k.github.io/webinar_slides/i5K_webinar_PacBio_Oct5...-Genomes of 'Candidatus Liberibacter solanacearum' Haplotype A from New Zealand

De Novo Assembly and Structural Variation Discovery … · De Novo Assembly and Structural Variation Discovery in Complex Genomes Using ... Repeat Copy Number Diversity in Humans

2014-01-14.PAG.Single Molecule Assembly - Schatzlabschatzlab.cshl.edu/presentations/2014-01-14.PAG... · 1/14/2014 · De novo assembly of complex genomes using single molecule sequencing

De novo identification of repeat families in large genomes Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005.

BB30055: Genes and genomes Genomes - Dr. MV Hejmadi (bssmvh)

Eukaryotic Genomes The Organization and Control of Eukaryotic Genomes.

Bacterial genomes

Virus Genomes

17 Genomes. Figure 17.7 Synthetic Cells 17 Genomes 17.1 How Are Genomes Sequenced? 17.2 What Have We Learned from Sequencing Prokaryotic Genomes? 17.3.

Genomes summary

Seminario "Efficient de novo assembly of single-cell bacterial genomes from short-read data sets"

Quick Reference Card - Equipment and Materials …...De Novo – HiFi Low DNA De Novo Seq – Input Sequencing De Novo Length – hotgun Meta M icrobial Variant Detection Structural

Rapid diversification of five Oryza AA genomes associated ...Here, we fully sequenced and assembled de novo the five biogeographically representative AA genomes. In comparison to the

The Changing Face of Sequencing Strategies for de novo sequencing of complex genomes.

Organelle genomes

De novo identification of repeat families in large genomes

Modelling genomes

2012-01-15.PAG.De novo assembly of complex genomes

Microbial Genomes