Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly...

Post on 18-Mar-2020

4 views 0 download

Transcript of Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly...

ParalleldenovoAssemblyofComplex(Meta)GenomesviaHipMer

AydınBuluçComputationalResearchDivision,LBNL

May23,2016InvitedTalkatHiCOMB 2016

Jointwork(alphabetical)withChaitanyaAluru,JarrodChapman,RobEgan,Evangelos Georganas,StevenHofmeyr,LennyOliker,DanRokhsar,KathyYelick

1. ParallelDeBruijn GraphConstructionandTraversalfordenovoGenomeAssembly,SC’14

2. Awhole-genomeshotgunapproachforassemblingandanchoringthehexaploid breadwheatgenome.GenomeBiology,16(26),2015.

3. meraligner:Afullyparallelsequencealigner,IPDPS’154. HipMer:AnExtreme-ScaleDeNovoGenomeAssembler,SC’15

OutlineandAcknowledgments

Workisfundedby

DenovoGenomeAssembly

1.Threecopiesofthesamenovel.

For all men tragically great are made so through a certain morbidness… all mortal greatness is but disease.

2.Sometextfromthenovel.Allpageswillberandomlycutintostripsofcharacters.Randomtypos(errors) throughouteachnovel.

For a

all men tragically g

ally great

great are made so

3.Afewstripsofcharactersfromonepage.

4.Allofthestripsofcharactersfromthe3novels.

5.Everystripmustbeassembledasshownheretocreateasinglecopyofthenovel.

For all men tragically great are made so

For a ally great

great are made so all men tragically g

1.ThreecopiesofthesameDNA.

2.SomepartoftheDNAsequence.Itwillbereadintostrips.Therearerandomerrors throughoutthesequence.

ACCGTAGCAAAACCGGGTAGTCATACTACTACGTACTCATCT

3.Thesequenceisreadintosmallerpieces(reads).CannotreadwholeDNAsequenceinonego.

ACCGTAGCAA

AAACCGGGTA TAGTCATACT

AAACCGGGTA

ACTACGTACT

4.Allreads

5.ReconstructoriginalDNAsequencefromthereadset.

ACCGTAGCAAAACCGGGTAGTCATACTACTACGTACTCATCT

ACCGTAGCAA

AAACCGGGTA

GTAGTCATACT

CTACTACGTAC

CGTACTCATCT

• There is no genome reference! – In principle we want to reconstruct unknown genome sequence.

• Reads are significantly shorter than whole genome.– Reads consist of 20 to 30K bases– Genomes vary in length and complexity – up to 30G bases

• Reads include errors.

• Genomes have repetitive regions.– Repetitive regions increase genome complexity.

DenovoGenomeAssemblyishard

• Switchgrass:1.4Giga-base pairs(Gbp)• Maize:2.4Gbp• Miscanthus:2.5Gbp• Humangenome:3Gbp• Barleygenome:7Gbp• Wheatgenome:17Gbp• Pinegenome:20Gbp• Salamander:20-30Gbp

Genomesvaryinsize

reads

contigs

scaffolds

k-mers

1

2

3

Input:Readsthatmaycontainerrors

Chopreadsintok-mers,processk-mers toexcludeerrors

Construct&traversedeBruijn graphofk-mers,generatecontigs

Leveragereadinformationtolinkcontigs andgeneratescaffolds.

GenomeAssemblyalaMeraculous

I/O,bandwidth,andmemoryintensive

Latencybound(irregularaccesses)

ComputeandI/Ointensive

MeraculousparallelizationwithUPC

• UPC is a PGAS (Partitioned Global Address Space) parallel language with one-sided communication.

• Core graph algorithms implemented in UPC (graph ó hash table).

• We need the notion of a huge global distributed hash table.• Irregular access pattern in the the distributed hash table

– One-sided communication is handy!• Portable implementation: can run any machine without change!• Result of this work: Complete assembly of human genome in

8.4 minutes using 15K cores

• Original code required 2 days and a large memory machine.

Reads

Parallelk-mer analysis:pass1

P0

P1

Pn

Parsetok-mers

Hyperloglog

Hyperloglog

Hyperloglog

k-mer cardinalityestimation

G=Globalcardinalityestimation

ParallelReduction

BloomFilterofsize(G/n)

BloomFilterofsize(G/n)

BloomFilterofsize(G/n)

CreatelocalBloomfilters

Alsoidentifyhigh-frequencyk-mers

atthisstep

Reads

Parallelk-mer analysis:pass2

P0

P1

Pn

Parsetok-mers

Hashk-mers

Hashk-mers

Hashk-mers

Hashk-mers &findowners

Receivedk-mers

Receivedk-mers

Receivedk-mers

All-to-allcommunication

ofk-mers

BloomFilter

BloomFilter

BloomFilter

Localset

Localset

Localset

Storeak-mer inthelocalsetonlyifitwas

seenbefore

Thelocalsetdoestheactualcounting

ofthek-mers

Bloomfilterisaprobabilistic datastructureusedformembershipqueries• Givenabloomfilter,wecanask:“Haveweseenthisk-mer before?”• Nofalsenegatives.• Mayhavefalsepositives(inpractice5%falsepositiverate)

k-mers withfrequency=1areuseless(eithererrororcannotbedistinguishedfromerror),andcansafelybeeliminated.

WhyuseaBloomfilter?

Reads

Parallelk-mer analysis:pass3

P0

P1

Pn

Parsetok-mers

Hashk-mers

Hashk-mers

Hashk-mers

Findextensionsofk-mers,hashthem&findtheirowners

Receivedk-mers &extensions

Receivedk-mers &extensions

Receivedk-mers &extensions

All-to-allcommunicationofk-mers andextensions

Keeptrackofthenumberofoccurrencesofeachextensionfor

eachk-mer

Localset

Localset

Localset

ACCCACTCTTAGCFAACCTTGCGCATXA

AGGCAATGGTAGFFAAAATTGCCCATXX

TTCCAGTTTTGCCAAACTTGGCTTTTCA

Useathreshold&findthehigh

qualityextensionsofk-mers

High-frequencyk-mers

Long-taileddistributionforgenomeswithrepetitivecontent:- Themaximumcountforanyk-mer inthewheatdatasetis451million- Ouroriginalscheme(SC’14)was“ownercounts”,afteranall-to-all- Countinganitemw/451millionoccurrencesaloneisloadimbalanced

Solution:Quicklyidentifyhigh-frequencyk-mers usingminimalcommunicationduringthe“cardinalityestimation”stepandtreatthemspeciallybyusinglocalcounters.

• InMeraculous,thedeBruijn graphisrepresentedasahashtable

• K-mers arebothnodesinthegraph&keysinthehashtable

• Anedgeinthegraphconnectstwonodesthatoverlapink-1bases

• Theedgesinthegraphareputinthehashtablebystoringtheextensionsofthek-mers astheircorrespondingvalues

GAT ATC TCT CTG TGA

AAC

ACC

CCG

AAT

ATG

TGC

ParallelDeBruijn GraphConstruction

ParallelDeBruijn GraphConstruction

• Challenge 1: The hash table that represents the de Bruijngraph is large (100s of GBs up to 10s of TBs !)– Solution: Distribute the graph over multiple processors.

The global address space of UPC is handy!

• Challenge 2: Parallel hash tables construction introduces communication and synchronization costs– Solution: Aggregate messages to reduce number of

messages and synchronization à 10x-20x performance improvement.

Pi

LocalbufferforP0

LocalbufferforP1

LocalbufferforPn

BufferlocaltoP0

DistributedHashtable

LocaltoP0

LocaltoP0

LocaltoP0

LocaltoP0

Aggregatingstoresoptimization

P0

P0 storesthek-mers &extensionsinitslocalbucketsinalock-free&communication-freefashion

Goal:• TraversethedeBruijn graphandfindUUcontigs (chainsofUU

nodes),oralternatively• findtheconnectedcomponentswhichconsistoftheUUcontigs.

• Mainidea:– Pickaseed– Iterativelyextenditbyconsecutivelookupsinthedistributedhash

table

GAT ATC TCT CTG TGA

AAC

ACC

CCG

AAT

ATG

TGC

Contig 1:GATCTGAContig 2:AACCG

Contig 3:AATGC

ParallelDeBruijn GraphTraversal

CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT

Assumeone oftheUUcontigs tobeassembledis:

ParallelDeBruijn GraphTraversal

ParallelDeBruijn GraphTraversal

CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT

ProcessorPi picksarandomk-mer fromthedistributedhashtableasseed:

Pi knowsthatforwardextensionisA

Pi usesthelastk-1basesandtheforwardextensionandforms:CAACGTATCA

Pi doesalookupinthedistributedhashtableforCAACGTATCA

Pi iteratesthisprocessuntilitreachesthe“right”endpointoftheUUcontig

Pi alsoiteratesthisprocessbackwardsuntilitreachesthe“left”endpointoftheUUcontig

MultipleprocessorsonthesameUUcontig

CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT

However,processorsPi,Pj andPt mighthavepickedinitialseedsfromthesameUUcontig

Pi Pj Pt

• ProcessorsPi,Pj andPt havetocollaborateandconcatenatesubcontigsinordertoavoidredundantwork.

• Solution:lightweightsynchronizationschemebasedonastatemachine

reads

contigs

scaffolds

k-mers

1

2

3PARALLELIZINGSCAFFOLDING

AlignmentforDenovο GenomeAssembly

Tobereusedformetagenomicsw/modifications

Read 42

Contig 101 Contig 61

Read ID start-pos end-pos Contig ID start-pos end-pos--------------------------------------------------------------------------------------------------Read 42 1 4 Contig 101 152 155Read 42 130 150 Contig 61 1 21Read 90 1 150 Contig 500 101 250

Read 90

Contig 500

Queries(Reads)

Targets(Contigs)

• Aqueryandatargetshouldmatchinatleastkbasesinordertobealigned• Wecallseed asubstringofasequence(queryortarget)withlengthequaltok

Aligningqueriestotargetsequences

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Seed: ACT

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Seed: ACT

�Seed: CTG

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Seed: ACT

�Seed: CTG

�Seed: TGG

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

A C T G G G G C A Target 0: Target 1:

Seed Index

query sequence : G C T G

query’s seed

Lookupseedindex

SeedGCTisnotfoundinseedindex

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

A C T G G G G C A Target 0: Target 1:

Seed Index

query sequence : G C T G

query’s seed

Lookupseedindex

• Indenovoassembly,billionsofreadsmustbealignedtocontigs• Firstalignertoparallelizetheseedindexconstruction(“fully”parallel)

Evangelos Georganas, Aydın Buluç,JarrodChapman,LeonidOliker,DanielRokhsar,andKatherineYelick.meraligner:Afullyparallelsequencealigner.InProceedingsoftheIPDPS,2015.

128

256

512

1024

2048

4096

8192

16384

480 960 1920 3840 7680 15360

Seco

nds

Number of Cores

merAligner-wheatideal-wheatmerAligner-humanideal-humanBWAmem-humanBowtie2-human

ParallelGenomeAlignmentforDenovoAssembly

Scaffoldingbeyondalignment

• High-level idea: Leverage read-to-contig information to generate links among contigs.• Distributed hash tables to index the link information.

• Formacontig graphbyusingthelinksandtraverseittoformscaffolds.

contig i contig j

LINK : < contig i ! contig j , link info >

contig 1 contig 2 contig 3 contig 4

tie 1!2 tie 2!3 tie 3!4

Scaffold Link 1!2 Link 2!3 Link 3!4

Scaffoldingbeyondalignment

• Computingcontig depthsandterminationstates• Contig bubbleidentification(fordiploidgenomes)• Orderingandorientationofcontigs(inherentlyserialasimplemented)• Insertsizeestimation• Locatingsplints andspans• Contig linkgeneration• GapclosingConceptually:Mini/localassembly– embarrassinglyparallelizable-

Toolsemployed:- Distributedhashtables- Speculativeexecution

read contig k

contig m

SPLINT

read 1 read 2

contig i

contig j

SPAN (a) (b)

Gap 1 Gap 2 Gap 3

Task 1 Task 2 Task 3

Gap closing

Assembled genome

Strongscaling(wheatgenome)onCrayXC30

32

64

128

256

512

1024

2048

4096

8192

16384

960 1920 3840 7680 15360

Seco

nds

Number of Cores

overall timekmer analysis

contig generationscaffolding

ideal overall time

• Completeassemblyofwheatgenomein39minutes(15Kcores).• OriginalMeraculous wouldrequire(projectedtime)aweek(~300x

slower) andasharedmemorymachinewith1TBmemory.

Strongscaling(humangenome)onCrayXC30@SC’15

• Completeassemblyofhumangenomein8.4minutes(15Kcores).• 350xspeedupoveroriginalMeraculous(took2,880minutesanda

largesharedmemorymachine).

32

64

128

256

512

1024

2048

4096

8192

480 960 1920 3840 7680 15360

Seco

nds

Number of Cores

overall timekmer analysis

contig generationscaffolding

ideal overall time

Strongscaling(humangenome)onCrayXC30@Now

• Completeassemblyofhumangenomein4 minutesusing23Kcores.• 700xspeedupoveroriginalMeraculous (took2,880minutesandalarge

sharedmemorymachine).

16

32

64

128

256

512

1024

2048

4096

8192

1920 3840 7680 15360 23040

Seco

nds

Number of Cores

ideal IO cached overall No IO cached

overall IO cachedkmer analysis

contig generationscaffolding

IO

• Insteadofmultiplecopiesofthesamebook,howyouhavethewholelibrarytoassemble!

Howaboutmetagenomes?

• 1984isthe“dominantspecies”inthissample.• Tooeasy: therearealsopreviouslyunknownbooksinthemix.

Option1:Binthereadstogenomes;runsinglegenomeassemblyL Readsaretooshorttocontainenoughinformationforbinning.Option2:Generatecontigs andbinthecontigsL Howdoyoueliminateerrorsforlow-coverageorganisms?

Metagenomechallenges

- Whydoesmetagenomeassemblybenefitsfromexpliciterrorcorrectionwhensinglegenomeassemblydoesnot?+Unevensequencingdepth:Errorsinhigh-depthregionsmightbemorefrequentthantruek-mers inlow-depthregions.- Whocaresaboutlow-depthregions?+Infact,that’swhatmattersmost.

Singlegenome Metagenome

AdaptingHipMertometagenomes

persistent

contiggeneration

alignment scaffolding

B

Datasizes

gapclosing

B

E

G

E

k-meranalysis Increasek

anditerate

A10TB

1TB

100GB

A

CCC

D

C

A

C

D

F

D

Inputs:(A) Shortreadsand(F)longinsert(matepaired)librariesIntermediate:(B)Error-freek-mers w/extensions(a.k.a.UFX),(C)contigs…Output:(G)finalgenomescaffolds

Secretmetagenomesauce#1

DEGASBerkeleymeeting

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

DEGASBerkeleymeeting

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

HipMeronaMOCKcommunity

– Mixof25bacteriawithdifferentabundancies (JGIdataset)– Datasetsize~100GBytes

Statistics metaHipMer metaSPAdes

#contigs 2,670 5,100

Totallength> 10kbp 92,245,198 92,152,728

Totallength>50kbp 69,493,125 77,292,121

Misassemblies 58 134

Mismatchesper100kbp 3.48 77.05

Genomesfraction(%) 92.10 91.10

Summary

• HipMer’s corealgorithmsscaletotensofthousandsofcoresandyieldperformanceimprovementsfromdays/weeksdowntominutes.

• HipMerbreaksthehardwarelimitationsbyenablingdistributed-memoryscaling

• Useofdenovoassemblyintimesensitiveapplicationslikeprecisionmedicineisnomoreformidable!

• SourcereleaseofHipMer:https://sourceforge.net/projects/hipmer/

• Ongoingwork:AdaptHipMerformetagenomicanalysis(somehintsandpreliminaryresultsgiveninthistalk).