short read genome assembly - Brown...
Transcript of short read genome assembly - Brown...
![Page 1: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/1.jpg)
shortreadgenomeassembly
Sorin IstrailCSCI1820Short-readgenomeassembly
algorithms3/6/2014
1
![Page 2: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/2.jpg)
GenomathicaAssembler•Mathematica notebookforgenomeassemblysimulation•Assemblercanbefoundat:http://cs.brown.edu/courses/csci1820/software/minimal_assembler.nb•SampleFASTAgenomephix174.fastacanbefoundinHW5Biology:http://cs.brown.edu/courses/csci1820/software/phix174.fasta•Rememberto– ChangetheinputgenometoyourFASTAfile’slocation– Evaluateeachcellinitially,thenyouonlyneedtoevaluatethelasttwocells
tore-runtheassembly,anddisplaytheresultsrespectively– Mathematica canbedownloadedhere:
http://www.brown.edu/information-technology/software/
2
![Page 3: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/3.jpg)
coverage=1
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
![Page 4: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/4.jpg)
coverage=2
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
![Page 5: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/5.jpg)
coverage=3
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
![Page 6: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/6.jpg)
coverage=4
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
![Page 7: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/7.jpg)
coverage=5
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
![Page 8: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/8.jpg)
coverage=2,pairedends
![Page 9: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/9.jpg)
RawSequenceReads
Sampleprep
Sequencedata
•wet-labexperimentalmethodstoisolate,prepare,andsequencetheDNA•resultsinanumberoflargeFASTQfiles•FASTQCcanbeusedtocheckbasicstatisticsofthefiles–http://www.bioinformatics.babraham.ac.uk/projects/fastqc/•manytoolsavailableforQC–e.g.http://hannonlab.cshl.edu/fastx_toolkit/
![Page 10: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/10.jpg)
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.
![Page 11: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/11.jpg)
http://www.ncbi.nlm.nih.gov/Traces/sra/
![Page 12: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/12.jpg)
![Page 13: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/13.jpg)
GenomeAssemblySoftware
•Overlap-layout-consensus•Celera:http://wgs-assembler.sourceforge.net/•K-mer based•Velvet:http://www.ebi.ac.uk/~zerbino/velvet/•SOAP-denovo:http://soap.genomics.org.cn/soapdenovo.html•ALLPATHS-LG:http://www.broadinstitute.org/software/allpaths-lg/blog/•IDBA-UD:http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/
![Page 14: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/14.jpg)
Twographmodels
• Afirstgraphmodel– Nodes(vertices)arecontiguoussequencesofkcharacters(k-mer)
– Directededgefromvi tovj ifvi[2..k]=vj[1..k-1]
A C G T T C
ACG CGT GTT TTC
![Page 15: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/15.jpg)
Twographmodels
• De-bruijngraph– Nodes(vertices)arecontiguoussequencesofk-1characters(k-1-mer)
– Directededgefromvi tovj ifvi[1..k-1]+vj[k-1]areavalidk-mer
A C G T T C
AC CG GT TT TCACG CGT GTT TTC
![Page 16: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/16.jpg)
Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Noteedgesthatarenot
reflectedintheinput!
![Page 17: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/17.jpg)
GenomeAssembly
• Buildingthek-mer graph– nodesask-mers,edges(k-1)overlap
17
![Page 18: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/18.jpg)
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
GACG ACGT
k=4
k=3
GAC ACG CGT
CGTA
GTA
1 1
1 1 1
![Page 19: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/19.jpg)
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
GACG ACGT
k=4
k=3
GAC ACG CGT
CGTA
GTA
GTAC TACG
TAC
1 1
1 1 2
1 1
1
1
![Page 20: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/20.jpg)
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
GACG ACGT
k=4
k=3
GAC ACG CGT
CGTA
GTA
GTAC TACG
TAC
CGTT
GTT
1 1
1 2 2
1 1
1
2
1
1
![Page 21: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/21.jpg)
GenomeAssembly
• Buildingthek-mer graph– nodesask-mers,edges(k-1)overlap– nodesas(k-1)-mers,edgesformk-mers
21
![Page 22: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/22.jpg)
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
k=4
k=3
GA AC CG GT TA
GAC ACG CGT GTA1 1
1 1 1
1
1
![Page 23: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/23.jpg)
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
k=4
k=3
GA AC CG GT TA
GAC ACG CGT GTA TAC1 1
1 3 2
2 1
2
1
1
![Page 24: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/24.jpg)
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
k=4
k=3
GA AC CG GT TA
GT
GAC ACG CGT GTA TAC
GTT
TT
1 2
1 4 2
2 1
2
2
1
1
2
1
![Page 25: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/25.jpg)
GenomeAssembly
• Buildingthek-mer graph– G(k):nodesask-mers,edges(k-1)overlap– H(k):nodesas(k-1)-mers,edgesformk-mers
• H(k)=G(k-1)– Soitreallydoesnotmatterwhichyouchoosetoimplement
• Wheredoesthecomplexitycomefrom?– Sequencingerrors,repeats,unevencoverage,contaminationfromotherorganisms,ploidy,unsequenced regions
25
![Page 26: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/26.jpg)
Poppingbubbles
Erroroccursinthemiddleofareadandispropagatedtomanyk-mers.
![Page 27: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/27.jpg)
Trimmingtips
Errorcreatesanerroneousendingk-mer
![Page 28: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/28.jpg)
Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Chimericextensions
Errorsconnecttwonodesinthegraphwhichdonotcorrespondtoavalidextensioninthegenomesequence
![Page 29: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/29.jpg)
Repetitiveregions
• Satellites,SINEs,LINEs• HomologousGenes– Ortholog:descendedfromthesameancestralsequenceandseparatedbyspeciation
– Paralog:genescreatedbyaduplicationevent
29
![Page 30: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/30.jpg)
30Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
![Page 31: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/31.jpg)
Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
![Page 32: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/32.jpg)
Velvetassembler
• Fourstages– Hashingreadsintok-mers– ConstructingthedeBruijn graph(notall4^kk-mers,onlythosethatexistininput)
– Correcterrors– Resolverepeats
• Butwhatafter?– Papergivesverylittleinformationonthis...
32
![Page 33: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/33.jpg)
TheChinesepostmanproblem(CPP)
• Computeaclosedtourofminimumlengththatvisitseachedgeatleastonce– Similartowhatwewantexceptwemaywanttovisitedgesmorethanonceduetorepeats• Howdowedealwithrepeats?
– Also,thestartingandendingverticesaredistinctingenomeassembly• Howcanweconverttheclosedtourtoanopenone?
33
![Page 34: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/34.jpg)
Yourhomework
• Youarenot requiredtoimplementsection4ofhttp://web.eecs.umich.edu/~pettie/matching/Edmonds-Johnson-chinese-postman.pdf
• YouarenotevenrequiredtomodelgenomeassemblyasCPP
• Butyoudohavetobuildthek-mer graph,correcterrors,resolverepeats,andcomputeaCPPorEulerian-liketour.
34
![Page 35: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/35.jpg)
Evaluatingassembly
• TheAssemblathon2studylists102measuresforevaluatingassemblyquality.• Bradnam etal.(2013) Assemblathon 2:evaluatingdenovomethodsfo genomeassemblyinthreevertebratespecies
1. NG50scaffoldlength:alengthx whereallscaffolds oflengthx orlongerconsistsofatleast50%ofthegenomesize
2. NG50contig length:alengthx whereallcontigs oflengthx orlongerconsistsofatleast50%ofthegenomesize
3. Amountofgene-sizedscaffolds(>25kbp).Usefulforgenefinding.
4. CEGMA:Numberof458coregenesmapped
![Page 36: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating](https://reader033.fdocuments.in/reader033/viewer/2022042216/5ebdb7782c048f3646237eab/html5/thumbnails/36.jpg)
5. Fosmid coverage:Howmanyvalidatedfosmid regionswerecapturedinassembly
6. Fosmid validity:Percentageofassemblyvalidatedbyvalidatedfosmid regions
7. Validatedfosmid regiontagscaffoldsummaryscore:numberofvalidatedfosmid regiontagpairsthatmatchthesamescaffoldmultipliedbythepercentageofuniquelymappingtagpairsthatmapwithcorrectdistance.Rewardsshort-rangeaccuracy.
8. and9.Usinglocalandglobalalignmentsofoptimalmapdata,howwelltheassemblyisordered.
10.REAPRsummaryscore:atoolthatevalutes accuracyofassemblyusingpairedreads
Evaluatingassembly