MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th,...
-
Upload
allen-anthony -
Category
Documents
-
view
217 -
download
0
Transcript of MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th,...
![Page 1: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/1.jpg)
MGM WorkshopAssembly Tutorial
Matthew BlowDOE Joint Genome Institute,
Walnut Creek, CA
Sep 26th, 2011
![Page 2: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/2.jpg)
• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly• Short read genome assembly of microbes at JGI
Contents
![Page 3: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/3.jpg)
• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly• Short read genome assembly of microbes at JGI
Contents
![Page 4: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/4.jpg)
Sanger 454 Illumina HiSeq
Mb/dayCost / MbRead length
150bp
450bp
650bp
1
$400
$15 $0.11,000
20,000
Traditional genome sequencing technology
Short-read
We have to figure out how to sequence microbial genomes using only illumina data!
Why sequence genomes using short reads?
![Page 5: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/5.jpg)
Short read genome sequencing
How do we convert this data back into a genome?
GenomicDNA
300bpfragments
Random fragmentation
3-10 kbfragments
Paired-end long insert
reads(10’s millions)
Paired-end short insert
reads(10’s millions)
molecular biology
Sequencing(Illumina)
![Page 6: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/6.jpg)
Contents
• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly• Short read genome assembly of microbes at JGI
![Page 7: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/7.jpg)
Why assemble?
Unassembled reads
Assemblies
Data Insights
Minimal
• Genes / operons(high quality consensus)• Pathways• Reference
sequences
Data reduction
Gb / Tb
Mb
![Page 8: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/8.jpg)
Short read assembly strategy
Contigs
ScaffoldsAssembly algorithms
e.g. Allpaths, Velvet,
Meraculous
~107 reads
~107 reads
‘Finished’
![Page 9: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/9.jpg)
Short read assembly strategy
Contigs
Scaffolds
~107
~107
‘Finished’
‘De Bruijn’ assembly
![Page 10: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/10.jpg)
De Bruijn Graph example
“It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness, it was the
epoch of belief, it was the epoch of incredulity,.... “
Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall
Velvet example courtesy of J. Leipzig 2010
![Page 11: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/11.jpg)
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Generate random ‘reads’ How do we assemble?
Traditional all-vs-all comparisons of datasets this size require immense computational resources.
De Bruijn solution: represent the data as a graph
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof
stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie
eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor
ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof
esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit
stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft
ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi
astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe
ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth
ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo
ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof
astheworst sitwasthew theageoffo eepochofbe theageofwi foolishnes incredulit ofbeliefit chofincred beliefitwa beliefitwa wisdomitwa eageoffool
eoffoolish itwastheag mesitwasth epochofinc ssitwasthe itwastheep astheageof stheageoff sitwasthee thebestoft oolishness heepochofb
ochofbelie wastheepoc bestoftime mesitwasth ebestoftim pochofincr…etc. to 10’s of millions of reads
![Page 12: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/12.jpg)
De Bruijn example
Step 1: “Kmerize” the data
Reads: theageofwi
age
geo
eof
ofw
fwi
sthebestof
sth
the
heb
ebe
bes
est
sto
tof
astheageof
ast
sth
the
hea
eag
age
geo
eof
worstoftim
wor
ors
rst
sto
tof
oft
fti
tim
imesitwast
ime
mes
esi
sit
itw
twa
was
ast
…..etc for all reads in the dataset
Kmers :(k=3)
the
hea
eag
![Page 13: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/13.jpg)
De Bruijn example
Step 2: Represent the ‘kmers’ in a graph
age geo eof ofw fwihea eagthesth the
heb ebe bes est sto tof
ast sththe hea eag age geo eof
Look for k-1 overlaps
wor ors rststo tof
oft fti tim
ime mes
esisititwtwa
was
ast
…..etc for all ‘kmers’ in the dataset
![Page 14: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/14.jpg)
De Bruijn example
Step 3: Simplify the graph as much as possible
->A De Bruijn graph
![Page 15: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/15.jpg)
Strengths of De Bruijn approach
• Computationally efficient• Overlaps are implicit in graph
Size of graph (and therefore computational memory requirement) is function of genome size, not number of reads
![Page 16: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/16.jpg)
Drawback of De Bruijn approach
Information is lost where repeats are ‘collapsed’
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
A ‘collapsed” repeat
![Page 17: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/17.jpg)
No single solution!Break the graph to give the final assembly
Drawback of De Bruijn approach
Connectivity is lost extracting the assembly
![Page 18: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/18.jpg)
De Bruijn example
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Why not always use longest ‘k’ possible?
Sequencing errors:
sthebentof
sth theheb
ebeben
entnto
tof
sthebentof
k=3
k=10100% wrong kmer
Mostly unaffected
kmers
![Page 19: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/19.jpg)
Does the De Bruijn approach work for assembly of microbe genomes?
Simulate short read genome assembly from six microbes with known genomes
![Page 20: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/20.jpg)
Simulated De Bruijn assembly for six ‘known’microbial genomes
0 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
0
10
20
30
40
50
60
70
80
90
100
A. haemolyticumB. MurdochiiC. FlavigenaS. SmaragdinaeH. Turkmenica
'Kmer' length (bp)
% G
en
om
e i
n u
niq
ue
‘k
me
rs’
Bacteria name:
Kmer =30, most of the genome CAN be assembled
97%
Kmer = 150A small fraction of the genome remains ‘unassemblable’
1. What fraction of the genome should we be able to assemble?
![Page 21: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/21.jpg)
How do we use short read data to improve this?
Microbe
A. haemolyticum B. Murdochii C. Flavigena C. Woesei H. Turkmenica S. Smaragdinae0
50
100
150
200
250
Pre
dic
ted
nu
mb
er
of
fra
gm
en
ts i
n t
he
as
-s
em
bly
Simulated De Bruijn assembly for six ‘known’microbial genomes
2. How fragmented do we expect microbial assemblies to be?
![Page 22: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/22.jpg)
Short read assembly strategy
Contigs
Scaffolds
~107
~107
‘De Bruijn’ assembly
‘Finished’
scaffolding
![Page 23: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/23.jpg)
Scaffolding using paired-end information
Align reads from short insert or long insert library
Join contigs using evidence from paired end data
Contigs from De Bruijn assembly
Scaffold
![Page 24: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/24.jpg)
Predicted improvement of microbial genome assemblies by scaffolding
A. hae
moly
ticum
B. Murd
ochii
C. Fla
vigen
a
C. Woes
ei
H. Turk
men
ica
S. Sm
arag
dinae
0
50
100
150
200
250
Pre
dic
ted
nu
mb
er
of
fra
gm
en
ts i
n
the
as
se
mb
ly
No scaffolding:~100 fragments
Scaffolding with short insert library:
~30 fragments
Scaffolding with long insert library:
1 – 7 fragments
![Page 25: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/25.jpg)
Summary of short read genome assembly theory
• De Bruijn graphs can efficiently assemble massive short read datasets
• Pairing information from short reads substantially improves contiguity of assemblies
• Theoretically, complete or near-complete microbe genomes can be generated using only short-read data
![Page 26: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/26.jpg)
• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly in
practice• Short read genome assembly of microbes at JGI
Contents
![Page 27: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/27.jpg)
Effect of genome properties on assembly results
Biased sequence composition
RESULT: incomplete / fragmented assembly
ACTGTCTAGTCAGCGCGCGCGCGCGCGCCCGCGCGCGCGGGCGGCGGCGCGGGCGGGCGCATGTAGTGATC
High repeat content
RESULT: misassemblies /
collapsed assemblies
r
rrr
r
Polyploidy
RESULT: fragmented assembly
a a’
![Page 28: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/28.jpg)
Bias – non-uniform sampling of gDNA due to gDNA prep, sample prep or sequencing.RESULT: incomplete / fragmented assembly
Read quality – Sequence error, inhomogeneity RESULT: fragmented assembly
Contamination – fragments containing vector, adapter , linker, stuffer or undesirable gDNA.RESULT: incorrect and inflated assembly
Chimeric reads – distinct genomic locations artificially connected in a read. RESULT: mis-assembly
Effect of data properties on assembly results
![Page 29: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/29.jpg)
How to get a good assembly
• Use the best available sequence data and informatics tools– JGI: Constant improvement of sequencing
chemistry, molecular biology and software• Quality control of sequence data is essential
– JGI: Automated QC pipeline to detect and filter out known problems
![Page 30: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/30.jpg)
• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly• Progress in short read genome assembly of
microbes at JGI
Contents
![Page 31: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/31.jpg)
Short read assemblies are improving over time
0
2
4
6
8
10
0
20
40
60
80
100
Ge
no
me
siz
e
(Mb
)
Ge
no
me
GC
(%)
Q4 2009(n = 64)
Q1 2010(n = 90)
Q2 2010(n = 31)
Q3 2010(n = 68)
Q4 2010(n = 94)
Q1 2011(n = 43)
50
100
150
10
20
30
40
Lo
ng
es
t c
on
tig
(K
b)
co
nti
g N
50
(K
b)
20
40
60
80
405060708090100
Re
ad
nu
m-
be
r (m
illio
ns
)
Hig
h q
ua
lity
re
ad
s (
%)
Sequenced genome properties remain constant
But illumina sequence quantity and quality is increasing…
…resulting in better microbial genome assemblies
Average results from sub-optimal “QC” assemblies
![Page 32: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/32.jpg)
How close are we to obtaining “finished” genomes using only short-read genome sequencing?
Analyze real short read genome assembly from six microbes with known genomes
![Page 33: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/33.jpg)
real
B. Murdochii
C. Flavigena
C. Woesei
H. Turkmenica
S. Smaragdinae
0 20 40 60 80 100 120 140
B. Murdochii
C. Flavigena
C. Woesei
H. Turkmenica
S. Smaragdinae
0 1 2 3 4 5 6 7 8
Short insert library only
simulated
Short read assemblies are approaching predicted ‘best possible’ results
Short + long insert libraries
Number of fragments in the assembly
Comparison of short-read assemblies with simulated ‘best possible’ results
![Page 34: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/34.jpg)
Average number of fragments
Average % known genes
identified
85 97.3%
3 97.4%
Short insert library only
Short + long insert libraries
1 100%
?
Short read assemblies include the vast majority of known genes
Comparison of short-read assemblies with reference genome annotation
![Page 35: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/35.jpg)
Pacific Biosciences Sequencer
Long reads from “3rd generation” Pacific Biosciences sequencer hold promise for improving short-read based assemblies
Maximum read length >4kb
1 32 4 5 6 7Read length (Kb)
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Nu
mb
er o
f re
ads
Mean read length = 1080bp
![Page 36: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/36.jpg)
Conclusion
• High quality genome sequencing using only short-reads is within reach
• Existing short-read microbial genomes assemblies are minimally fragmented and contain the vast majority known genes
• Third-generation sequencing may provide an inexpensive path to finished genomes
![Page 37: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/37.jpg)
Metagenome assembly is an ongoing challenge
-All challenges of isolate genome assembly remain-Extra challenges from diversity and different abundance of constituent genomes- The same strategies as isolate assembly can be used, but many heuristics fail for metagenomes
![Page 38: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/38.jpg)
De Bruijn art
![Page 39: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/39.jpg)
De Bruijn art
![Page 40: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/40.jpg)
END
![Page 41: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/41.jpg)
Useful Reviews
• Miller JR, Koren S, Sutton G. , Assembly algorithms for next-generation sequencing data. Genomics. 2010 Jun;95(6):315-27.
• Mihai Pop, Genome assembly reborn: recent computational challenges. Brief Bioinform (2009) 10 (4):354-366.
![Page 42: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/42.jpg)
Illumina data qualitySyntrophorhabdus aromaticivorans
PASS
Read Quality
Genome Properties
Library Quality
Run Quality
![Page 43: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/43.jpg)
Illumina data quality Opitutaceae bacterium TAV2
Genome Properties
Library Quality
Run Quality
Read Quality
FAIL
![Page 44: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/44.jpg)
Metagenomes are harder to assemble
![Page 45: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/45.jpg)
velvet gap size distribution of aligned contig shreds
0
20
40
60
80
100
120
< 100 100-999 1000-1999 2000-2999 3000-3999 > 4000
Gap size (bp)
nu
mb
er
of
ga
ps
4085750 std (trimmed toQ20)4085750 jumping(trimmed to Q20)4085750 std trimmed +jumping trimmed4086221 std
4086221 jumping
4086221 std + jumping
Velvet gaps Fibrobacter succinogenes & Ignisphaera aggregans
![Page 46: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/46.jpg)
Features of Assemblers
Algorithm Feature OLC Assemblers DBG Assemblers
Read features Base substitutions Euler, AllPaths, SOAPHomopolymer miscount CABOGConcentrated error in 3′ end EulerFlow space Newbler
Removal of erroneous reads Based on K-mer frequencies Euler, Velvet, AllPathsBased on K-mer freq and QV AllPathsFor multiple values of K AllPathsBy alignment to other reads CABOG
Base error correction Based on K-mer frequencies Euler, SOAPBased on Kmer freq and QV AllPathsBased on alignments CABOG
Graph construction Reads as graph nodes CABOG, Newbler, EdenaK-mers as graph nodes Euler, Velvet, ABySS, SOAPSimple paths as graph nodes AllPaths
Graph reduction Collapse simple paths CABOG, Newbler Euler, Velvet, SOAPErosion of spurs CABOG, Edena Euler, Velvet, AllPaths, SOAPBubble smoothing Edena Euler, Velvet, SOAPBubble detection AllPathsReads separate tangled paths Euler, SOAPBreak at low coverage Velvet, SOAPBreak at high coverage CABOG EulerHigh coverage indicates repeat CABOG Velvet
Graph partitions Partition by K-mers ABySSPartition by scaffolds AllPaths
Mate pairs Constrain path searches Euler, Velvet, AllPathsGuide path selection Euler, AllpathsMerge contigs or fill gaps CABOG, Shorty Velvet, ABySS, SOAPTransitive link reduction CABOG SOAPDetect, avoid repeat contigs CABOG Velvet, SOAPCreate scaffolds CABOG, Shorty Euler, Velvet, AllPaths, SOAP
J.R.Miller et al. Genomics 95 (2010)
![Page 47: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/47.jpg)
Pop M Brief Bioinform 2009;10:354-366
(A) Overlap between two read (agreement within overlapping region need not be perfect); (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Incorrect assembly produced by the greedy approach.(D) Disagreement between two reads (thin lines) that could extend a contig (thick line), indicating a potential repeat boundary. Contig extension must be terminated in order to avoid misassembly.
Greedy overlap
CORRECT
INCORRECT
![Page 48: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/48.jpg)
Overlap graph of a genome containing a two-copy repeat (B).
Overlap-Layout-Consensus (OLC)
![Page 49: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/49.jpg)
Comparing Overlap and de Bruijn Graphs
Schatz et al., Genome Res. (2010)
![Page 50: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/50.jpg)
Iterative kmer evaluation: IDBA
Y Peng, et al. IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler (2010)
![Page 51: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/51.jpg)
Jeremy Leipzig (jerdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html
![Page 52: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/52.jpg)
N50
The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E.
For example, given a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb. (http://www.cbcb.umd.edu/research/castats.shtml)
N50 length is the length ‘x’ such that 50% of the sequence is contained in contigs of length x or greater.
(Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)
![Page 53: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/53.jpg)
Theoretical performance
Cahill et al., PLoS ONE (2010)
Assessing performance of a range of read lengths
Repeat-induced gaps
![Page 54: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/54.jpg)
Supplement: Assembler flowcharts
![Page 55: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/55.jpg)
Phrap
Bamidele-Abegunde T. 2010 http://library2.usask.ca
![Page 56: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/56.jpg)
CAP3 & PCAP
Bamidele-Abegunde T. 2010 http://library2.usask.ca
![Page 57: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/57.jpg)
MIRA
Bamidele-Abegunde T. 2010 http://library2.usask.ca
![Page 58: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/58.jpg)
Velvet
Bamidele-Abegunde T. 2010 http://library2.usask.ca
![Page 59: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/59.jpg)
Supplement: Genome Improvement
![Page 60: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/60.jpg)
Typical Microbial project
FINISHING
Annotation
Public release
Sequencing
Draftassembly
Goals:
Completely restore genome
Produce high quality consensus
![Page 61: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/61.jpg)
Metagenomic assembly and Finishing
• Typically size of metagenomic sequencing project is very large
• Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members
• Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies
• Chimerical contigs produced by co-assembly of sequencing reads originating from different species.
• Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly.
• No assemblers developed for metagenomic data sets
The whole-genome shotgun sequencing approach was used for a number of
microbial community projects, however useful quality control and assembly
of these data require reassessing methods developed to handle relatively
uniform sequences derived from isolate microbes.
![Page 62: MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th, 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e3c5503460f94b2f18d/html5/thumbnails/62.jpg)
QC: Annotation of poor quality sequence
To avoid this:
make sure you use high quality sequence
choose proper assembler