Design Goals Crash Course: Reference-guided Assembly.
-
date post
19-Dec-2015 -
Category
Documents
-
view
225 -
download
3
Transcript of Design Goals Crash Course: Reference-guided Assembly.
Next-Gen Sequence Lengths
Capillary (Sanger) Roche 454 FLX0
200
400
600
800
1000
1200
1400
1600
maxmeanmin
Sequencing Technology
Sequence L
ength
(bp)
Illumina AB SOLiD Helicos0
10
20
30
40
50
60
70
80
maxmeanmin
Sequencing Technology
Sequence L
ength
(bp)
3 6 9 12 15 18 21 24 27 30 330%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Unique Genome Coverage (H. sapiens)
Sequence Length
Uniq
ue G
enom
e C
overa
ge
Mixing It Up: Paired-end Reads
0 50 100 150 200 250 300 3500
200
400
600
800
1000
1200
1400
1600
1800
fragment length (bp)
read p
air
s (
count)
C. elegans: a case for INDELs
SPEED100 million Illumina readsAlignment time: 93 min (17,800 reads/s)
Assembly time: 100 min
INDELS
INDEL validation rate: 89.3 % (216)SNP validation rate: 97.8 % (229)
Scaling Up
Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
Project Date
Refe
rence S
equence L
ength
(bp)
C. elegans
M. musculus
H. sapiens
P. stipitis
M. musculus mtDNA
H. sapiens CAPON region
D. melanogaster
H. sapiens ENCODE region
Aligners: Feature Set
ELAND MAQNewble
r SHRiMP SOAP
SequencingPlatforms
Illumina454
SOLiDcapillary
Illumina IlluminaSOLiD
454 IlluminaSOLiD
Illumina
AlignmentAlgorithm
Smith-Waterma
n
Hash-based
Hash-based
FlowMapper
Smith-Waterma
n
Hash-based
Co-assemblyCreation
?
GappedAlignments ?
Paired-end Reads
PlatformBinaries
Windows, Mac, Linux,
Sun, iPhone
Mac, Linux Linux Mac, Linux Mac, Linux
Performance: AlignerIllumina 35 bp (X Chromosome)
program aligned reads/s
MOSAIK 180 - 16,658
ELAND 7,716
SOAP 1,637
MAQ 1,376
SHRIMP 39
MOSAIK (fast)
MOSAIK (single)
MOSAIK (multi)
MOSAIK (all)
ELAND MAQ SOAP SHRIMP0
2000
4000
6000
8000
10000
12000
14000
16000
Performance: AlignerRoche 454 FLX ~250 bp
program aligned reads/s
Roche 454 Newbler 1,176
MOSAIK 317 - 616
Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†.
† Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)
Accuracy: Classification
MOSAIK
(fas
t)
MOSAIK
(sin
gle)
MOSAIK
(mul
ti)
MOSAIK
(all)
ELAND
MAQSO
AP0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
unique readsnon-unique reads
Accuracy: Unique Read Alignment
MOSAIK (fast) MOSAIK (single) MOSAIK (multi) MOSAIK (all) ELAND MAQ SOAP0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
readsINDELsSNPs
Reasons to use ?
• FAST• Accurate• Multiprocessor (OPENMP)
• Co-assemblies• Gapped alignments• Widely used
“One tool, many technologies,
many applications”
(Near) Future Development
• All technologies– Pacific BioSciences– Helicos
• All application areas– Adapter trimming– Coverage graphs
• Optimization• Improved paired-end read support• File format standardization (SAF & SRF)
1000 Genomes Project
• Many samples with light coverage (1000 dg)
– 100 samples from 10 populations at 2x coverage– Find 90% of the 1 % frequency variants per
population
• Trios with moderate coverage (990 dg)
– 30 trios at 11x coverage
• If you’re looking for SNPs, are your tools and methods robust?
Scaling Up: Disk Footprint
• Current situation: files created by MOSAIK are not optimized for speed or size– Assembly can take a long time (slow disk
speed)
• Hypothetical solution– Optimize the file formats– Ditch the built-in index– Keep data sorted by aligned location
Scaling Up: Memory Footprint
• Current situation: storing the entire human genome stored with all associated hash locations
– Optimized hash table ≈ 55 GB RAM
– File-based hash table (BerkeleyDB)• User selects how much RAM to use• Dreadfully slow performance• Large disk footprint ≈ 65 GB file
Scaling Up: Memory Footprint
9 10 11 12 13 14 15 16 17 180
5
10
15
20
25
30
35
40
45
50
55
60
65
70
JumpDB Memory Usage (Human Genome)
JumpDB MOSAIK hash table
hash size (bp)
mem
ory
used (
GB
RA
M)
Berkeley (all positions in database)
Berkeley (1 position in database)
Jump (all positions in file-based database)
Mosaik hash table
0 4 8 12 16 20
Alignment Performance with 35bp human reads
Reads/s
Scaling Up: Speed & Sensitivity
• Current situation: speed increases as the hash size increases, sensitivity decreases
• Hypothetical solution: use small hash sizes and require a clustering of a predefined length.
• Status: Implemented but not tested.