Purge Haplotigs: allelic contig reassignment for third-gen ...
de novo assembly course - cbs.dtu.dk · The dog assembly tracked closely to the theoretical...
Transcript of de novo assembly course - cbs.dtu.dk · The dog assembly tracked closely to the theoretical...
27626 - Next Generation Sequencing Analysis
de novo assemblyPreprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group Josef K Vogt
Slides by: Simon Rasmussen
2017
27626 - Next Generation Sequencing Analysis
Generalized NGS analysis
Raw reads
Pre-processing
Assembly:Alignment /
de novo
Application specific:
Variant calling,count matrix, ...
Comparesamples / methods
Answer?Question
Dat
a si
ze
27626 - Next Generation Sequencing Analysis
Merge small DNA fragments together so they form a previously unknown sequence
Merge millions reads together so they form previously unknown sequences
What is de novo assembly?
de novo assembly• Assemble reads into longer fragments
• Find overlap between reads
• Many approaches
reads&
con*gs&
scaffolds&
Lets try to assemble some reads!• Rules:
• a minimum of 7-bp overlap
• overlap must not include any N bases
• same orientation so that the sequence can be read left to right
• there may be 1-bp differences
• simplified - no double stranded DNA
..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN....NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..
..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN....NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..
Which are valid?
..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN..
..NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..
..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN..
..NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..
Which are valid?
..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN..
..NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..
..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN..
..NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..
27626 - Next Generation Sequencing Analysis
Which approaches?
• Greedy (“Simple” approach)
• Overlap-Layout-Consensus (OLC)
• de Bruijn graphs
27626 - Next Generation Sequencing Analysis
Simple approach - Greedy• Pseudo code:
1.Pairwise alignment of all reads
2.Identify fragments that have largest overlap
3.Merge these
4.Repeat until all overlaps are used
• Can only resolve repeats smaller than read length
• High computational cost with increasing no. reads
27626 - Next Generation Sequencing Analysis
Reads > Contigs > Scaffolds
• Overlap Layout Consensus and de Bruijn use a similar general approach.
1.Try to correct sequence errors in reads with high coverage
2.Assemble reads to contiguous sequence fragments “contigs”
3.Identify repeat contigs
4.Combine and order contigs to “scaffolds”, with gaps representing regions of uncertainty
27626 - Next Generation Sequencing Analysis
Overlap-Layout-Consensus
• Create overlap graph by all-vs-all alignment (Overlap)
• Build graph where each node is a read, edges are overlaps between reads (Layout)
• Example
separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.
Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.
To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.
In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.
Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (<50 bp). Forlonger reads (>100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.
As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was
widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that
Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.
Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.
Schatz et al.
1168 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from
separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.
Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.
To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.
In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.
Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (<50 bp). Forlonger reads (>100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.
As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was
widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that
Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.
Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.
Schatz et al.
1168 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from
Schatz et al., Genome Res, 2010
27626 - Next Generation Sequencing Analysis
Overlap-Layout-Consensus• Create consensus sequence
• We need to use graph theory to solve the graph
• Walk the Hamiltonian path
• Eg. visit each node exactly once
Imagine trying to solve this for a graph of hundred of thousands of nodes (=reads) - this is an NP-complete problem
27626 - Next Generation Sequencing Analysis
Overlap-Layout-Consensus
• Relative high assembly error rate
• Very hard for high coverage and large genomes (all-vs-all)
• With short read lengths, hard to resolve repeats
• Good for small genomes and large read lengths (454, Ion Torrent, Pac Bio)
27626 - Next Generation Sequencing Analysis
de Bruijn graph• Directed graph of overlapping items (here DNA
sequences)
• Instead of comparing reads, decompose reads into k-mers
• Graph is created by mapping the k-mers to the graph
• Each k-mer only exists once in the graph
• Problem is reduced to walking Eulerian path (visiting each edge once) - this is a solve-able problem
27626 - Next Generation Sequencing Analysis
Drawbacks ...
• Lots of RAM required (1-1000 GB !)
• Optimal k can not be identified a priori, must be experimentally tested for each dataset
• small k: very complex graph, large k: limited overlap in low coverage areas
• Iterative approach to find best assembly
27626 - Next Generation Sequencing Analysis
8
separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.
Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.
To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.
In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.
Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (<50 bp). Forlonger reads (>100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.
As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was
widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that
Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.
Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.
Schatz et al.
1168 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from
How is the graph constructed?
• Same 10 reads, extract k-mers from reads and map onto graph, k = 3:
GAC ACC CCT CTA TAC ACA CAA AAG AGT
GTT
TTA
TAG
GTC
TCC
CCG
No alignment is used!
Different assemblers uses different modifications of the de Bruijn graphs
1 1 1 1 1 12 2 2 2 2 13 4 4 4 3 2 15 6 6 5 5
3
2
1
6 8 9 8
3
2
1
27626 - Next Generation Sequencing Analysis
Complicated graphs
CTA TAC ACA CAA AAG AGT
GTT
TTA
TAG
GTC
TCC
CCG
separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.
Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.
To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.
In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.
Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (<50 bp). Forlonger reads (>100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.
As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was
widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that
Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.
Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.
Schatz et al.
1168 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from
Large genomes with many repeats/errors creates very large graphs
GAC ACC CCT
GTCTCCCCT
GAC
ACCT
G to T
AAGACTCCGACTGGGACTTT
Create the de Bruijn graph of this genome using k=3
27626 - Next Generation Sequencing Analysis
2
2
2
1
After building: Simplify
Clip tips (seq err, end)
Remove low cov. links
30
27
Pinch bubbles(seq err, middle, SNP)
30
30 28
27
27626 - Next Generation Sequencing Analysis
Create contigs and scaffolds
C3
Repeat
Cut graph at repeat boundaries to create
contigs
Use paired end information to resolve repeats and combine to scaffolds
The assembly is done
C4...
...
...
... C2C1C1 C2 C3 C4
S1
Fill potential gaps using PE readsS2
27626 - Next Generation Sequencing Analysis
Iterate parameters• Re-run with different k-sizes, find optimum
• Compare assembly statistics such as, assembly length, N50, no. contigs
• Assembly refinement
• Break contigs not supported by PE/MP reads
• Analyze assembly using REAPR or QUAST
27626 - Next Generation Sequencing Analysis
Application Note: Sequencing
Illumina, Inc.
FOR RESEARCH USE ONLY
© 2011 Illumina, Inc. All rights reserved.Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Pub. No. 770-2011-009 Current as of 8 August 2011
Figure 2: Circos Plot of MiSeq E. coli Assembly
Circos plot showing E. coli MG1655 assembly from MiSeq data. Genomic position is indicated on the white outermost ring. Genes are indicated in red (numbers are in thousands). Coverage from the de novo assembly is represented in orange, aligned to the complete refer-ence using BLAST and filtered for 99% identity. Percent GC content is indicated in green, Grey lines show repetitive sequences and their relationships across the genome.
Figure 3: De novo Assembly Contigs and N50 Length
0
50000
100000
150000
200000
250000
300000
0
20000
40000
60000
80000
100000
120000
140000
IT 314 MIRA 3.2.1 IT 314 Ray 1.3.0
MiSeq MIRA 3.0.0 MiSeq Ray 1.3.0
A
B
A. Max contig length in bp for MiSeq assemblies (purple) using MIRA v3.0.0 and Ray v1.3.0 (green) compared to Ion Torrent assemblies (blue) using MIRA v3.2 and Ray v1.3.0 (grey). B. N50 is the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly.
analyzed using CASAVA 1.8a5, and de novo assembly was completed using Velvet. For the de novo assembly comparison between MiSeq and Ion Torrent, the open access assemblers MIRA1 and Ray2 were used on MiSeq data down-sampled to 50× and compared to the entire data set from Ion Torrent reads3. These open source assembly tools are reported to work well with both Illumina and Ion Torrent data3, and produced results comparable to Velvet for the MiSeq data.
Results and Data AnalysisData generated from the MiSeq and HiSeq systems showed similar cluster density and numbers of clusters passing filter. De novo assembly metrics from the HiSeq and MiSeq reads are very similar (Table 1). Comparison of HiSeq and MiSeq data with the reference sequence illustrates equivalent coverage over a range of GC content (Figure 1). Data from the MiSeq assembly overlaying the E. coli reference sequence are shown in a Circos plot (Figure 2), demonstrating excellent coverage over the entire genome.
De novo assembly data from the 2 × 150 bp MiSeq run was compared with Ion Torrent data3. To make an equal comparison, MiSeq data was down-sampled to 50× coverage, comprising 231 Mb, or approximately 1/7th of the data. Both the max contig length and N50 values were vastly superior in the down-sampled MiSeq data compared to the entire Ion Torrent data set (Figures 3A and B).
Conclusions
Using the same library preparation from bacterial DNA, sequencing on MiSeq was shown to be very comparable to HiSeq; both platforms yield high-quality data with > 85% bases above Q30 with even GC coverage. De novo assembly with these data also produce similar results, with excellent coverage of the reference sequence. Sequencing results generated on the MiSeq System are highly predictive of those delivered by the high-throughput HiSeq 2000 sequencing platform, making MiSeq ideal for piloting larger studies or performing independent experiments requiring speed and accuracy. For de novo assembly, the importance of high-quality, paired-end MiSeq reads is readily apparent compared to Ion Torrent. The high quality assembly produced from MiSeq paired-end reads show that better data give a more accurate picture of the genome.
References1. http://www.chevreux.org/projects_mira.html2. http://sourceforge.net/projects/denovoassembler/files3. http://pathogenomics.bham.ac.uk/blog/2011/05/first-look-at-ion-torrent-
data-de-novo-assembly
Successful de novo assembly
• Success is a factor of:
• Genome size, genomic repeats(!), ploidy
• High coverage, long read lengths, PE/MP libraries
Repeats in E. coli
Improving de novo assemblies
• Paired end & Mate pair for long range continuity
• Hybrid approaches (combine Illumina with PacBio/Oxford Nanopore)
• Synthetic long reads: Illumina Synthetic Reads (Moleculo) or 10X Genomics
• Hi-C contact maps
27626 - Next Generation Sequencing Analysis
Two bacterial genomes de Bruijn graphs
only a handful of near-identical repeats longer than 200 bp (Fig. 3), whereas complex genomes, such as the human, usually have their repeat length determined by whether there has been an active LINE or SINE transposable element (usually around 4 kb in length for the former and between 500 bp and 1 kb for the lat-ter). As the ability to produce longer read pairs (also referred to as ‘mate pairs’ to distinguish them from the shorter read pairs) has only recently been optimized for next-generation technologies, assemblies of complex genomes have been rare.
The other main barrier for large, complex genome assem-blies is the memory overhead for these methods. Although the de Bruijn data structure is compressed, all the methods use some sort of adjunct data structures in addition to the core de Bruijn graph to map the reads to the graph. These adjunct structures are critical for leveraging additional information required for accurate assemblies, such as read pair information.
sequence length from a reference assembly. The read lengths need only be over the k-mer length to generate a reasonable assembly (in theory, k must be over 15 bp, though in practice 19 is the lowest sensible k-mer, and larger k-mers are always better, although at the expense of having to generate more coverage to support these large k-mer sizes).
The first assembler to exploit this technology was Roche’s 454 assembler, Newbler, which adapted the scheme specifically to handle the main source of error in 454 sequencing—namely, ambiguity in the length of homopolymer runs. In late 2007 and early 2008, sev-eral second-generation de Bruijn graph assemblers were released for very short reads, compatible with the Solexa technology, includ-ing SHARCGS27, VCAKE28, VELVET29, EULER-SR30, EDENA31, ABySS32 and ALLPATHS33. Some of these methods, such as VELVET, EULER-SR and ABySS, explicitly use de Bruijn graphs, whereas other methods implicitly explore a de Bruijn graph—for example, constrained by read-pair behavior, as in ALLPATHS. The methods differ in how they treat errors and to what extent they use read-pair information. Read pairs are defined as two short DNA sequence reads generated from different ends of a longer DNA molecule—for example, 35-bp reads generated from both ends of a 500 bp frag-ment. One does not know the identity of the sequence between the read pairs, but one usually has an estimate of the length of the inter-vening sequence. As it is only marginally more expensive to generate short reads in read-pair format than as single reads, extremely high coverage of read pairs is routinely available. The more advanced de Bruijn graph assemblers29,30,32,33 can use read pairs to provide long assemblies. A particular challenge has been the two-base-encoding ‘color space’ of ABI SOLiD technology. In this two-base encoding, a single error produces a systematic translation error on all subse-quent decoding of the bases for the rest of the read. In the context of an alignment, such an encoding scheme can be integrated into the alignment routine, and there is an argument that the double base encoding provides better discrimination between errors and observed differences. In de novo assembly, however, there is no ref-erence. The solution has been to perform the assembly directly in color space and then ‘key’ the resulting color space assembly to one of the four feasible base-pair assemblies using either a small amount of traditional sequence or the presence of a known base at the start of each SOLiD read.
Whichever sequencing technology and assembly method are used, the ability to provide long assemblies critically requires that at least a proportion of the read pairs are longer than the longest common near-identical repeat in the genome. This var-ies considerably between genomes. Bacterial genomes often have
Linear stretches
Tips
× × × × × × × ×
× × ×
× × × ×
× × × × × ×
×
× × × ×
×
×××××××××
a
b
1. Sequencing (for example, Solexa or 454)
2. Hashing
3. Simplification of linear stretches
4. Error (tip and bubble) removal Bubble
×
Figure 3 | Constructing and visualizing a de Bruijn graph of a DNA sequence. (a) An example de Bruijn graph assembly for a short genomic sequence without polymorphism. Sequence at top represents the genome, which is then sampled using shotgun sequencing in base space with 7-bp reads (step 1). Some of the reads have errors (red). In step 2, the k-mers in the reads (4-mers in this example) are collected into nodes and the coverage at each node is recorded. There are continuous linear stretches within the graph, and the sequencing errors create distinctive, low-coverage features through out the graph. In step 3, the graph is simplified to combine nodes that are associated with the continuous linear stretches into single, larger nodes of various k-mer sizes. In step 4, error correction removes the tips and bubbles that result from sequencing errors and creates a final graph structure that accurately and completely describes in the original genome sequence. (b) A full de Bruijn graph from a bacterial genome that shows the general lack of repetitive structure within the entire genome.
S10 | VOL.6 NO.11s | NOVEMBER 2009 | NATURE METHODS SUPPLEMENT
REVIEW
Figure 2.9: Graphical representation of the de Bruijn of the Streptococcussuis genome
In this representation, node sequences are represented as curves, whichconnect at their tips.
2.7.2 Local reference based visualisation
It quickly became clear that troubleshooting some of the algorithms de-scribed in the following chapters would require some adequate visualisa-tion techniques. Given that most tests were done on species with a knownreference, it was possible to use this sequence as a guide to the graph’scomplexity.
The first solution consisted in following the path of the referencesequence through the graph, recording the properties of the nodes beingtraversed. Figure 2.10 represents the length and multiplicity of successivenodes on the reference path.
In this diagram, long contigs are interrupted by two types of breaks.Sometimes, two long contigs are separated by a very short, isolated fea-
39
Flicek & Birney, Nat.Methods 2009 Zerbino, 2009
Few repeats “more” repeats
N50: Assembly quality
• Calculate sum of assembly
• Order contigs by size
• Sum contigs starting by largest
• When half the sum is reached, N50 is the length of the contig
N50: What is the smallest piece in the largest half of the assembly?
N50 example5 scaffolds, calculate N50:
200kb
180kb
150kb
125kb
35kb Sum: 200 + 180 + 150 + 125 + 35 = 690kbHalf: 630 / 2 = 395kb200kb + 180kb = 380kb380kb + 150kb = 530kb530kb > 395kb => N50: 150kb
Start adding:
27626 - Next Generation Sequencing Analysis
Some assemblers
• OLC: Newbler, (the only one I really use)
• de Bruijn: Allpaths-LG, SPAdes, Velvet, SOAPdenovo, Megahit
• other: MIRA, SGA