By: Harold Arlen and Edgar HarburgHarold Arlen Edgar Harburg.
MUSCLE (Edgar 2004a,b) · MUSCLE (Edgar 2004a,b) [0] k–mer distance estimation for unaligned...
Transcript of MUSCLE (Edgar 2004a,b) · MUSCLE (Edgar 2004a,b) [0] k–mer distance estimation for unaligned...
MUSCLE (Edgar 2004a,b)
[0] k–mer distance estimation for unaligned sequences
[1] distance (UPGMA) guide tree generated
[2] pairwise global alignment down tree
[a] consensus (profile) constructed
[b] insertions propagated up tree
[3] K2P distances calculated
[4] back to [1] (once)
[5] pairwise global alignment down tree (like [2])
=> sum of pairs used to accept/reject realignment
pairwise global alignment
BMC Bioinformatics 2004, 5:113 http://www.biomedcentral.com/1471-2105/5/113
Page 3 of 19
(page number not for citation purposes)
number of changed nodes has not decreased, the processof improving the tree is considered to have converged anditeration terminates.
Progressive alignmentA new progressive alignment is built. The existing align-ment is retained of each subtree for which the branchingorder is unchanged; new alignments are created for the(possibly empty) set of changed nodes. When the align-ment at the root is completed, the algorithm may termi-nate, return to step 2.1 or go to Stage 3.
Stage 3: refinementThe third stage performs iterative refinement using a vari-ant of tree-dependent restricted partitioning [12].
Choice of bipartitionAn edge is deleted from the tree, dividing the sequencesinto two disjoint subsets (a bipartition). Edges are visitingin order of decreasing distance from the root.
Profile extractionThe profile (multiple alignment) of each subset isextracted from the current multiple alignment. Columnscontaining no residues (i.e., indels only) are discarded.
Re-alignmentThe two profiles obtained in step 3.2 are re-aligned toeach other using profile-profile alignment.
Accept/rejectThe SP score of the multiple alignment implied by thenew profile-profile alignment is computed. If the scoreincreases, the new alignment is retained, otherwise it isdiscarded. If all edges have been visited without a changebeing retained, or if a user-defined maximum number of
iterations has been reached, the algorithm is terminated,otherwise it returns to step 3.1. Visiting edges in order ofdecreasing distance from the root has the effect of first re-aligning individual sequences, then closely related groups
Algorithm elementsIn the following, we describe the elements of the MUSCLEalgorithm. In several cases, alternative versions of theseelements were implemented in order to investigate theirrelative performance and to offer different trade-offsbetween accuracy, speed and memory use. Most of thesealternatives are made available to the user via command-line options. Four benchmark datasets have been used toevaluate options and parameters in MUSCLE: BAliBASE[10,11], SABmark [15], SMART [16-18] and our ownbenchmark, PREFAB [2].
Objective scoreIn its refinement stage, MUSCLE seeks to maximize anobjective score, i.e. a function that maps a multiplesequence alignment to a real number which is designed togive larger values to better alignments. MUSCLE uses thesum-of-pairs (SP) score, defined to be the sum over pairs ofsequences of their alignment scores. The alignment scoreof a pair of sequences is computed as the sum of substitu-tion matrix scores for each aligned pair of residues, plusgap penalties. Gaps require special consideration (Figure3). We use the term indel for the symbol that indicates agap in a column (typically a dash '-'), reserving the termgap for a maximal contiguous series of indels. The gappenalty contribution to SP for a pair of sequences is com-puted by discarding all columns in which both sequenceshave an indel, then applying an affine penalty g + λe foreach remaining gap where g is the per-gap penalty, λ is the
Progressive alignmentFigure 1Progressive alignment. Sequences are assigned to the leaves of a binary tree. At each internal (i.e., non-leaf) node, the two child profiles are aligned using profile-profile align-ment (see Figure 2). Indels introduced at each node are indi-cated by shaded background.
M Q T I FL H - I W
M Q T I F
L H I W
L Q S W
L S F
L Q S WL - S F
M Q T I FL H - I WL Q S - WL - S - F
Profile-profile alignmentFigure 2Profile-profile alignment. Two profiles (multiple sequence alignments) X and Y are aligned to each other such that columns from X and Y are preserved in the result. Col-umns of indels (gray background) are inserted as needed in order to align the columns to each other. The score for aligning a pair of columns is determined by the profile func-tion, which should assign a high score to pairs of columns containing similar amino acids.
M Q T FL H T WL Q S W
X
L T I FM T I WY
M Q T - FL H T - WL Q S - WL - T I FM - T I W
Text
Edgar (2004)
\
Its space complexity is basically O(N2)þO(L2)þO(NL). When the sequence length exceeds thethreshold (set as 10 000 residues at present), FFT-NS-2 automatically switches the DP algorithm to amemory saving one [54] and the space complexitybecomes O(N2)þO(NL). On a current desktopcomputer, this method can be applied to an MSAconsisting of up to "10 000 sequences. Themaximum length depends on the similarity level:"10 000 residues for distantly related sequences or"500 000 residues for closely related sequences withglobal homology.
The progressive method has a drawback in thatonce a gap is incorrectly introduced at a step, thegap is never removed in later steps. To overcomethis drawback, there are two types of solutions,the iterative refinement method [55–61] and theconsistency-based method [25, 62–64]. These twoprocedures are quite different: the former tries tocorrect mistakes in the initial alignment, whereas thelatter tries to avoid mistakes in advance, but bothwork well to improve the alignment accuracy.
Iterative refinement method withth eWSP scoreçFFT-NS-iIn the iterative refinement method, an objectivefunction that represents the ‘goodness’ of the MSA isexplicitly defined. An initial MSA, calculated by theprogressive or another method, is subjected to aniterative process and is gradually modified so thatthe objective function is maximized, as shown inFigure 1B. Various combinations of objectivefunctions and optimization strategies have beenproposed to date [55–61]. Among them, Gotoh’siterative refinement method, PRRN [16], is themost successful one, and it forms the basis of recentmethods, including MAFFT, MUSCLE [23, 65] andPRIME [66]. The iterative alignment option ofMAFFT, called FFT–NS–i, uses the weighted sum–of–pairs (WSP) objective function [24]. As shown inFigure 1B, an MSA is partitioned into two groups,which are then realigned using an approximategroup-to-group alignment algorithm [20]. The newMSA replaces the old one if it has a higher score.This process is repeated until no more improvements
Replace if better
Initial alignment
Tree-dependent partitioning
A
B
β γ δ εα
Group-to-groupalignment
αβγ
δε
αβγδε
αβγδε
Unaligned sequences Distance matrix 1
Distance matrix 2
Tree 1 Alignment 1
Tree 2 Alignment 2
All-to-allcomparisons
Group-to-groupalignment
Group-to-groupalignment
αβχδε
β γ δ εα
α β γ δ
εγδ
αβ γδε
αβγδε
αβγδε
α
β
χ
δ
β γ δ ε
α
β
γ
δ
β γ δ ε
α β
β γ δ εα
δ ε
γ δεαβ γδε
Figure 1: Calculation procedures ofthe progressivemethod (A) and the iterative refinementmethod (B).
page 4 of13 Katoh and Toh
Katoh and Toh (2008)
MAFFT (Katoh and Toh 2008)
(too) many different algorithms available
uses variants of sum of pairs or COFFEE scoring
can use local or global alignment
can use structural pairwise alignments
good for low similarity sequences
‘program’ is really a large shell script that dispatches to a variety of special purpose programs
restricts access to some algorithms by alignment size
can be overridden by modifying the script
Nearest Alignment Space Termination (NAST)
DeSantis et al. (2006), Caporaso et al. (2010)
builds a multiple sequence alignment from a template
for each new sequence:
BLAST (etc.) to find most similar template sequence
pairwise alignment of template and new sequence
insert into template without introducing insertions
can cause local mis–alignments (or worse)
primarily used for identification (DNA barcoding, etc.)
other better options (i.e. identification algorithms)
translatorX (Abascal et al. 2010)
[1] translates nucleotides to amino acids (standard tables)
[2] aligns amino acids using an external program
can be manually edited
can be aligned using an ‘unsupported’ program
[3] reverse translates back to the original nucleotides
removes incomplete codons from the ends
has difficulty with long strings of ambiguous nucleotides
useful for difficult to align coding regions
sequence qualitybase–by–base error probability for base–calling programs
reflects assay bias (e.g. detection chemistry, algorithms)
allows for more efficient sequence editing and assembly
allows for ‘poorly supervised’ automation
base calling: PHRED...
Ewing et al. (1998), Ewing and Green (1998)
the ‘standard’ open base–caller for ABI BigDye chemistry
works with other chemistries also
more ABI training data => best for ABI
ABI’s KB base–caller is good (better), but closed source
other base–callers for other chemistries (e.g. LifeTrace)
most algorithmic differences among programs are minor
differences are mostly a result of different training data
algorithms are empirically derived (i.e. kluges)
...base calling: PHRED...
[1] calculate ideal peak locations
assumes relatively even spacing
chromatograms converted from log to linear
[2] locate peaks in trace data
[3] compare ideal and actual peaks (align)
merge and split peaks based on ideal peaks
call bases using signal intensity
[4] call ambiguous bases
near equal signal for multiple bases
PHRED: ideal peak locations...
[1a] preliminary peaks for each dye colorthe maximum value between a pair of inflected points
midpoint is used if there is no maximummust be 10% above previous peak (background)
[1b] synthetic trace of preliminary peaks (all dye colors)height = 1, width = 1/4 local peak–to–peak distance
[1c] sliding window: each peak ± 200 scanscalculate mean scaled standard deviation of peak–to–peak distance
<0.45 == good spacing
...PHRED: ideal peak locations
[1d] select starting point
window of lowest mean scaled standard deviation
work right to end, then left to start
[i] construct a ‘damped’ synthetic trace at the current position
Fourier transform the synthetic trace
i.e. fit to a sin wave function
[ii] if mean scaled standard deviation >0.45 => force average spacing; else: modify fit based on direction (left or right) and other kluges
PHRED: locate peaks
[2a] for each dye color search original trace for ‘concave’ regions
sum florescence signal for each scan to estimate peak ‘area’ (area under the curve)
[2b] accept peaks that are at least 10% bigger than the average 10 previous peaks and 5% larger than the previous peak
peak location == geometric center
PHRED: ideal vs. actual peaks
[3a] align ideal and actual peaks
similar to a sequence alignment algorithm
[3b] call exact matches (highest intensity signal)
[3c] call large shifted peaks (>0.2 relative average area)
[3d] call small shifted peaks (>0.1 relative average area)
[3e] remaining uncalled peaks are either called, or saved as ‘best uncalled peak’
if no signal predominates, called as ‘N’
PHRED: call ambiguous bases
[4] any peaks not assignably to a predicted peak are called provided:
[4a] it is the strongest signal at a given scan
[4b] >10% above background
[4c] is unsplit (i.e. is just one peak)
[4d] is flanked by called peaks
[4e] adding the peak improves local peak spacing
error probabilities: PHRED
calculates probabilities using a local window
able to distinguish between ‘good’ and ‘bad’ regions
not able to distinguish overall ‘good’ from ‘bad’
outputs log probabilities
e.g. q = -10 • log10(p) [p = 0.001; q = 30]
predicts quality by measuring peak properties
similar to linear discriminate analysis
without assumption of normality (data are not normal)
sequence error probabilities
sequence error probabilities
PHRED: signs of error
(a) peak spacing (7–peak window)
(b) height of largest uncalled peak relative to smallest called peak (7–peak window)
(c) height of largest uncalled peak relative to smallest called peak (3–peak window)
(d) distance from the nearest unresolved base (•-1)
PHRED: threshold values
need training set (i.e. resequence known regions)
usually calculated from plasmid sequences
not directly comparable to PCR products
produce a lookup table for q = 1–50
compute empirical error rate for each parameter
new sequence versus known sequence
can be generated for any sequencing technology
Illumina base calling
model–based:AYB (Massingham and Goldman 2012), Bustard (Illumina default), BayesCall (Kao and Song 2009), naiveBayescall (Kao and Song 2011), Onlinecall (Das and Vikalo 2012), Rolexa (Ledergerber and Dessimoz 2011), Softy (Das and Vikalo 2013), Swift (Whiteford et al. 2009), etc.
(supervised) machine learning:Altacyclic (Erlich et al. 2008), freeIbis (Renaud et al. 2013), Ibis (Kircher et al. 2009), etc.
Illumina base calling
important parameters:cross–talk among dyes phasing (i.e. secondary signals) as a function of cycle signal decay as a function of cycleintensity of the previous cycleintensity of the current cycleintensity of the next cycle
Illumina base calling
thymine retention (Kircher et al., 2009). The reads produced byboth versions were aligned back to the !X174 genome, and thenumber of sequences mapped and average edit distance wascomputed. We observed that LIBOCAS outperforms the previ-ous SVM library for both metrics.Because the introduction of incorrectly labelled training ex-
amples could influence the quality of the SVM model, wesought to evaluate whether our masking procedure would havean effect on the number of mapped reads. The mapping statisticsconfirmed that masking divergent bases on the !X genome im-proves the final sequence accuracy (170572 sequences mapped)compared with not masking any bases (170220) or maskingrandom bases (170225).We tested freeIbis on a recent paired-end GAIIx run from
mid-2011 from our own sequencing centre with 2! 126 cyclesand a single index of seven nucleotides. This multiplexed run hadboth human DNA as target, and !X174 as control and wasbasecalled using the previous version, Ibis, and the current one,freeIbis as well as naiveBayesCall (v. 0.3) and All your base(AYB, v2.08). We compared how each performed in terms ofsequence accuracy, the number of sequences mapped and edit
distance to the reference, as well as runtime (Table 1). Weshowed that freeIbis provides more high-quality base calls, lead-ing to an increased number of reads being mapped to the refer-ence with a lower edit distance than is the case for otherbasecallers. The predicted versus observed quality scores wereplotted for Bustard and for freeIbis (Fig. 1). The sequences forthe two GA runs used for comparison were produced usingBustard Off-Line Basecaller (OLB v.1.9.3). Our results showthat freeIbis offers an improved accuracy and calibrated qualityscores for these sequencing runs (including one on a HiSeq andanother on a MiSeq) and outperforms Bustard on runs withunusually high error rates (see Supplementary Data).Using the genotype calls from the same sequencing data but
using three different basecallers (Ibis, freeIbis and Bustard) tocompare with calls from Sanger sequences, we determined thatfreeIbis offers improved genotyping accuracy (seeSupplementary Data).
4 CONCLUSION
FreeIbis provides substantial improvements in sequence accur-acy, quality score calibration and genotyping accuracy overBustard, and is more computationally efficient than equally ac-curate model-based methods such as AYB.
ACKNOWLEDGEMENTS
We would like to thank the Bioinformatics Group, theSequencing and the Population Genetics Group at the MaxPlanck Institute for Evolutionary Anthropology for providingdata and feedback. We are also indebted to Vojtech Franc,Yun Song, Hazel Marsden and Tim Massingham who providedsupport for use of their software.
Funding: Work was funded by the Max Planck Society
Conflict of Interest: none declared.
REFERENCES
Das,S. and Vikalo,H. (2012) Onlinecall: fast online parameter estimation and basecalling for illumina’s next-generation sequencing. Bioinformatics, 28, 1677–1683.
Erlich,Y. et al. (2008) Alta-cyclic: a self-optimizing base caller for next-generationsequencing. Nat. Methods, 5, 679–682.
Franc,V. and Sonnenburg,S. (2009) Optimized cutting plane algorithm for large-scale risk minimization. J. Mach. Learn. Res., 10, 2157–2192.
Kao,W. et al. (2009) Bayescall: a model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res., 19, 1884.
Kircher,M. et al. (2009) Improved base calling for the illumina genome analyzerusing machine learning strategies. Genome Biol., 10, R83.
Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25, 1754–1760.
Massingham,T. and Goldman,N. (2012) All your base: a fast and accurate prob-abilistic approach to base calling. Genome Biol., 13, R13.
McKenna,A. et al. (2010) The genome analysis toolkit: a MapReduce frameworkfor analyzing next-generation DNA sequencing data. Genome Res., 20,1297–1303.
Whiteford,N. et al. (2009) Swift: primary data analysis for the Illumina Solexasequencing platform. Bioinformatics, 25, 2194–2199.
Table 1. Accuracy for each basecaller on a Illumina GAIIx dataset(2! 126 cycles with 366135 257 clusters)
Basecaller Trainingtime
Callingtime
Mapped (%)a Editdistance
Bustard 583348 201 (83.93%) 1.379naiveBayesCall 591h 658h 578957 145 (83.34%) 1.496AYB 394h 593183 967 (85.52%) 1.076Ibis 19.4 h 13.2h 592929 953 (85.31%) 1.167freeIbis 21.3 h 12.2h 594095 219 (85.48%) 1.145
The human sequences were mapped to the hg19 version of the human genome. Thenumber of mapped sequences and the average number of mismatches for those weretallied for each method. Time trials were conducted on a machine with 74 GB ofRAM and using 8 of the 12 Intel Xeon cores running at 2.27GHz. aPercentagerelative to sequences assigned to the read group of interest.
Fig. 1. Plot of the predicted versus the observed base quality score forcontrol reads. Ideally the base qualities should follow the diagonal line.The root mean square error (RMSE) shows that quality scores predictedusing freeIbis have a greater correlation to their observed error rates
1209
freeIbis
at The New
York B
otanical Garden on N
ovember 17, 2014
http://bioinformatics.oxfordjournals.org/
Dow
nloaded from
(Renaud et al. 2013)
Das and Vikalo BMC Bioinformatics 2013, 14:129 Page 7 of 10http://www.biomedcentral.com/1471-2105/14/129
where sj can take values of unit vectors comprising threezeros and one non-zero entry equal to 1, and 1 ≤ j ≤4. Base probabilities P(Si = sj|Y, λ, !) can be calculatedfrom the state probabilities of the trellis that we defined inthe parameter estimation section, e.g.,
P(STi =[ 1 0 0 0]) =
4!
j=1P(Ti = tj|Y, λ, !), (15)
and so on. Note that these probabilities are also the ‘qual-ity score’ assigned to the given basecall (more on qualityscores in the next section). Clearly, we need to find pos-teriori probabilities P(Ti = tj|Y, λ, !). For this, we againturn to the soft-output Viterbi and forward-backwardalgorithms that we described in the previous section.
Note that the value of λ used for base calling in windowl is approximated by the value of λ which maximizes thelog-likelihood function formed using ! and Si from theprevious window, l − 1 (except in window l = 1 wherewe use Si provided by Bustard). It is straightforward toshow that this maximization entails solving the quadraticequation in λ
lW!
i=(l−1)W+14λ2 + (KiXi)T"i−1(Yi)
(i"
j=2(1 − dj))∥Xi∥2
λ
− Yi"i−1Yi
(i"
j=2(1 − dj))2∥Xi∥2
= 0,
(16)
and choosing the positive solution as the value of λ.
Quality scoresPerformance of various base calling algorithms can becompared by evaluating error rates that they achievewhen applied to determining the order of nucleotidesin a known sequence. In practical applications, wherethe sequence being analyzed is not known, we need toassess the confidence of a base calling procedure. To thisend, quality scores provide information as to how reli-able the corresponding base calls are. The quality scoresthat we assign to base calls are the posterior probabilitiesof the bases computed by the forward-backward/SOVAschemes. In particular, we use the posteriori probabilitiesof the bases computed according to (15) as the qual-ity scores. In order to assess the ‘goodness’ of qualityscores, we consider their discrimination ability [10,11].The discrimination ability for a given error rate is obtainedby sorting all bases according to their quality scores indescending order and finding the number of bases calledbefore the error rate exceeds the predefined threshold.
ResultsGAIIPerformance of the forward-backward algorithm andSOVA is verified on a full lane data obtained by sequenc-ing phiX174 ((EMBL/NCBI accession number J02482)bacteriophage using Illumina’s Genome Analyzer II whichgenerates reads of length 76. After basecalling the lane byBustard, naiveBayesCall, Rolexa, Ibis, forward-backwardand SOVA, the calls were mapped onto the known refer-ence sequence comprising 5386 bases. The optimal align-ment is found using a Hamming distance metric. Readsthat map with less than 30% errors are retained whilereads having more errors are removed to ensure that thereis no ambiguity in the alignment. This results in approx-imately 7 million reads and 550 million bases which areused to compare the performance of the considered base-calling schemes. Average error rates computed over theentire lane are compared in Table 1. Figure 2 shows the bytile error rates, by cycle error rates and the discriminationabilities of the different basecallers. Forward-backwardalgorithm and SOVA outperform all other schemes interms of error rates and discrimination abilities.
HiSeqPerformance of the forward-backward algorithm andSOVA is verified on reads from E.coli (EMBL/NCBI acces-sion number NC007779) using Illumina’s HiSeq2000 com-prising of 100 cycle paired end data. The error ratesfor both pairs of reads are shown as a function of cyclenumber in Figure 3. Average error rates are compared inTable 2 for both SOVA and FB schemes. As can be seen,we improve on Bustard’s calls by 12.3 and 9.6% for the firstand second pair respectively.
DiscussionComputational complexityFor each read, the most computationally expensive Bus-tard’s step is its correction of phasing effects. For bothforward-backward algorithm and SOVA, we need toevaluate 16 objective functions for the states at each stage
Table 1 Comparison of error rates and speed for GAII
Decoding strategy Error rate Running times
FB 0.0128 400mins
SOVA 0.0129 300mins
OnlineCall 0.0137 30mins
naiveBayesCall 0.0139 1500mins
Ibis 0.0147 480mins
Bustard 0.0154 40mins
Rolexa 0.0171 720mins
A comparison of error rates and running times (per lane) for different basecallers (note that Bustard’s running time is underestimated since it does notaccount for the parameter estimation step).
(Das and Vikalo 2013)
Illumina base calling: Ibis
process image file to extract sample datacreate SVM models for each cycle
train/test data from known genome sequenceintensity values of current cycle + previous and next
model outputs base call (classification)PHRED (like) error probability (based on classification probability)
SVM
project data into a hyperplane that separates data
search for useful hyperplane(s)related to discriminate functions
https://commons.wikimedia.org/w/index.php?curid=73710028
Illumina error probabilities
model–based:PA = IA/IA+IC+IG+IT (Whiteford et al. 2009)likelihood of the base call (Das and Vikalo 2012)
(supervised) machine learning:SVM assignment scores converted to error probabilities using piecewise linear regression (Renaud et al. 2013)
sequence contigs•an assembly of two or more sequencing reads
•usually from different primers or library fragments
•[1] confirm sequence interpretation
•disagreement among reads must be resolved
•ambiguous bases, contradictory bases
•resolved (as best as possible) based on quality
•unresolvable coded as IUPAC polymorphism
•[2] make consensus (compromise)
•usually larger than individual reads
super contigs and scaffolds
•an assembly of two or more contigs
•often produced with a secondary assembler
•consensus
•a major source of error in ‘draft’ genomes
•scaffolds
•contain regions of (approximately) known size, but unknown sequence
•often represented as a uniform size (e.g. 100 Ns)
assembly quality
•N50: median assembly length
•longer is generally better
•gene content:
•count the number of reference genes found among the assemblies (BLAST, hmmer, etc.)
•Benchmarking set of Universal Single–Copy Othologs (BUSCO; Seppey et al. 2019)
•Core Eukaryotic Genes (CEG; Parra et al. 2007)
sequence trimming•window size
•usually 20 bp
•allowable ‘error’ threshold
•ambiguous bases
•e.g. no more than 2 bases
•confidence
•e.g. no more than 2 bases with QV < 20
•[1] read from end
•[2] trim at first window error below threshold
sequence ‘correction’…
•remove ‘systematic’ errors from sequencing reads
•rare bits of sequence (k–mers) that are similar to common bits of sequence (k–mers) are corrected to the more common variant
•often improves the sequence quality
•can improve the assembly size
•decreases the assembly size more often than not
•can introduce errors into the sequence
Heydari et al. BMC Bioinformatics (2017) 18:374 Page 7 of 13
Table 4 NGA50 of respectively contigs (top) and scaffolds (bottom) assembled by SPAdes before and after error correction
Tools D1 D2 D3 D4 D5 D6 D7 D8
Contig NGA50
Uncorrected 397 392 92 570 119 253 231 409 264 881 8 559 6 429 50 484
ACE 397 392 = 92 570 = 125 608 ↑ 231 409 = 264 881 = 8 771 ↑ 3 143! 28 679!BayesHammer 397 392 = 92 344 ↓ 132 564" 231 409 = 264 881 = 9 075 ↑ 6 540 ↑ 53 534 ↑BFC 397 392 = 92 570 = 132 876" 231 409 = 264 881 = 9 375 ↑ 6 389 ↓ 49 185 ↓BLESS 2 397 392 = 92 570 = 119 265 ↑ 231 409 = 264 881 = 7 975 ↓ 3 047! 23 814!Blue 397 392 = 92 708 ↑ 132 876" 231 409 = 289 353 ↑ 7 628! 6 191 ↓ 50 486 ↑Fiona 397 392 = 92 611 ↑ 119 253 = 231 409 = 264 881 = 9 224 ↑ 5 346! 45 472 ↓Karect 397 392 = 92 611 ↑ 132 876" 231 409 = 264 881 = 9 865" 6 392 ↓ 54 132 ↑Lighter 397 392 = 92 570 = 132 564" 231 409 = 289 353 ↑ 9 609" 6 423 ↓ 50 440 ↓Musket 397 392 = 92 566 ↓ 132 876" 231 409 = 264 881 = 9 293 ↑ 6 170 ↓ 46 377 ↓RACER 397 392 = 92 523 ↓ 112 393 ↓ 231 409 = 264 881 = 7 336! 3 244! 21 538!SGA-EC 397 392 = 92 344 ↓ 119 255 ↑ 231 409 = 264 881 = 9 296 ↑ 6 435 ↑ 52 105 ↑Trowel 397 392 = 92 344 ↓ 119 335 ↑ 231 409 = 264 881 = 7 808 ↓ 6 389 ↓ 48 357 ↓
Scaffold NGA50
Uncorrected 397 392 97 353 132 876 231 409 289 353 8 829 6 472 60 554
ACE 397 392 = 97 353 = 133 713 ↑ 231 409 = 264 881 ↓ 9 190 ↑ 3 158! 35 392!BayesHammer 397 392 = 97 353 = 133 309 ↑ 231 409 = 264 881 ↓ 9 443 ↑ 6 576 ↑ 58 570 ↓BFC 397 392 = 97 353 = 133 088 ↑ 231 409 = 264 881 ↓ 9 664 ↑ 6 419 ↓ 59 613 ↓BLESS 2 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 8 441 ↓ 3 073! 35 638!Blue 397 392 = 97 288 ↓ 133 309 ↑ 231 409 = 289 353 = 7 841! 6 183 ↓ 61 289 ↑Fiona 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 9 491 ↑ 5 385! 54 188!Karect 397 392 = 97 353 = 133 058 ↑ 231 409 = 264 881 ↓ 10 302" 6 446 ↓ 62 304 ↑Lighter 397 392 = 97 353 = 133 309 ↑ 231 409 = 289 353 = 9 955" 6 468 ↓ 59 697 ↓Musket 397 392 = 97 353 = 133 088 ↑ 231 409 = 264 881 ↓ 9 502 ↑ 6 219 ↓ 55 842 ↓RACER 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 7 603! 3 266! 23 783!SGA-EC 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 9 640 ↑ 6 483 ↑ 60 636 ↑Trowel 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 8 107 ↓ 6 435 ↓ 57 078 ↓
Arrows in the table are based on their value relative to the NGA50 value obtained from uncorrected data as follows:! < -10% < ↓ < 0% < ↑ < +10% <"
Fig. 3, the breakpoints marked as ‘A’ and ‘B’ each occur infour cases.In order to identify the mechanisms that cause break-
points, the k-mer spectrum of both corrected and uncor-rected data along the two contigs was examined. In thissection, k = 21 is used throughout, as it corresponds tothe smallest k-mer size that is used to establish overlapbetween individual reads by the multi-k SPAdes assem-bler. In Fig. 3, black bars visualize the locations of ‘losttrue 21-mers’, i.e., 21-mers that do exist in the referencesequence (hence ‘true’) and also do exist in the uncor-rected data but that are no longer present in the correcteddata (hence ‘lost’). Lost true k-mers hence refer to thosek-mers that were systematically, but erroneously removedduring error correction. In many cases, lost true 21-mers
0
100
200
300
400
500
0 10 20 30 40 50 60 70 80 90 100
Sca
ffold
leng
th N
GA
x (K
bp)
x
UncorrectedACE
BayesHammerBFC
BLESS 2Blue
FionaKarectLighterMusketRACER
SGA-ECTrowel
20
40
60
45 50 55
Fig. 2 SPAdes assemblies. SPAdes assembly results forD. melanogaster for (un)corrected data. Scaffolds with length NGAx orlarger contain x% of the genome
(Heydari et al. 2016)
…sequence ‘correction’
•k–mer
•e.g. BLESS (Heo et al. 2014), Hammer (Medvedev et al. 2011), HiTEC (Ilie et al. 2011), Musket (Liu et al. 2013), Quake (Qu et al. 2009) RACER (Ilie and Molnar 2013)
•multiple sequence alignment (MSA)
•e.g. Coral (Salmela and Schröder 2011), ECHO (Kao et al. 2011), Karect (Allam et al. 2015)
Quake k–mer ‘correction’
•[1] count (or estimate) k–mer frequency
•[2] determine common/rare threshold from the data
•[3] model common k–mers with the Gaussian and/or zeta distribution
•[4] model rare k–mers with the gamma distribution
•[5] for each read remove rare k–mers:
•[a] by trimming from the 3′ end
•[b] change bases with low quality scores to more common bases
Karect MSA ‘correction’
•[1] for each read:
•[a] global pairwise align reads to references having at least x k–mers (indels are permitted) in common
•[b] store alignment if > y alignment overlap and edit distance < z
•[2] for each read:
•[a] extract the shortest stored alignment
•[b] change bases to the modal base for each alignment position
Cri to reads that share with r exactly k -mer ri (Fig. 1a). Type (b): If
less that m type (a) reads are found (m is a user-defined constraint
[default m ¼ maxð30;minð150; 0:6#estimated coverageÞÞ.]),Karect adds to Cri reads that may contain up to d mismatches/indels
in the l-prefix or l-suffix of k -mer ri (Fig. 1b), where d is a user-
defined parameter (default d ¼2). To count the mismatches or
indels, the Hamming or edit distance is used, respectively. Type (c):
If jCrij < m Karect generates two smaller k 0-mers of ri, where k 0 ¼ 2l
and searches for exact k 0-mer matches (Fig. 1c). Type (d): If
jCrij < m, Karect searches for reads that contain up to d mis-
matches/indels in the l-prefix or l-suffix of the k 0-mer (Fig. 1d).
To reduce the effect of bias towards specific k -mers, Cri
is allowed to include at most m reads sharing the same k-mer or
k 0-mer. Cri reads are added to Cr, and the process is repeated
for other k -mers of r. For more details refer to the Supplementary
Document.
2.2 Alignment and normalization of candidate readsOur goal is to correct reference read r. Karect aligns each read c in
the candidate set Cr, against r (line 5 in Algorithm CORRECTERRORS).
The result includes the start and end of c or r (semi-global align-
ment) to allow the alignment of overlaps. We use a variant of the
Needleman and Wunsch (1970) algorithm; refer to the
Supplementary Document for details.
To exclude candidate reads sequenced from different genome
regions, an alignment is considered valid only if the overlap exceeds
a threshold (Default s1 ¼ maxðminð0:7 # avgReadLen; 35Þ;0:2#refReadLenÞ:) s1 and the number of mismatches/indels within
the overlap does not exceed a threshold (Default s2 ¼ 25% of the
overlap.) s2. This rudimentary filter may still accept some reads
from irrelevant genome regions. To further minimize this problem,
Karect assigns a weight wc to each read (refer to Section 2.5).
Consider reference read r ¼ CAA and candidate read c1 ¼ GAAA.
r can be transformed to c1 by substituting C with G at position 1 and
inserting A at position 4. Substitutions are modeled as deletions fol-
lowed by insertions. Therefore, the alignment corresponds to
del(C,1); ins(G,1); ins(A,4). Now consider another candidate read
c2 ¼ AAA. r can be transformed to c2 by the following operations:
del(C,1); ins(A,1). Observe that, inserting an A at position 1 gener-
ates the same string as inserting A at position 4. Therefore, an
equivalent representation for the alignment is del(C,1); ins(A,4). We
call this the normalized form of the alignment (line 6 in Algorithm
CORRECTERRORS), where normalization means that operations are
shifted as far as possible to the right. Normalization allows better
grouping of operations of a set of candidate reads, which enables
Karect to correct reference reads with high accuracy. In the previous
example, after normalization it is revealed that, to correct r, we
must insert an A at position 4, with high probability. The concept of
normalization is also used in DAGCon (Chin et al., 2013), but the
resulting representation is suboptimal; the details are explained in
the Supplementary Document. Note that normalization is not
required if the sequencing technology generates only substitution
errors.
2.3 Storing alignments in the POGEach normalized alignment is stored in a POG Gr associated with
the reference read r (line 7 in Algorithm CORRECTERRORS). Initially,
Gr represents only r. The candidate read alignments are then added
incrementally in Gr in a manner similar to DAGCon, with the differ-
ence that similar out-nodes (i.e. nodes connected by edges coming
out from the same node) are merged instantly; this saves time and
space. Also, in contrast to DAGCon, similar in-nodes (i.e. nodes
connected by edges going to the same node) are not merged, since
this is not required by our extraction algorithm; this also saves com-
putational time. Figure 2 illustrates an example of aligning four can-
didate reads c1; . . . ; c4 to reference read r. The value on each edge
corresponds to the number of alignments passing through that edge.
We are going to modify these values in Section 2.5.
For sequencing technologies that generate only substitution
errors, instead of a POG we use an array of size jrj to accumulate
alignment weights.
2.4 Extracting corrected read from the POGGiven POG Gr for a reference read r, the corrected read r0 corres-
ponds to a path within Gr (line 8 in Algorithm CORRECTERRORS).
There are many ways to select such a path. For instance, it can be
the path that maximizes the sum of edge scores, but the quality of
error correction is expected to be low, because the heuristic favors
longer paths. As another example, DAGCon assigns each node a
score based on the weights of the out-edges and local coverage and
selects the path that maximizes the sum of node scores.
We propose a novel approach. First, we normalize all edge
weights such that the sum of the out-edge weights of any node is 1
(Fig. 3). The rationale is that, after normalization, edge weights will
reflect the transition probability between nodes. Then, the problem
is mapped to the classic problem of finding the most reliable path in
a network (Petrovic and Jovanovic, 1979), which is the path that
maximizes the product of edge weights. Since POGs are directed
Fig. 2. Example POG. The first row shows the initial POG for reference read r.
In the second row, c1 introduces an insertion and a substitution. Next, c2 in-
cludes a deletion, an insertion and a substitution and so on. At each row, the
newly introduced changes are shown in bold
Fig. 3. Normalized POG of Figure 2. The extracted path is shown in bold
Karect 3423
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/31/21/3421/195621by The New York Botanical Garden useron 27 November 2017
(Allam et al. 2015)