Supplementary Information Section: Review of General ...€¦ · Web viewWajid B and Serpedin E....
Transcript of Supplementary Information Section: Review of General ...€¦ · Web viewWajid B and Serpedin E....
Supplementary MaterialsWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation sequencers.
1. ABySSResources are a very important aspect of genome assembly. The term `resources' in the field of
computational algorithms pertains to the use of space (memory) and computational time, see
Section 33. Different algorithms that try to solve the same problem are compared using three
benchmarks, space, time, and the results generated by the programs. If the output of two
programs is comparable then the one that uses less time and space is definitely the better choice.
In order to handle large amounts of data generated by NGS platforms and also to speed up the
process ABySS (1) uses distributed computing, a cluster of systems, to generate a distributed
representation of a de Bruijn graph parallelizing an assembly of billions of short reads thereby
making use of memory cost effective solutions to assemble genomes of virtually any size, see
Figure S1. Key concepts for the construction of the de Bruijn graph and its simplification are
derived from EULER-SR, VELVET and EDENA.
1.1. Construction of a distributed de Bruijn graph
ABySS starts by discarding all reads with un-known bases (‘N’) and then proceeds with the
construction of a distributed graph. Every graph is defined by a set of nodes and edges. A
distributed graph is defined by a set of distributed nodes and distributed edges. We know from
earlier considerations that each k-mer acts as a node. Here each read l-mer is converted into (l-k
+ 1) overlapping k-mers. However, like a hash function within a hash table, the location where
each k-mer is to be placed should be as unique as possible; there may be some conflicts which
can be resolved via further processing of data. Therefore, the location of the nodes (k-mers) in
the memory is determined deterministically. Each base {A, C, G, T} translates into a numerical
value {0, 1, 2, 3}. Therefore, each k-mer and its reverse complement, based upon their
sequences, admit a base-4 representation. These two base-4 equivalent transformations are
converted into two individual hash keys. The bit equivalence of the hash keys is combined
together using XOR operation and the result is used to index the k-mer in memory.
As for each distributed edge, we know that each k-mer has 4 plausible extensions on either
side (indegree and outdegree) {A, C, G, T}. For each k-mer in the sequence collection, a
message is sent to its eight possible neighbors. If adjacency exists, then a k-1 overlap between
the concerned k-mers is recorded. The adjacencies are represented as 8 bits, each bit displaying
the presence or absence of an edge per k-mer. ABySS uses the MPI (Message Passing Interface)
protocol for communication between nodes; for the internal hash tables it uses the Google sparse
hash library (http://code.google.com/p/google-sparsehash/) and for hashing employs the
functions (http://burtleburtle.net/bob/c/lookup3.c).
1.2 Graph simplification and scaffolding
Graph simplification is done by removing tips (Figure 3I) and bubbles (Figure 7). Eventually all
nodes that are linked via unambiguous edges are collapsed to form one node which happens to be
the contig. The reads are aligned to the initial contigs to create a set of linked contigs. Links that
were made in error due to miss-assemblies are removed at this point. Two contigs are joined if
they share at least q links. A list of contigs
is generated that pairs with contig . A graph search is performed to search for a single
unique path starting from and visiting each contig within
only once. To see whether two contigs link together, the distance between each pair of
contigs is calculated via maximum likelihood
(http://genome.cshlp.org/content/suppl/2009/04/27/gr.089532.108.DC1.html), and the relative
orientation and the order in which the two contigs are to be connected is inferred from the read
pair information. The process is repeated for each contig and ultimately all consistent paths are
connected to generate the global assembly.
2. Adjacency table and adjacency matrix
If two reads overlap with one another with an overlap size which hints that they are neighbors of
one another in the actual genome, we say that the two reads are adjacent to one another. Reads
that are neighbors share unique k-mers, as many as the length of their overlap. The adjacency
table therefore assigns one link per pair of overlapping reads regardless of the number of k-mers
shared. Figure S2 illustrates the usage of adjacency table and adjacency matrix as two
alternatives to represent the connectivity between two nodes in the graph
(http://en.wikipedia.org/wiki/Adjacency_table).
3. Alignment-layout-consensusComparative assembly uses the alignment-layout-consensus paradigm, see Figure S3. In the
alignment phase, reads are aligned to the reference sequence to determine their relative
placement with respect to one another. Alignment of all the reads produces the `layout'. Forming
the consensus represents an approach similar to that explained in Section 27.
4. Base-ratioBase ratio of say 0.6 means that amongst the bases that cast vote as to which base should be the
winner and subsequently part of the target genome at least 60% of the bases should be casting
their vote in favor of the winning base.
5. Bidirected graphA bidirected graph (2) is an umbrella term which includes all types of graphs. A bidirected graph
is especially interesting because the orientation allowed by the graph edges makes it very
flexible. We know that any graph has a set of nodes and a set of edges . In a bidirected
graph, edges are referred to as links, loops or lobes, as depicted in Figure S4.
In Figure S2, the adjacency matrix was shown where each entry showed the interaction
between the nodes and . Here rather than having a node-node incidence matrix, the matrix
defined is a node-edge incidence matrix. The node-edge incidence matrix A = [ ] has a row for
each node and a column for each edge . Each element can take on one of the
values {-2, -1, 0, 1, 2}, depending on the number of heads or tails connecting the edge to the
node (Figure S5).
6. Breadth first searchA breadth first search (BFS) is a graph search algorithm that begins at the root node and explores
all its neighboring nodes that are present at the same depth from the root node. Then for each of
the nearest nodes, it explores their unexplored neighboring nodes one by one, all neighboring
nodes being at the same distance from the root node. The searching continues until it finds what
it was looking for. The term `breadth first' is coined due to the fact that the algorithm explores all
nodes that are at a distance k from the root before it starts exploring all the nodes that are a
distance k + 1 from the root. In other words, the search is driven by keeping in mind the breadth
of each node from its root, and all nodes that are at the same breadth are searched before going
any deeper in the tree. BFS does not adopt any heuristic and exhaustively searches over the
entire graph until it finds its goal.
7. Chimeric readsChimeric reads are reads with one end belonging to one part of the genome and the other end
belonging to a non-adjacent, probably far apart, part of the genome.
8. Chinese restaurant processA Chinese restaurant (because of the immense population in China) could be assumed to contain
an infinite number of tables with infinite capacity. In CRP,
(http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/20070921.pdf), each
customer can either sit at an un-used table or sit at one of the previously occupied tables. CRP is
used to randomize partitioning N customers (here N reads) onto tables (here contigs). Suppose
this distribution is denoted by . The tables are chosen in accordance to the
following random process:
The first customer chooses the first table.
The customer chooses the first unoccupied table with probability
, and an occupied table with the probability
, where ‘c’ is the number of people sitting at that table.
9. Composite vertex
In making the graph, similar vertices were collapsed to form one vertex. If is one such vertex
then all the vertices that were collapsed to form this vertex will be denoted in terms of notation
. Vertex is called composite if contains two positions in the sequence that are
positioned close (distance smaller than some threshold) to one another. Every composite vertex
is a plausible source of a whirl. Whirls are caused by ambiguities in pair-wise alignment. All
composite vertices are converted to non-composite vertices by splitting to produce at least
one non-composite vertex. In other words, whirls are removed by putting gaps or removing some
matches in the alignment. This is inherently what is meant by splitting the collapsed composite
vertex. Bulges are more difficult to remove. They can be resolved by eliminating any of its
edges. But how does one choose or ensure that the removed edge was the right one. This is
achieved by adopting the maximum weighted spanning tree strategy. The approach resumes to
either adding or removing edges provided that the user defined condition is satisfied. It adds an
edge if and only if the new edge does not form cycles shorter than what it is already considered
in terms of the user defined threshold. Otherwise, the edge is removed. Iteration of this process
on the entire graph ensures that the misbehaving edges are removed. (These explanations come
with Reference to De novo Assembly with A-Bruijn Graphs).
10. Coverage
Coverage of 20 means that at least 20 bases should be present at any particular location
(meaning 20 reads should overlap at that particular location) to cast the vote as to which base
should be the consensus base at that location in the target genome.
11. de Bruijn graphA de Bruijn graph is a directed graph, where an edge connecting one node to another represents
an overlap between the two connecting nodes. In a de Bruijn graph, each node has dimensions
and distinct symbols. Therefore, there are a total number of nodes. A directed edge between
the parent node and the child node exists if the last symbols of the parent coincide or
overlap with the first symbols of the child node. Therefore, for every node: indegree =
outdegree = l (http://en.wikipedia.org/wiki/De_Bruijn_graph#cite_note-Bruijn1946-0).
12. Depth first searchDepth first search (DFS) is a search tool for graph structures. It starts by picking up one branch
of the tree and it explores all the children of the branch until it reaches the leaf, thereby exploring
one branch all the way to its depth before exploring the next one, see Figure S6.
13. Detection and estimationA standard detection problem resumes to finding the consensus base for a set of bases aligned at
the same point of the genome, e.g. Figure S3C. Here simply the frequency or probability was
used as a tool to choose which base (here base would be the concerned hypothesis) is to be
chosen amongst other bases or hypotheses. In statistical decision theory, a situation when a
receiver receives a noisy version of a signal and decides which hypothesis is true among the M
possible hypotheses, it is referred to as a detection problem. In a 4-ary case, the receiver had to
decide among the hypotheses: . In genome assembly the hypotheses are {A, T,
G, C}. In estimation theory, once the receiver has made a decision in favor of the true
hypothesis, it estimates some parameter associated with the signal that may not be known in an
optimum fashion based on a finite number of samples of the signal (3).
14. Dijkstra's algorithmDijkstra's algorithm solves the Single-Source Shortest Paths problem. In a shortest-path problem,
the data is provided as a directed weighted graph . The aim is to find a path with the
smallest weight between two nodes and . The path with the smallest weight happens to be
the shortest path. In Section 6, the breadth-first search (BFS) algorithm was explained. BFS is a
shortest-path algorithm operating on graphs where each edge has a unit weight. In a single-
source shortest path problem the aim is to find all shortest paths between an arbitrary node
and all the other nodes in the graph (4).
15. Distributed computingThe data generated by the NGS platforms for genome assembly is usually very large. To handle
and process such a large amount of data takes a lot of time. Although the processing speed of the
systems today is very high and more and more memory is available for data processing, at the
same time we do see considerable time and space being employed to handle this copious amount
of data, especially produced by genomes that are larger in size. However, whenever such issues
arise we see computer scientists and engineers coming up with solutions to resolve such issues.
One can go and buy a bigger, more powerful machine, or can use a cluster of systems to solve
the same thing. A cluster of systems use what is called distributed computing. Here many smaller
systems interact with one another by sending and receiving messages.
In distributed computing, one giant problem is divided into a large group of tasks where
each task is performed by an individual system. Each task may itself be divided up into a number
of sub-tasks where each sub-task would be done in parallel by an individual system, causing it to
be done much quicker. It is very important in distributed computing to keep an account of where
the individual data is placed and also where it is being used. Usually distributed computing is
cheaper than buying a much more powerful system with comparable performance. Two powerful
mechanisms that allow distributed computing are MPI and OpenMP.
16. Eulerian pathAn Eulerian path is where all edges, and not nodes, are traversed exactly once. An Eulerian path
may traverse a given node more than once.
17. Eulerian super-path problem
A directed graph is Eulerian if it contains a cycle that traverses every directed edge exactly
once (5). EULER conducts genome assembly by finding an Eulerian super-path in the Eulerian
graph , which consists of a set of Eulerian paths , such that all the individual paths are
traversed only once.
The aim is to convert an Eulerian super-path which assumes the graph and set of paths
into an Eulerian path with graph and set of paths such that each path is a
single edge. Such a conversion is achieved by using a series of transformations. These
transformations are essentially equivalent. Equivalent in the sense that a solution to an Eulerian
path problem in is equivalent to providing a solution to the Eulerian super-path problem
in .
18. Exhaustive assemblerThe exhaustive assembler (6) came at a time which marked the transition from assemblers that
pertained to Sanger technology towards assemblers that pertains to NGS, see Figure S1.
Therefore, it assembles DNA from read data, where the read length is very large as opposed to
NGS, where the read length is small. An exhaustive scheme for genome assembly essentially
compares all reads, given the fact that there are n reads, with one another in order to identify the
best over-lapping pairs of reads. This process takes time. An intelligent use of a k-mer
library and an adjacency table collapses that time to .
k-mer library: It is a collection of k-mers (a string of characters of size k) that is produced by
chopping up the read data into many k-mers and cataloging them along with the reads from
which they were produced.
Adjacency Table: If two reads overlap with one another with an overlap size which hints that
they are neighbors of one another in the actual genome, the two reads are considered to be
adjacent to one another. Reads that are neighbors share unique k-mers, as many as the length of
their overlap. Building the k-mer library and identifying which reads they come from indirectly
identifies the neighboring reads. The adjacency table therefore assigns one link per pair of
overlapping reads regardless of the number of k-mers shared.
Identifying Contigs: In graph theory, disjoint graphs are a set of nodes and edges, which
together form non-overlapping sets. The adjacency table acts as a set of disjoint undirected
cyclical graphs. A breadth first search (BFS) is an exhaustive graph search algorithm that begins
at the root node and explores all its neighboring nodes that are present at the same depth from the
root node. It continues the search keeping in mind the breadth of each node from its root, and all
nodes that are at the same breadth are searched before going any deeper in the tree 6. Because
the adjacency table is a set of disjoint undirected cyclical graphs, performing BFS on all these
individual graphs is equivalent to doing a joint BFS on all graphs at the same time. Each disjoint
graph identified in this manner represents a single contig. The set of all contigs identifies the
genome.
19. GenovoGenovo [7] is a de novo assembler with a qualitative probabilistic framework using a Chinese
restaurant process prior model to account for the unknown number of genomes in the sample, see
Figure S1. Genovo does not require any prior information, performs read alignment, read
denoising and de novo assembly. Genovo adopts a modular approach to provide further
flexibility within its probabilistic framework, meaning that one can adjust the noise model
employed without affecting the rest of the model and also one can replace the uniform prior with
a prior based on the reference sequence for better results, see Section 39. Normally distributed
probabilistic models are corroborated by the law of large numbers and central limit theorem, see
Section 24.
19.1 Probabilistic model
In genome assembly, one can never predict beforehand the number of contigs one will obtain
from the reads. However, one can safely say that for a genome of length the number of contigs
should not be more than . Also the size of any one contig should not be larger than .
However, for mathematical simplicity one can assume that there could be an infinite amount of
contigs in the assembly, where each contig is allowed to contain an infinite amount of bases.
They are sampled uniformly. Therefore,
where , so is the base at location of contig .
Generally, the data produced by the NGS platforms is very large. Let the number of reads
produced be . Using the same scheme as above and assuming infinite contigs, where each read
either seeds or extends the contigs, a Chinese Restaurant Process (CRP)
(http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/20070921.pdf), might be
justified as a model, see Section 8. CRP is used to randomize partitioning reads onto the
possible set of contigs. Say, this is denoted by distribution . The contigs are chosen
in accordance to the following random process:
The first read chooses the first contig.
The read chooses to seed a new contig with probability , or
extend a contig with the probability , where is the number of
reads forming the extending contig.
Once every read is allocated to a particular contig the question remains where is the read
placed within each contig. As for an empty contig it does not make sense to ask to which
location within the contig a read would belong to since the contig is empty. However, for a
contig which is not empty, a bias is afforded in choosing a location nearer the start of the contig
as opposed to the end of the contig. Notice that both the start and the end position can be taken
up since the position is chosen randomly, see Figure S7. Using a geometric distribution allows to
assign a read location to any position centered around . Thus, . Each read is
assigned a length , for any . Parameter denotes any arbitrary distribution. The
individual bases for the read are copied from the alignment in accordance to
. Herein, stands for the noise model of the sequencing
technology. Now we are aware that every alignment does have matches (match), mismatches
(mis), insertions (ins) and deletions (del). The probability of aligning any read with any
alignment given depends on each of these phenomena (ins, del, match, mis) and the
weight assigned to them should be proportional to the number of their occurrences. Now let
and denote the probability of incorrectly copying, the probability of deletion,
the probability of insertion and the number of hits in the alignment, respectively. Therefore,
Notice that in the above equation, rather than using , it is using [59]. This is
due simply because for to mismatch, it has to align to either or and not . Since
and is missing, one is left with . Therefore, it is dividing it by
. Since , taking the logarithm of the above equation, it
follows that
(8)
In the end, the log likelihood of the model reduces to .
19.2 Assembly algorithm
The algorithm is a modified version of Iterated Conditional Modes (ICM) algorithm [60]. It
maximizes the local conditional probabilities sequentially until convergence is achieved, and it
outputs the most probable assembly.
Consensus Sequence: Here ICM updates each base , the base at location of contig ,
which is aligned such that . Here is the consensus or the number
of reads that align onto location .
Read Mapping: Taking the probabilistic model into consideration, this part maps reads to
the best plausible alignment. It starts by finding for each read to which contig it belongs to,
which particular location within it belongs to and the alignment within each contig . It starts
sampling from updating for each read
by first removing it completely from the assembly, choosing a new location and then evaluating
the statistics for each ; find . This is the alignment score. It chooses the alignment that best
fits. Then it samples from . It proceeds further by sampling the
location of from .
Global Moves: Notice that all the above steps of ICM are local, since all the parameters
were updated on a case by case basis by exhausting all the cases. However, the same steps
evolve into a global approach by adopting the following steps:
1) Propose Insertions/Deletions: If most of the reads aligned to a location in a contig
present an insertion, then put that insertion in the contig and realign the reads. If the
likelihood is improved, then accept the change. Similarly if most reads that align to a
location on the contig present a deletion, then delete that location in the contig and
realign the reads. If the likelihood is improved, then accept the change.
2) Merge: Merge two contigs if their ends overlap and provided that the likelihood
increases.
Chimeric Reads: Reads whose starting points and ending points match to a different
location on the genome. They are normally present on the edge of the contigs. Therefore, after
every five iterations, the edge reads are removed thereby allowing correct reads or contigs to
patch the edges and continue increasing the likelihood. If the read is not chimeric, it will again be
merged at the edge.
20. Graph structure
Graph theory has been extensively used for genome assembly. A graph is a data structure
consisting of a set of nodes and a set of edges . A graph may be directed, if all
the edges are oriented . It is undirected if all edges do not assume any orientation
or . Genome assembly is looked upon as a graph structure with known nodes
and unknown edges. The nodes are the reads and how they are connected is denoted in terms of
the graph edges (9). A node in a graph is called balanced if the number of edges entering into
is equal to the number of edges exiting .
21. Hamiltonian pathA Hamiltonian path is that path where each node in the graph is traversed only once.
22. Hash table
A hash table is a data structure that uses a hash function to map a ‘key’ of any variable to an
index of an array (‘bucket’), where its associated value is located. Hash tables due to their fast
search operation are very widely used in databases.
23. K-mer numberingAn equivalent of a graph structure where k-mer numbers are the nodes and the edges between
them represent adjacencies between two k-mer numbers, as illustrated in Figure S8. (Reference
explanation to ALLPATHS)
24. Law of large numbers and central limit theoremLaw of large number states that the average results obtained from an experiment (called sample
average) repeated a number of trials come closer and closer to the true average when the number
of trials is increased without bound. There are at least two forms for the law of large numbers:
strong and weak version. The strong law of large numbers states that the sample averages
converge almost surely to the expected value:
In probability theory an event that happens almost surely means that it happens with probability
1. The weak law of large numbers states that the sample average converges in probability to the
expected (true) value:
meaning that if the number of samples is large, the sample average will be close to the true
expected value.
An inevitable consequence of using law of large numbers is the central limit theorem. If
denotes a sum of identically distributed and independent (one random
variable does not affect the other), then converges in distribution to a Gaussian with
zero-mean and unit-variance . The parameters and are the mean and standard
deviation of the distribution from which are sampled. This is what justifies the use of
Gaussian distribution.
25. Maximum likelihood genome assemblyAmongst the many ways of estimating unknown parameters, one of the simplest and robust
methods is the maximum likelihood approach, see Figure S1. Consider estimating from a set of
observations . The likelihood function is the conditional density function of which
depends on the parameter via
The Maximum Likelihood Estimate (MLE) is the value of that maximizes the likelihood
function:
The MLE of can be obtained by solving the equation
In [57], MLE has been used for genome assembly.
1) Using Maximum Likelihood: For any circular genome , let be the length of that
genome. NGS platforms use to provide a set of reads of length . For any read, say the
read, let be the frequency with which it appears in . Assume the output of NGS contains a
total number of reads. Probabilistically one can say that reads are generated via
independent trials. In each trial, a position is uniformly sampled from and the read begins
at that position. Therefore, . If each read has length and each position can
be either of the 4 bases , then each read is itself a variable and there are such
variables. Let be the number of trials whose outcome is the read. When all variables are
considered independently, they follow a binomial distribution. However, when they are taken
together, their joint distribution is multinomial: where
is the global read-count likelihood or the likelihood of the parameters
given . Maximizing the global read-count likelihood is equivalent to minimizing the
negative logarithm of this likelihood . However, since the multinomial distribution
assumes the constraint , this cannot be performed directly. Since the number of
trials is very large (the number of trials might be assumed to go to infinity), ’s become
independent and the joint multinomial distribution can be expressed as the product of individual
binomial distributions of each . The binomial approximation of the length of the genome
is a constant , and is independent of each di. Therefore, the approximate length of the actual
genome can be ascertained through the usage of the Expectation-Maximization (EM) algorithm.
Assuming that the genome size is known, then for the likelihood one obtains the approximate
expression:
Now one can further infer that , where is some positive constant
independent of all ’s, and
2) Graph Construction: A modified version of de Bruijn graph is made via applying the
‘overlap-layout-consensus’ paradigm. The de Bruijn graph is formed by connecting vertices by
an edge if the corresponding reads overlap. Employing de Bruijn graphs reduces the fragment
assembly problem to finding a path visiting every edge of the graph exactly once, i.e., an
Eulerian path problem [31]. Graph correction comes after graph construction via simplification,
multiplicity, erosion, whirl removal, removing bulges, and straightening the graph [37].
Although only slight improvement past 75 coverage is achieved, the methodology explained
above provided the first exact polynomial time assembly algorithm that deals with double-
stranded genomes [57].
26. Minimal extension and subsumptionsUsing the data obtained, ALLPATHS finds all minimal extensions and subsumptions of each
read in the spectrum, see Figure S9. This is then used to perform a depth first search (DPS), see
Section 12. This is then stored as a graph structure where the nodes are reads and an edge
denotes the fact that was the minimal extension of . The nodes also store the offset
from the start of DPS. (Reference explanation to ALLPATHS).
27. MinimusMinimus uses AMOS (A Modular Open-Source), a collection of modules and libraries which are
opensource and are useful in developing genome assemblers
(http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOS), see Figure S1. These
AMOS modules, or stages, interact with one another via an AMOS data structure called “bank”.
Minimus [33] is built using 3 AMOS modules, namely Overlapper, Unitigger and Consensus
following the overlap-layout-consensus paradigm. Initially all the reads are loaded into the
AMOS bank. It computes all pairwise alignments between the reads (a step conducted by the
Overlapper) and uses the resulting information to generate an overlap graph where each read is
a node and an edge connects two nodes if the reads overlap (a step conducted by the
Unitigger).
Like all assemblers that exploit graph theory to solve the genome assembly problem
Minimus too inherently employs an extensive graph simplification process in order to simplify
identifying the contigs and subsequently implementing scaffolding. Two methods adopted by
Unitigger for graph simplification are explained in Figure 3A and B in main text.
After simplifying the graph, the consensus sequence is formed by progressive multiple
alignment of the reads aligned within each unitig. Here Quality-values (Q-values) are used
during the consensus stage to trim the poor quality areas of each read. Q-values are sequencing
platform specific. Therefore, the trimming criterion is subject to which NGS platform was used
for sequencing.
28. Overlap-layout-consensusThe overlap-layout-consensus, as the name suggests consists of three steps, see Figure S10. In
the first step an overlap graph is created by joining all the reads by their respective best over-
lapping reads. The step is similar to the greedy approach. The layout stage, ideally, is responsible
for finding one single path from the start of the genome traversing through all the reads exactly
once and reaching the end of the sequenced genome. A consensus sequence is formed by looking
at all the bases that occur in that location and identifying the one with the highest call.
29. Parallel construction of bidirected string graphs
This assembler uses a bidirected string graph having a set of nodes and a set of edges to
represent the genome, see Figure S1. A bidirected string graph is derived from bidirected de
Bruijn graphs [40], [22]. This is justified by the following issue inherent in data assembly.
Although reads are sequenced from the end to the end, it is not known to which of the two
double stranded DNA strands the read belongs to. Therefore, a read can have either of the two
orientations: or . This issue paves the way for using bidirected de Bruijn graphs, as
illustrated in Figure S11.
An edge in the graph is defined using a tuple . Here
represents the orientation whether the edge points outwards or inwards from the node. A path in
the graph is an ordered list of tuples , with , as
depicted in Figure S12.
Converting a bidirected de Bruijn graph into a bidirected string graph is done via expanding
the tuple to include two more parameters. The new tuple is .
Here is the character obtained by walking from to , while is the character
obtained while walking from to . Notice in Figure S13 that although the suffix of and
the prefix of overlap, yet the characters defining the edge are . Here the first is the
first base in the k-mer of yet the second base is not the first character in the k-mer . It
is actually the first base of the k-mer , which is the character obtained while walking from
.
1) Making the graph: The first step in generating the graph is to create nodes which are
made by pairing up each k-mer with its reverse complement. Then assign +ve to the k- mer that
is lexicographically larger than its counterpart. Once all k-mers and their counterparts are
identified then all unique k-mers are sorted in parallel [41]. For a genome of length , about
nodes are sent to each processor, after being sorted in parallel, where p stands for the
number of processors. The elements that are passed are already sorted. Each k-mer has a unique
identifier from , where denotes the total number of unique k-mers. Since the
numbering is done on a sorted list, the process is trivial. Let be the length of all k-mers
combined. Since the process is conducted in parallel by p processors, the whole process takes
time to execute.
Once each processor has been provided with its share of sorted nodes, we need to identify
edges between the nodes to complete the graph. In order to create edges between any two given
nodes, it should be realized that any k-mer/node can overlap with a certain number of nodes. This
is achieved via passing messages from one node to another. The message has the format
. Here stands for the node which sends the message, denotes
the node which receives the message, is the type of message (there are three message
types), and represents the character associated with the edge between and , see
Figures S11-S14. The message before being sent needs to be created, and this step takes
time. Once the messages are created, they are sorted locally in each processor using the radix sort
algorithm in time. The messages are then sent and the edges are created in
time per processor and then sorted locally using the radix sort in another time.
As shown above the entire graph construction process from sorting all unique k-mers, message
creation, sorting the messages to edge creation, and finally sorting the individual edges was all
done in parallel using p processors with each sub-task itself executing in time.
Although the entire process took time, in algorithmic terminology the constant n is
omitted as it is a constant of proportionality and only is presented as the algorithm’s
time complexity.
2) Larger genomes: For a genome of length L the total number of unique k-mers cannot
be larger than L. Therefore, even if there is a coverage of about 100 or the assembly data goes
into terabytes, one only needs to know the individual unique -mers and the number of times
each k-mer occurs, also called its copy count. Since the graph formation is dependent on the
sorted array of unique -mers and one can consider the read assembly data in stages, one can
populate the array in stages (since once an entry of a unique -mer is made all that is left to do is
increase its copy count if the same k-mer appears in the data), and sort the array in stages. This
means that no matter how large the data is, the whole process can be done in stages. Therefore,
assembly of data is taken one stage at a time and the entire process of graph formation is laid out
and implemented. Of course, each time the sorted array gets bigger and bigger, and the process
remains the same.
For larger genomes, the number of messages sent is far greater than since any -mer may
send a message to another -mer that may not exist in the data. Since two nodes overlap with one
another by length characters, therefore, for each node only one and one
message is sent. This reduces the number of messages that are sent for larger genomes. Note that
due to the staged process, for larger genomes the actual graph construction is computationally
less expensive as opposed to the analysis of the data and piling up of the data at the right
processor.
30. Prefix treeA prefix tree (also known as ‘tree’) is a tree structure that has strings or alphabets as its keys, see
Figure S15.
31. Quality ValuesThe Quality (Q-) values are the probabilistic data provided by the NGS platforms identifying the
quality of each base in each read. Q-values are evaluated differently for different platforms. For
example for Solexa (later on acquired by Illumina) the formula is:
Where is the probability of identifying a base incorrectly. To convert Q-values into
probability use:
Similarly for Sanger and other platforms the formula is:
The conversion of Q-values into probability is done via:
and can be converted to one another using the formula:
is the probability that the base-call is incorrect. Therefore, probability of a correct base-call is
. Therefore: , where . The Q-values essentially provide
an integer mapping to the error probability of calling a base.
32. Radix sortRadix sort is an algorithm that sorts a list of L numbers composed of n digits by first sorting all
numbers by just looking at their least significant digit. It then sorts the entire list by just looking
at the second-least significant number. It does so iteratively until the entire list is sorted in terms
of its most significant digit. Figure S16 depicts graphically the radix sorting algorithm (4).
33. Similarity matrixA similarity matrix is a similar representation to an adjacency matrix (Figure S2), since it does
identify whether two nodes or points are adjacent to one another, and it also tells how close or
how far apart they are from one another. In another words, a similarity matrix provides a weight
to every edge of the graph. For strings, a similarity matrix is very close to the concept of distance
matrix (Figure S17) or substitution matrix. A de-facto standard in substitution matrices is
BLOSUM62 (16). This matrix marks scores to the substitution of one amino acid with another
one based on purely statistical grounds with no direct affiliation with the amino acid's
biochemistry or its structure (16).
34. Speed and time complexityOne very important aspect of genome assembly is the use of resources. The term ‘resources’ in
the field of computational algorithms pertains to the use of space (memory) and computational
time. The use of these two resources is quantified in terms of the notation. Usually these
two resources compete with one another. An algorithm may take very little time, but at the
expense of memory. Whereas an algorithm may take up less space by taking up more time to
execute. A true development in the field of algorithms and problem solving comes when a new
algorithm outperforms another algorithm in solving a particular problem. But how does one
define outperform? The three benchmarks that are compared are space, time, and the results
generated by the programs. If the output of two programs is comparable then the one that uses
less time and space is definitely the better choice. The word comparable is used since two
programs that try to solve the same problem may not give exactly the same results, and their
results may diverge more if the problem happens to be a complicated one.
35. ScaffoldingThe process of relative placement of these contigs in relation to one another in order to form the
genome is called scaffolding. Two contigs are adjacent to one another if two or more mate-pairs
link them together. The mate-pair information due to very small coverage may not be used to
identify the individual bases of the missing links yet they are good enough to identify which
contig comes first and which contig comes next.
36. SingletonsReads that are unused in the assembly process.
37. Spectrum
A collection of a set of reads and their reverse complements is called the
spectrum . The reverse complements of the reads are also referred to as their mate-pair data.
38. Suffix Tree and Suffix ArraysA suffix tree is different from a prefix tree, because it uses a suffix for building the tree whereas
the latter uses a prefix. Suffix trees have been widely used in a large number of biological
applications ranging from finding exact string matching, recognizing DNA contaminations,
_nding common sub-strings, to minimum length decoding of DNA (17), see Figure S18,
(http://sary.sourceforge.net/docs/suffix-array.html). For suffix arrays see Figure S19.
39. Tandem repeatsRepeats are portions of the DNA that occur more than once. Tandem repeats are identified by
areas of the genome where a particular sequence repeats itself over and over again in tandem to
one another. One example would be
. In the example above
the is only placed for readability. Here the tandem repeat sequence is .
40. Uniform distributionIn genomes, in general we know that A-T and G-C are base pairs. Therefore, the number of A's
is equal to the number of T's. Same is the case for G's and C's. Also A, T, G and C are almost
present in equal proportions though some organisms may have a higher G-C rich content due to
the stability of G-C base pairs. Therefore, if then
. Here we use the uniform
distribution since all random variables are equally probable.
41. UnipathIt is an equivalent of a graph structure where k-mer numbers are the nodes and the edges between
them represent adjacencies between two k-mer numbers, a terminology used by ALLPATHS.
References
1. Simpson, J.T. et al. 2009. Abyss: a parallel assembler for short read sequence data. Genome Res, vol 19, no. 6, p. 1117.
2. Edmonds, J. and Johnson, E. 2003. Matching: A well-solved class of integer linear programs. Combinatorial Optimization Eureka, You Shrink! (eds. Giovanni, et al.), pages 27-30.
3. Barkat, M. 2005. Signal detection and estimation. Artech house.4. Gormen, T. H., et al. 1976. Introduction to algorithms. MIT Press and McGraw-Hill Book Company,
vol. 7, pp. 1162-1171.5. Pevzner, P. and Books, C.N. 2000. Computational molecular biology: an algorithmic approach,
volume 1. MIT press Cambridge, MA.6. Shah, M.K. et al. 2004. An exhaustive genome assembly algorithm using k-mers to indirectly perform
n-squared comparisons in o(n). Proc IEEE Comput Syst Bioinform Conf (CSB'04).7. Laserson, J. et al. 2010. De novo assembly for metagenomes. Res Comput Mol Biol. pages 341-356.
Springer.8. J. Besag. 1986. On the statistical analysis of dirty pictures. J R Stat Soc Series B Stat Methodol, 48(3):
259-302.9. Koller, D. and Friedman, N. 2009. Probabilistic graphical models: Principles and techniques. The
MIT Press.10. Medvedev, P. and Brudno, M. 2009. Maximum likelihood genome assembly. J Comput Biol,
16(8):1101-1116.11. Pevzner, P. et al. 2004. De novo repeat classification and fragment assembly. Genome Res. vol. 14,
no. 9, p. 1786.12. Hernandez, D. et al. 2008. De novo bacterial genome sequencing: millions of very short reads
assembled on a desktop computer. Genome Res. 18(5):802.13. Sommer, D.D. et al. 2007. Minimus: a fast, lightweight genome assembler. BMC Bioinformatics,
8(1):64.14. Jackson, B.G. and Aluru, S. 2008. Parallel construction of bidirected string graphs for genome
assembly. ICPP, pages 346-353.15. Helman, D.R. et al. 1998. A new deterministic parallel sorting algorithm with an experimental
evaluation. Journal of Experimental Algorithmics (eds. Italiano, G.F. et al.), 3(4).16. Eddy, S.R. 2004. Where did the blosum62 alignment score matrix come from? Nat Biotechnol,
22(8):1035-1036. 17. Pop, M. 2009. Genome assembly reborn: recent computational challenges. Brief Bioinform.
10(4):354.
Figure Legends
Figure S1 Schemes and their associated algorithms
The figure depicts the most fundamental schemes adopted by assembly algorithms. The algorithms have been listed in order to clarify fundamental concepts; however, the same algorithm can be categorized into more than one approach. For instance, all Eulerian path approach algorithms could be categorized under graph based schemes. However, assisted assembly can be categorized under both comparative assembly and the overlap-layout-consensus approach since it uses concepts from both.
Figure S2 Adjacency table and adjacency matrix
A. An undirected cyclic graph. Equivalent representations are (B): Adjacency Table and (C): Adjacency Matrix. The advantages of using (B) over (C) is that it uses less memory. Also searching for nodes that are adjacent to one another in (B) is an operation as opposed to
(C) where it is an operation. However, for checking whether for two specific nodes there
is an edge between them, then for (C) it is an operation while for (B) it is an operation, where p and q stand for the number of edges entering into the two nodes, respectively.
Figure S3 Alignment-Layout-Consensus
A. Alignment: In this stage all reads (in yellow color) are aligned with a closely related reference sequence (shown in green). The alignment process may allow one or more mismatches between each individual read and the reference sequence depending on the user. The alignment of all the reads creates a Layout (B), beyond which the reference sequence is not used any more. C. Consensus: The layout helps in producing a consensus sequence (shown in green), where each base in the sequence is identified by simple majority amongst the bases at that position or via some probabilistic approach (shown in red and brown colors). The end result is the novel genome. D. The reference genome and the novel genome produced by the alignment-layout-consensus are placed side by side. The red regions mark the area in which both sequences differ. The novel genome may even be longer or shorter than the reference sequence.
Figure S4 Bidirected graphs
In a bidirected graph if an edge connects two nodes and , then it is referred to as a link. If an edge connects a nodes by itself, it is called a loop. However, if an edge points to an empty node, or a null node, or has only one end, then it is called a lobe.
Figure S5 Node-edge incidence matrix values
In the incidence matrix each element can take on values -2, -1, 0, 1, 2 depending on
whether the edge has with node 2 heads, 1 head, no connection, 1 tail or 2 tails,
respectively.
Figure S6 Depth First Search
This is an exhaustive search scheme which starts from the root (A) and picks up one branch, say the left child (B). It then explores all the way until it reaches the leaf of that branch, (D), before back-tracking. It then backtracks, or when one side of the branch has been explored it then searches the other branch until its reaches its leaf. The sequence in which this tree is searched is A, B, C, D, E, F, G, H, I, J, K and lastly L. The search continues until the goal is reached or the tree is exhausted, whichever comes first.
Figure S7 Allocating a read to a particular contig
This is done in two steps. At first the length of the region within the contig from which reads are
generated is chosen randomly, using a distribution: Then the actual location around this region is chosen using a symmetric variation of a geometric distribution.
Figure S8 -mer numbering
As one moves along any sequence, it can be divided into k-mers, which are encoded as 101, 102, 103, 104 so on and so forth, changing the numbering only when it hits a k-mer that has appeared before. This movement in one direction or path can then be identified as the closed interval [101,104].
Figure S9 Minimal extension and subsume
A. The upper sequence C completely subsumes the lower sequence D as they align perfectly and that C overhangs D on its either ends. B. E is the minimal extension of F such that E is not an extension of all other extensions of F.
Figure S10 Overlap-Layout-Consensus
A. Overlap: Two reads overlap at the region shown in green. B. Layout: Overlap of a number of reads at the same time becomes a layout. C. Consensus: The layout helps in producing a
consensus sequence. One contiguous consensus sequence is called a contig (shown in red). Many contigs connected together form the DNA sequence in question.
Figure S11 Bidirected de Bruijn graphs
In a bidirected de Bruijn graph, the two nodes can interact with one another in four possible ways. The arrows above show the orientation of the two reads with relation to one another. The de Bruijn tuple underneath represents the equivalent representation of the individual figures. Panel C and D are equivalent. Although there are four ways in which the two nodes can interact, since C and D are equivalent, there are just three types of messages, as opposed to four, that need to be sent from one node to another to evaluate in which orientation they overlap.
Figure S12 Message passing
The figure shows the messages that are passed between two nodes based upon the orientation in which the reads overlap. It also shows the equivalent tuple representation of the de Bruijn and String graphs. A. Every node has two segments. One segment contains a k-mer and the other
segment its reverse complement. The segment shown is the -mer which is lexicographically
larger than its reverse complement . Its prefix is , while its suffix is . B, C, and D each shows the equivalent representations of the de Bruijn graph, the String graph and the message that is shared to form an edge between the two nodes.
Figure S13 Interaction between nodes
A. The two nodes present the interaction . Each of the two segments within each node
are defined as +ve or -ve based upon their alphabetical order. The -mer that is larger lexicographically as opposed to its reverse complement is defined +ve. The edge is defined by
two characters. In the case above, they are A/A. The first base ‘A’ is the first character in the -
mer of . The second base ‘A’ is the first character in the -mer . Panel B and C show the interactions between the individual segments of the node in A, in which case they are both equivalent. Within each segment, the underlined bases overlap and the bases shown big in B and
C end up being the identifier of the edge shared between nodes and .
Figure S14 Types of messages The figure displays the different types of messages that are sent between two nodes. A. Type 1
message is sent from to . B. Type 1 message is sent from to . C. Type 2 message is sent
from to if is lexicographically larger than . However, if is lexicographically larger
than , then a Type 2 message is sent from to . D. If is lexicographically equal to , then
and we end up in a loop with a message of Type 2. E. Type 3 message is sent from to if
i+ is lexicographically larger than . However, if is lexicographically larger than , then a
Type 3 message is sent from to . F. If is lexicographically equal to then , and we end up in a loop with a message of Type 3.
Figure S15 SSAKE/VCAKE: Prefix tree
It is a tree structure good for holding strings. Reads are composed of bases {A, T, G, C}. While searching for any particular read, each transition from parent to child divides the search space by
4. If there are m reads, each of size n, then searching for a particular read is an or
operation, since n steps are required from the root to reach the leaf. Every node shares common prefixes with its parents right up to the root, which is a NULL node. The spectrum is organized by the prefixes of bases from their 5' end. VCAKE on the other hand uses the first 11 bases of the 5' end of the reads and their reverse complements to be used as a prefix.
Figure S16 Radix sort
The input is the unordered list of numbers. Radix sort looks from the least significant digit to the most significant digit to sort the list. Notice in the unordered list 390 comes before 300. In the second list 390 still comes before 300 in the list. That is because while sorting radix sort only looks at digits with the same significance level and does not look at anything else. Therefore, the relative order of the numbers with the same digit at the significance level which is being looked upon remains the same after one complete iteration.
Figure S17 Distance matrix
Hamming distance for two strings of equal lengths is the minimum number of substitutions required to change one string into another. The string A requires one substitution to convert into B. The matrix shows the distance required to convert one string into another. Notice that the matrix is symmetric. Therefore, only the upper triangle or the lower triangle of the matrix needs
to be filled up. Generally, for an matrix, the number of distances that need to be calculated
is . Herein, the number of distances that need to be evaluated is 6.
Figure S18 Making a suffix tree
The figure above represents, step by step, how to build a suffix tree for the string “BWABBAS$”. The suffix tree is made by assigning a number to every letter of the string B (1), W (2), A (3), B (4), B (5), A (6), S (7), $ (8). The tree starts building by taking suffixes one by one, moving towards left of the string. It first takes $, then S$, then AS$, and so on and so forth.
Figure S19 Building a suffix array
In the first step every letter of the word is assigned an index based upon its position in the word. In this example BWABBAS is indexed as B(1), W(2), A(3), B(4), B(5), A(6), S(7), $(8). A. Next, the letters of the word are placed in an array and indexed based on the first alphabet of every word. B. Finally, all the words are sorted in an alphabetical order and the final indexing of the array is shown. C. Here a simple binary search shows that it is very efficient to find a particular word. In this example S$ is searched using a binary search. The search starts from the middle of the array (1). Here index 5 contains BAS$. Since S$ comes after BAS$ alphabetically, therefore, the search looks ahead within the remaining half of the array. (2) Here index 1 contains BWABBAS$. Since S$ comes after BWABBAS$ alphabetically, therefore, the search looks ahead within the remaining quarter array. (3) Found what was looking for since index 7 contains S$.