Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E....

44
Supplementary Materials Wajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation sequencers. 1.ABySS Resources are a very important aspect of genome assembly. The term `resources' in the field of computational algorithms pertains to the use of space (memory) and computational time, see Section 33. Different algorithms that try to solve the same problem are compared using three benchmarks, space, time, and the results generated by the programs. If the output of two programs is comparable then the one that uses less time and space is definitely the better choice. In order to handle large amounts of data generated by NGS platforms and also to speed up the process ABySS (1) uses distributed computing, a cluster of systems, to generate a distributed representation of a de Bruijn graph parallelizing an assembly of billions of short reads thereby making use of memory cost effective solutions to assemble genomes of virtually any size, see Figure S1. Key concepts for the construction of the de Bruijn graph and its simplification are derived from EULER-SR, VELVET and EDENA. 1.1. Construction of a distributed de Bruijn graph ABySS starts by discarding all reads with un-known bases (‘N’) and then proceeds with the construction of a distributed graph. Every graph is defined by a set of nodes and edges. A distributed

Transcript of Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E....

Page 1: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

Supplementary MaterialsWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation sequencers.

1. ABySSResources are a very important aspect of genome assembly. The term `resources' in the field of

computational algorithms pertains to the use of space (memory) and computational time, see

Section 33. Different algorithms that try to solve the same problem are compared using three

benchmarks, space, time, and the results generated by the programs. If the output of two

programs is comparable then the one that uses less time and space is definitely the better choice.

In order to handle large amounts of data generated by NGS platforms and also to speed up the

process ABySS (1) uses distributed computing, a cluster of systems, to generate a distributed

representation of a de Bruijn graph parallelizing an assembly of billions of short reads thereby

making use of memory cost effective solutions to assemble genomes of virtually any size, see

Figure S1. Key concepts for the construction of the de Bruijn graph and its simplification are

derived from EULER-SR, VELVET and EDENA.

1.1. Construction of a distributed de Bruijn graph

ABySS starts by discarding all reads with un-known bases (‘N’) and then proceeds with the

construction of a distributed graph. Every graph is defined by a set of nodes and edges. A

distributed graph is defined by a set of distributed nodes and distributed edges. We know from

earlier considerations that each k-mer acts as a node. Here each read l-mer is converted into (l-k

+ 1) overlapping k-mers. However, like a hash function within a hash table, the location where

each k-mer is to be placed should be as unique as possible; there may be some conflicts which

can be resolved via further processing of data. Therefore, the location of the nodes (k-mers) in

the memory is determined deterministically. Each base {A, C, G, T} translates into a numerical

value {0, 1, 2, 3}. Therefore, each k-mer and its reverse complement, based upon their

sequences, admit a base-4 representation. These two base-4 equivalent transformations are

converted into two individual hash keys. The bit equivalence of the hash keys is combined

together using XOR operation and the result is used to index the k-mer in memory.

Page 2: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

As for each distributed edge, we know that each k-mer has 4 plausible extensions on either

side (indegree and outdegree) {A, C, G, T}. For each k-mer in the sequence collection, a

message is sent to its eight possible neighbors. If adjacency exists, then a k-1 overlap between

the concerned k-mers is recorded. The adjacencies are represented as 8 bits, each bit displaying

the presence or absence of an edge per k-mer. ABySS uses the MPI (Message Passing Interface)

protocol for communication between nodes; for the internal hash tables it uses the Google sparse

hash library (http://code.google.com/p/google-sparsehash/) and for hashing employs the

functions (http://burtleburtle.net/bob/c/lookup3.c).

1.2 Graph simplification and scaffolding

Graph simplification is done by removing tips (Figure 3I) and bubbles (Figure 7). Eventually all

nodes that are linked via unambiguous edges are collapsed to form one node which happens to be

the contig. The reads are aligned to the initial contigs to create a set of linked contigs. Links that

were made in error due to miss-assemblies are removed at this point. Two contigs are joined if

they share at least q links. A list of contigs

is generated that pairs with contig . A graph search is performed to search for a single

unique path starting from and visiting each contig within

only once. To see whether two contigs link together, the distance between each pair of

contigs is calculated via maximum likelihood

(http://genome.cshlp.org/content/suppl/2009/04/27/gr.089532.108.DC1.html), and the relative

orientation and the order in which the two contigs are to be connected is inferred from the read

pair information. The process is repeated for each contig and ultimately all consistent paths are

connected to generate the global assembly.

2. Adjacency table and adjacency matrix

If two reads overlap with one another with an overlap size which hints that they are neighbors of

one another in the actual genome, we say that the two reads are adjacent to one another. Reads

that are neighbors share unique k-mers, as many as the length of their overlap. The adjacency

Page 3: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

table therefore assigns one link per pair of overlapping reads regardless of the number of k-mers

shared. Figure S2 illustrates the usage of adjacency table and adjacency matrix as two

alternatives to represent the connectivity between two nodes in the graph

(http://en.wikipedia.org/wiki/Adjacency_table).

3. Alignment-layout-consensusComparative assembly uses the alignment-layout-consensus paradigm, see Figure S3. In the

alignment phase, reads are aligned to the reference sequence to determine their relative

placement with respect to one another. Alignment of all the reads produces the `layout'. Forming

the consensus represents an approach similar to that explained in Section 27.

4. Base-ratioBase ratio of say 0.6 means that amongst the bases that cast vote as to which base should be the

winner and subsequently part of the target genome at least 60% of the bases should be casting

their vote in favor of the winning base.

5. Bidirected graphA bidirected graph (2) is an umbrella term which includes all types of graphs. A bidirected graph

is especially interesting because the orientation allowed by the graph edges makes it very

flexible. We know that any graph has a set of nodes and a set of edges . In a bidirected

graph, edges are referred to as links, loops or lobes, as depicted in Figure S4.

In Figure S2, the adjacency matrix was shown where each entry showed the interaction

between the nodes and . Here rather than having a node-node incidence matrix, the matrix

defined is a node-edge incidence matrix. The node-edge incidence matrix A = [ ] has a row for

each node and a column for each edge . Each element can take on one of the

Page 4: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

values {-2, -1, 0, 1, 2}, depending on the number of heads or tails connecting the edge to the

node (Figure S5).

6. Breadth first searchA breadth first search (BFS) is a graph search algorithm that begins at the root node and explores

all its neighboring nodes that are present at the same depth from the root node. Then for each of

the nearest nodes, it explores their unexplored neighboring nodes one by one, all neighboring

nodes being at the same distance from the root node. The searching continues until it finds what

it was looking for. The term `breadth first' is coined due to the fact that the algorithm explores all

nodes that are at a distance k from the root before it starts exploring all the nodes that are a

distance k + 1 from the root. In other words, the search is driven by keeping in mind the breadth

of each node from its root, and all nodes that are at the same breadth are searched before going

any deeper in the tree. BFS does not adopt any heuristic and exhaustively searches over the

entire graph until it finds its goal.

7. Chimeric readsChimeric reads are reads with one end belonging to one part of the genome and the other end

belonging to a non-adjacent, probably far apart, part of the genome.

8. Chinese restaurant processA Chinese restaurant (because of the immense population in China) could be assumed to contain

an infinite number of tables with infinite capacity. In CRP,

(http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/20070921.pdf), each

customer can either sit at an un-used table or sit at one of the previously occupied tables. CRP is

used to randomize partitioning N customers (here N reads) onto tables (here contigs). Suppose

this distribution is denoted by . The tables are chosen in accordance to the

following random process:

The first customer chooses the first table.

Page 5: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

The customer chooses the first unoccupied table with probability

, and an occupied table with the probability

, where ‘c’ is the number of people sitting at that table.

9. Composite vertex

In making the graph, similar vertices were collapsed to form one vertex. If is one such vertex

then all the vertices that were collapsed to form this vertex will be denoted in terms of notation

. Vertex is called composite if contains two positions in the sequence that are

positioned close (distance smaller than some threshold) to one another. Every composite vertex

is a plausible source of a whirl. Whirls are caused by ambiguities in pair-wise alignment. All

composite vertices are converted to non-composite vertices by splitting to produce at least

one non-composite vertex. In other words, whirls are removed by putting gaps or removing some

matches in the alignment. This is inherently what is meant by splitting the collapsed composite

vertex. Bulges are more difficult to remove. They can be resolved by eliminating any of its

edges. But how does one choose or ensure that the removed edge was the right one. This is

achieved by adopting the maximum weighted spanning tree strategy. The approach resumes to

either adding or removing edges provided that the user defined condition is satisfied. It adds an

edge if and only if the new edge does not form cycles shorter than what it is already considered

in terms of the user defined threshold. Otherwise, the edge is removed. Iteration of this process

on the entire graph ensures that the misbehaving edges are removed. (These explanations come

with Reference to De novo Assembly with A-Bruijn Graphs).

10. Coverage

Page 6: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

Coverage of 20 means that at least 20 bases should be present at any particular location

(meaning 20 reads should overlap at that particular location) to cast the vote as to which base

should be the consensus base at that location in the target genome.

11. de Bruijn graphA de Bruijn graph is a directed graph, where an edge connecting one node to another represents

an overlap between the two connecting nodes. In a de Bruijn graph, each node has dimensions

and distinct symbols. Therefore, there are a total number of nodes. A directed edge between

the parent node and the child node exists if the last symbols of the parent coincide or

overlap with the first symbols of the child node. Therefore, for every node: indegree =

outdegree = l (http://en.wikipedia.org/wiki/De_Bruijn_graph#cite_note-Bruijn1946-0).

12. Depth first searchDepth first search (DFS) is a search tool for graph structures. It starts by picking up one branch

of the tree and it explores all the children of the branch until it reaches the leaf, thereby exploring

one branch all the way to its depth before exploring the next one, see Figure S6.

13. Detection and estimationA standard detection problem resumes to finding the consensus base for a set of bases aligned at

the same point of the genome, e.g. Figure S3C. Here simply the frequency or probability was

used as a tool to choose which base (here base would be the concerned hypothesis) is to be

chosen amongst other bases or hypotheses. In statistical decision theory, a situation when a

receiver receives a noisy version of a signal and decides which hypothesis is true among the M

possible hypotheses, it is referred to as a detection problem. In a 4-ary case, the receiver had to

decide among the hypotheses: . In genome assembly the hypotheses are {A, T,

G, C}. In estimation theory, once the receiver has made a decision in favor of the true

Page 7: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

hypothesis, it estimates some parameter associated with the signal that may not be known in an

optimum fashion based on a finite number of samples of the signal (3).

14. Dijkstra's algorithmDijkstra's algorithm solves the Single-Source Shortest Paths problem. In a shortest-path problem,

the data is provided as a directed weighted graph . The aim is to find a path with the

smallest weight between two nodes and . The path with the smallest weight happens to be

the shortest path. In Section 6, the breadth-first search (BFS) algorithm was explained. BFS is a

shortest-path algorithm operating on graphs where each edge has a unit weight. In a single-

source shortest path problem the aim is to find all shortest paths between an arbitrary node

and all the other nodes in the graph (4).

15. Distributed computingThe data generated by the NGS platforms for genome assembly is usually very large. To handle

and process such a large amount of data takes a lot of time. Although the processing speed of the

systems today is very high and more and more memory is available for data processing, at the

same time we do see considerable time and space being employed to handle this copious amount

of data, especially produced by genomes that are larger in size. However, whenever such issues

arise we see computer scientists and engineers coming up with solutions to resolve such issues.

One can go and buy a bigger, more powerful machine, or can use a cluster of systems to solve

the same thing. A cluster of systems use what is called distributed computing. Here many smaller

systems interact with one another by sending and receiving messages.

In distributed computing, one giant problem is divided into a large group of tasks where

each task is performed by an individual system. Each task may itself be divided up into a number

of sub-tasks where each sub-task would be done in parallel by an individual system, causing it to

be done much quicker. It is very important in distributed computing to keep an account of where

Page 8: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

the individual data is placed and also where it is being used. Usually distributed computing is

cheaper than buying a much more powerful system with comparable performance. Two powerful

mechanisms that allow distributed computing are MPI and OpenMP.

16. Eulerian pathAn Eulerian path is where all edges, and not nodes, are traversed exactly once. An Eulerian path

may traverse a given node more than once.

17. Eulerian super-path problem

A directed graph is Eulerian if it contains a cycle that traverses every directed edge exactly

once (5). EULER conducts genome assembly by finding an Eulerian super-path in the Eulerian

graph , which consists of a set of Eulerian paths , such that all the individual paths are

traversed only once.

The aim is to convert an Eulerian super-path which assumes the graph and set of paths

into an Eulerian path with graph and set of paths such that each path is a

single edge. Such a conversion is achieved by using a series of transformations. These

transformations are essentially equivalent. Equivalent in the sense that a solution to an Eulerian

path problem in is equivalent to providing a solution to the Eulerian super-path problem

in .

18. Exhaustive assemblerThe exhaustive assembler (6) came at a time which marked the transition from assemblers that

pertained to Sanger technology towards assemblers that pertains to NGS, see Figure S1.

Therefore, it assembles DNA from read data, where the read length is very large as opposed to

NGS, where the read length is small. An exhaustive scheme for genome assembly essentially

compares all reads, given the fact that there are n reads, with one another in order to identify the

Page 9: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

best over-lapping pairs of reads. This process takes time. An intelligent use of a k-mer

library and an adjacency table collapses that time to .

k-mer library: It is a collection of k-mers (a string of characters of size k) that is produced by

chopping up the read data into many k-mers and cataloging them along with the reads from

which they were produced.

Adjacency Table: If two reads overlap with one another with an overlap size which hints that

they are neighbors of one another in the actual genome, the two reads are considered to be

adjacent to one another. Reads that are neighbors share unique k-mers, as many as the length of

their overlap. Building the k-mer library and identifying which reads they come from indirectly

identifies the neighboring reads. The adjacency table therefore assigns one link per pair of

overlapping reads regardless of the number of k-mers shared.

Identifying Contigs: In graph theory, disjoint graphs are a set of nodes and edges, which

together form non-overlapping sets. The adjacency table acts as a set of disjoint undirected

cyclical graphs. A breadth first search (BFS) is an exhaustive graph search algorithm that begins

at the root node and explores all its neighboring nodes that are present at the same depth from the

root node. It continues the search keeping in mind the breadth of each node from its root, and all

nodes that are at the same breadth are searched before going any deeper in the tree 6. Because

the adjacency table is a set of disjoint undirected cyclical graphs, performing BFS on all these

individual graphs is equivalent to doing a joint BFS on all graphs at the same time. Each disjoint

graph identified in this manner represents a single contig. The set of all contigs identifies the

genome.

19. GenovoGenovo [7] is a de novo assembler with a qualitative probabilistic framework using a Chinese

restaurant process prior model to account for the unknown number of genomes in the sample, see

Figure S1. Genovo does not require any prior information, performs read alignment, read

denoising and de novo assembly. Genovo adopts a modular approach to provide further

flexibility within its probabilistic framework, meaning that one can adjust the noise model

employed without affecting the rest of the model and also one can replace the uniform prior with

a prior based on the reference sequence for better results, see Section 39. Normally distributed

Page 10: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

probabilistic models are corroborated by the law of large numbers and central limit theorem, see

Section 24.

19.1 Probabilistic model

In genome assembly, one can never predict beforehand the number of contigs one will obtain

from the reads. However, one can safely say that for a genome of length the number of contigs

should not be more than . Also the size of any one contig should not be larger than .

However, for mathematical simplicity one can assume that there could be an infinite amount of

contigs in the assembly, where each contig is allowed to contain an infinite amount of bases.

They are sampled uniformly. Therefore,

where , so is the base at location of contig .

Generally, the data produced by the NGS platforms is very large. Let the number of reads

produced be . Using the same scheme as above and assuming infinite contigs, where each read

either seeds or extends the contigs, a Chinese Restaurant Process (CRP)

(http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/20070921.pdf), might be

justified as a model, see Section 8. CRP is used to randomize partitioning reads onto the

possible set of contigs. Say, this is denoted by distribution . The contigs are chosen

in accordance to the following random process:

The first read chooses the first contig.

The read chooses to seed a new contig with probability , or

extend a contig with the probability , where is the number of

reads forming the extending contig.

Once every read is allocated to a particular contig the question remains where is the read

placed within each contig. As for an empty contig it does not make sense to ask to which

Page 11: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

location within the contig a read would belong to since the contig is empty. However, for a

contig which is not empty, a bias is afforded in choosing a location nearer the start of the contig

as opposed to the end of the contig. Notice that both the start and the end position can be taken

up since the position is chosen randomly, see Figure S7. Using a geometric distribution allows to

assign a read location to any position centered around . Thus, . Each read is

assigned a length , for any . Parameter denotes any arbitrary distribution. The

individual bases for the read are copied from the alignment in accordance to

. Herein, stands for the noise model of the sequencing

technology. Now we are aware that every alignment does have matches (match), mismatches

(mis), insertions (ins) and deletions (del). The probability of aligning any read with any

alignment given depends on each of these phenomena (ins, del, match, mis) and the

weight assigned to them should be proportional to the number of their occurrences. Now let

and denote the probability of incorrectly copying, the probability of deletion,

the probability of insertion and the number of hits in the alignment, respectively. Therefore,

Notice that in the above equation, rather than using , it is using [59]. This is

due simply because for to mismatch, it has to align to either or and not . Since

and is missing, one is left with . Therefore, it is dividing it by

. Since , taking the logarithm of the above equation, it

follows that

Page 12: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

(8)

In the end, the log likelihood of the model reduces to .

19.2 Assembly algorithm

The algorithm is a modified version of Iterated Conditional Modes (ICM) algorithm [60]. It

maximizes the local conditional probabilities sequentially until convergence is achieved, and it

outputs the most probable assembly.

Consensus Sequence: Here ICM updates each base , the base at location of contig ,

which is aligned such that . Here is the consensus or the number

of reads that align onto location .

Read Mapping: Taking the probabilistic model into consideration, this part maps reads to

the best plausible alignment. It starts by finding for each read to which contig it belongs to,

which particular location within it belongs to and the alignment within each contig . It starts

sampling from updating for each read

by first removing it completely from the assembly, choosing a new location and then evaluating

the statistics for each ; find . This is the alignment score. It chooses the alignment that best

fits. Then it samples from . It proceeds further by sampling the

location of from .

Global Moves: Notice that all the above steps of ICM are local, since all the parameters

were updated on a case by case basis by exhausting all the cases. However, the same steps

evolve into a global approach by adopting the following steps:

Page 13: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

1) Propose Insertions/Deletions: If most of the reads aligned to a location in a contig

present an insertion, then put that insertion in the contig and realign the reads. If the

likelihood is improved, then accept the change. Similarly if most reads that align to a

location on the contig present a deletion, then delete that location in the contig and

realign the reads. If the likelihood is improved, then accept the change.

2) Merge: Merge two contigs if their ends overlap and provided that the likelihood

increases.

Chimeric Reads: Reads whose starting points and ending points match to a different

location on the genome. They are normally present on the edge of the contigs. Therefore, after

every five iterations, the edge reads are removed thereby allowing correct reads or contigs to

patch the edges and continue increasing the likelihood. If the read is not chimeric, it will again be

merged at the edge.

20. Graph structure

Graph theory has been extensively used for genome assembly. A graph is a data structure

consisting of a set of nodes and a set of edges . A graph may be directed, if all

the edges are oriented . It is undirected if all edges do not assume any orientation

or . Genome assembly is looked upon as a graph structure with known nodes

and unknown edges. The nodes are the reads and how they are connected is denoted in terms of

the graph edges (9). A node in a graph is called balanced if the number of edges entering into

is equal to the number of edges exiting .

21. Hamiltonian pathA Hamiltonian path is that path where each node in the graph is traversed only once.

22. Hash table

Page 14: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

A hash table is a data structure that uses a hash function to map a ‘key’ of any variable to an

index of an array (‘bucket’), where its associated value is located. Hash tables due to their fast

search operation are very widely used in databases.

23. K-mer numberingAn equivalent of a graph structure where k-mer numbers are the nodes and the edges between

them represent adjacencies between two k-mer numbers, as illustrated in Figure S8. (Reference

explanation to ALLPATHS)

24. Law of large numbers and central limit theoremLaw of large number states that the average results obtained from an experiment (called sample

average) repeated a number of trials come closer and closer to the true average when the number

of trials is increased without bound. There are at least two forms for the law of large numbers:

strong and weak version. The strong law of large numbers states that the sample averages

converge almost surely to the expected value:

In probability theory an event that happens almost surely means that it happens with probability

1. The weak law of large numbers states that the sample average converges in probability to the

expected (true) value:

meaning that if the number of samples is large, the sample average will be close to the true

expected value.

An inevitable consequence of using law of large numbers is the central limit theorem. If

denotes a sum of identically distributed and independent (one random

variable does not affect the other), then converges in distribution to a Gaussian with

zero-mean and unit-variance . The parameters and are the mean and standard

Page 15: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

deviation of the distribution from which are sampled. This is what justifies the use of

Gaussian distribution.

25. Maximum likelihood genome assemblyAmongst the many ways of estimating unknown parameters, one of the simplest and robust

methods is the maximum likelihood approach, see Figure S1. Consider estimating from a set of

observations . The likelihood function is the conditional density function of which

depends on the parameter via

The Maximum Likelihood Estimate (MLE) is the value of that maximizes the likelihood

function:

The MLE of can be obtained by solving the equation

In [57], MLE has been used for genome assembly.

1) Using Maximum Likelihood: For any circular genome , let be the length of that

genome. NGS platforms use to provide a set of reads of length . For any read, say the

read, let be the frequency with which it appears in . Assume the output of NGS contains a

total number of reads. Probabilistically one can say that reads are generated via

independent trials. In each trial, a position is uniformly sampled from and the read begins

at that position. Therefore, . If each read has length and each position can

Page 16: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

be either of the 4 bases , then each read is itself a variable and there are such

variables. Let be the number of trials whose outcome is the read. When all variables are

considered independently, they follow a binomial distribution. However, when they are taken

together, their joint distribution is multinomial: where

is the global read-count likelihood or the likelihood of the parameters

given . Maximizing the global read-count likelihood is equivalent to minimizing the

negative logarithm of this likelihood . However, since the multinomial distribution

assumes the constraint , this cannot be performed directly. Since the number of

trials is very large (the number of trials might be assumed to go to infinity), ’s become

independent and the joint multinomial distribution can be expressed as the product of individual

binomial distributions of each . The binomial approximation of the length of the genome

is a constant , and is independent of each di. Therefore, the approximate length of the actual

genome can be ascertained through the usage of the Expectation-Maximization (EM) algorithm.

Assuming that the genome size is known, then for the likelihood one obtains the approximate

expression:

Now one can further infer that , where is some positive constant

independent of all ’s, and

Page 17: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

2) Graph Construction: A modified version of de Bruijn graph is made via applying the

‘overlap-layout-consensus’ paradigm. The de Bruijn graph is formed by connecting vertices by

an edge if the corresponding reads overlap. Employing de Bruijn graphs reduces the fragment

assembly problem to finding a path visiting every edge of the graph exactly once, i.e., an

Eulerian path problem [31]. Graph correction comes after graph construction via simplification,

multiplicity, erosion, whirl removal, removing bulges, and straightening the graph [37].

Although only slight improvement past 75 coverage is achieved, the methodology explained

above provided the first exact polynomial time assembly algorithm that deals with double-

stranded genomes [57].

26. Minimal extension and subsumptionsUsing the data obtained, ALLPATHS finds all minimal extensions and subsumptions of each

read in the spectrum, see Figure S9. This is then used to perform a depth first search (DPS), see

Section 12. This is then stored as a graph structure where the nodes are reads and an edge

denotes the fact that was the minimal extension of . The nodes also store the offset

from the start of DPS. (Reference explanation to ALLPATHS).

27. MinimusMinimus uses AMOS (A Modular Open-Source), a collection of modules and libraries which are

opensource and are useful in developing genome assemblers

(http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOS), see Figure S1. These

AMOS modules, or stages, interact with one another via an AMOS data structure called “bank”.

Minimus [33] is built using 3 AMOS modules, namely Overlapper, Unitigger and Consensus

following the overlap-layout-consensus paradigm. Initially all the reads are loaded into the

AMOS bank. It computes all pairwise alignments between the reads (a step conducted by the

Overlapper) and uses the resulting information to generate an overlap graph where each read is

Page 18: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

a node and an edge connects two nodes if the reads overlap (a step conducted by the

Unitigger).

Like all assemblers that exploit graph theory to solve the genome assembly problem

Minimus too inherently employs an extensive graph simplification process in order to simplify

identifying the contigs and subsequently implementing scaffolding. Two methods adopted by

Unitigger for graph simplification are explained in Figure 3A and B in main text.

After simplifying the graph, the consensus sequence is formed by progressive multiple

alignment of the reads aligned within each unitig. Here Quality-values (Q-values) are used

during the consensus stage to trim the poor quality areas of each read. Q-values are sequencing

platform specific. Therefore, the trimming criterion is subject to which NGS platform was used

for sequencing.

28. Overlap-layout-consensusThe overlap-layout-consensus, as the name suggests consists of three steps, see Figure S10. In

the first step an overlap graph is created by joining all the reads by their respective best over-

lapping reads. The step is similar to the greedy approach. The layout stage, ideally, is responsible

for finding one single path from the start of the genome traversing through all the reads exactly

once and reaching the end of the sequenced genome. A consensus sequence is formed by looking

at all the bases that occur in that location and identifying the one with the highest call.

29. Parallel construction of bidirected string graphs

This assembler uses a bidirected string graph having a set of nodes and a set of edges to

represent the genome, see Figure S1. A bidirected string graph is derived from bidirected de

Bruijn graphs [40], [22]. This is justified by the following issue inherent in data assembly.

Although reads are sequenced from the end to the end, it is not known to which of the two

double stranded DNA strands the read belongs to. Therefore, a read can have either of the two

orientations: or . This issue paves the way for using bidirected de Bruijn graphs, as

illustrated in Figure S11.

Page 19: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

An edge in the graph is defined using a tuple . Here

represents the orientation whether the edge points outwards or inwards from the node. A path in

the graph is an ordered list of tuples , with , as

depicted in Figure S12.

Converting a bidirected de Bruijn graph into a bidirected string graph is done via expanding

the tuple to include two more parameters. The new tuple is .

Here is the character obtained by walking from to , while is the character

obtained while walking from to . Notice in Figure S13 that although the suffix of and

the prefix of overlap, yet the characters defining the edge are . Here the first is the

first base in the k-mer of yet the second base is not the first character in the k-mer . It

is actually the first base of the k-mer , which is the character obtained while walking from

.

1) Making the graph: The first step in generating the graph is to create nodes which are

made by pairing up each k-mer with its reverse complement. Then assign +ve to the k- mer that

is lexicographically larger than its counterpart. Once all k-mers and their counterparts are

identified then all unique k-mers are sorted in parallel [41]. For a genome of length , about

nodes are sent to each processor, after being sorted in parallel, where p stands for the

number of processors. The elements that are passed are already sorted. Each k-mer has a unique

identifier from , where denotes the total number of unique k-mers. Since the

numbering is done on a sorted list, the process is trivial. Let be the length of all k-mers

Page 20: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

combined. Since the process is conducted in parallel by p processors, the whole process takes

time to execute.

Once each processor has been provided with its share of sorted nodes, we need to identify

edges between the nodes to complete the graph. In order to create edges between any two given

nodes, it should be realized that any k-mer/node can overlap with a certain number of nodes. This

is achieved via passing messages from one node to another. The message has the format

. Here stands for the node which sends the message, denotes

the node which receives the message, is the type of message (there are three message

types), and represents the character associated with the edge between and , see

Figures S11-S14. The message before being sent needs to be created, and this step takes

time. Once the messages are created, they are sorted locally in each processor using the radix sort

algorithm in time. The messages are then sent and the edges are created in

time per processor and then sorted locally using the radix sort in another time.

As shown above the entire graph construction process from sorting all unique k-mers, message

creation, sorting the messages to edge creation, and finally sorting the individual edges was all

done in parallel using p processors with each sub-task itself executing in time.

Although the entire process took time, in algorithmic terminology the constant n is

omitted as it is a constant of proportionality and only is presented as the algorithm’s

time complexity.

2) Larger genomes: For a genome of length L the total number of unique k-mers cannot

be larger than L. Therefore, even if there is a coverage of about 100 or the assembly data goes

into terabytes, one only needs to know the individual unique -mers and the number of times

Page 21: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

each k-mer occurs, also called its copy count. Since the graph formation is dependent on the

sorted array of unique -mers and one can consider the read assembly data in stages, one can

populate the array in stages (since once an entry of a unique -mer is made all that is left to do is

increase its copy count if the same k-mer appears in the data), and sort the array in stages. This

means that no matter how large the data is, the whole process can be done in stages. Therefore,

assembly of data is taken one stage at a time and the entire process of graph formation is laid out

and implemented. Of course, each time the sorted array gets bigger and bigger, and the process

remains the same.

For larger genomes, the number of messages sent is far greater than since any -mer may

send a message to another -mer that may not exist in the data. Since two nodes overlap with one

another by length characters, therefore, for each node only one and one

message is sent. This reduces the number of messages that are sent for larger genomes. Note that

due to the staged process, for larger genomes the actual graph construction is computationally

less expensive as opposed to the analysis of the data and piling up of the data at the right

processor.

30. Prefix treeA prefix tree (also known as ‘tree’) is a tree structure that has strings or alphabets as its keys, see

Figure S15.

31. Quality ValuesThe Quality (Q-) values are the probabilistic data provided by the NGS platforms identifying the

quality of each base in each read. Q-values are evaluated differently for different platforms. For

example for Solexa (later on acquired by Illumina) the formula is:

Page 22: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

Where is the probability of identifying a base incorrectly. To convert Q-values into

probability use:

Similarly for Sanger and other platforms the formula is:

The conversion of Q-values into probability is done via:

and can be converted to one another using the formula:

is the probability that the base-call is incorrect. Therefore, probability of a correct base-call is

. Therefore: , where . The Q-values essentially provide

an integer mapping to the error probability of calling a base.

32. Radix sortRadix sort is an algorithm that sorts a list of L numbers composed of n digits by first sorting all

numbers by just looking at their least significant digit. It then sorts the entire list by just looking

at the second-least significant number. It does so iteratively until the entire list is sorted in terms

of its most significant digit. Figure S16 depicts graphically the radix sorting algorithm (4).

33. Similarity matrixA similarity matrix is a similar representation to an adjacency matrix (Figure S2), since it does

identify whether two nodes or points are adjacent to one another, and it also tells how close or

Page 23: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

how far apart they are from one another. In another words, a similarity matrix provides a weight

to every edge of the graph. For strings, a similarity matrix is very close to the concept of distance

matrix (Figure S17) or substitution matrix. A de-facto standard in substitution matrices is

BLOSUM62 (16). This matrix marks scores to the substitution of one amino acid with another

one based on purely statistical grounds with no direct affiliation with the amino acid's

biochemistry or its structure (16).

34. Speed and time complexityOne very important aspect of genome assembly is the use of resources. The term ‘resources’ in

the field of computational algorithms pertains to the use of space (memory) and computational

time. The use of these two resources is quantified in terms of the notation. Usually these

two resources compete with one another. An algorithm may take very little time, but at the

expense of memory. Whereas an algorithm may take up less space by taking up more time to

execute. A true development in the field of algorithms and problem solving comes when a new

algorithm outperforms another algorithm in solving a particular problem. But how does one

define outperform? The three benchmarks that are compared are space, time, and the results

generated by the programs. If the output of two programs is comparable then the one that uses

less time and space is definitely the better choice. The word comparable is used since two

programs that try to solve the same problem may not give exactly the same results, and their

results may diverge more if the problem happens to be a complicated one.

35. ScaffoldingThe process of relative placement of these contigs in relation to one another in order to form the

genome is called scaffolding. Two contigs are adjacent to one another if two or more mate-pairs

link them together. The mate-pair information due to very small coverage may not be used to

identify the individual bases of the missing links yet they are good enough to identify which

contig comes first and which contig comes next.

36. SingletonsReads that are unused in the assembly process.

Page 24: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

37. Spectrum

A collection of a set of reads and their reverse complements is called the

spectrum . The reverse complements of the reads are also referred to as their mate-pair data.

38. Suffix Tree and Suffix ArraysA suffix tree is different from a prefix tree, because it uses a suffix for building the tree whereas

the latter uses a prefix. Suffix trees have been widely used in a large number of biological

applications ranging from finding exact string matching, recognizing DNA contaminations,

_nding common sub-strings, to minimum length decoding of DNA (17), see Figure S18,

(http://sary.sourceforge.net/docs/suffix-array.html). For suffix arrays see Figure S19.

39. Tandem repeatsRepeats are portions of the DNA that occur more than once. Tandem repeats are identified by

areas of the genome where a particular sequence repeats itself over and over again in tandem to

one another. One example would be

. In the example above

the is only placed for readability. Here the tandem repeat sequence is .

40. Uniform distributionIn genomes, in general we know that A-T and G-C are base pairs. Therefore, the number of A's

is equal to the number of T's. Same is the case for G's and C's. Also A, T, G and C are almost

present in equal proportions though some organisms may have a higher G-C rich content due to

the stability of G-C base pairs. Therefore, if then

. Here we use the uniform

distribution since all random variables are equally probable.

Page 25: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

41. UnipathIt is an equivalent of a graph structure where k-mer numbers are the nodes and the edges between

them represent adjacencies between two k-mer numbers, a terminology used by ALLPATHS.

References

1. Simpson, J.T. et al. 2009. Abyss: a parallel assembler for short read sequence data. Genome Res, vol 19, no. 6, p. 1117.

2. Edmonds, J. and Johnson, E. 2003. Matching: A well-solved class of integer linear programs. Combinatorial Optimization Eureka, You Shrink! (eds. Giovanni, et al.), pages 27-30.

3. Barkat, M. 2005. Signal detection and estimation. Artech house.4. Gormen, T. H., et al. 1976. Introduction to algorithms. MIT Press and McGraw-Hill Book Company,

vol. 7, pp. 1162-1171.5. Pevzner, P. and Books, C.N. 2000. Computational molecular biology: an algorithmic approach,

volume 1. MIT press Cambridge, MA.6. Shah, M.K. et al. 2004. An exhaustive genome assembly algorithm using k-mers to indirectly perform

n-squared comparisons in o(n). Proc IEEE Comput Syst Bioinform Conf (CSB'04).7. Laserson, J. et al. 2010. De novo assembly for metagenomes. Res Comput Mol Biol. pages 341-356.

Springer.8. J. Besag. 1986. On the statistical analysis of dirty pictures. J R Stat Soc Series B Stat Methodol, 48(3):

259-302.9. Koller, D. and Friedman, N. 2009. Probabilistic graphical models: Principles and techniques. The

MIT Press.10. Medvedev, P. and Brudno, M. 2009. Maximum likelihood genome assembly. J Comput Biol,

16(8):1101-1116.11. Pevzner, P. et al. 2004. De novo repeat classification and fragment assembly. Genome Res. vol. 14,

no. 9, p. 1786.12. Hernandez, D. et al. 2008. De novo bacterial genome sequencing: millions of very short reads

assembled on a desktop computer. Genome Res. 18(5):802.13. Sommer, D.D. et al. 2007. Minimus: a fast, lightweight genome assembler. BMC Bioinformatics,

8(1):64.14. Jackson, B.G. and Aluru, S. 2008. Parallel construction of bidirected string graphs for genome

assembly. ICPP, pages 346-353.15. Helman, D.R. et al. 1998. A new deterministic parallel sorting algorithm with an experimental

evaluation. Journal of Experimental Algorithmics (eds. Italiano, G.F. et al.), 3(4).16. Eddy, S.R. 2004. Where did the blosum62 alignment score matrix come from? Nat Biotechnol,

22(8):1035-1036. 17. Pop, M. 2009. Genome assembly reborn: recent computational challenges. Brief Bioinform.

10(4):354.

Page 26: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

Figure Legends

Figure S1 Schemes and their associated algorithms

The figure depicts the most fundamental schemes adopted by assembly algorithms. The algorithms have been listed in order to clarify fundamental concepts; however, the same algorithm can be categorized into more than one approach. For instance, all Eulerian path approach algorithms could be categorized under graph based schemes. However, assisted assembly can be categorized under both comparative assembly and the overlap-layout-consensus approach since it uses concepts from both.

Figure S2 Adjacency table and adjacency matrix

A. An undirected cyclic graph. Equivalent representations are (B): Adjacency Table and (C): Adjacency Matrix. The advantages of using (B) over (C) is that it uses less memory. Also searching for nodes that are adjacent to one another in (B) is an operation as opposed to

(C) where it is an operation. However, for checking whether for two specific nodes there

is an edge between them, then for (C) it is an operation while for (B) it is an operation, where p and q stand for the number of edges entering into the two nodes, respectively.

Figure S3 Alignment-Layout-Consensus

A. Alignment: In this stage all reads (in yellow color) are aligned with a closely related reference sequence (shown in green). The alignment process may allow one or more mismatches between each individual read and the reference sequence depending on the user. The alignment of all the reads creates a Layout (B), beyond which the reference sequence is not used any more. C. Consensus: The layout helps in producing a consensus sequence (shown in green), where each base in the sequence is identified by simple majority amongst the bases at that position or via some probabilistic approach (shown in red and brown colors). The end result is the novel genome. D. The reference genome and the novel genome produced by the alignment-layout-consensus are placed side by side. The red regions mark the area in which both sequences differ. The novel genome may even be longer or shorter than the reference sequence.

Figure S4 Bidirected graphs

In a bidirected graph if an edge connects two nodes and , then it is referred to as a link. If an edge connects a nodes by itself, it is called a loop. However, if an edge points to an empty node, or a null node, or has only one end, then it is called a lobe.

Page 27: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

Figure S5 Node-edge incidence matrix values

In the incidence matrix each element can take on values -2, -1, 0, 1, 2 depending on

whether the edge has with node 2 heads, 1 head, no connection, 1 tail or 2 tails,

respectively.

Figure S6 Depth First Search

This is an exhaustive search scheme which starts from the root (A) and picks up one branch, say the left child (B). It then explores all the way until it reaches the leaf of that branch, (D), before back-tracking. It then backtracks, or when one side of the branch has been explored it then searches the other branch until its reaches its leaf. The sequence in which this tree is searched is A, B, C, D, E, F, G, H, I, J, K and lastly L. The search continues until the goal is reached or the tree is exhausted, whichever comes first.

Figure S7 Allocating a read to a particular contig

This is done in two steps. At first the length of the region within the contig from which reads are

generated is chosen randomly, using a distribution: Then the actual location around this region is chosen using a symmetric variation of a geometric distribution.

Figure S8 -mer numbering

As one moves along any sequence, it can be divided into k-mers, which are encoded as 101, 102, 103, 104 so on and so forth, changing the numbering only when it hits a k-mer that has appeared before. This movement in one direction or path can then be identified as the closed interval [101,104].

Figure S9 Minimal extension and subsume

A. The upper sequence C completely subsumes the lower sequence D as they align perfectly and that C overhangs D on its either ends. B. E is the minimal extension of F such that E is not an extension of all other extensions of F.

Figure S10 Overlap-Layout-Consensus

A. Overlap: Two reads overlap at the region shown in green. B. Layout: Overlap of a number of reads at the same time becomes a layout. C. Consensus: The layout helps in producing a

Page 28: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

consensus sequence. One contiguous consensus sequence is called a contig (shown in red). Many contigs connected together form the DNA sequence in question.

Figure S11 Bidirected de Bruijn graphs

In a bidirected de Bruijn graph, the two nodes can interact with one another in four possible ways. The arrows above show the orientation of the two reads with relation to one another. The de Bruijn tuple underneath represents the equivalent representation of the individual figures. Panel C and D are equivalent. Although there are four ways in which the two nodes can interact, since C and D are equivalent, there are just three types of messages, as opposed to four, that need to be sent from one node to another to evaluate in which orientation they overlap.

Figure S12 Message passing

The figure shows the messages that are passed between two nodes based upon the orientation in which the reads overlap. It also shows the equivalent tuple representation of the de Bruijn and String graphs. A. Every node has two segments. One segment contains a k-mer and the other

segment its reverse complement. The segment shown is the -mer which is lexicographically

larger than its reverse complement . Its prefix is , while its suffix is . B, C, and D each shows the equivalent representations of the de Bruijn graph, the String graph and the message that is shared to form an edge between the two nodes.

Figure S13 Interaction between nodes

A. The two nodes present the interaction . Each of the two segments within each node

are defined as +ve or -ve based upon their alphabetical order. The -mer that is larger lexicographically as opposed to its reverse complement is defined +ve. The edge is defined by

two characters. In the case above, they are A/A. The first base ‘A’ is the first character in the -

mer of . The second base ‘A’ is the first character in the -mer . Panel B and C show the interactions between the individual segments of the node in A, in which case they are both equivalent. Within each segment, the underlined bases overlap and the bases shown big in B and

C end up being the identifier of the edge shared between nodes and .

Figure S14 Types of messages The figure displays the different types of messages that are sent between two nodes. A. Type 1

message is sent from to . B. Type 1 message is sent from to . C. Type 2 message is sent

from to if is lexicographically larger than . However, if is lexicographically larger

Page 29: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

than , then a Type 2 message is sent from to . D. If is lexicographically equal to , then

and we end up in a loop with a message of Type 2. E. Type 3 message is sent from to if

i+ is lexicographically larger than . However, if is lexicographically larger than , then a

Type 3 message is sent from to . F. If is lexicographically equal to then , and we end up in a loop with a message of Type 3.

Figure S15 SSAKE/VCAKE: Prefix tree

It is a tree structure good for holding strings. Reads are composed of bases {A, T, G, C}. While searching for any particular read, each transition from parent to child divides the search space by

4. If there are m reads, each of size n, then searching for a particular read is an or

operation, since n steps are required from the root to reach the leaf. Every node shares common prefixes with its parents right up to the root, which is a NULL node. The spectrum is organized by the prefixes of bases from their 5' end. VCAKE on the other hand uses the first 11 bases of the 5' end of the reads and their reverse complements to be used as a prefix.

Figure S16 Radix sort

The input is the unordered list of numbers. Radix sort looks from the least significant digit to the most significant digit to sort the list. Notice in the unordered list 390 comes before 300. In the second list 390 still comes before 300 in the list. That is because while sorting radix sort only looks at digits with the same significance level and does not look at anything else. Therefore, the relative order of the numbers with the same digit at the significance level which is being looked upon remains the same after one complete iteration.

Figure S17 Distance matrix

Hamming distance for two strings of equal lengths is the minimum number of substitutions required to change one string into another. The string A requires one substitution to convert into B. The matrix shows the distance required to convert one string into another. Notice that the matrix is symmetric. Therefore, only the upper triangle or the lower triangle of the matrix needs

to be filled up. Generally, for an matrix, the number of distances that need to be calculated

is . Herein, the number of distances that need to be evaluated is 6.

Figure S18 Making a suffix tree

Page 30: Supplementary Information Section: Review of General ...€¦  · Web viewWajid B and Serpedin E. Review of general algorithmic features for genome assemblers for next-generation

The figure above represents, step by step, how to build a suffix tree for the string “BWABBAS$”. The suffix tree is made by assigning a number to every letter of the string B (1), W (2), A (3), B (4), B (5), A (6), S (7), $ (8). The tree starts building by taking suffixes one by one, moving towards left of the string. It first takes $, then S$, then AS$, and so on and so forth.

Figure S19 Building a suffix array

In the first step every letter of the word is assigned an index based upon its position in the word. In this example BWABBAS is indexed as B(1), W(2), A(3), B(4), B(5), A(6), S(7), $(8). A. Next, the letters of the word are placed in an array and indexed based on the first alphabet of every word. B. Finally, all the words are sorted in an alphabetical order and the final indexing of the array is shown. C. Here a simple binary search shows that it is very efficient to find a particular word. In this example S$ is searched using a binary search. The search starts from the middle of the array (1). Here index 5 contains BAS$. Since S$ comes after BAS$ alphabetically, therefore, the search looks ahead within the remaining half of the array. (2) Here index 1 contains BWABBAS$. Since S$ comes after BWABBAS$ alphabetically, therefore, the search looks ahead within the remaining quarter array. (3) Found what was looking for since index 7 contains S$.