Oligonucleotide Probe Design for Large Genomes using ... · Oligonucleotide Probe Design for Large...
Transcript of Oligonucleotide Probe Design for Large Genomes using ... · Oligonucleotide Probe Design for Large...
Oligonucleotide Probe Design for Large Genomes
using Multiple Spaced Seeds
OLIGONUCLEOTIDE PROBE DESIGN FOR LARGE GENOMES
USING MULTIPLE SPACED SEEDS
BY
HAMID MOHAMADI
a thesis
submitted to the department of computing and software
and the school of graduate studies
of mcmaster university
in partial fulfilment of the requirements
for the degree of
Master of Science
c© Copyright by Hamid Mohamadi, March 2012
All Rights Reserved
Master of Science (2012) McMaster University
(Computer Science) Hamilton, Ontario, Canada
TITLE: Oligonucleotide Probe Design for Large Genomes using
Multiple Spaced Seeds
AUTHOR: Hamid Mohamadi
SUPERVISORS: Dr. William F. Smyth
Dr. G. Brian Golding
Dr. Lucian Ilie
NUMBER OF PAGES: xii, 95
ii
To my parents
Abstract
An oligonucleotide is a small fragment of DNA or RNA that is designed to hybridize
with a unique piece in a target sequence. Oligonucleotides have a wide range of
applications in molecular biology and medicine. They can be used as probes to
screen for diseases and viral infections in medicine as well as DNA microarray design,
polymerase chain reaction (PCR) amplification, and gene identification in molecular
biology.
The major computational challenge for designing oligonucleotide probes is finding
the optimal probe for each target sequence. Each probe must be specific to its target
sequence, must be sensitive in order to detect the target sequence, and the set of
oligonucleotides must be uniform under the same experimental conditions. Many
algorithms and software programs have been created for this problem, however, none
is able to solve it very well.
We introduce a new method for oligonucleotide design that employs sensitive mul-
tiple spaced seeds, and show that our algorithm computes unique and more efficient
oligonucleotides as well as executing orders of magnitude faster than the other algo-
rithms that have been proposed for the same task.
iv
Acknowledgements
I would like to thank my supervisors, Dr. Smyth, Dr. Golding, and Dr. Ilie for their
guidance and support throughout my thesis.
I am also very grateful to Shima Khoshraftar and Anahita Mansouri for their kind
help.
v
Contents
Abstract iv
Acknowledgements v
1 Introduction 1
2 Preliminaries 6
2.1 Molecular biology primer . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Organisms and cells . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 DNA, RNA, and protein . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Genome, chromosome, and gene . . . . . . . . . . . . . . . . . 12
2.1.4 Oligonucleotides . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.5 Thermodynamics of DNA . . . . . . . . . . . . . . . . . . . . 17
2.2 Related computer science notation . . . . . . . . . . . . . . . . . . . 19
2.2.1 Sequence alignments . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Seeds for homology search . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Suffix tree and suffix array . . . . . . . . . . . . . . . . . . . . 27
3 Related work 29
vi
3.1 ArrayOligoSelector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 GoArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 OligoArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 OligoPicker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 OligoWiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 PICKY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 ProbeSel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 ProbeSelect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 ProDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.10 ProMide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.11 ROSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.12 YODA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Our proposed algorithm 45
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 General description of the problem . . . . . . . . . . . . . . . . . . . 48
4.3 The outline of our algorithm . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Encoding the input sequences . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Cross-hybridization assessment . . . . . . . . . . . . . . . . . . . . . 51
4.5.1 Multiple spaced seeds for homology search . . . . . . . . . . . 53
4.5.2 Overlap complexity . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5.3 Fast homology search . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.4 Intensive homology search . . . . . . . . . . . . . . . . . . . . 59
4.6 GC-content evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Melting temperature management . . . . . . . . . . . . . . . . . . . . 62
vii
4.8 Secondary structure assessment . . . . . . . . . . . . . . . . . . . . . 64
5 Experimental results and evaluation 67
5.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Results from other algorithms . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Evaluation and comparison . . . . . . . . . . . . . . . . . . . . . . . 82
6 Summary and conclusion 89
viii
List of Figures
2.1 A eukaryotic cell. (Farabee, 2007) . . . . . . . . . . . . . . . . . . . . 7
2.2 Structure of DNA. (from National Human Genome Research Institute) 8
2.3 Structure of RNA. (from National Human Genome Research Institute) 9
2.4 Structure of Protein (Villarreal, 2008) . . . . . . . . . . . . . . . . . . 11
2.5 Splicing. (Alberts et al., 2003) . . . . . . . . . . . . . . . . . . . . . . 13
2.6 The Standard Genetic Code (Godfrey-Smith and Sterelny, 2008) . . . 14
2.7 Central dogma of biology (Horspool, 2008) . . . . . . . . . . . . . . . 15
2.8 Two oligos hybridized with their target DNA sequences . . . . . . . . 16
2.9 Melting temperature of DNA . . . . . . . . . . . . . . . . . . . . . . 17
2.10 Dot plot of two sequences. . . . . . . . . . . . . . . . . . . . . . . . . 21
2.11 Needleman-Wunsch alignment of two sequences. . . . . . . . . . . . . 23
2.12 Smith-Waterman alignment of two sequences. . . . . . . . . . . . . . 24
2.13 Example of a hit by a spaced seed. . . . . . . . . . . . . . . . . . . . 26
2.14 Suffix tree for x = TCGTAACGACC. . . . . . . . . . . . . . . . . . . 28
2.15 Suffix array for x = TCGTAACGACC. . . . . . . . . . . . . . . . . . 28
4.1 General description of the problem. . . . . . . . . . . . . . . . . . . . 48
4.2 Possibility of overlapping hits for a. consecutive seed. b. spaced seed. 55
4.3 Example of computing overlap complexity for two spaced seeds. . . . 56
ix
4.4 The set of multiple spaced seeds used in the fast homology search phase. 58
4.5 Fast homology search. . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 The set of multiple spaced seeds used in intensive homology search phase. 60
4.7 GC-content evaluation process. . . . . . . . . . . . . . . . . . . . . . 61
4.8 Examples of forming secondary structures a. hairpin b. dimer. . . . . 64
4.9 The OligoDesign algorithm. . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Seeds used for fast homology search (s8w10) and for intensive homology
search (s8w9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Seeds used for fast homology search (s8w10) and for intensive homology
search (s8w8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Seeds used for fast homology search (s8w10) and for intensive homology
search (s16w9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Seeds used for double fast homology search (s8w10) and (s8w8). . . . 74
5.5 Seeds used for fast homology search (s8w10). . . . . . . . . . . . . . . 75
5.6 Seeds used for fast homology search (s8w11) and for intensive homology
search (s8w9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7 Seeds used for fast homology search (s8w11) and for intensive homology
search (s16w9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.8 Eight spaced seeds of weight six used in the evaluation program. . . . 82
x
List of Tables
4.1 Summary of the ways that related algorithms assess the parameters
involved in oligo design. . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Encoding of input DNA sequences . . . . . . . . . . . . . . . . . . . . 50
4.3 Nearest-Neighbor parameters for DNA/DNA duplexes (SantaLucia,
1998). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Input data sets used for experimental results. . . . . . . . . . . . . . 68
5.2 Results by employing s8w10-s8w9. . . . . . . . . . . . . . . . . . . . . 72
5.3 Results by employing s8w10-s8w8. . . . . . . . . . . . . . . . . . . . . 72
5.4 Results by employing s8w10-s16w9. . . . . . . . . . . . . . . . . . . . 73
5.5 Results by employing s8w10-s8w8. . . . . . . . . . . . . . . . . . . . . 74
5.6 Results by employing s8w10. . . . . . . . . . . . . . . . . . . . . . . . 75
5.7 Results by employing s8w11-s8w9. . . . . . . . . . . . . . . . . . . . . 76
5.8 Results by employing s8w11-s16w9. . . . . . . . . . . . . . . . . . . . 77
5.9 Description of the oligo design software programs used in comparison. 78
5.10 Results for ArrayOligoSelector. . . . . . . . . . . . . . . . . . . . . . 79
5.11 Results for OligoArray. . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.12 Results for OligoPicker. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.13 Results for OligoWiz. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xi
5.14 Results for PICKY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.15 Results for YODA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.16 Evaluation for mousenervous. . . . . . . . . . . . . . . . . . . . . . . . 83
5.17 Evaluation for ecoli. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.18 Evaluation for bee. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.19 Evaluation for yeast. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.20 Evaluation for plasmodium. . . . . . . . . . . . . . . . . . . . . . . . 84
5.21 Evaluation for zebrafish. . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.22 Evaluation for drosophila. . . . . . . . . . . . . . . . . . . . . . . . . 85
5.23 Evaluation for chicken. . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.24 Evaluation for celegans. . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.25 Evaluation for arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . . 86
5.26 Evaluation for maize. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.27 Evaluation for mouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.28 Evaluation for human. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.29 Evaluation for mouserna. . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.30 Evaluation for rice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.31 Total comparison of the algorithms for all data sets. . . . . . . . . . . 88
xii
Chapter 1
Introduction
An oligonucleotide is a short fragment of nucleic acid polymers, DNA or RNA, that
is designed to hybridize with a unique region, which is complementary to itself, in a
target sequence. (The nucleotides A and C are said to be complementary to T and G
respectively) Here, hybridization is the process of binding two complementary DNA
sequences into a single double-stranded DNA molecule and binding occurs because
of the presence of hydrogen bonds between base pairs. In other words, the target
sequence could be uniquely identified by the oligonucleotide as a probe. A biologist
may detect the existence of the probe’s complementary fragment in the larger DNA
sequence by hybridization of the probe to some unknown DNA fragment of interest.
Oligonucleotides are generally known by their lengths which either can be short,
between 20 and 30 nucleotides, or long, between 50 and 70 nucleotides. Oligonu-
cleotides have a wide range of applications in medicine and molecular biology. They
can be used as probes to screen for diseases and viral infections in medicine as well
as DNA microarray design, polymerase chain reaction (PCR) amplification, and gene
identification in molecular biology.
1
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
The major computational challenge for designing oligonucleotide probes is how to
find the optimum probe for each target sequence in a set of DNA/RNA sequences. An
optimum oligonucleotide must discriminate well between its target sequence and all
other non-target sequences and the optimum designed probe for each target sequence
must be unique to that sequence in order to only hybridize with its complement region
in the target sequence and not to cross-hybridize with non-target sequences. More
precisely, each probe must be specific to its target sequence, must be sensitive in
order to detect the target sequence, and the set of oligonucleotides must be uniform
under the same experimental conditions.
In order to design the optimum set of oligonucleotides, some criteria and param-
eters have been proposed by experts (Lockhart et al., 1996; Kane et al., 2000) in this
area such as:
• Similarity of an oligonucleotide to non-target sequences
• Maximum consecutive match of an oligonucleotide to non-target sequences
• Melting temperature
• GC-content
• Secondary structure
• Low complexity regions
The first two parameters are the most important criteria among the suggested criteria
for designing the optimum set of oligonucleotides.
There are various algorithms and programs for designing oligonucleotide probes
such as ArrayOligoSelector (Bozdech et al., 2003), GoArrays (Rimour et al., 2005),
2
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
OligoArray (Rouillard et al., 2003), OligoPicker (Wang and Seed, 2003), OligoWiz
(Nielsen et al., 2003), PICKY (Chou et al., 2004), ProbeSel (Kaderali and Schliep,
2002), ProbeSelect (Li and Stormo, 2001), ProDesign (Feng and Tillier, 2007), Pro-
Mide (Rahmann, 2003), ROSO (Reymond et al., 2004), and YODA (Nordberg,
2004). These programs differ in what parameters and criteria they consider and how
they identify and select the optimal oligonucleotides. For low-complexity manage-
ment, most of the algorithms use a filter or mask to the nucleotide repeats, while
others apply prohibited regions defined by the user, lossless compression calculations,
the properties of the suffix array structure, or a custom complexity score.
To achieve maximum uniformity for the set of oligonucleotides, a small difference
between melting temperatures of oligonucleotides is required. Several approaches are
available for computing the melting temperature. The most commonly used approach
is applying the Nearest Neighbour model with either the parameters from SantaLucia
(1998) or the parameters from Rychlik et al. (1990). The GC-content evaluation of
the oligonucleotide sequences, which is closely related to the melting temperature, is
performed by defining a fixed range or threshold by the user and filtering out oligonu-
cleotides which are not in that range. The secondary structure assessment can be
performed in two ways; self-complementarity checking by aligning the oligonucleotide
with its reverse-complement sequence; or thermodynamic calculations to determine
the stability of potential secondary structures. The core procedure of these algorithms
for cross-hybridization assessment are based on sequence similarity search tools such
as, suffix trees, suffix arrays, and seeds. Suffix trees and suffix arrays are well known
data structures for exact pattern matching and text searching (Manber and Myers,
1990; Smyth, 2003; Puglisi et al., 2007), but they do not work well for approximate
3
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
text searching, so some heuristic approaches such as BLAST, which is based on seeds,
have been proposed for this task.
Seeds were first defined and used in BLAST (Altschul et al., 1990), Basic Local
Alignment Search Tool, which is the most widely used algorithm in bioinformatics
for homology search. Instead of the slow and quadratic-time dynamic programming
algorithm of Smith and Waterman (1981), which is infeasible for long sequences,
BLAST searches for a seed of 11 contiguous matches between the sequences as an
indicator of potential local similarity. On the other hand, PatternHunter (Ma et al.,
2002) uses a spaced seed; that is, the 11 matches are not consecutive but separated
by don’t care positions. The sensitivity of this spaced seed is significantly higher
than that of BLAST’s contiguous seed. Some software programs use a combination
of several spaced seeds as the chances to find the similarities increase very much. In
fact, multiple spaced seeds quickly became the state-of-the-art in similarity search in
biological applications due to their great efficiency and flexibility.
In this thesis, we present a new algorithm to design the optimum set of oligonu-
cleotide probes for a given set of target sequences. The proposed algorithm em-
ploys multiple spaced seeds as the heart of homology search procedure for cross-
hybridization management. To reach the maximum uniformity for the set of obtained
oligonucleotides, our algorithm first calculates the melting temperature by applying
the Nearest Neighbour model with the parameters from SantaLucia, then computes
the average melting temperature of all oligonucleotide candidates and finally takes
into account the candidates within the predefined fixed range from the average melt-
ing temperature. GC-content evaluation is carried out by setting up a fixed range
and filtering out oligonucleotide candidates which are not within the predefined range.
4
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
The secondary structure assessment and low-complexity region assessment are implic-
itly included in the cross-hybridization management step. Finally, we illustrate that
our algorithm discovers more unique and useful oligonucleotides as well as executes
orders of magnitude faster than the other algorithms that have been proposed for the
same task.
The thesis is organized in six chapters. In Chapter 2 the preliminaries and basic
notation needed in the area of molecular biology, such as genome, gene, DNA, and
oligonucleotides, as well as the related ones in computer science such as alignment, and
homology search using seeds are introduced. Related work and current algorithms for
designing oligonucleotides are briefly described in Chapter 3. In Chapter 4 we explain
our proposed algorithms in detail. Chapter 5 presents the experimental results and
also evaluation and comparison of our algorithm with the most well-known algorithms
that are available for oligonucleotide design. The evaluation and comparison will be
done using a separate program which is written by us for all algorithms. The thesis
concludes in Chapter 6 with a few remarks about the importance of our contribution
and further research that can be done.
5
Chapter 2
Preliminaries
In this chapter, we present the necessary concepts and definitions which will be used
in this thesis. It includes two sections; the first section introduces the molecular
biology terminology and concepts and the other one explains concepts and terms
from computer science.
2.1 Molecular biology primer
It is necessary to know some basic concepts in biology in order to understand the con-
cept of the thesis, so we provide a short introduction to basic biological background.
2.1.1 Organisms and cells
All organisms are composed of small cells. A cell is a fundamental working unit of
every living system which is capable of independent functioning and includes several
building blocks that are surrounded by a cell membrane. Organisms can be classified
into unicellular (consisting only of one cell) including most bacteria, or multicellular
6
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
including most but not all fungi, plants, and animals.
There are two types of cells: prokaryotic and eukaryotic. While prokaryotic cells
are usually independent, eukaryotic cells are often found in multicellular organisms.
Most organisms such as flowers, trees, worms, flies, mice, and humans are eukaryotes.
Prokaryotes are simpler and smaller than eukaryotes. They also lack a nucleus and
most of the other organelles of eukaryotes. The main difference between eukaryotes
and prokaryotes is that eukaryotic cells include membrane-bound compartments in
which specific metabolic activities occur. Cells make decisions through complex net-
works of chemical reactions which are called pathways.
Figure 2.1: A eukaryotic cell. (Farabee, 2007)
7
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
2.1.2 DNA, RNA, and protein
The chemical components of a cell are water which constitutes 70% of the cell’s weight,
small molecules (salts, lipids, amino acids, and nucleotides) which constitute 7% of
the cell’s weight, and macromolecules (proteins, DNA, and RNA) that constitute 23%
of the cell’s weight (Alberts et al., 2003).
Figure 2.2: Structure of DNA. (from National Human Genome Research Institute)
DNA, deoxyribonucleic acid, is a nucleic acid that is the major information carrier
molecule in a cell. It encodes the genetic material determining what an organism
will develop into and what an organism’s functions are. A DNA molecule, which is
called a polynucleotide, is a chain of small molecules, called nucleotides . There are
four different nucleotides grouped into two types, purines: adenosine and guanine
and pyrimidines: cytosine and thymine. They are usually referred to as bases and
8
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
denoted by their initial letters, A,C ,G and T. A and T are complementary, as are C
and G. As shown in Fig. 2.2, the two DNA strands are held together in the shape of
a double helix by hydrogen bonds between complementary bases.
Figure 2.3: Structure of RNA. (from National Human Genome Research Institute)
RNA, ribonucleic acid, is chemically similar to DNA and it is also constructed
from nucleotides. But it is usually single stranded and instead of the thymine (T), it
has an alternative uracil (U), which generally is not found in DNA (Fig. 2.3). RNA
does not form a stable double helix because of this minor difference, but can form
secondary structures by pairing up with itself. Several types of RNA exist which
have various functions in a cell; mRNA, or messenger-RNA, is used to carry a gene’s
message out of the nucleus; tRNA, or transfer-RNA, transfers genetic information
from mRNA to an amino acid sequence; rRNA , or ribosomal-RNA, is a part of the
9
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
ribosome which is involved in translation.
Proteins, which take up almost 20% of a eukaryotic cell, are the second major
building blocks and functional molecules of the cell after water. They can be classified
into the following groups;:
• structural proteins, which can be considered as the basic building blocks of an
organism such as bones and connective tissues;
• enzymes, which carry out biochemical reactions such as altering, joining to-
gether or chopping up other molecules. These reactions and the pathways they
construct is called metabolism;
• Transmembrane proteins, which are the key in maintenance of the cellular envi-
ronment, regulating cell volume, extraction and concentration of small molecules
from the extracellular environment and generation of ionic gradients essential
for muscle and nerve cell function.
There are also four levels of protein structures (Fig. 2.4):
1. primary structure, in which a protein is the chain of 20 different types of amino
acids that can be joined together in any linear order, sometimes called poly-
peptide chains, and can be represented as a string of 20 different symbols;
2. secondary structure, which is formed when sequence of amino acids affects the
folding and is usually in the form of three substructures in folded chains; two
common substructures that often can be seen: alpha-helices and beta-strands
which are typically joined by the third less regular structures, called loops ;
10
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
3. tertiary structure, in which a fixed relatively stable three-dimensional structure
is formed because some parts of a protein molecule chain come into contact
with each other due to various repulsive or attractive forces such as hydrogen
bonds, attractions between positive and negative charges, disulfide bridges, and
hydrophobic and hydrophilic forces between such parts;
4. quaternary structure, which is formed when more than one chain of amino-acids
form the protein.
Figure 2.4: Structure of Protein (Villarreal, 2008)
11
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
2.1.3 Genome, chromosome, and gene
There may be many long DNA molecules in a cell that are organized as chromo-
somes. DNA in eukaryote chromosomes winds around complex structures that are
called histones. Mitochondria, which are membrane-enclosed organelles found in most
eukaryotic cells, also contain DNA but the amount is very small in comparison to
chromosomal DNA. Mitochondrial and chromosomal DNA make the genome of the
organism. Genomes, which are included in all organisms, encode all the hereditary
information of the organism. Chromosomes of eukaryotes, which are in the nucleus,
are separated from mitochondrial genomes and contained by the nuclear membrane.
All cells in an organism, which result from DNA replication at each cell division,
have identical genomes. A gene is often a continuous chain of DNA molecules that
encode instructions on how to create proteins. A complex molecular process can read
information, which is encoded as a string of A, C, G, and T, from genes and form
a special type of a protein or a few different proteins. This process which is called
protein synthesis has three main phases: transcription, splicing, and translation.
In the transcription phase, one strand of a double-stranded DNA molecule unwinds
in the nucleus and its information is copied into a molecule of mRNA. Then, the
mRNA exits from the cell nucleus.
In the splicing phase, some segments of the mRNA, called introns, are removed
and then the remaining segments, called exons, are joined together. The way that
eukaryote genomes are organized results in the removal of introns. The DNA segment
that corresponds to the coding region of genes is not continuous, but includes exons
and introns. Exons are the segments of the gene that may or may not code for proteins
and are interspersed with splicesomal introns that generally are removed by splicing.
12
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Prokaryote genes do not include introns and there is no splicing phase for them. The
consequence of splicing is final edited mRNA (Fig. 2.5).
Figure 2.5: Splicing. (Alberts et al., 2003)
In translation, proteins are made by joining amino acids together in the order that
has been encoded in the mRNA. The mRNA sequence is considered as a sequence
of triplets, called codons, that map to amino acids using the standard genetic code.
In this process, in cytoplasm ribosomes, components of cells that synthesize protein
chains, synthesize proteins using the mature mRNA transcript obtained during the
transcription stage. There are 64 codons and only 20 amino acids so that some codons
are redundant; for instance, Lysine is encoded by AAA and AAG. Figure 2.6 shows
the standard genetic code. The three-letter abbreviations such as “Ser” and “Asn”
13
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
are types of amino acid molecules. Each amino acid is carried to the ribosome by a
tRNA molecule that specially distinguishes one or more codons on the mRNA. Then
the amino acids are added to the nascent protein. The last step of translation is the
end part of gene expression and the final result is a protein which corresponds to the
chain encoded by mRNA.
Figure 2.6: The Standard Genetic Code (Godfrey-Smith and Sterelny, 2008)
Figure 2.7 summarizes what we have had so far in this section and gives an
overview of the central dogma of molecular biology with all the usual flows of in-
formation in solid arrows and unusual flows in dashed arrows.
14
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Figure 2.7: Central dogma of biology (Horspool, 2008)
2.1.4 Oligonucleotides
Oligonucleotides, often abbreviated as oligos, are short pieces of single-stranded DNA
or RNA molecules that are designed to bind with unique positions in target sequences.
The way that an oligo binds to its target strand allows scientists to employ oligos
as research tools. An oligo can be used to bind or find its matching target sequence
even in a complex pool of millions of unrelated pieces of DNA or RNA. Using this
interesting and unique fact, researchers are able to decode and study the genetic
makeup of any living organism ranging from bacteria to humans. Figure 2.8 shows
an oligo which is hybridized to its target DNA sequence.
Oligos are usually known by their length, which either can be short, from 20 to
30, or long, from 50 to 70 nucleotide bases, and have a wide range of applications in
medicine and molecular biology. They can be used as probes to screen for diseases
and viral infections in medicine as well as DNA microarray design, polymerase chain
reaction (PCR) amplification, gene identification, northern blot, and southern blot
15
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
GATACCGAGGTGATGAAATGCATCGTTGAGGTCATCTCCGACACACTTTC ||||||||||||||||||||||||||||||||||||||||||||||||||
GGACGATCTTTACGGCTATGGCTCCACTACTTTACGTAGCAACTCCAGTAGAGGCTGTGTGAAAGTCAGGTCTCT
GAGAGCGGATCGGGGAGCATTTGCGGATCGGTCACTTTTTCCTC |||||| |||||||||||||||||||||||||||| | ||||
TAGTGGTGGCCTCTCGTTTAGCCCCTCGTAAACGCCTAGCCAGTGACGACGGAGAACATTGCACGT
Oligo1:
Target1:
Oligo2:
Target2:
Figure 2.8: Two oligos hybridized with their target DNA sequences
in molecular biology. Moreover, oligos can be used in diagnostic tests for genetic
diseases, like breast cancer or cystic fibrosis, or diagnostic tests for infectious diseases,
like hepatitis or AIDS. They can also be utilized in research to discover new drugs or
treatments for a variety of diseases, or producing safe and more plentiful agricultural
products.
The major computational challenge for designing oligos is how to find the optimum
oligo for each target sequence in a set of DNA/RNA sequences. An optimum oligo
must discriminate well between its target sequence and all other non-target sequences
and the optimum designed oligo for each target sequence must be unique to that
sequence in order to only hybridize with its complement region in the target sequence
and not to cross-hybridize with non-target sequences. More precisely, each oligo probe
must be specific to its target sequence which is referred as the specificity of oligos;
oligos must be sensitive in order to detect the target sequences which is known as
the sensitivity of oligos; the set of oligonucleotides must be uniform under the same
experimental conditions, such as melting temperature, which is called the uniformity
of the oligo set.
16
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
2.1.5 Thermodynamics of DNA
One of the important parameters for designing oligos is melting temperature, Tm. The
melting temperature of an oligo duplex is the temperature at which the oligo is 50%
annealed to its complement. This means that 50% of the molecules are single-stranded
while 50% of the molecules are in the double-stranded form (Figure 2.9). Inaccurate
prediction of Tm will increase the probability of failed assay design. Usually, the
melting temperature of an oligo depends on three major factors (Owczarzy et al.,
2008); oligo concentration - high DNA concentrations favor duplex formation; salt
concentration - higher ionic concentrations of the solvent leads to increases in Tm;
oligo sequence - generally, sequences with a higher fraction of GC base pairs have a
higher Tm than do AT-rich sequences.
Figure 2.9: Melting temperature of DNA
17
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Several approaches are available for computing the melting temperature. The
most commonly used approach is applying the Nearest Neighbour model with either
the parameters from SantaLucia (1998) or the parameters from Rychlik et al. (1990).
If the concentration of the oligo is much higher than the concentration of the DNA
target, the following thermodynamic relationship can be used to predict Tm:
Tm =∆H
∆S +R ln(C/4)− 273.15 (2.1)
where ∆H (k.cal/mol) is the total energy exchange between the system and its sur-
rounding environment, ∆S (cal/mol.K) denotes the energy spent by the system to
organize itself, R (cal/mol.K) is the ideal gas constant, 1.987, and C is the molar
concentration of the oligo (Owczarzy et al., 2008).
Another major factor in designing oligos is hybridization free energy, ∆G. Hy-
bridization of a single-stranded (SS) oligo and another single-stranded sequence is a
chemical reaction where two single-stranded sequences come together to form a du-
plex. The transition from one state to another state contributes to a change in energy
of the system and can be summarized in equation 2.2:
[Oligo(SS)] + [Sequence (SS)]⇐⇒ [Duplex] (2.2)
where [Oligo(SS)] and [Sequence (SS)] denote the concentration of the oligo and the
sequence respectively in the system, and [Duplex] denotes the concentration of the
Duplex. ∆G is the change in Gibbs Free Energy (k.cal/mol) and is the net exchange
of energy between the system and its environment. It determines the stability of
the binding between an oligo and a sequence. Oligos with low binding free energy
18
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
with the target sequence and high binding free energy with non-target sequences are
required because they tend to form a more stable binding with their targets rather
than non-targets which results in cross-hybridization reduction. For the task of oligo
selection, a threshold may be set for binding free energy so that oligos with lower free
energy than the threshold will be picked. ∆G can be obtained by equation 2.3:
∆G = ∆H − T ×∆S (2.3)
where ∆H and ∆S are the changes in enthalpy and entropy, respectively, associated
with duplex formation, and T (Kelvin) represents the absolute temperature of the
system.
Hybridization free energy, ∆G, and melting temperature, Tm, values are derived
differently and have no correlative relationship even though they share basic compo-
nents enthalpy, ∆H, and entropy, ∆S. The only way to relate a given ∆G to a given
Tm value is to explicitly know the value of ∆H and ∆S from which they are derived.
2.2 Related computer science notation
In this section, we provide some preliminary and basic definitions and notation from
computer science used in this thesis.
2.2.1 Sequence alignments
The comparison of strings can be done in many different ways. An alignment is a
way of arranging strings to determine regions of close similarity between them.
19
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
One of the essential problems in biology is determining whether two or more ge-
nomic or protein sequences are related and consequently whether sequences show
similarity by chance or because of common ancestry. In biology, the term similarity
applies to sequences that are in some sense similar and has no evolutionary conno-
tations while the term homology describes sequences which are evolutionary related
and stem from a common ancestor (Golding et al., 2011). So, generally, alignment
in biology is an arrangement of two evolutionary related sequences to identify their
regions of similarity that may indicate their functional and structural relationships.
The most straightforward method for performing comparison between sequences
is a dot plot. Using a dot plot it is easy to identify the regions of similarity such
as repeats and rearrangement, in sequences. A dot plot can be created by arranging
one of the sequences along the vertical axis of a matrix and the other one along the
horizontal axis. Then, a dot is placed in all of the entries of the matrix where the
letter along both axes is the same (Figure 2.10).
A long sequence of dots on a diagonal represents regions of similarity between the
two sequences. To reduce the noise of a dot plot matrix, it is useful to filter it by
defining window sizes, w, and stringencies, s. By applying these parameters to a dot
plot matrix, a dot will be placed when there are at least s matches in the w closest
entries on the same diagonal. While dot plots provide a useful way to visualize the
sequences being compared, they are not so useful in performing an actual alignment
between two sequences. To do this, other methods are required.
Formally an alignment of two sequences X of length n and Y of length m is a
mapping X ′ and Y ′ of the same length that may differ from the original sequences X
and Y by having space characters. There may be several alignments for two sequences
20
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
C A G A C T G T A A
C
T
G
A
C
T
G
G
Figure 2.10: Dot plot of two sequences.
so an alignment score can be defined as follows to evaluate the alignment:
S =L∑i=1
s(X ′[i], Y ′[i]) (2.4)
where X ′[i] and Y ′[i] represents the ith character of X ′ and Y ′ respectively, s(X ′[i],
Y ′[i]) is the score of aligning the two characters X ′[i] and Y ′[i], and L is the common
length of X ′ and Y ′. An optimal alignment of two sequences is an alignment with
the maximum alignment score.
There are two types of alignment: global alignment and local alignment. A global
alignment is an alignment in which all of the characters in both sequences are involved
in the alignment while a local alignment compares regions of all possible lengths in-
stead of taking into account the total sequence, and identifies similar regions between
two sequences.
21
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
The first global alignment algorithm is the Needleman-Wunsch algorithm that was
developed in 1970 (Needleman and Wunsch, 1970). It is an application of the dynamic
programming approach to find the optimal alignment, which we now explain. The
idea behind the algorithm is motivated by the observation that any sub-path ending
at a position within the optimal path must itself be optimal at that position. So, the
optimal path can be identified by extending the optimal sub-path. The algorithm is
an elegant and simple way to obtain an alignment which maximizes a specific score.
The Needleman-Wunsch algorithm works as follows: First a matrix M is created
like the dot plot matrix and the entries are filled up using predefined scores for
matches, mismatches and gaps. Then, using the following relation, the score of each
entry M(i, j) is determined:
M(i, j) = max(Mi−1,j−1 + s(Xi, Yj),Mi−1,j + sgap,Mi,j−1 + sgap) (2.5)
Finally, the optimal path is identified by starting from the rightmost entry at the
bottom of matrix and walking to left and top through the matrix up to somewhere
on the topmost row or leftmost column. The time complexity of the Needleman-
Wunsch algorithm is O(mn) and the space complexity is O(mn) as well.
As an example, suppose that we want to perform the Needleman-Wunsch algo-
rithm on the two given sequences X = ACTGATTCA and Y = ACGCATCA with
smatch = 2, smismatch = −3, and sgap = −2. All steps mentioned above are summarized
in Figure 2.11 which results in the following alignment:
A C T G - A T T C A
| | | | | | |
A C - G C A T - C A
22
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
A C T G A T T C A
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
A -2 2 0 -2 -4 -6 -8 -10 -12 -14
C -4 0 4 2 0 -2 -4 -6 -8 -10
G -6 -2 2 1 4 2 0 -2 -4 -6
C -8 -4 0 -1 2 1 -1 -3 0 -2
A -10 -6 -2 -3 0 4 2 0 -2 2
T -12 -8 -4 0 -2 2 6 4 2 0
C -14 -10 -6 -2 -4 0 4 2 6 4
A -16 -12 -8 -4 -5 -2 2 1 4 8
Figure 2.11: Needleman-Wunsch alignment of two sequences.
In the Needleman-Wunsch algorithm, highly similar and short regions may be
missed because the rest of the sequence can outweigh them. Therefore, it makes sense
to look for a local alignment. The Smith-Waterman algorithm (Smith and Waterman,
1981) discovers an alignment which identifies the optimal subsequence pair that gives
the maximum degree of similarity between the two original sequences. This means
all of the sequences might not be aligned together. By a minor modification in the
Needleman-Wunsch algorithm, the Smith-Waterman algorithm is obtained. In the
modification, an alignment path is not required to reach the boundary entries in the
last rows or columns of the alignment matrix, but can begin and end in internal
entries. To do this, zero must be the minimum score placed in the matrix. So, the
score function in the relation (2.5) is changed to the following relation:
M(i, j) = max(Mi−1,j−1 + s(Xi, Yj),Mi−1,j + sgap,Mi,j−1 + sgap, 0) (2.6)
23
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Figure 2.12 illustrates an example of a local alignment for the given sequences X =
ATGCATCCCATGAC and Y = TCTATATCCGT using the Smith-Waterman algo-
rithm which results in the following alignment:
A T C C
| | | |
A T C C
A T G C A T C C C A T G A C
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 2 0 0 0 2 0 0 0 0 2 0 0 0
C 0 0 0 0 2 0 0 4 2 2 0 0 0 0 2
T 0 0 2 0 0 0 0 2 1 0 0 2 0 0 0
A 0 2 0 0 0 2 0 0 0 0 2 0 0 2 0
T 0 0 4 2 0 0 2 0 0 0 0 4 2 0 0
A 0 2 0 0 0 2 0 0 0 0 2 0 0 2 0
T 0 0 4 2 0 0 4 2 0 0 0 4 0 0 0
C 0 0 2 0 4 0 0 6 4 2 0 0 0 0 2
C 0 0 0 0 2 0 0 4 8 6 4 2 0 0 2
G 0 0 0 2 0 0 0 2 6 5 3 1 4 2 0
T 0 0 2 0 0 0 2 0 4 3 2 5 3 1 0
Figure 2.12: Smith-Waterman alignment of two sequences.
Creating the alignment matrix and backtracking in it to find the optimal alignment
takes O(nm) time and space. Due to the quadratic time complexity of the dynamic
programming algorithm, seeds are used in many alignment applications in practice.
Further information about seeds and how they work will be given in next section.
24
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
2.2.2 Seeds for homology search
The quadratic time complexity of the dynamic programming algorithm of Smith-
Waterman makes it impossible to apply it for large sequences. Hence, approaches that
can quickly identify similarity between sequences are more required. BLAST (Altschul
et al., 1990), Basic Local Alignment Search Tool, the most widely used algorithm for
similarity search, was based on the realization that having a long subsequence in
common may result in significant similarity and consequently local alignment. The
default length for the common subsequence is 11. In other words, the two sequences
being aligned need to have 11 consecutive positions which are identical, or simply
11 matches, and it is represented as a seed 11111111111; a 1 stands for a match.
The main idea is that it would be easier to search exact matches than approximate
ones. However, PatternHunter (Ma et al., 2002) mentioned that searching for 11
matches that are not consecutive has a higher probability of finding actual alignments.
Their seed is 111*1**1*1**11*111; a 1 stands for a match and * stands for a don’t
care position. It means, when searching for a potential alignment, only positions
corresponding to 1’s are checked, whereas those corresponding to *’s are ignored.
Such a seed is called a spaced seed as opposed to the contiguous seed of BLAST. The
new seed, implemented in the PatternHunter algorithm, is much more sensitive, in
the sense that it has a much higher chance of finding actual alignments. The number
of 1’s in a seed is called the weight of the seed whereas the total number of symbols
is the length.
The sensitivity of a seed is a measure of the seed’s ability to detect similar regions
between sequences. A formal model is required in order to precisely define the sensi-
tivity. First, given two sequences, we define the similarity level, p, as the probability
25
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
of having a match between the two sequences. Then we model the alignment as a
Bernoulli random sequence R of 0’s and 1’s so that the probability of a 1 is p, the
similarity level. We shall denote the length of R by N . Given a seed s, we say that
s hits R at position k if aligning the end of s with position k in R causes all 1’s in s
to align with 1’s in R. An example of a hit is shown in Figure 2.13.
11101001111100101101101110010 111*1**1*1**11*111
Figure 2.13: Example of a hit by a spaced seed.
The sensitivity of the seed s is formally defined as the probability that s hits R. In
addition to s, this probability depends on the similarity p and length of the random
region, N . There is a dynamic programming approach for computing the sensitivity
of a given seed (Li et al., 2004).
Another version of spaced seeds called transition-constrained seed that is a biolog-
ically motivated version of spaced seeds. It was first introduced and used in the YASS
program (Noe and Kucherov, 2005). In addition to 1’s for matches and *’s for don’t
cares, the transition-constrained seed contains the new character @ which stands for
either a match or a transition; that is, a substitution A←→ G or C←→ T. The biolog-
ical motivation for this is that transitions are more common than transversions, that
is, A/G ←→ C/T. The seed used in YASS is 1@1**11**1*11@1.
Employing several seeds together for homology search can greatly increase the
sensitivity. This set of several different seeds is called multiple spaced seeds. The
multiple spaced seeds were first used in PatternHunter II (Li et al., 2004) which uses
16 spaced seeds, each of which has 11 matches. When, at least, one of the seeds
26
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
of multiple spaced seeds detects a random region R, we say that multiple spaced
seeds hit R. Hence, the definition of sensitivity and also the dynamic programming
algorithm which computes it can be extended to the case of multiple spaced seeds.
2.2.3 Suffix tree and suffix array
Suffix trees and suffix arrays are two important data structures for text searching and
indexing that are widely used in computational biology and bioinformatics applica-
tions (Manber and Myers, 1990; Smyth, 2003; Puglisi et al., 2007).
For a given string x, there are two special kinds of substrings x[i..j] which are vey
important. For any integer j ∈ 0 . . . n, we say x[1..j] is a prefix of x. For any integer
i ∈ 1 . . . n+1, we say x[i..n] is a suffix of x. The suffix tree of a given string x is a tree
whose leaves denote the suffixes of x. In other words, each suffix of x is represented by
a path from the root to the corresponding leaf in the suffix tree. The suffix array of
a given string x is an array of integers that gives the starting positions of the suffixes
of string x sorted in lexicographical order. The longest common prefix, denoted by
LCP, of two strings is the longest string which is prefix of both. This value, stored
alongside the list of prefix indices, indicates how many characters a particular suffix
has in common with the suffix directly above it, starting at the beginning of both
suffixes. The LCP is useful in making some string operations more efficient. For
example, it can be used to avoid comparing characters that are already known to be
the same when searching through the list of suffixes. Figures 2.14 and 2.15 represent
examples of suffix tree, suffix array and LCP array for the string:
index: 1 2 3 4 5 6 7 8 9 10 11
x : T C G T A A C G A C C
27
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
A T C G
11
C G
C
8
C
CGACC
5
A
GACC
6
9
10
TAACGACC
2
7
ACC
ACC
TAACGACC
3
AACGACC
4
CGTAACGACC
1
Figure 2.14: Suffix tree for x = TCGTAACGACC.
i SA[i] SuffixSA[i] LCP[i]
1 5 AACGACC 0
2 9 ACC 1
3 6 ACGACC 2
4 11 C 0
5 10 CC 1
6 7 CGACC 1
7 2 CGTAACGACC 2
8 8 GACC 0
9 3 GTAACGACC 1
10 4 TAACGACC 0
11 1 TCGTAACGACC 1
Figure 2.15: Suffix array for x = TCGTAACGACC.
28
Chapter 3
Related work
Several studies have been performed on oligonucleotide probes, and many algorithms
and programs have been proposed for designing and selecting oligonucleotides in the
literature. In this chapter, we present the best available algorithms and programs for
designing oligonucletotides.
3.1 ArrayOligoSelector
ArrayOligoSelector (Bozdech et al., 2003) designs optimized oligonucleotide probes by
considering some parameters such as uniqueness in the genome, sequence complexity,
lack of self-binding, GC-content and proximity to the 3′ end, or the right end, of
the gene. It consists of two main steps; in the first step, the program computes
scores of uniqueness, sequence complexity, lack of self-binding and GC-content for
each candidate oligo. The details of scoring for the mentioned parameters are as
follows:
• Uniqueness: The binding free energy of a candidate oligo to its most homologous
29
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
sequence is considered as the uniqueness score. BLASTN, BLAT, or gfclient
are used to locate the most homologous sequence followed by a calculation of
the theoretical binding energy. The binding free energy is computed using the
nearest-neighbor model with the thermodynamic parameters from SantaLucia
(SantaLucia, 1998).
• Sequence complexity: The program employs the LZW compression algorithm
(Ziv and Lempel, 1977) to compute the sequence complexity score which is the
difference in bytes between the oligo sequence and its compressed version.
• Self-annealing: Using the Smith-Waterman local alignment algorithm (Smith
and Waterman, 1981), ArrayOligoSelector computes the alignment score of the
optimum local alignment between the oligo sequence and its reverse complement
as a measurement of the secondary structure created by the self annealing of
an oligo.
• GC-content: The score is computed as the GC percentage of the oligo sequence.
After the first step, the obtained scores are used in the second step to select the
oligos that are unique for the target sequences, have low level of internal repeat, have
low tendency for self-annealing, and are within a narrow range from the target GC
percentage which is specified by the user.
3.2 GoArray
GoArray (Rimour et al., 2005) has been developed to overcome the problems of
classical approaches for designing oligonucleotide such as the lack of adaptability to
30
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
complex biological systems. The essential factor regarding the adaptability is cross-
hybridization which could result in misinterpretation of biological results. In order to
compute the specificity of an oligonucleotide, GoArray follows the sequence similarity
and maximum consecutive match parameters as follows (Kane et al., 2000):
• The oligo must not have more than 75% similarity with a non-target sequence.
• The oligo must not have a stretch of more than 15 identical bases with a non-
target sequence.
The outcomes of performing in silico test revealed that in a complex biological
system, long oligos, from 50 to 70 bases, don’t have good specificity while short oligos,
20 to 30 bases, have higher specificity so they seem to be more adapted but perhaps
with low sensitivity. Hence, the algorithm creates the oligos by concatenating two
specific short sequences to achieve both specificity and sensitivity. To do so, GoArray:
• Reads each sequence from the 3′ end to find the first specific sequence
• Checks specificity of the sequence using BLAST. If the sequence shows an align-
ment with 75% or more similarity or contains a stretch of 15 identical bases with
non-target sequence it is considered as non-specific.
• Looks for the second specific sequence.
• Merges two sequences using randomly chosen bases, called linker, and checks
the specificity of the whole oligo.
• Checks for melting temperature, secondary structure, and prohibited sequences.
31
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
3.3 OligoArray
OligoArray (Rouillard et al., 2003) takes into account several criteria for designing
and selecting oligonucleotide probes. In addition to percentage and length of sequence
similarities, the specificity of an oligo is computed by thermodynamic properties of
hybridization to its target sequence which can particularly consider regions with short
and high GC-content value and results in stable cross-hybridization at temperatures
commonly used during hybridization. By allowing more flexibility for adjusting the
sequence length by one or a few bases, OligoArray tries to consider a narrow melt-
ing temperature distribution rather than a uniform oligo length to achieve a better
uniformity during hybridization. The outline of this algorithm is as follows:
• Sequences are masked for the existence of prohibited patterns, such as regions
of same bases (GGGGG, CCCCC, TTTTT, AAAAA) or di- and tri-nucleotide
repeats extending over more than 10 bases. All positions corresponding to these
prohibited regions are filtered and automatically masked.
• Melting temperature is calculated for each oligo candidate by applying Nearest
Neighbour model with parameters from SantaLucia (1998) and the user should
identify the acceptable range.
• The oligo candidates are checked for the absence of strong secondary structures
at the hybridization temperature.
• The specificity of oligos is examined by considering all possible cross-hybridizations
of the oligo with similar sequences using BLAST and also by performing ther-
modynamic calculations. The oligo is considered to be specific for its target
32
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
sequence, if there is no possible cross-hybridization with melting temperature
above the specificity threshold set by the user.
• The GC-content range and the position in transcript are checked for the re-
maining oligo candidates.
3.4 OligoPicker
OligoPicker (Wang and Seed, 2003) helps in selecting oligo probes for each of the tar-
get DNA sequences given for microarray spotting. The algorithm takes into account
the following criteria for designing and selecting the oligos:
• Location in the sequence: adjacency to 5′, (left end) or 3′ (right end) end
according to random or oligo dT priming. In general, the preferred oligos for
optimal sensitivity in a random primed labeling will be placed as close to the
5′ end as possible.
• Melting temperature uniformity: OligoPicker calculates the Tm for all oligo
candidates using the following formula and then determines the median Tm for
oligos:
64.9 + 41× gcCount/oligoLength− 600/oligoLength (3.1)
where gcCount is the number of all G’s and C’s in an oligo. An oligo candidate
is ignored if its Tm is not within 5◦ of the median Tm.
• Probe accessibility: The probability of forming secondary structures is very
high in regions of significant self-complementarity. Hence, oligo candidates are
33
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
checked for homology to the complementary strand of their cognate sequences
using BLAST.
• Reduced cross-hybridization: Because contiguous base pairing is the single most
important determinant of cross-hybridization for OligoPicker, it considers the
rejection of contiguous sequence identity as the primary filter in the oligo selec-
tion scheme. To do so, OligoPicker uses a hash table to quickly search for the
common stretch between sequences.
• Evasion of non-coding RNA and low complexity regions: When total RNA is
used as the starting material, interfering of RNA other than mRNA with array
hybridization may be a practical concern. To address this concern, sequence
regions similar to rRNA or snRNA (small nuclear RNA) are skipped during
the oligo selection procedure by using both contiguous base match screening
and BLAST. Low-complexity regions may also result in cross-hybridization so
they are identified by the DUST program (Hancock and Armstrong, 1994) and
avoided when selecting oligos.
3.5 OligoWiz
OligoWiz (Nielsen et al., 2003) defines some parameters for designing and selecting
oligos and for each parameter it calculates a score between 0 and 1. The final score of
an oligo candidate would be the weighted sum of its parameter scores that is between
0 and 1 as well. The set of parameters that are considered in OligoWiz is:
• Cross-hybridization: Assessment of the cross-hybridization degree is carried out
by calculating the homology score for each oligo sequence using BLAST.
34
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
• Melting temperature difference: It is necessary that all oligos perform well un-
der similar hybridization situations. An effective parameter of an oligo related
to the hybridization property is the melting temperature and the minimal dif-
ference between the Tm of the all oligos is preferred. OligoWiz employs the
Nearest Neighbour model with parameters from SantaLucia and uses the fol-
lowing formula for Tm calculation:
Tm =1000∆H
A+ ∆S +R ln(CT/4)+ 16.6 log10[Na
+]− 273.15 (3.2)
where ∆H (k.cal/mol) is the total energy exchange between the system and
its surrounding environment, A is a constant correcting for helix initiation,
∆S (cal/mol.K) denotes the energy spent by the system to organize itself, R
(cal/mol.K) is the ideal gas constant 1.987, CT is the molar concentration of
the oligo, and [Na+] is the molar concentration of salt.
• Position within transcript: A score would be assigned to each oligo based on its
position in the target sequence. For example, oligos with target positions closer
to the starting point of reverse transcriptase are more desired.
• Low-complexity filtering: A low-complexity score is assigned to skip oligos com-
posed of very common regions in the oligo design process. To estimate the
low-complexity measure for an oligo, a list of sequence subregions with related
information content is generated for each species. Oligos that contain those par-
ticular regions are considered as low-complexity oligos and would be assigned a
low score.
• GATC-only score: Each oligo sequence that contains bases different from A, C,
35
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
G, and T will be given score 0.
3.6 PICKY
PICKY (Chou et al., 2004) has been proposed to design the computationally opti-
mized oligo probes. To do so, PICKY:
• Uses a generalized suffix array for both the sequence and its complement that
allows quick identification of repetitive, low complexity, self-similar and self-
complementary regions in the oligo sequence.
• Utilizes the suffix array to check the maximum consecutive match parameter,
i.e. to make sure that no oligo sequence have stretch equal to or longer than
the maximum match length of 15 bases in common with other sequences. This
is performed by sweeping across the suffix array and checking two neighbours
of each suffix to detect all regions on all sequences that must be omitted.
• Identifies all other sequences that are similar to each oligo sequence. Again,
using the suffix array, similar non-target sequences can be quickly realized
and their melting temperatures with oligo candidates can then be calculated.
PICKY calculates the melting temperature by the following equation:
Tm =∆H
∆S +R ln(C/4)+ 12.0× log10[Na
+]− 273.15 (3.3)
First, the melting temperature of an oligo candidate with its target sequence is
computed. Then to avoid imperfectly matched cross-hybridization, the melting
36
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
temperatures between each oligo candidate and all potential non-targets are cal-
culated. Using the suffix array, PICKY identifies and aligns an oligo candidate
with all non-targets with up to the similarity level of the sequence similarity pa-
rameter so the oligo candidates not following the sequence similarity parameter
will be skipped.
• Compares target and non-target melting temperatures of all oligo candidates to
identify a subset that can specify each gene, has the minimum chance for cross-
hybridization, has a uniform range for melting temperature, and maximizes the
distance between the lowest target and the highest non-target melting temper-
atures of the chosen set.
3.7 ProbeSel
ProbeSel (Kaderali and Schliep, 2002) considers oligos which are perfectly comple-
mentary to their target sequences and are unique up to k mismatches in order to
design oligos that bind specifically to the target sequences. All steps of the algorithm
are summarized as follows:
• Creates a generalized suffix tree of all target sequences and their reverse com-
plement which helps to identify non-unique oligos.
• Removes oligo candidates that don’t satisfy the pre-specified length.
• Eliminates oligo candidates that are similar to non-target sequences more than
the allowed threshold.
37
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
• Skips oligo candidates that hybridize with their target in a melting temperature
less than the predefined threshold.
• Aligns the remaining oligo candidates with their complementary targets and
calculates the melting temperature.
• Picks the final temperature T and one oligo for each of the target sequences. T is
a temperature threshold and the melting temperature of all oligos with their in-
tended target must be higher than T whereas all undesired cross-hybridizations
melting temperatures must be lower than T .
The first two undesired oligo elimination steps significantly decrease the number of
alignments that must be performed between oligos and targets.
3.8 ProbeSelect
ProbeSelect (Li and Stormo, 2001) concentrates on the specificity of input sequences
for designing the optimal set of oligonucleotide oligos. First, a set of oligo candidates is
created from oligos that maximize the minimum number of mismatches to every other
gene in the genome. Then, it selects the optimal oligonucleotides from the mentioned
set. When an oligo hybridizes with the target sequence in an acceptable range of
hybridization free energy and maximizes the difference in free energy with non-target
sequences, it will be picked up as an optimal oligo. The outline of ProbeSelect is as
follows:
• Creates a suffix array of the coding DNA sequences of a genome from an organ-
ism.
38
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
• Generates a “landscape” for each gene using the suffix array. A landscape of a
sequence includes the occurrence of all words of that sequence. In this case, the
reason for using the landscape is to identify low frequency words in the rest of
the genome that form the set of unique oligos.
• Selects oligo candidates using the results from testing many genes. Oligos with
the lowest frequencies at the level of subword lengths are those that occur
least in the rest of the genome, even with a few mismatches. Based on this
evidence, the algorithm considers all the subword frequencies for each oligo and
selects ones which have the lowest values, therefore supposed to have the fewest
approximate matches elsewhere.
• Looks for matching sequences in the whole genome. The Myers algorithm (My-
ers, 1999) is applied to identify all match positions for the candidate oligos in
the coding sequences of the genome with four or fewer mismatches for short
oligos, 10 or fewer for oligos of length 50 bases, and 20 or fewer for oligos of
length 70 bases.
• Locates match sequence positions in all genes. A binary search in a sorted array
of all sequences gives the positions of the matches in a sequence.
• Calculates the free energy and melting temperature for each oligo and its target
sequence. The free energy is computed using the alignment of the oligo sequence
and its target because it is usually the lowest energy structure.
• Picks up the oligos that have the most stable hybridization with their target
sequence, thus resulting in good discrimination from potential non-targets.
39
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
3.9 ProDesign
ProDesign (Feng and Tillier, 2007) utilizes spaced seeds for designing suitable oligos
which allows the inclusion of more mismatches between an oligo sequence and its tar-
get sequence. The algorithm can design both gene-specific and group-specific oligos.
The algorithm consists of six steps explained below:
• Creates a hash table for each spaced seed. The hash table consists of related
words of each seed where every 1 in the seed is replaced by any possible nu-
cleotide A, C, G and T encoded by two bits, 00, 01, 10, and 11 respectively.
The size of each hash table is 4w where w is the weight, the number of matches,
of the seed.
• Checks each word in the hash table to see whether that word is specific to a
group of sequences or not. The specificity of the word is obtained by counting
its occurrences in the group sequences. Let hi,j mark the occurrence of the word
xi in the sequence j; it equals one if the word occurs in the sequence and zero
otherwise.
• Computes the clusters of words. Let Hi,k mark the occurrence of the word xi
in the group k. If the value of hi,j is 1 for all the sequences in the group, then
Hi,k is 1. For each word in the hash table a list of groups with Hi,k = 1 is built.
A word is specific to a group if Hi,k = 1 only for that group. Some groups have
one or even more specific words and some groups may not have a word. The
latter groups are considered later to be reclustered into other groups.
• Considers the specific words for finding oligo candidates for the group. The user
can specify the length of the required oligos. If the group-specific word satisfies
40
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
the length threshold it will be returned as the group oligo. Otherwise, two or
more words need to be joined together to provide the required length. In this
case, ProDesign starts with selecting a random sequence in each group. Then,
the position of the specific word is found by scanning the sequence. The scan
is extended forward to find the second specific word. In order to be joined, two
words must be either overlapping or the gaps between them must be less than
3 bases. The joined word starts from the first character of first word until the
last character of the second word.
• Selects the modified word as the final oligo for each group only if it is specific
for all sequences of the group.
• Checks all oligo candidates in order not to have low-complexity regions, satisfy
the melting temperature constraints, and %GC content requirements.
3.10 ProMide
ProMide (Rahmann, 2003) is a specific and fast algorithm to design and select short
oligonucleotides of length up to 30 bases. The algorithm can design oligos for large
data sets in a reasonable time. ProMide considers two parameters related to the lack
of specificity to design specific probes:
• The longest common factor lcf which is the longest common region that appears
in both an oligo and a sequence.
• The longest common factor with one mismatch lcf1 which is the longest common
region that appears in both an oligo and a sequence with at most one mismatch.
41
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
The idea behind this approach is that lcf can approximate the lack of specificity
of an oligo better than the number of mismatches between an oligo and non-target
sequences which the oligo specificity is checked against. A long lcf of the oligo and
non-targets denotes the higher chance of undesired cross-hybridization. However,
this parameter may be too hopeful at times. For example, it does not exclude oligos
whose lcf is low even though they may have a long subsequence in common with
non-targets by allowing one mismatch. Therefore, lcf1 is also taken into account to
solve this issue.
The algorithm first selects a set of oligo candidates that fulfill the specified melting
temperature or length range. Then, the oligos in the set will be ranked by compar-
ing their lcf and lcf1 vectors which include the length of the factors between each
oligo and all non-target sequences, and incorporate some additional sequence-specific
restrictions. Using memory-efficient enhanced suffix arrays, the lcf and lcf1 vectors
can be computed very fast.
3.11 ROSO
ROSO (Reymond et al., 2004) employs BLAST to compute the specificity. Because
specificity analysis in the design phase using BLAST is independent of the oligo
selection phase, the user can separate the time-consuming BLAST step in order to
select the oligos efficiently.
• Filters the input sequences by eliminating identical genes, repetitions of bases,
and degenerated bases.
• Checks for potential cross-hybridizations using BLAST.
42
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
• Removes oligonucleotide probes that form a stable secondary structure.
• Calculates the melting temperature of each oligo candidate and chooses a set
of oligos in way that minimizes the variability of the melting temperature.
• Picks up the optimal set of oligos according to the GC-content rate, the first
and the last bases (preferably a G or a C), the number repetitions, and the
hybridization free energy.
3.12 YODA
YODA (Nordberg, 2004) designs and verifies oligos in several steps of increasing
computational intensity, with undesired candidate oligos being eliminated at each
step. The steps are identifying average melting temperature and GC-content, finding
prohibited sequences and contiguous stretches of identities, computing Tm , looking
for potential secondary structure, searching for potential dimerization, and checking
the similarity of oligo candidates to non-targets. The outline of the algorithm is as
follows:
• Identifying average Tm and GC-content: The algorithm calculates the Tm by
employing the NN model with the parameters from SantaLucia (SantaLucia,
1998). Then, it computes the average Tm for all oligos of the specified length.
The user determines the range of acceptable Tm’s. Also, YODA computes the
average GC-content for all oligo candidates of the specified length.
• Prohibited sequences and contiguous identities: Oligo candidates are checked
for the existence of any prohibited region or any long stretch of consecutive
43
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
identities with non-target sequences. Also, the user can determine prohibited
sequences, such as Poly-X (e.g. AAAAA or TTTTT) sequences, to be avoided
in oligos.
• Filter for Tm: All oligo candidates passing the previous steps are checked for
Tm. The oligo candidate is rejected, if its Tm is not in the range specified by
the user.
• Filter for secondary structure: Oligo candidates are examined for potential
stem-loop structures by searching for short stretches of complementary se-
quences, called stems, that are separated by few bases (the loop). Remaining
candidates for each sequence are stored in a temporary file for later sorting and
validation.
• Sort and validate candidates: Each target sequence can have several oligo candi-
dates. Picking up the final oligo, or set of oligos, for each sequence is performed
by some Probe Sorter procedures. Oligo candidates picked by a Probe Sorter are
considered for final validation, which checks the oligo for potential dimerization
and sequence similarity to non-target sequences.
44
Chapter 4
Our proposed algorithm
In this chapter, we introduce and explain all details of our proposed algorithm for
designing and selecting the best set of oligonucleotide probes for a given set of se-
quences.
4.1 Motivation
As we have seen in the previous chapter, several algorithms and programs exist for
oligo design which differ in what parameters and criteria they consider and how they
identify and select the optimal oligonucleotides.. In order to design the optimum
set of oligonucleotides, some criteria and parameters related to specificity, sensitivity
and uniformity of oligo probes have been addressed by experts (Lockhart et al., 1996;
Kane et al., 2000) in this area such as: similarity of an oligonucleotide to non-target
sequences, identity or maximum match of an oligonucleotide to non-target sequences,
low complexity regions, GC-content, secondary structure, and melting temperature.
The first two parameters are heavily concerned with cross-hybridization evaluation
45
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
and considered as the most important criteria among the suggested criteria for de-
signing the optimum specific set of oligonucleotides. From the above parameters
and criteria, maximum consecutive match, sequence identity, GC-content, and low-
complexity regions are related to the specificity of oligos, secondary structure is related
to the sensitivity aspect of oligos, and the melting temperature is related to unifor-
mity feature of the set of oligos. Table 4.1 summarizes the ways that each algorithm
deals with the parameters involved in oligo design task.
Table 4.1: Summary of the ways that related algorithms assess the parameters in-volved in oligo design.
Algorithm Cross-hybridization
GC-content Tm Low-complexity
Secondarystructure
ArrayOligo Thermodynamicsand BLAST
User-defined - LZW Self-complementarity
GoArray BLAST - NN Prohibitedsequence
Mfold
OligoArray Thermodynamicsand BLAST
User-defined NN Masking—repeats
Mfold
OligoPicker BLAST - G+C DUST Self-complementarity
OligoWiz Thermodynamicsand BLAST
- NN Customscoring
-
PICKY Suffix array andthermodynamics
User-defined NN Suffix array Suffix array
ProbeSel Suffix array andthermodynamics
- NN - Mfold
ProbeSelect Suffix array andthermodynamics
- NN Masking—repeats
Self-complementarity
ProDesign Homology searchbased on seeds
User-defined NN - Mfold
ProMide Suffix array andthermodynamics
- NN - -
ROSO BLAST User-defined NN Masking—repeats
Thermodynamics
YODA Custom method User-defined NN Prohibitedsequence
Self-complementarity
46
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
We are interested in designing and selecting the set of oligos so that each oligo is
specific to its target sequence to achieve the highest specificity, each oligo is sensitive
in order to detect the target sequence to reach the highest sensitivity, and the set
of oligos is uniform under the same experimental conditions to obtain the maximum
uniformity.
As shown in Table 4.1, most algorithms are heavily based on BLAST for cross-
hybridization assessment and some other ones rely on suffix arrays and suffix trees.
BLAST uses the simple strategy of finding short consecutive seed hits, which are then
extended into longer local alignments. This way of similarity search has a key tradeoff:
increasing the size of seeds decreases sensitivity while decreasing size of seeds increases
the running time. Therefore, the BLAST-based algorithms have limited sensitivity
for cross-hybridization assessment because of applying the consecutive seeds. On the
other hand, suffix trees and suffix arrays, as mentioned before, are well known data
structures for exact pattern matching and text searching, but they do not work well
for approximate text searching. Therefore, algorithms and programs that employ the
suffix arrays or suffix trees as the heart of similarity search tools for cross-hybridization
assessment are not efficient and sensitive for this task as well.
Due to the limitations of current algorithms and programs and according to the
above discussion, there is a clear need for designing an efficient algorithm that is very
accurate in terms of specificity, sensitivity, and uniformity as well as running very
fast on large input data sets. This need is the most important motivation for us to
perform this research work.
47
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
4.2 General description of the problem
The general description of the problem is illustrated in the Figure 4.1.
Output: Best Set of Oligos Oligo Design Algorithm
>ref|NC_002655.2|:190-273 ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGA
>ref|NC_002655.2|:2818-3750 ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCCGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGTCGGCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCACGGGAAAATATCGTTTATCAGTGCTGGGAGCGTTTTTGCCA
>ref|NC_002655.2|:5251-5547 GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCAACGCAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGATAATCGCGGTTATTACTGGGATGGCGGTCACTGGCGCGACCACGGCTGGTGGAAACAACATTATGAATGGCGAGGCAATCGCTGGCACCCACACGGACCGCCGCCACCGCCGCGTCACCATAAGAAAGCTCATCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA …
?
50, 81.42, ref|NC_002655.2|:1081779-1082243 TCCACCCGCCGTATTGCGGCAAGTCGTATTCCGGCTGATCACATGGTGCT
54, 81.00, ref|NC_002655.2|:c4933838-4931802 TGCATTGCTCGGAGATGTAGTCACGGTTGAGTCGGCAGAGACATTCAGTCTCA
52, 78.59, ref|NC_002655.2|:c5451565-5446463 TGTGCAAAGTGCTCCCTGTGGATTGACCAATGTCGGGGAACAACGTGGAACA
…
Input: DNA Sequences
Figure 4.1: General description of the problem.
As input, we have a set of DNA or RNA sequences which are generally in fasta
format. This sequence format has the minimum amount of information. A fasta file
includes a ‘>’ sign in the beginning of a line to denote the beginning of a new sequence
followed by a phrase to specify the sequence title (see Figure 4.1). The sequence
information itself follows immediately. No other information is stored within a fasta
file. The output of the problem would be the best set of oligonucleotide probes that
can be in any format or in the fasta format, the same as input sequences. The goal
is to design an efficient algorithm for extracting the best set of oligos from the input
DNA/RNA sequences that have the maximum specificity, sensitivity, and uniformity
according to the mentioned parameters and criteria.
48
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
4.3 The outline of our algorithm
The proposed algorithm starts by encoding the input sequences and continues by
considering both the sequence identity percent and the maximum consecutive match
parameters for cross-hybridization assessment and also GC-content parameters to
reach the maximum specificity for designed oligo probes. To perform the homology
search for cross-hybridization, our algorithm looks for similar regions in input se-
quences using multiple spaced seeds and hashing. Then, GC-content evaluation is
performed to finish the specificity issue. Next, the uniformity is obtained by perform-
ing the melting temperature management. After that, the sensitivity of oligo probes
will be evaluated by self-annealing and secondary structure assessment. At the end,
an intensive homology search is performed to reach the maximum specificity. The
outline of our proposed algorithm for designing and selecting the best set of oligo
probes is as follows:
• Encoding the input sequences
• Light cross-hybridization assessment
• GC-content evaluation
• Melting temperature management
• Secondary structure assessment
• Intensive cross-hybridization assessment
• Final oligo selection
Each step of the algorithm will be explained in detail in the following subsections.
49
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
4.4 Encoding the input sequences
We denote the set of input sequences by G = {g1, g2, . . . gk}, where gi denotes the ith
sequence and is a string over the alphabet Σ = {A,C,G,T}. After reading the set
of input sequences, all sequences are merged into a single vector V = g1g2 . . . gk. To
encode the input vector of all sequences, we use the following table to represent each
nucleotide base:
Table 4.2: Encoding of input DNA sequences
Nucleotide base Coded representationA 00C 01G 10T 11
Using the above coding, it is possible to reduce the amount of memory to a
quarter of original size because in the coded representation we require only two bits
to represent each base while in regular representation, one byte is used per nucleotide
base. This is very helpful and efficient when dealing with large genomes such as the
human genome whose size is about 3 GB in normal representation that will be 0.75
GB in coded representation.
Another advantage of this encoding is utilizing the bit-parallel nature of computer
words. In other words, we can consider the vector of input sequences as blocks
of computer words of 32 or 64 bits that correspond to 16 or 32 nucleotide bases
respectively. Therefore, we can speed up the comparison of k bases by the factor of
W/2, where W is the length of the computer word.
50
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
4.5 Cross-hybridization assessment
Generally, the cross-hybridization assessment is carried out by performing a homology
search. The goal of a homology search is to find the similar regions between sequences.
As we have seen in the second chapter, there are several ways to do a homology search
such as: applying dynamic programming approaches, e.g. Smith-Waterman (Smith
and Waterman, 1981), using heuristic approaches, e.g. FASTA (Lipman and Pearson,
1985) and BLAST (Altschul et al., 1990), or utilizing text searching and indexing data
structures, e.g. suffix trees and suffix arrays.
Because dynamic programming approaches for homology search have quadratic
time complexity, they are slow in cases we have vey long sequences to be compared
such as biological sequences. It can take linear time to search exact occurrences of
patterns in sequences or texts but a time complexity linear in the size of the long
sequences for finding exact patterns is too much so time linear in the length of the
pattern is desired. To do so, an index such as a suffix tree or suffix array on the
long sequence or text can be created. Constructing this index takes linear time in
the sequence size, but the point is that it can be reused several times which provides
very fast searching throughout the indexed sequence.
Usually, in biological applications we are interested in looking for approximate
matches between genomic or proteomic sequences rather than finding exact matches.
This is because the biological mechanisms are made in a way to allow for errors so
our search algorithms should tolerate the errors or mismatches in order to provide
meaningful results. However, there are no approximate indexes that are time and
space efficient as in the exact case. Therefore, suffix trees and suffix arrays may not
be appropriate choices for homology search.
51
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Several techniques have been proposed to handle approximate indexes. Using
multiple spaced seeds is the best-known approach to construct indexes of texts and
sequences. Here, the indexes are hash tables constructed for multiple spaced seeds
that are explained in detail below and that will be considered as the heart of the
homology search procedure for oligo probe design in our proposed method.
Now, we explain in details why multiple spaced seeds are so useful for approximate
sequence searching. Consider a spaced seed s of length l and weight w, which is the
number of 1’s. Also, consider a random region R of length N and similarity level
p, as what we had in the definition of sensitivity. The expected number of hits that
s has in R is (N − l + 1)pw, since there are N − l + 1 places where s can hit and
each has probability pw. Now, if the weight of the spaced seed is increases by 1, then
the expected number of hits will be decreased by a factor of p. Assuming that the
four bases A, C, G, and T appear with the equal probability, which means that the
number of expected hits is only a quarter of the previous one. Less hits means less
wasted ones, that is, less false positives and therefore increased specificity. However,
increasing the weight of a seed also decreases the true positives, and therefore the
sensitivity. In order to increase both, we can increase not only the weight but also
the number of seeds. It turns out that doubling the number of seeds provides slightly
better sensitivity. But doubling the number of seeds only increases the expected hits
by a factor of two whereas increasing the weight by one reduces it by a quarter.
Essentially, this is the main reason why multiple spaced seeds are so good. It should
be mentioned that we need more memory for a higher number of seeds in order to
store more hash tables which imposes an upper bound on the number of seeds that
can be used.
52
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
4.5.1 Multiple spaced seeds for homology search
To perform an efficient homology search, we require efficient multiple seeds but com-
puting good multiple spaced seeds is a hard problem. There are several algorithms
that compute good multiple spaced seeds but the only fast algorithm available is due
to (Ilie et al., 2011; Ilie and Ilie, 2007). Moreover, this algorithm computes the most
sensitive multiple spaced seeds.
Now, we explain in detail the mentioned algorithm. Finding optimal multiple
spaced seeds is NP-hard but even finding good ones is very difficult. Computing
optimal multiple spaced seeds by exhaustive search is infeasible because of two expo-
nential steps in this process:
• There exist exponentially many spaced seeds to be checked and evaluated based
on their sensitivity
• Computing the sensitivity of each spaced seed has exponential time complexity
as well.
In order to compute multiple spaced seeds while spending an acceptable amount of
time, some algorithms have been proposed that handle the exponential nature of the
steps either by reducing the number of spaced seeds to be evaluated or by approximat-
ing the sensitivity of each spaced seed. However, the only available polynomial-time
algorithm is by (Ilie et al., 2011; Ilie and Ilie, 2007) with a time difference that con-
tributes to remarkable consequences in practice. This means if it takes several days
for other algorithms to compute multiple spaced seeds, it takes only several seconds
for the Ilies’ algorithm. In addition, the quality of generated multiple spaced seeds
is not affected by this reduction in time. On the contrary, the obtained multiple
53
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
spaced seeds have higher sensitivity than those of all the other algorithms. Hence,
not only was it the best choice for our homology search purpose but it also was the
only possibility to compute the high number of large sets of seeds required for the
cross-hybridization evaluations.
4.5.2 Overlap complexity
The algorithm considers a new concept called overlap complexity (OC) measure. This
measure is well correlated with sensitivity but is much easier to compute and takes
polynomial time instead of the exponential time required for sensitivity. Therefore,
instead of sensitivity, it can be employed in computations. Using OC, a polyno-
mial time algorithm can be obtained to compute multiple spaced seeds. The idea
behind the overlap complexity is as follows. Some spaced seeds may have more over-
lapped hits than others even though the number of expected hits of spaced seeds
of the same weight is the same. For instance, consider the following spaced seeds:
“11111111111” and “111*1**1*1**11*111”. As we see in figure 4.2, if 11111111111
has a hit then it will have another one shifted by one position if the next position
is a match which has the probability p, the similarity level. On the other hand, the
spaced seed 111*1**1*1**11*111 needs six new matches in order to have the same
additional hit. This occurs with probability p6, which is noticeably lower. In general,
the difference is even more because of the higher weight.
What this means is that the hits of the seed that overlaps less will be more
uniformly distributed and hence able to hit more alignments. That means, higher
sensitivity. In other words, high sensitivity is obtained by low number of overlapping
hits. The complexity measure introduced to replace sensitivity, overlap complexity,
54
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
CGTCAAGACTT? |||||||||||? CGTCAAGACTT? 11111111111 11111111111 a. b.
GAG?C??T?G??AC?TTC? |||?|??|?|??||?|||? GAG?C??T?G??AC?TTC? 111*1**1*1**11*111 111*1**1*1**11*111
Figure 4.2: Possibility of overlapping hits for a. consecutive seed. b. spaced seed.
OC, is therefore defined as follows. Suppose we have two spaced seeds s1 and s2.
The overlap complexity of the two spaced seeds, denoted by OC(s1, s2), is formally
defined as:
OC(s1, s2) =
|s1|−1∑i=1−|s2|
2σ[i] (4.1)
where σ[i] denotes the number of pairs of matched 1’s between s2 which is i positions
shifted and s1. The values of i range from 1− |s2| to |s1| − 1, where a negative-value
shift indicates |s2| starts first. To compute the value of σ[i], two variables t1 and t2
are defined as follows:
t1 = ∗|s2|−1s1∗|s2|−1
t2,i = ∗|s2|−1+is2∗|s1|−i−1, for 1− |s2| ≤ i ≤ |s1| − 1
(4.2)
Then,
σ[i] = card{1 ≤ j ≤ |s1|+ 2|s2| − 2, t1[j] = t2,i[j] = 1} (4.3)
To simply understand the overlap complexity of two spaced seeds, consider the
following example. For two spaced seeds, s1 = 1 ∗ 11 and s2 = 1 ∗ 1, t1 = ∗ ∗ 1 ∗ 11 ∗ ∗
and t2,i = ∗2+i1 ∗ 1∗3−i for −2 ≤ i ≤ 3. Figure 4.3 illustrates the alignment of s1
55
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
against copies of s2 shifted by i positions where i lies in the mentioned range.
σ[i] t1: **1*11** t2,-2: 1*1***** 1 t2,-1: *1*1**** 0 t2,0: **1*1*** 2 t2,1: ***1*1** 1 t2,2: ****1*1* 1 t2,3: *****1*1 1
Figure 4.3: Example of computing overlap complexity for two spaced seeds.
Therefore, OC(1 ∗ 11, 1 ∗ 1) =∑3
i=−2 2σ[i] = 13. The OC is symmetric, which
means OC(s1, s2) = OC(s2, s1), for any pair of spaced seeds s1 and s2. For a set of
multiple spaced seeds S = {s1, s2, . . . , sk} the overlap complexity is defined as:
OC(S) =∑
1≤i≤j≤k
OC(si, sj) (4.4)
where OC(S) is the sum of the overlap complexities of each two seeds in the set.
It is shown by (Ilie and Ilie, 2007) that overlap complexity is experimentally very
well correlated with sensitivity for single spaced seeds. For multiple spaced seeds,
comparing sensitivity and overlap complexity cannot be performed because exhaustive
search is infeasible in this case.
The algorithm for computing the set of optimal multiple space seeds based on the
concept of overlap complexity handles the two exponential steps as follows:
• Sensitivity: replacing the polynomial time overlap complexity approach with
exponential time sensitivity approach
56
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
• Number of spaced seeds: there are two issues regarding the number of spaced
seeds: first, exponentially many possible choices for the lengths of seeds which is
handled by guessing a set of good lengths; second, exponentially many possible
seeds for the fixed lengths which is handled by repeatedly swapping a 1 with
a * as long as the overlap complexity improves, besides, selecting each swap,
based on a greedy approach, which has the most improvement for results. The
number of swaps in each seed is bounded by the weight of the seeds.
Various sets of variable-length multiple spaced seeds of different weights have been
generated for our oligonucleotide probe design algorithm using the Ilie’s algorithm,
discussed above.
Cross-hybridization assessment of the proposed method for oligo probe design is
performed in two homology search phases: fast and intensive. In the fast phase of
homology search, the proposed algorithm tries to quickly eliminate the non-candidate
oligo positions as much as it can. This is an essential and very efficient phase for al-
leviating the intensive homology search especially for large input data sets of target
sequences such as the human genome. The intensive phase of homology search is
performed at the end of the algorithm to make sure that the final oligo probe candi-
dates have totally been checked, tested, and verified against all other positions for any
specificity. Both fast and intensive homology search phases are described in detail as
follows.
4.5.3 Fast homology search
To perform the fast homology search phase for cross-hybridization assessment, we use
the set of eight multiple spaced seeds of weight 10 in Figure 4.4.
57
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11
Figure 4.4: The set of multiple spaced seeds used in the fast homology search phase.
For each of the eight spaced seeds, a hash table is considered that handles collisions
using the linear probing technique in the following way. The algorithm starts from
the right end of the input vector and screens the vector to the left end. At each
position in the input vector, the algorithm computes the corresponding integer value
of the position according to the seed model. Here, finding the hits is equivalent to
finding the same integer values. After computing the integer value for a position in
the input vector, the algorithm searches for the integer value in the hash table and
inserts the hash value and the related position into it if the hash value does not exist
in the table. If the integer value is available in the hash table, which indicates a hit,
the proposed algorithm extends the positions related to the hit from both left and
right up to the predefined oligo length and checks for sequence identity parameter.
Both positions will be eliminated and considered as non-candidates if the similarity
level of two extended regions is more than the predefined threshold and this process
will be continued by sliding the extended regions one position to the left or right till
the threshold condition is satisfied. The default value of this threshold is 75%. Figure
4.5 shows how the fast homology search phase works. This process is repeated for
58
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
each spaced seed.
j i GCNTACACGTCACCATCTGTGCCACCACACATGTCTCTAGTGATCCCTCATAAGTTCCAACAAATGTGCCACCTCGCATACACCAACATTGATGGGC 111*1**11*1*111
Key Pos
.
.
.
.
.
.
990925075 i
.
.
.
.
.
.
j i GCNTACACGTCACCATCTGTGCCACCACACATGTCTCTAGTGATCCCTCATAAGTTCCAACAAATGTGCCACCTCGCATACACCAACATTGATGGGC
Figure 4.5: Fast homology search.
To check the maximum consecutive match parameter, or stretches of identical
bases, the above technique is applied using a consecutive seed of length equal to the
threshold for the stretch’s length which is by default 15. In addition, after finding the
hits, all oligo positions containing those stretches will be eliminated and considered
as non-candidates.
4.5.4 Intensive homology search
To perform the intensive homology search phase, first all hash tables are created based
on the set of eight spaced seeds of weight nine in Figure 4.6.
The hash tables are created by: starting from the right end of the input vector
and screening to the left end, calculating the integer value for each position based on
59
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
111*1***11*1*11 11***1**1****1***1*111 1*11**1*1***1****1**11 1*1*11****11******1*11 11*1******1***1*1***111 11***1*1*******1**1*1*11 11*1*****1**1******1****111 111**1******1****1*****1**11
Figure 4.6: The set of multiple spaced seeds used in intensive homology search phase.
the seed’s model, and inserting the integer value as a hash key and its corresponding
position into the hash table; if the hash key is already available in the hash table, the
algorithm only inserts the corresponding position to the array of positions for that
hash key.
After creating all hash tables, the algorithm deeply checks the specificity of an
oligo candidate position against all other possible positions in the hash tables. In other
words, the algorithm first computes the related integer value of the oligo candidate
position, and then searches the hash value in hash tables. If the hash value is available
in the tables, all related positions of that hash key are considered for extension and
Kane’s similarity check.
The difference between the fast and intensive phases is that for the fast phase
we keep only one position for each hash key that is the last seen position of a hit in
screening from right to left, while in the intensive phase we keep all positions and
their corresponding hash values in the hash table.
60
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
4.6 GC-content evaluation
GC-content is the percentage of guanine (G) and cytosine (C) in a DNA sequence. For
the GC-content evaluation, the proposed algorithm considers a user-defined range that
is determined by minimum and maximum GC-content thresholds. In the proposed
algorithm, the default values for minimum and maximum GC-content thresholds are
30% and 70%, respectively.
...AAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACG
30 ≤ GC-‐Content
≤ 70 ?
NO Eliminate ✗
YES Candidate ✓
...AAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACG
30 ≤ GC-‐Content
≤ 70 ?
NO Eliminate ✗
YES Candidate ✓
✓
CCGCGCGCGCCTGCAACCAACAGTACCACCAGGCATGCCACACACGGCCCGAGCTA...
30 ≤ GC-‐Content
≤ 70 ?
NO Eliminate ✗
YES Candidate ✓
✗✓✗✓✓✓...
.
.
.
Figure 4.7: GC-content evaluation process.
As shown in Figure 4.7, GC-content evaluation is carried out by starting from the
right end of the input vector, screening to the left end, checking the percentage of G
and C, and eliminating the oligo candidate positions whose GC-content percentage is
not in the predefined-range.
61
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
4.7 Melting temperature management
The melting temperature, Tm, of an oligo duplex is the temperature at which the
oligo is 50% bound to its complement. This means that 50% of the molecules are
single-stranded while 50% of the molecules are in the double-stranded form.
To achieve maximum uniformity for the set of oligonucleotides, a small difference
between melting temperatures of oligonucleotides is required. Several approaches are
available for computing the melting temperature. The most commonly used approach
is applying the Nearest Neighbour model with either the parameters from SantaLucia
(1998) or Rychlik et al. (1990).
The proposed algorithm employs the following formula to calculate the Tm:
Tm =∆H
∆S +R ln(C/4)− 273.15 + 12 log[Na+] (4.5)
where ∆H (k.cal/mol), or enthalpy, is the total energy exchange between the system
and its surrounding environment, ∆S (cal/mol.K), or entropy, denotes the energy
spent by the system to organize itself, R = 1.987 (cal/mol.K) is the ideal gas constant,
C is the molar concentration of the oligo, and [Na+] is the molar concentration of
salt.
∆H and ∆S are obtained from the Nearest Neighbor model and the thermody-
namic parameters summarized in Table 4.3. ∆H and ∆S are sum of all nearest
neighbor stacks or doublets including end interactions as well, that is:
∆H = ∆Hend +∑ij
Nij∆Hij
∆S = ∆Send +∑ij
Nij∆Sij
(4.6)
62
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
where Nij is the number of times the specific nearest-neighbor stack occurs in the
duplex sequence.
Table 4.3: Nearest-Neighbor parameters for DNA/DNA duplexes (SantaLucia, 1998).
Stack (5′3′/3′5′) ∆H (kcalmol
) ∆S ( calmol.K
)AA/TT -7.9 -22.2AT/TA -7.2 -20.4TA/AT -7.2 -21.3CA/GT -8.5 -22.7GT/CA -8.4 -22.4CT/GA -7.8 -21.0GA/CT -8.2 -22.2CG/GC -10.6 -27.2GC/CG -9.8 -24.4GG/CC -8.0 -19.9Init. w/term. G/C 0.1 -2.8Init. w/term. A/T 2.3 4.1Symmetry correction 0 -1.4
After calculating the melting temperature for the remaining oligo candidate posi-
tions, the proposed algorithm performs the following approach to reach the maximum
uniformity for the set of oligonucleotides. By considering a user-defined length of in-
terval for oligos Tm (default is 10 degrees), the algorithm tries to find the optimum
interval in which the maximum number of target sequences are covered. To do this,
the algorithm keeps a record of all possible melting temperatures for each target se-
quence in all possible intervals of the predefined range and then picks the interval
including the highest number of target sequences. At the end, oligo positions whose
melting temperatures are not in this interval will be eliminated as non-candidate
positions.
63
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
4.8 Secondary structure assessment
Forming secondary structures can affect the sensitivity of oligonucleotides probes by
dramatically decreasing the ability to anneal with target sequences. Examples of
forming secondary structures such as hairpins and dimers are shown in Figure 4.8.
AAAAAAAAAAAAAAAAAAAAA3’ CAGATCAG || ||| T 5’ CATCTGTGCCACCACACATGTCTCGAGTG
AAAAAAAAAAAAAAAAAAAAA3’ CAGATCAG |||||| 5’ CATCTGTGCCACCACACATGTCTCTAGTT
stem
loop
3’ TCTGACCTCAGATCTGTAC 5’ |||||||||||||||| ||||| 5’ CATGTCTAGACTCCAGTCT 3’
b. Dimer
a. Hairpin
3’ CCCTGATCAGCTCGGCA 5’ ||||||||||||||||||| 5’ ACGGCTCGACTAGTCCC 3’
Figure 4.8: Examples of forming secondary structures a. hairpin b. dimer.
The secondary structure assessment can be performed in two ways; the first way
is self-complementary (self-annealing) checking by aligning the oligonucleotide with
its reverse-complement sequence; the other one is thermodynamic calculations to
determine the stability of potential secondary structures. In the proposed algorithm,
64
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
the first approach will be used because the goal is to realize if any secondary structure
is probable not to identify the best secondary structure.
Because the probability of forming secondary structures is very high in regions of
significant self-complementarity, the secondary structure assessment step is used to
avoid designing the oligos that form stable secondary structure. Oligo candidates are
examined for potential stem-loop structures by searching for short stretches of com-
plementary sequences, called stems, that are separated by few bases, called the loop.
Also, all oligo candidates are checked for ability to form dimers which is performed
by checking the number of base-pairing nucleotide bases within a specific window
size(Figure 4.8). All parameters related to hairpins and dimers are specified by the
user. The default value of stems is six and the minimum and maximum length of the
loop are one and three respectively. For dimers, the default values for window length
and stringency are 15 and 13 respectively.
The proposed algorithm picks the “safest position” for each target sequence after
all elimination from previous steps is done. This position is the middle position of
the largest interval of consecutive candidate positions and is considered as the first
candidate for the final oligo candidate. The final oligo candidate can be also selected
according to their position in the target sequence (proximity to 5′ or 3′). Then,
the secondary structure assessment is performed and all surviving oligo candidates
for each target sequence are considered for the final intensive homology search and
selection. Figure 4.9 shows the summarized version of the proposed algorithm.
65
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Algorithm OligoDesign(G)- given: the set G of k input target sequences: G = {g1, g2, . . . , gk}- returns: a set O of m optimal oligonucleotides: O = {o1, o2, . . . , om}
1. V = Merge(G)2. C = Code(V ) // A:00, C:01, G:10, T:113. FastHomology(C, 1maxMatch) // checking for max consecutive match4. for i = 1 to |Seed1| do5. FastHomology(C, Seed1[i]) // checking for sequence identity6. GCcontent(C,minGC ,maxGC)7. MeltTemp(C, |range|)8. for i = 1 to |Seed2| do9. hi = Hash(C, Seed2[i])10. O = ∅11. for i = 1 to k do12. oi = ε13. while (oi = ε and ∃ candidate ∈ [1..|gi|]) do14. SecStruct(candidate)15. for j = 1 to |Seed2| do16. IntensiveHolomoly(candidate, hj)17. O = O + (oi = candidate)18. end while19. end for20. return (O)
Figure 4.9: The OligoDesign algorithm.
66
Chapter 5
Experimental results and
evaluation
This chapter presents the experimental results as well as the comparison of the pro-
posed method with other oligo probe design algorithms.
5.1 Experimental results
The proposed algorithm is implemented entirely in C/C++ and runs on many plat-
forms such as Linux/Unix, Mac OS, and Windows.
The input data sets used for experimental results are listed in Table 5.1. All input
data sets are in fasta format in which the beginning of a new sequence is denoted
by a ‘>’ sign following by a title to specify the sequence. The input DNA sequences
may contain ambiguity characters, denoted by N, but they will not be included in de-
signed oligo probe. In other words, all possible oligo probes containing the ambiguity
characters will be masked and considered as non-candidate oligo positions. So, only
67
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
those oligo probes will be selected that contain the characters A, C, G, T (uppercase
and lowercase are accepted).
Table 5.1: Input data sets used for experimental results.
Species model Gene set size Descriptionmousenervous 1,421 genes 4,354,947 bp 1421 genes known to be involved in the
development of the mouse nervous systemecoli 5,317 genes 4,843,471 bp Escherichia coli O157:H7 (E. coli) gene se-
quencesbee 11,324 genes 6,010,949 bp Apis (bee) EST sequences
yeast 6,702 genes 9,074,997 bp Yeast CDS sequences
plasmodium 9,518 genes 10,739,506 bp Plasmodium falciparum sequences
zebrafish 12,238 genes 23,003,650 bp Zebrafish mRNA sequences
drosophila 18,962 genes 32,198,758 bp Drosophila melanogaster complete CDSsequences
chicken 26,236 genes 32,732,911 bp Gallus gallus mRNA for DNA topoiso-merase I, complete cds
celegans 30,935 genes 34,753,016 bp C.elegans EST sequences
arabidopsis 28,952 genes 36,298,530 bp Arobidopsis thaliana complete CDS se-quences
maize 58,579 genes 38,963,590 bp Maize EST sequences (release 15 data set)
mouse 35,284 genes 68,604,317 bp Mouse cDNA sequences
human 28,205 genes 72,720,516 bp Homo sapiens small muscle protein, X-linked (SMPX), mRNA
mouserna 36,598 genes 93,830,285 bp Mus musculus transcript sequence file (re-lease 21)
rice 66,710 genes 113,204,455 bp Rice cDNA gene set
The default parameters used in the implementation are listed below.
• oligo length: 50 bases (b)
• maximum consecutive match: 15 b
• maximum sequence similarity: 75%
68
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
• minimum GC-content: 30%, maximum GC-content: 70%
• oligo concentration: 1 µM, salt concentration: 75 mM
• melting temperature interval length: 10
• hairpin stem: 6 b, maximum hairpin loop: 3 b, minimum hairpin loop: 1 b
• dimer window length: 15 b, stringency: 13 b
The systems used for running the proposed algorithm have the following specifi-
cation:
• CPU: Intel R© CoreTM i7-2600 CPU @ 3.40GHz
– Number of Cores = 4
– Number of Threads = 8
– Clock Speed = 3.4 GH
– Cache Size = 8 MB
• Physical Memory: 16 GB
• OS: GNU Linux version 2.6.38.8-desktop-69mib
• Compiler: gcc version 4.4.3
The proposed algorithm has been also parallelized using openMP, an Application
Program Interface (API) that may be used to explicitly direct multi-threaded, shared
memory parallelism.
Tables 5.2-5.8 show the results of running the proposed algorithm by mentioned
default parameters and the various set of multiple spaced seeds of different weight for
69
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
fast and intensive homology search in Figures 5.1-5.7. For most scenarios, we have
employed two different sets of spaced seeds for fast and intensive homology search
steps. However, it should be mentioned that the proposed algorithm can be run
only by performing fast homology search or double fast homology search for cross-
hybridization assessment which results in less running time but less oligo quality as
well. For fast homology search step three sets of spaced seeds have been tested which
are: The set of eight multiple spaced seeds of weight 10, denoted by s8w10, the set
of eight multiple spaced seeds of weight 11, denoted by s8w11, and the set of eight
multiple spaced seeds of weight 8, denoted by s8w8. For intensive homology search
step three different sets of spaced seeds have also been tested which are: The set of
eight multiple spaced seeds of weight 9, denoted by s8w9, the set of eight multiple
spaced seeds of weight 8, denoted by s8w8, and the set of 16 multiple spaced seeds
of weight 9, denoted by s16w9. As we see in Tables 5.2-5.8, different seed models for
fast and intensive homology search steps generate different number of oligonucleotide
probes and have different running time and memory usage. The best running time and
memory usage is for the case that only fast homology search is applied, but the quality
of generated oligo probes are not good and some of them do not satisfy the sequence
similarity parameter. The combination of eight seeds of weight 10 (s8w10) for fast
homology search step and eight seeds of wight nine (s8w9) for intensive homology
search step is selected as the best combination for comparison with other oligo design
algorithms which will be presented in the next section. Moreover, the implementation
of the algorithm allows it to run on any system by changing the seed specification,
that is, if the physical memory of the system is not enough for the combination of
s8w10-s8w9 then smaller seed sets will be selected to perform the intensive homology
70
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
search.
111*1***11*1*11 11***1**1****1***1*111 1*11**1*1***1****1**11 1*1*11****11******1*11 11*1******1***1*1***111 11***1*1*******1**1*1*11 11*1*****1**1******1****111 111**1******1****1*****1**11
111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11
Fast Homology Search Intensive Homology Search
Figure 5.1: Seeds used for fast homology search (s8w10) and for intensive homologysearch (s8w9).
1*1****1***1*****1*111 1*11*1******1**1***1*1 11***1*1*****1**1*1**1 11**1*****1****1**1****11 1*1***1******11***1****11 111******1*****1******1***11 11*1*******1********1***1*11 11***1******1*******1**1*1*1
111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11
Fast Homology Search Intensive Homology Search
Figure 5.2: Seeds used for fast homology search (s8w10) and for intensive homologysearch (s8w8).
5.2 Results from other algorithms
To compare our algorithm with the algorithms for oligo probe design, a comprehensive
survey on the available software programs has been performed and the best algorithms
71
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.2: Results by employing s8w10-s8w9.
Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 4 353ecoli 5,317 4,843,471 4,647 4 374bee 11,324 6,010,949 10,675 6 428yeast 6,702 9,074,997 6,178 9 569plasmodium 9,518 10,739,506 4527 14 658zebrafish 12,238 23,003,650 7,989 31 1,211drosophila 18,962 32,198,758 11,826 34 1,636chicken 26,236 32,732,911 16,692 38 1,661celegans 30,935 34,753,016 21,724 44 1,755arabidopsis 28,952 36,298,530 21,326 45 1,826maize 58,579 38,963,590 43,614 59 1,952mouse 35,284 68,604,317 20,200 99 3,321human 28,205 72,720,516 18,781 109 3,511mouserna 36,598 93,830,285 17,963 151 4,477rice 66,710 113,204,455 28,552 215 5,375
Table 5.3: Results by employing s8w10-s8w8.
Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 4 753ecoli 5,317 4,843,471 4,647 4 753bee 11,324 6,010,949 10,675 9 756yeast 6,702 9,074,997 6,178 12 763plasmodium 9,518 10,739,506 4527 28 767zebrafish 12,238 23,003,650 7,989 40 1,217drosophila 18,962 32,198,758 11,826 61 1,639chicken 26,236 32,732,911 16,692 70 1,665celegans 30,935 34,753,016 21,724 90 1,758arabidopsis 28,952 36,298,530 21,326 90 1,830maize 58,579 38,963,590 43,614 134 1,961mouse 35,284 68,604,317 20,200 218 3,335human 28,205 72,720,516 18,781 238 3,522mouserna 36,598 93,830,285 17,963 336 4,466rice 66,710 113,204,455 28,552 523 5,373
72
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
11**11*1****11**11 111***1**1**11*1*1 11**11****11**11*1 1*1*1*11***1**1*11 1*11*1****1*1*1*11 11*1***1*1*1***111 1*1*1**1*11****1*11 11*11*****1**1*1**11 111****1***1***1**111 1*1**1**1***11****111 111**1****1**1*****1*11 11*1*****1******1**1*111 111***1*******1***1**11*1 11**1*1********1****1**111 11*1***1***1********1*1*11 111*1****1*********1*****111
111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11
Fast Homology Search Intensive Homology Search
Figure 5.3: Seeds used for fast homology search (s8w10) and for intensive homologysearch (s16w9).
Table 5.4: Results by employing s8w10-s16w9.
Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 6 697ecoli 5,317 4,843,471 4,647 7 739bee 11,324 6,010,949 10,675 11 844yeast 6,702 9,074,997 6,178 14 1,119plasmodium 9,518 10,739,506 4527 27 1294zebrafish 12,238 23,003,650 7,989 53 2,375drosophila 18,962 32,198,758 11,826 63 3,204chicken 26,236 32,732,911 16,692 71 3,252celegans 30,935 34,753,016 21,724 84 3,435arabidopsis 28,952 36,298,530 21,326 85 3,574maize 58,579 38,963,590 43,614 115 3,820mouse 35,284 68,604,317 20,200 193 6,492human 28,205 72,720,516 18,781 215 6,864mouserna 36,598 93,830,285 17,963 294 8,754rice 66,710 113,204,455 28,552 422 10,509
73
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
1*1*1*1****11***11 11****11***1**11*1 1*1**1**1****111*1 1**11**11******1*11 111****1**1***1*1*1 11*1**1***1**1****11 111******1***1*****1*11 11*1***1*************1*1*11
111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11
Fast Homology Search Fast Homology Search
Figure 5.4: Seeds used for double fast homology search (s8w10) and (s8w8).
Table 5.5: Results by employing s8w10-s8w8.
Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 2 309ecoli 5,317 4,843,471 4,649 3 310bee 11,324 6,010,949 10,678 4 313yeast 6,702 9,074,997 6,198 6 319plasmodium 9,518 10,739,506 4554 4 323zebrafish 12,238 23,003,650 8,052 12 349drosophila 18,962 32,198,758 11,840 16 369chicken 26,236 32,732,911 16,737 22 370celegans 30,935 34,753,016 22,041 17 940arabidopsis 28,952 36,298,530 22,171 17 943maize 58,579 38,963,590 44,646 21 949mouse 35,284 68,604,317 20,516 28 1,013human 28,205 72,720,516 19,075 31 1,022mouserna 36,598 93,830,285 18,210 41 1,067rice 66,710 113,204,455 29,567 47 1,109
74
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11
Figure 5.5: Seeds used for fast homology search (s8w10).
Table 5.6: Results by employing s8w10.
Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 2 309ecoli 5,317 4,843,471 4,649 2 310bee 11,324 6,010,949 10,678 2 312yeast 6,702 9,074,997 6,198 4 319plasmodium 9,518 10,739,506 4554 2 323zebrafish 12,238 23,003,650 8,053 10 349drosophila 18,962 32,198,758 11,841 12 369chicken 26,236 32,732,911 16,738 13 936celegans 30,935 34,753,016 22,046 13 940arabidopsis 28,952 36,298,530 22,180 14 943maize 58,579 38,963,590 44,656 17 949mouse 35,284 68,604,317 20,535 23 1,013human 28,205 72,720,516 19,081 25 1,022mouserna 36,598 93,830,285 18,210 34 1,067rice 66,710 113,204,455 29,604 39 1,109
75
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
111*1***11*1*11 11***1**1****1***1*111 1*11**1*1***1****1**11 1*1*11****11******1*11 11*1******1***1*1***111 11***1*1*******1**1*1*11 11*1*****1**1******1****111 111**1******1****1*****1**11
1*111*1*1***111*11 11*1***11*111**111 1111**11**1**1*1*11 111**1*1***1**1**1111 11**1*1*11*1*****1*111 111*1***1****11***1*111 111*1**1*******1***1**1*111 11*11*1*****1*******1*1**111
Fast Homology Search Intensive Homology Search
Figure 5.6: Seeds used for fast homology search (s8w11) and for intensive homologysearch (s8w9).
Table 5.7: Results by employing s8w11-s8w9.
Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 3 753ecoli 5,317 4,843,471 4,646 3 754bee 11,324 6,010,949 10,677 6 756yeast 6,702 9,074,997 6,178 11 763plasmodium 9,518 10,739,506 4527 16 767zebrafish 12,238 23,003,650 7989 42 1,211drosophila 18,962 32,198,758 11,826 55 1,636chicken 26,236 32,732,911 16,692 57 1,661celegans 30,935 34,753,016 21,724 64 1,755arabidopsis 28,952 36,298,530 21,331 69 1,826maize 58,579 38,963,590 43,614 94 1,952mouse 35,284 68,604,317 20,200 149 3,321human 28,205 72,720,516 18,781 160 3,511mouserna 36,598 93,830,285 17,963 218 4,477rice 66,710 113,204,455 28,556 294 5,375
76
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
11**11*1****11**11 111***1**1**11*1*1 11**11****11**11*1 1*1*1*11***1**1*11 1*11*1****1*1*1*11 11*1***1*1*1***111 1*1*1**1*11****1*11 11*11*****1**1*1**11 111****1***1***1**111 1*1**1**1***11****111 111**1****1**1*****1*11 11*1*****1******1**1*111 111***1*******1***1**11*1 11**1*1********1****1**111 11*1***1***1********1*1*11 111*1****1*********1*****111
Fast Homology Search Intensive Homology Search
1*111*1*1***111*11 11*1***11*111**111 1111**11**1**1*1*11 111**1*1***1**1**1111 11**1*1*11*1*****1*111 111*1***1****11***1*111 111*1**1*******1***1**1*111 11*11*1*****1*******1*1**111
Figure 5.7: Seeds used for fast homology search (s8w11) and for intensive homologysearch (s16w9).
Table 5.8: Results by employing s8w11-s16w9.
Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 6 753ecoli 5,317 4,843,471 4,646 7 754bee 11,324 6,010,949 10,675 10 844yeast 6,702 9,074,997 6,178 15 1,119plasmodium 9,518 10,739,506 4,527 28 1,294zebrafish 12,238 23,003,650 7,989 61 2,175drosophila 18,962 32,198,758 11,826 81 3,204chicken 26,236 32,732,911 16,692 88 3,552celegans 30,935 34,753,016 21,724 95 3,435arabidopsis 28,952 36,298,530 21,331 106 3,574maize 58,579 38,963,590 43,614 145 3,820mouse 35,284 68,604,317 20,200 231 6,492human 28,205 72,720,516 18,781 259 6,864mouserna 36,598 93,830,285 17,963 352 8,754rice 66,710 113,204,455 28,556 488 10,509
77
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
have been selected for this purpose. Of the algorithms in chapter 4, those freely avail-
able and executable for comparison purposes are: ArrayOligoSelector, OligoArray,
OligoPicker, OligoWiz, PICKY, and YODA. All programs were run with the same
parameters so that the comparison makes sense; the length of generated oligo probes
for all algorithms is set to 50; Maximum consecutive match and maximum sequence
similarity percentage are set to 15 and 75 respectively; the GC-content range is set
to [30,70]. Table 5.9 shows the description of software programs used in the evalu-
ation and comparison process. The results of these algorithms are shown in Tables
5.10-5.15. It should be noted that all programs have been run on the same machine
with the specifications mentioned in the beginning of this chapter.
Table 5.9: Description of the oligo design software programs used in comparison.
Algorithm Organism Specificity Availability User Platform Programmingbank interface language
AOS No limit fasta file Free Command line L Python
OAR No limit fasta file Free Command line L/W/M Javaand GUI
OPR No limit fasta file Free Command line L Perl
OWZ Limited fasta file Free Command line W/L/M Perl, Javaand GUI
PKY No limit fasta file By request GUI L/W/M C++
YDA No limit fasta file Free Command line W/L/M Javaand GUI
The abbreviation used for algorithm AOS: ArrayOligoSelector, OAR: OligoArray, OPR: OligoPicker,OWZ: OligoWiz, PKY: PICKY, YDA: YODA. In platform column, L: Linux, W: Windows, M:Macintosh.
78
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.10: Results for ArrayOligoSelector.
Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,421 00:12:56 NAecoli 5,317 4,843,471 5,161 00:20:13 NAbee 11,324 6,010,949 11,317 00:52:12 NAyeast 6,702 9,074,997 6,645 00:40:23 NAplasmodium 9,518 10,739,506 8,991 01:14:50 NAzebrafish 12,238 23,003,650 8,481 01:52:44 NAdrosophila 18,962 32,198,758 16,501 02:13:42 NAchicken 26,236 32,732,911 26,036 02:50:22 NAcelegans 30,935 34,753,016 30,788 03:59:10 NAarabidopsis 28,952 36,298,530 27,918 03:03:37 NAmaize 58,579 38,963,590 58,522 06:33:49 NAmouse 35,284 68,604,317 34,491 07:44:29 NAhuman 28,205 72,720,516 27,923 06:08:34 NAmouserna 36,598 93,830,285 34,856 10:21:06 NArice 66,710 113,204,455 66,520 21:09:40 NA
Table 5.11: Results for OligoArray.
Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,410 00:05:55 1,300ecoli 5,317 4,843,471 4,503 00:42:49 1,400bee 11,324 6,010,949 10,575 01:01:02 3,200yeast 6,702 9,074,997 6,156 01:26:00 1,300plasmodium 9,518 10,739,506 5,370 13:39:46 3,000zebrafish 12,238 23,003,650 7,573 09:15:09 3,800drosophila 18,962 32,198,758 10,034 14:32:05 3,300chicken 26,236 32,732,911 15,984 17:16:42 2,400celegans 30,935 34,753,016 23,142 21:14:35 3,000arabidopsis 28,952 36,298,530 19,340 14:43:50 3,200maize 58,579 38,963,590 42,423 16:50:12 2,500mouse 35,284 68,604,317 20,353 86:10:59 8,000human 28,205 72,720,516 18,083 24:12:12 6,000mouserna 36,598 93,830,285 18,343 103:49:30 7,900rice 66,710 113,204,455 19,814 110:48:35 10,000
79
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.12: Results for OligoPicker.
Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,421 00:01:06 NAecoli 5,317 4,843,471 4,672 00:04:20 NAbee 11,324 6,010,949 10,823 00:09:54 NAyeast 6,702 9,074,997 6,249 00:09:00 NAplasmodium 9,518 10,739,506 5,964 00:15:25 NAzebrafish 12,238 23,003,650 8,226 00:24:38 NAdrosophila 18,962 32,198,758 12,245 00:39:03 NAchicken 26,236 32,732,911 17,485 00:49:55 NAcelegans 30,935 34,753,016 24,086 01:23:47 NAarabidopsis 28,952 36,298,530 23,687 02:19:47 NAmaize 58,579 38,963,590 49,475 04:28:43 NAmouse 35,284 68,604,317 23,779 02:46:37 NAhuman 28,205 72,720,516 21,410 02:18:27 NAmouserna 36,598 93,830,285 21,757 02:49:55 NArice 66,710 113,204,455 38,692 08:38:26 1,754
Table 5.13: Results for OligoWiz.
Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,421 00:54:54 NAecoli 5,317 4,843,471 5,317 00:32:29 NAbee 11,324 6,010,949 11,324 01:46:36 NAyeast 6,702 9,074,997 6,702 01:01:01 NAplasmodium 9,518 10,739,506 9,517 11:37:08 NAzebrafish 12,238 23,003,650 12,238 13:04:35 NAdrosophila 18,962 32,198,758 18,962 06:19:40 NAchicken 26,236 32,732,911 26,235 05:34:33 NAcelegans 30,935 34,753,016 30,935 05:42:52 NAarabidopsis 28,952 36,298,530 28,952 05:17:28 NAmaize 58,579 38,963,590 58,579 08:36:26 NAmouse 35,284 68,604,317 35,283 28:53:23 NAhuman 28,205 72,720,516 28,205 24:08:28 NAmouserna 36,598 93,830,285 36,585 28:53:07 416rice 66,710 113,204,455 66,710 69:26:16 467
80
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.14: Results for PICKY.
Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1,421 30 101ecoli 5,317 4,843,471 4,557 47 106bee 11,324 6,010,949 10,442 62 133yeast 6,702 9,074,997 5,850 121 194plasmodium 9,518 10,739,506 4,138 43 230zebrafish 12,238 23,003,650 7,428 345 506drosophila 18,962 32,198,758 10,484 493 671chicken 26,236 32,732,911 13,781 422 685celegans 30,935 34,753,016 16,807 393 725arabidopsis 28,952 36,298,530 18,584 436 756maize 58,579 38,963,590 26,506 442 814mouse 35,284 68,604,317 12,473 457 1,419human 28,205 72,720,516 10,807 392 1,507mouserna 36,598 93,830,285 9,483 445 1,938rice 66,710 113,204,455 13,365 617 2,298
Table 5.15: Results for YODA.
Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,418 00:20:06 819ecoli 5,317 4,843,471 4,590 00:33:54 918bee 11,324 6,010,949 10,526 01:08:53 768yeast 6,702 9,074,997 6,128 02:30:23 1,300plasmodium 9,518 10,739,506 3,885 01:02:25 515zebrafish 12,238 23,003,650 7,855 08:19:16 770drosophila 18,962 32,198,758 11,613 10:32:47 1,500chicken 26,236 32,732,911 16,306 12:20:47 1,100celegans 30,935 34,753,016 20,941 25:25:43 1,000arabidopsis 28,952 36,298,530 20,318 80:50:12 683maize 58,579 38,963,590 41,198 57:03:36 1,200mouse 35,284 68,604,317 19,399 37:40:05 933human 28,205 72,720,516 17,997 33:58:01 1,200mouserna 36,598 93,830,285 17,267 39:13:10 1,700rice 66,710 113,204,455 25,891 127:10:27 1,900
81
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
5.3 Evaluation and comparison
To evaluate and compare all mentioned algorithms with the our proposed algorithm,
we have developed a separate program. This program employs eight highly sensitive
multiple spaced seeds of weight six shown in Figure 5.8.
111*1*11 11**1**111 11*1*****1*11 1*1***1****111 1*1**1**1***1*1 11***1****1**1*1 11*1******1***1*1 11******1****1**11
Figure 5.8: Eight spaced seeds of weight six used in the evaluation program.
The final set of oligo probes designed by each algorithms is considered as the
input of the evaluation program. The program works as follows. For each oligo
probe, using eight hash tables from seeds, it looks for the similar region between the
oligo probe and all possible regions in the whole data set (that is, the vector of joined
input target sequences) and consequently checks the sequence similarity percentage
parameter, i.e. 75%. Oligo probes that have more than 75% similarity are considered
bad oligos. The evaluation program is slow because the multiple spaced seeds of
weight six cause many hits to be checked for the sequence similarity parameter. All
evaluations were performed on sharcnet ( http://www.sharcnet.ca). We have used
the following cluster for our evaluation task: kraken, orca, hound, bramble, and silky.
The comparison of algorithms on data sets are shown in Tables 5.16-5.30.
82
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.16: Evaluation for mousenervous.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 1,421 174 1,247 87.76 776 NAOligoArray 1,410 315 1,095 77.65957447 355 1,300OligoPicker 1,421 17 1,404 98.80 66 NAOligoWiz 1,421 144 1,277 89.87 3,294 NAPICKY 1,421 36 1,385 97.47 30 101YODA 1,418 0 1,418 100 1,206 819Proposed method 1,418 0 1,418 100 4 353
Table 5.17: Evaluation for ecoli.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 5,161 647 4,514 87.46 1,213 NAOligoArray 4,503 1,124 3,379 75.04 2,569 1,400OligoPicker 4,672 145 4,527 96.90 260 NAOligoWiz 5,317 2,309 3,008 56.57 1,949 NAPICKY 4,557 65 4,492 98.57 47 106YODA 4,590 16 4,574 99.65 2,034 918Proposed method 4,647 0 4,647 100 4 374
Table 5.18: Evaluation for bee.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 11,317 1,359 9,958 87.99 3,132 NAOligoArray 10,575 4,136 6,439 60.89 3,662 3,200OligoPicker 10,823 354 10,469 96.73 594 NAOligoWiz 11,324 1,544 9,780 86.37 6,396 NAPICKY 10,442 110 10,332 98.95 62 133YODA 10,526 136 10,390 98.71 4,133 768Proposed method 10,675 0 10,675 100 6 428
83
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.19: Evaluation for yeast.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 6,645 1,922 4,723 71.08 2,423 NAOligoArray 6,156 3,016 3,140 51.01 5,160 1,300OligoPicker 6,249 208 6,041 96.67 540 NAOligoWiz 6,702 875 5,827 86.94 3,661 NAPICKY 5,850 138 5,712 97.64 121 194YODA 6,128 34 6,094 99.45 9,023 1,300Proposed method 6,178 0 6,178 100 9 569
Table 5.20: Evaluation for plasmodium.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 8,991 7,455 1,536 17.08 4,490 NAOligoArray 5,370 4,225 1,145 21.32 49,186 3,000OligoPicker 5,964 1,671 4,293 71.98 925 NAOligoWiz 9,517 7,811 1,706 17.93 41,828 NAPICKY 4,138 131 4,007 96.83 43 230YODA 3,885 55 3,830 98.58 3,745 515Proposed method 4,527 0 4,527 100 14 658
Table 5.21: Evaluation for zebrafish.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 8,481 6,215 2,266 26.72 6,764 NAOligoArray 7,573 5,233 2,340 30.90 33,309 3,800OligoPicker 8,226 579 7,647 92.96 1,478 NAOligoWiz 12,238 8,810 3,428 28.01 47,075 NAPICKY 7,428 382 7,046 94.86 345 506YODA 7,855 76 7,779 99.03 29,956 770Proposed method 7,989 0 7,989 100 31 1,211
84
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.22: Evaluation for drosophila.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 16,501 10,590 5,911 35.82 8,022 NAOligoArray 10,034 5,803 4,231 42.17 52,325 3,300OligoPicker 12,245 802 11,443 93.45 2,343 NAOligoWiz 18,962 10,701 8,261 43.57 22,780 NAPICKY 10,484 304 10,180 97.10 493 671YODA 11,613 163 11,450 98.60 37,967 1,500Proposed method 11,826 0 11,826 100 34 1,636
Table 5.23: Evaluation for chicken.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 26,036 20,357 5,679 21.81 10,222 NAOligoArray 15,984 12,136 3,848 24.07 62,202 2,400OligoPicker 17,485 1,366 16,119 92.19 2,995 NAOligoWiz 26,235 15,466 10,769 41.05 20,073 NAPICKY 13,781 315 13,466 97.71 422 685YODA 16,306 236 16,070 98.55 44,447 1,100Proposed method 16,692 0 16,692 100 38 1,661
Table 5.24: Evaluation for celegans.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 30,788 25,057 5,731 18.61 14,350 NAOligoArray 23,142 19,503 3,639 15.72 76,475 3,000OligoPicker 24,086 4,116 19,970 82.91 5,027 NAOligoWiz 30,935 18,617 12,318 39.82 20,572 NAPICKY 16,807 1,012 15,795 93.98 393 725YODA 20,941 406 20,535 98.06 91,543 1,000Proposed method 21,724 0 21,724 100 44 1,755
85
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.25: Evaluation for arabidopsis.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 27,918 22,641 5,277 18.90 11,017 NAOligoArray 19,340 16,771 2,569 13.28 53,030 3,200OligoPicker 23,687 6,107 17,580 74.22 8,387 NAOligoWiz 28,952 16,635 12,317 42.54 19,048 NAPICKY 18,584 3,447 15,137 81.45 436 756YODA 20,318 271 20,047 98.67 291,012 683Proposed method 21,326 0 21,326 100 45 1,826
Table 5.26: Evaluation for maize.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 58,522 38,224 20,298 34.68 23,629 NAOligoArray 42,423 30,736 11,687 27.55 60,612 2,500OligoPicker 49,475 9,835 39,640 80.12 16,123 NAOligoWiz 58,579 48,243 10,336 17.64 30,986 NAPICKY 26,506 1,757 24,749 93.37 442 814YODA 41,198 1,285 39,913 96.88 205,416 1,200Proposed method 43,614 0 43,614 100 59 1,952
Table 5.27: Evaluation for mouse.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 34,491 31,848 2,643 7.66 27,869 NAOligoArray 20,353 18,164 2,189 10.76 310,259 8,000OligoPicker 23,779 5,275 18,504 77.82 9,997 NAOligoWiz 35,283 29,799 5,484 15.54 104,003 NAPICKY 12,473 631 11,842 94.94 457 1,419YODA 19,399 410 18,989 97.89 135,605 933Proposed method 20,200 0 20,200 100 99 3,321
86
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.28: Evaluation for human.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 27,923 26,288 1,635 5.86 22,114 NAOligoArray 18,083 16,589 1,494 8.26 87,132 6,000OligoPicker 21,410 4,056 17,354 81.06 8,307 NAOligoWiz 28,205 25,119 3,086 10.94 86,908 NAPICKY 10,807 495 10,312 95.42 392 1,507YODA 17,997 163 17,834 99.09 122,281 1,200Proposed method 18,781 0 18,781 100 109 3,511
Table 5.29: Evaluation for mouserna.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 34,856 33,529 1,327 3.81 37,266 NAOligoArray 18,343 17,209 1,134 6.18 373,770 7,900OligoPicker 21,757 5,428 16,329 75.05 10,195 NAOligoWiz 36,585 34,456 2,129 5.82 103,987 416PICKY 9,483 406 9,077 95.72 445 1,938YODA 17,267 235 17,032 98.64 141,190 1,700Proposed method 17,963 0 17,963 100 151 4,477
Table 5.30: Evaluation for rice.
Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)
ArrayOligoSelector 66,520 62,797 3,723 5.60 76,180 NAOligoArray 19,814 17,857 1,957 9.88 398,915 10,000OligoPicker 38,692 14,756 23,936 61.86 31,106 1,754OligoWiz 66,710 60,551 6,159 9.23 249,976 467PICKY 13,365 1,468 11,897 89.02 617 2,298YODA 25,891 704 25,187 97.28 457,827 1,900Proposed method 28,552 0 28,552 100 215 5,375
87
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Table 5.31 shows the total results of all algorithms on all data sets.
Table 5.31: Total comparison of the algorithms for all data sets.
Algorithm Total oligo Total Total % total Total Totalgenerated bad good good Time (sec) Space (MB)
ArrayOligoSelector 365,571 289,103 76,468 20.92 249,467 NAOligoArray 223,103 172,817 50,286 22.54 1,568,961 60,300OligoPicker 269,971 54,715 215,256 79.73 98,343 NAOligoWiz 376,965 281,080 95,885 25.44 762,536 NAPICKY 166,126 10,697 155,429 93.56 4,745 12,083YODA 225,332 4,190 221,142 98.14 1,577,385 16,306Proposed method 236,112 0 236,112 100 862 29,107
As we have seen in Tables 5.16-5.30 and according to the Table 5.31, it can be
concluded that the proposed algorithm for designing the optimal set of oligonucleotide
probes identifies and selects more unique and efficient oligos as well as running orders
of magnitude faster than the best available algorithms that have been proposed for
the same task.
88
Chapter 6
Summary and conclusion
The proposed algorithm in this research work provides a tool for researchers in biol-
ogy and medicine for the accurate, rapid, easy, flexible, and free design of signature
oligonucleotides. The oligonucleotides designed by our algorithm can be used in a
wide range of applications in biology or medicine such as: microarray design, PCR
amplification, gene identification, diagnostic tests for genetic diseases like breast can-
cer or cystic fibrosis, diagnostic tests for infectious diseases like hepatitis or AIDS, or
discovering new drugs or treatments for a variety of diseases.
To achieve the maximum specificity, sensitivity, and uniformity for the set of
oligonucleotides, the proposed algorithm tries to truly satisfy related parameters. It
employs the most sensitive sets of multiple spaced seeds to perform the homology
search, which is considered the most important parameter involved in the specificity
of the designed oligo probes. The other parameter affecting the specificity was GC-
content which was implemented by filtering out oligonucleotides whose GC-content
89
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
was outside the predefined range. For sensitivity, the algorithm tried to avoid de-
signing oligonucleotides forming stable secondary structures such as dimers and hair-
pins which was done by self-complementary verification of all potential regions in an
oligonucleotide. The maximum uniformity for the set of oligonucleotides was achieved
by identifying the optimal interval with a predefined small length in which the melting
temperatures of oligonucleotides are allowed to lie.
The results obtained from the proposed algorithm were compared with the most
well-known algorithms and software programs proposed for the same task. To perform
an independently fair comparison among all algorithms considered in the compari-
son process, a separate program was written that employed highly sensitive multiple
spaced seeds to identify bad designed oligos by algorithms, which had similarity more
than what was allowed to non-target sequences. The comparison illustrated that
our proposed algorithm for oligo design finds and selects unique and more efficient
oligonucleotide as well as running orders of magnitude faster than the other well-
known algorithms.
As future work, it would be a good idea to focus on the thermodynamics of DNA
oligonucleotides and to find more precise formulas to predict the melting temperature
of DNA sequences. Moreover, finding a good parameter for low-complexity region
assessment would be an interesting direction research.
90
Bibliography
Alberts, B., Johnson, A., Lewis, J., Raff, M., Bray, D., Hopkin, K., Roberts, K., and
Walter, P. (2003). Essential Cell Biology, Second Edition. Garland Science/Taylor
& Francis Group.
Altschul, S. F., Gish, W., Miller, W., Meyers, E. W., and Lipman, D. (1990). Basic
local alignment search tool. Journal of Molecular Biology, 215(3), 403–410.
Bozdech, Z., Zhu, J., Joachimiak, M., Cohen, F., Pulliam, B., and DeRisi, J. (2003).
Expression profiling of the schizont and trophozoite stages of plasmodium falci-
parum with a long-oligonucleotide microarray. Genome Biology, 4(2), R9.
Chou, H.-H., Hsia, A.-P., Mooney, D. L., and Schnable, P. S. (2004). Picky: oligo
microarray design for large genomes. Bioinformatics, 20(17), 2893–2902.
Farabee, M. J. (2007). On-Line Biology Book. Estrella Mountain Community College.
Feng, S. and Tillier, E. R. (2007). A fast and flexible approach to oligonucleotide
probe design for genomes and gene families. Bioinformatics, 23(10), 1195–1202.
Godfrey-Smith, P. and Sterelny, K. (2008). Biological information. In E. N. Zalta,
editor, The Stanford Encyclopedia of Philosophy. Fall 2008 edition.
91
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Golding, B., Morton, D., and Haerty, W. (2011). Elementary Sequence Analysis.
E-Book text for the course Biology 3S03, Department of Biology, McMaster Uni-
versity, http://helix.biology.mcmaster.ca/3S03.pdf.
Hancock, J. M. and Armstrong, J. S. (1994). Simple34: an improved and enhanced
implementation for vax and sun computers of the simple algorithm for analysis of
clustered repetitive motifs in nucleotide sequences. Computer Applications in the
Biosciences, 10(1), 67–70.
Horspool, D. (2008). An overview of the central dogma of molecular biochemistry
with all unusual flows of information included (in green). In Wikipedia, the Free
Encyclopedia.
Ilie, L. and Ilie, S. (2007). Multiple spaced seeds for homology search. Bioinformatics,
23(22), 2969–2977.
Ilie, L., Ilie, S., and Bigvand, A. M. (2011). Speed: fast computation of sensitive
spaced seeds. Bioinformatics.
Kaderali, L. and Schliep, A. (2002). Selecting signature oligonucleotides to identify
organisms using dna arrays. Bioinformatics, 18(10), 1340–1349.
Kane, M. D., Jatkoe, T. A., Stumpf, C. R., Lu, J., Thomas, J. D., and Madore, S. J.
(2000). Assessment of the sensitivity and specificity of oligonucleotide (50mer)
microarrays. Nucleic Acids Research, 28(22), 4552–4557.
Li, F. and Stormo, G. D. (2001). Selection of optimal dna oligos for gene expression
arrays. Bioinformatics, 17(11), 1067–1076.
92
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Li, M., Ma, B., Kisman, D., and Tromp, J. (2004). Patternhunter ii: highly sensitive
and fast homology search. Journal of Bioinformatics and Computational Biology,
2(3), 417–439.
Lipman, D. and Pearson, W. (1985). Rapid and sensitive protein similarity searches.
Science, 227(4693), 1435–1441.
Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S.,
Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996). Ex-
pression monitoring by hybridization to high-density oligonucleotide arrays. Nature
Biotechnology, 14, 1675–1680.
Ma, B., Tromp, J., and Li, M. (2002). Patternhunter: faster and more sensitive
homology search. Bioinformatics, 18(3), 440–445.
Manber, U. and Myers, G. (1990). Suffix arrays: a new method for on-line string
searches. In Proceedings of the first annual ACM-SIAM symposium on Discrete al-
gorithms, SODA ’90, pages 319–327, Philadelphia, PA, USA. Society for Industrial
and Applied Mathematics.
Myers, G. (1999). A fast bit-vector algorithm for approximate string matching based
on dynamic programming. J. ACM, 46, 395–415.
Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the
search for similarities in the amino acid sequence of two proteins. Journal of Molec-
ular Biology, 48(3), 443 – 453.
Nielsen, H. B., Wernersson, R., and Knudsen, S. (2003). Design of oligonucleotides
93
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
for microarrays and perspectives for design of multi-transcriptome arrays. Nucleic
Acids Research, 31(13), 3491–3496.
Noe, L. and Kucherov, G. (2005). Yass: enhancing the sensitivity of dna similarity
search. Nucleic Acids Research, 33(suppl 2), 540–543.
Nordberg, E. K. (2004). Yoda: selecting signature oligonucleotides. Bioinformatics,
21(8), 1365–1370.
Owczarzy, R., Moreira, B. G., You, Y., Behlke, M. A., and Walder, J. A. (2008). Pre-
dicting stability of dna duplexes in solutions containing magnesium and monovalent
cations. Biochemistry, 47(19), 5336–5353. PMID: 18422348.
Puglisi, S. J., Smyth, W. F., and Turpin, A. H. (2007). A taxonomy of suffix array
construction algorithms. ACM Comput. Surv., 39.
Rahmann, S. (2003). Fast large scale oligonucleotide selection using the longest com-
mon factor approach. Journal of Bioinformatics and Computational Biology, 1(2),
343–361.
Reymond, N., Charles, H., Duret, L., Calevro, F., Beslon, G., and Fayard, J.-M.
(2004). Roso: optimizing oligonucleotide probes for microarrays. Bioinformatics,
20(2), 271–273.
Rimour, S., Hill, D., Militon, C., and Peyret, P. (2005). Goarrays: highly dynamic
and efficient microarray probe design. Bioinformatics, 21(7), 1094–1103.
Rouillard, J., Zuker, M., and Gulari, E. (2003). Oligoarray 2.0: design of oligonu-
cleotide probes for dna microarrays using a thermodynamic approach. Nucleic
Acids Research, 31(12), 3057–3062.
94
M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science
Rychlik, W., Spencer, W., and Rhoads, R. (1990). Optimization of the annealing
temperature for dna amplification in vitro;. Nucleic Acids Research, 18(21), 6409–
6412.
SantaLucia, J. (1998). A unified view of polymer, dumbbell, and oligonucleotide
dna nearest-neighbor thermodynamics. Proceedings of the National Academy of
Sciences, 95(4), 1460–1465.
Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular sub-
sequences. Journal of Molecular Biology, 147(1), 195–197.
Smyth, W. F. (2003). Computing Patterns in Strings. Pearson Addison-Wesley.
Villarreal, M. R. (2008). Main protein structures levels. In Wikipedia, the Free
Encyclopedia.
Wang, X. and Seed, B. (2003). Selection of oligonucleotide probes for protein coding
sequences. Bioinformatics, 19(7), 796–802.
Ziv, J. and Lempel, A. (1977). A universal algorithm for sequential data compression.
Information Theory, IEEE Transactions on, 23(3), 337 – 343.
95