Indexing Biological Sequence Data
description
Transcript of Indexing Biological Sequence Data
![Page 1: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/1.jpg)
Indexing Biological Sequence Indexing Biological Sequence DataData
Doctoral Seminarby
Mihail R. Halachev
Supervisor: Dr. N. Shiri
Dept. of Computer Science and Software EngineeringConcordia University
11/29/2004
![Page 2: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/2.jpg)
2
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
![Page 3: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/3.jpg)
Source: National Health Museum
3
From DNA to sequence data From DNA to sequence data representationrepresentation
The 2 strands are complementary:
A TC G
A DNA segment can be encoded using the bases
from only one of the strands:
S = AGTACG Σ = {A, C, G, T}
![Page 4: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/4.jpg)
Source: Wikipedia 4
From mRNA to sequence data From mRNA to sequence data representationrepresentation
Each codon specifies a single amino acid.S = ATGLRS*
|Σ’| = 20
![Page 5: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/5.jpg)
5
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
![Page 6: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/6.jpg)
6
Basic tasks over biological dataBasic tasks over biological data
From a biological point of view: Having a novel DNA sequence, perform a search in primary
biological DBs for similar (already known) sequences. Similarity (Alignment) Homology
Compare a novel protein sequence to secondary protein DBs containing motifs, signatures, protein domains, etc.
Approximation of the biochemical function of the query protein
From a computational point of view:
- both tasks are essentially searching
![Page 7: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/7.jpg)
7
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
![Page 8: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/8.jpg)
8
Search techniques for Search techniques for sequence biological data sequence biological data
(BLAST, Clustal W)(BLAST, Clustal W)
Basic Local Alignment Search Tool (BLAST) [Altschul ‘90, ‘97]
The NCBI BLAST family of programs includes:
blastp - an amino acid query against a protein DB
blastn - a nucleotide query against a nucleotide DB
blastx - a nucleotide query (in all reading frames) against a protein DB
tblastn - a protein query against a nucleotide DB (in all reading frames)
tblastx - the six-frame translations of a nucleotide query against the six- frame translations of a nucleotide DB
![Page 9: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/9.jpg)
9
How BLAST works?How BLAST works?
Local pairwise alignment• The BLAST algorithm is a heuristic search method that seeks words of length W that score at least T when aligned with the query and scored with a substitution matrix. • Words in the database that score T or greater are extended in both directions in an attempt to find a alignment to produce a HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. • T parameter values: a trade-off between speed and sensitivity of the search.
Source: National Center for Biotech Info
![Page 10: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/10.jpg)
10
BLAST Case Study [Hunt ‘01]BLAST Case Study [Hunt ‘01]
Hardware:SUN Enterprise 450, 2 GB RAM, 4 Processors, Solaris 7
Software:BLAST (with default parameter settings)
Data:3 human chromosomes (294 Mbp, 10% of human genome),data on local disks
Queries:99 query sequences (predicted human genes), with length between 429 to 5999 bp
Results:6559 hits, average 66 hits per query.
Time: 62 hours
![Page 11: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/11.jpg)
11
BLAST ObservationsBLAST Observations
“BLAST: - performs serial scan of the DB; - is CPU intensive; - its usefulness depends on the biologists being able to provide appropriate search parameters values.”
[Hunt ‘01]
“Filtering approaches, like BLAST, are only suitable for high similarity matching, but often low similarities are biologically significant.”
[Navarro ‘00a]
![Page 12: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/12.jpg)
12
Clustal W [Thompson ‘94]Clustal W [Thompson ‘94]
Dynamic Programming alignment method Based on global multiple alignment
Input : set of N sequences Output : the optimal alignment of N sequences
Improved sensitivity (may find similar sequences which BLAST may omit)
50-100 times slower than BLAST
![Page 13: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/13.jpg)
13
Motivation for Indexing?Motivation for Indexing?
“Many of these biological datasets are growing at exponential rates – for example, the sizes of the sequence datasets in GenBank have been doubling every sixteen months.”
[Tata ‘04]
“As there is a rapid rise in both the volume of data and the demand for searches by researchers investigating functional genomics, it is worth investigating the possibility of accelerating these searches using indexes.” [Hunt ‘01]
![Page 14: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/14.jpg)
14
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
![Page 15: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/15.jpg)
15
Indexing Techniques for Sequence Indexing Techniques for Sequence DataData
Q-grams [Navaro ‘98]
String B-Tree [Ferragina ‘99]
Multi-D Index [Jagadish ‘00]
Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata
‘04]
![Page 16: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/16.jpg)
16
Q-grams -- ConstructionQ-grams -- Construction
Input: T is a text over Σ, |T| = n, |Σ| = σ
Pick an integer, say q = 4 (0 < q < n, a good heuristic is q ≈ log
σn)
Each substring of T with size q is called a “q-gram” and is stored in the index table (in lexical order) with a list of pointers to positions (or blocks) in T where this q-gram occurs
![Page 17: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/17.jpg)
17
Q-grams -- SearchingQ-grams -- Searching
For a pattern P, |P| = m,
Find all approximate occurrences P’ of P in T, where error ratio of each P’ ≤ λ
λ = k / m, where k is the edit distance of P’ to P Knowing m and the desired λ, compute k Split P at k +1 disjoint pieces Having k +1 disjoint pieces of P,
for each of them search the index table (binary search)
Set of candidate matches is the union of all occurrences
Verify each candidate by neighborhood search
![Page 18: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/18.jpg)
18
Q-grams -- ExampleQ-grams -- Example
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
987654321
nactcartnocoiratnoractabmoc
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
987654321
nactcartnocoiratnoractabmoc
3mba
9ron15ioc
14rio23ctc
21rac17con
20tra1com
6, 24tca7car
12tar25can
10, 18ont4bat
2omb5atc
16oco8aro
19ntr13ari
11nta26an
27n22act
T[pos]q-gramsT[pos]q-grams
3mba
9ron15ioc
14rio23ctc
21rac17con
20tra1com
6, 24tca7car
12tar25can
10, 18ont4bat
2omb5atc
16oco8aro
19ntr13ari
11nta26an
27n22act
T[pos]q-gramsT[pos]q-grams
T =
Set q = 3,
Index Table:
![Page 19: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/19.jpg)
19
Q-grams -- ExampleQ-grams -- Example
Search for P = con, k = 1 (i.e. allow only one error), split P in k+1 pieces: P1 = c and P2 = on
3mba
9ron15ioc
14rio23ctc
21rac17con
20tra1com
6, 24tca7car
12tar25can
10, 18ont4bat
2omb5atc
16oco8aro
19ntr13ari
11nta26an
27n22act
T[pos]q-gramsT[pos]q-grams
3mba
9ron15ioc
14rio23ctc
21rac17con
20tra1com
6, 24tca7car
12tar25can
10, 18ont4bat
2omb5atc
16oco8aro
19ntr13ari
11nta26an
27n22act
T[pos]q-gramsT[pos]q-grams• Candidate MatchesP1 = c : 25, 7, 1, 17, 23P2 = on : 10, 18
• Verification (1 error allowed)con ? bat con ? cancon ? carcon ? comcon ? concon ? ctccon ? ioccon ? ombcon ? ontcon ? tar
Answer:T[25], T[1], T[17]
+T[9], T[17]
![Page 20: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/20.jpg)
20
Indexing Techniques for Sequence Indexing Techniques for Sequence DataData
Q-grams [Navaro ‘98]
String B-Tree [Ferragina ‘99]
Multi-D Index [Jagadish ‘00]
Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata
‘04]
![Page 21: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/21.jpg)
21
String B-TreeString B-Tree -- -- Construction Construction
Input: S = {aid, atom, attenuate, car, patent, zoo, atlas}Step 1. Store S consequently on disk.
Input: set of words
Step 2. Sort lexicographically each suffix of each word
Lexicographic Order“aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]
Step 3. Create leaf nodes.Each node contains pointers to the sorted suffixes.
1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31
Step 4. Propagate LMP and RMP from each node up, until construct root
1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31
1 10 20 8 28 39 29 31
1 8 28 31
![Page 22: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/22.jpg)
22
Searching using this index structure is inefficient, because the keys are external and multiple I/Os are required to fetch them.
Lexicographic Order“aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]
1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31
1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31
1 10 20 8 28 39 29 31
1 8 28 31
String B-TreeString B-Tree -- -- ConstructionConstruction
![Page 23: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/23.jpg)
23
Each node is implemented as modified Patricia Trie.
Lexicographic Order“aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]
1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31
1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31
1 10 20 8 28 39 29 31
1 8 28 31
String B-TreeString B-Tree -- -- ConstructionConstruction
1 16 25 10
aid
ate
atent
attenuate
0
5
3
1
3
2
9
i
a
n
e
t
t
![Page 24: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/24.jpg)
24
String B-Tree -- SearchingString B-Tree -- Searching
Find all occurrences of P = te in S
Start at root:t > n and t < z branch right
1 8 28 31 0
3
1 23
a zm n
a i d
m nt
z o o
1 10 20 8 28 39 29 31
1 8 28 31 00
33
11 2233
a zm n
a i d
m nt
z o o
1 10 20 8 28 39 29 31
Child Node:t ≥ t and t < z branch right
28 39 29 31
2
0
1 13
n zs t
n t
s t z o o
29 12 36 3128 7 32 39
28 39 29 31
2
00
11 1133
n zs t
n t
s t z o o
29 12 36 3128 7 32 39
Child Node:te ≥ te and te < tl branch left
29 12 36 31
t t e
t l a s
z o o
0
2
41
3l
z
e
t
12
29
36
29 17 26 1236 6 11 15 31
29 12 36 31
t t e
t l a s
z o o
00
22
4411
33l
z
e
t
12
29
36
29 17 26 1236 6 11 15 31
Leaf node:P = te found at:S[17,18]S[26,27]S[12,13]
29 17 26 120
2
7
1
3
n
u
e
t
t t e
t e n t
t e n u a t e
S[17]
S[29]
S[12]4
t
S[26]
29 17 26 1200
22
77
11
33
n
u
e
t
t t e
t e n t
t e n u a t e
S[17]
S[29]
S[12]44
t
S[26]
![Page 25: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/25.jpg)
25
Indexing Techniques for Sequence Indexing Techniques for Sequence DataData
Q-grams [Navaro ‘98]
String B-Tree [Ferragina ‘99]
Multi-D Index [Jagadish ‘00]
Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata
‘04]
![Page 26: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/26.jpg)
26
Multi-D Index -- ConstructionMulti-D Index -- Construction
Dimension X
Dimension Y
abce magh
abcd makk
abqs makk
abqs mdbc
alaa magz
almn mazz
abqa maza
abzz mdyz
Input: A set of pairs of strings(not necessarily of same length)
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
0 5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
00 55 1010
1515
2020
2525
3030
3535
4040
4545
5050
5555
6060
6565
7070
7575
Step 1.Store the pairs of strings
(separated properly) consequently on disk
![Page 27: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/27.jpg)
27
Multi-D Index -- ConstructionMulti-D Index -- Construction
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
Step 2. Create index leaf nodes, storing pointers to separating symbolsStep 3. Construct internal nodes (until construct root).
R-trees and MBR computation are used for building up the index.
10 20 5 35 60 50 45 75
MBR1 MBR2
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
0 5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
00 55 1010
1515
2020
2525
3030
3535
4040
4545
5050
5555
6060
6565
7070
7575
![Page 28: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/28.jpg)
28
Multi-D Index -- ConstructionMulti-D Index -- Construction
Searching using this index structure is inefficient, because the keys are external and multiple I/Os are required to fetch them.
At each node, for each dimension,
create an ‘Elided Trie’. E-tries are very similar to Patricia Tries.
For searches, use the E-Tries in a similar manner as the Patricia Tries (during the downward traversal of the index tree).
![Page 29: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/29.jpg)
29
Multi-D Index -- ConstructionMulti-D Index -- Construction
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
0 5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
00 55 1010
1515
2020
2525
3030
3535
4040
4545
5050
5555
6060
6565
7070
7575
![Page 30: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/30.jpg)
30
Multi-D Index -- Multi-D Index -- SearchingSearching
Prefix Search:Q1=(abc*,makk*)
Start at root E-Tries repeat {
x-dim: abc* can only be on left MBR
y-dim: makk* can be in both MBRs
Compute the intersection examine only left MBR
….. until reach a leaf index node….
}
Step k (leaf page) {//compute candidatesx-dim: string pair @ 0 string pair @ 10
y-dim: string pair @ 10 string pair @ 20
Answer to query = the intersection
}
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
0 5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
00 55 1010
1515
2020
2525
3030
3535
4040
4545
5050
5555
6060
6565
7070
7575
![Page 31: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/31.jpg)
31
Indexing Techniques for Sequence Indexing Techniques for Sequence DataData
Q-grams [Navaro ‘98]
String B-Tree [Ferragina ‘99]
Multi-D Index [Jagadish ‘00]
Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata
‘04]
![Page 32: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/32.jpg)
32
Suffix Tree [Gusfield ‘97]Suffix Tree [Gusfield ‘97]
A Suffix Tree for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m.
Each internal node (except the root) has at least 2 children and each edge is labeled with a nonempty substring of S.
No 2 edges out of a node can have edge-labels beginning with the same character.
The key feature of the Suffix Tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i, i.e., S [i..m].
![Page 33: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/33.jpg)
33
Suffix TreeSuffix Tree
Input: string S = xabxa, add $ at the end (no suffix of S is a prefix of another suffix).
$
ab
xa
2
b x a $3
$
4
5
$
6 $
$
x
xa
b
a
1
Suffix Tree forS = xabxa$
![Page 34: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/34.jpg)
34
Suffix Tree -- SearchingSuffix Tree -- Searching
1 2 3 4 5 6
x a b x a $
Find all occurrences of P = xa in S
$
ab
xa
2
b x a $3
$
4
5
$
6 $
$
x
xa
b
a
1 S =
![Page 35: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/35.jpg)
35
Generalized Suffix TreeGeneralized Suffix Tree
ST can be build for more than one string.1 2 3 4 5 6
x a b x a $
1 2 3 4 5
b x a d $
S1 =
S2 =
b x a $ 3,1
$
4,1 5,1
$
6,1 $
$
ab
xa
2,1
$x
xa
ba
1,1
5,2d
$1,2
d$
2,2
d$
3,2
$d
4,2
![Page 36: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/36.jpg)
36
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
![Page 37: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/37.jpg)
37
Desired for the Indexing Desired for the Indexing TechniqueTechnique
Relatively fast construction, reasonable amount of storage consumption (persistently stored);
Allows huge sequences to be indexed; Supports versatile queries over data;
+
Supports bioinformatics applications!
![Page 38: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/38.jpg)
38
Applicability for Sequence Biological Data
Data Structure Suitable for bio-data indexing?
Q-gramsYesBLAST is using very similar idea. Provides high similarity matching, suitable for some bioinformatics applications.
String B-TreeYes? A DNA sequence cannot be broken into words, but can we exploit the repeats?
Multi-D IndexYes?Can we view promoters, genes, exons, introns, etc. as attributes in a DB?
Suffix TreeYes?Slow construction, limited input sequence size, size of index ≈ 10x size of input, but supports versatile queries over data
![Page 39: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/39.jpg)
39
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
![Page 40: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/40.jpg)
40
Suffix Trees: A closer lookSuffix Trees: A closer look
Suffix Trees are well known in the biological sequence processing field
Recent advances in Suffix Tree construction algorithms
Suffix Trees provide support for answering
versatile biological questions
![Page 41: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/41.jpg)
41
Suffix Tree (ST) ApplicationsSuffix Tree (ST) Applications
REPuter [Kurtz ‘99]The REPuter program family provides state of the art software solutions to compute and visualize repeats in whole genomes or chromosomes.
MUMmer [Delcher ‘99, ‘02, ‘04]MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. NUCmer program aligns contigs from a shotgun sequencing project to another set of contigs or a genome.
![Page 42: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/42.jpg)
42
ST Construction Algorithms ST Construction Algorithms HistoryHistory
[Weiner ‘73] First linear time algorithm to build Suffix Tree (called Position Tree).
[McCreight ‘76] A more space efficient solution.
[Ukkonen ‘95] Presents a variation of [McCreight ‘76], but much easier to understand, to prove bounds, and to implement.
All these algorithms are in-memory algorithms. In practice, the sequences to be indexed are large, they cannot fit in the memory; the corresponding ST is ≈ 10x bigger.
![Page 43: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/43.jpg)
43
Advances in ST Construction Advances in ST Construction AlgorithmsAlgorithms
[Hunt ‘01]Abandons the use of the suffix links (the algorithm is not linear any more), presents the idea of partitioning to reduce the number of disk I/O’s
[Giegerich ‘03]Proposes a space efficient representation of ST.
[Tata ‘04]Extends ideas in [Hunt ‘01] and [Giegerich ‘03], focuses on development of an efficient buffering strategy.
[Tata ‘04] builds a ST on the entire human genome (approx. 3 Gbp)
in 30 hours, using a single processor machine;
even for the in-memory case [Tata ‘04 - O(m2)], performs better than [Ukkonen ‘95 - O(m)]
![Page 44: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/44.jpg)
44
Versatile Biological Support by Versatile Biological Support by STST
Exact search (with or without wild cards)
Approximate search
[Longest] Common substring/subsequence of 2 (or more) strings Recognizing DNA contamination Alignment
[Shortest] Superstring of 2 (or more) strings Shotgun sequencing and sequence assembly
Finding repeats in a single sequence
Compressing DNA strings to study the information content of a string or to discriminate between exons and introns in eukaryotic DNA
….
![Page 45: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/45.jpg)
45
Suffix Tree RepresentationsSuffix Tree Representations
Suffix Array [Manber ‘93, Myers ‘94, Baeza-Yates ‘00]
LC-tries [Anderson ‘95]
Suffix Binary Search Tree [Irving ‘03]
![Page 46: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/46.jpg)
46
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
![Page 47: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/47.jpg)
47
ConclusionConclusion
BLAST Case Study
Observations on existing searching techniques
Alternative indexing techniques for sequence data and their possible application for biological sequence data
Suffix Trees
![Page 48: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/48.jpg)
48
Future WorkFuture Work
Suffix Tree Construction Further improvements of [Tata ‘04] algorithm – time/space Combining of two (or more) Suffix Trees Suffix Tree maintenance
Suffix Tree Usage Most of the widely known ST-based algorithms rely on the
suffix links. How the algorithms that use ST will change in the absence of suffix links?
Potential of ST for mining biodata
Alternative Index Data Structures“Families of reiterated sequences account for about one third of the human genome.” [McConkey ‘93]
![Page 49: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/49.jpg)
49
ReferencesReferences
[Altschul ‘90] S.F. Altschul et al. “Basic local alignment search tool”. J. Mol. Biol., 215:403-10, 1990.[Altschul ‘97] S. F. Altschul, T. L. Madden, A. A. Schaeer, J. Zhang, Z. Zhang, W. Miller, and D. J.
Lipman. “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”. Nucleic Acids Research, 25:3389-3402, 1997.
[Anderson ‘95] A. Andersson and S. Nilsson. “Efficient implementation of suffix trees”. Softw. Pract. Exp., 25(2):129-141, 1995
[Baeza-Yates ‘00] R. Baeza-Yates and G. Navarro. “A Hybrid Indexing Method for Approximate String Matching”. Journal of Discrete Algorithms, 2000.
[Delcher ‘99] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. “Alignment of Whole Genomes”. Nucleic Acids Research, 27:2369-2376, 1999.
[Ferragina ‘99] P. Ferragina and R. Grossi. “The string B-tree: a new data structure for string search in external memory and its applications”. Journal of the ACM, 46(2):236-280, 1999
[Giegerich ‘03] R. Giegerich, S. Kurtz, and J. Stoye. “Efficient implementation of lazy suffix trees”. Softw. Pract. Exper. 2003; 33:1035-1049, 2003
[Gusfield ‘97] D. Gusfield. “Algorithms on strings, trees and sequences : computer science and computational biology”. Cambridge University Press, 1997
[Hunt ‘01] E. Hunt, M.P. Atkinson, and R.W. Irving. “A Database Index to Large Biological Sequences”. In VLDB J., 7(3):139-148, 2001
[Irving ‘03] R.W. Irving and L. Love. “The Suffix Binary Search Tree and Suffix AVL Tree”. Journal of Discrete Algorithms, 1 (2003) 387–408, 2003.
[Jagadish ‘00] H.V. Jagadish, Nick Koudas, and Divesh Srivastava. “On effective multi-dimensional indexing for strings”. In ACM SIGMOD Conference on Management of Data, pages 403-414, 2000.
![Page 50: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/50.jpg)
50
ReferencesReferences
[Kurtz ‘99] S. Kurtz and C. Schleiermacher. “REPuter: fast computation of maximal repeats in complete genomes”. Bioinformatics, pages 426-427, 1999
[Manber ‘93] U. Manber and G. Myers. “Suffix arrays: a new method for on-line string searches”. SIAM J. Comput., 22(5):935-948, 1993.
[McConkey ‘93] E. McConkey. “Human Genetics: The Molecular Revolution”. Jones and Bartlett, Boston, MA, 1993
[McCreight ‘76] E.M. McCreight. “A Space-economical Suffix Tree Construction Algorithm”. J. ACM, 23(2):262-272, 1976
[Myers ‘94] E. W. Myers. “A sublinear algorithm for approximate key word searching”. Algorithmica,12(4/5):345-374, 1994.
[Navarro ‘98] G. Navarro and R. Baeza-Yates. “A practical q-gram index for text retrieval allowing errors”. CLEI Electronic Journal, 1(2), 1998
[Navarro ‘00a] G. Navarro. “A Guided Tour to Approximate String Matching”. ACM Computing Surveys,33:1:31-88, 2000.
[Navarro ‘00b] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. “Indexing Text with Approximate q-grams”. In CPM2000, LNCS 1848, pages 350-365, 2000
[Tata ‘04] S. Tata, R.A. Hankins, and J. Patel. “Practical Suffix Tree Construction”. In Proc. of the 30th VLDB, 2004
[Thompson ‘94] J. D. Thompson, D. G. Higgins, and T. J. Gibson. “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice”. In Nucleic Acids Research, Vol. 22, No. 22 4673-4680, 1994
[Ukkonen ‘95] E. Ukkonen. “On-line construction of suffix-trees”. Algorithmica 14 (1995), 249-260, 1995
[Weiner ‘73] P. Weiner. “Linear Pattern Matching Algorithms”. In Proc. of the 14th Annual Symposium on Switching and Automata Theory, 1973
![Page 51: Indexing Biological Sequence Data](https://reader034.fdocuments.in/reader034/viewer/2022051401/56814b0c550346895db82573/html5/thumbnails/51.jpg)
51
Thank You!Thank You!