Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
-
Upload
vernon-pearson -
Category
Documents
-
view
242 -
download
5
Transcript of Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Suffix Trees, Suffix Arrays
and Suffix TraysRichard Cole
Tsvi Kopelowitz
Moshe Lewenstein
Indexing problem
Input: Text T=t1,…,tn (preprocess to DS)
Queries: Pattern P=p1,…,pm (use DS)
T=
5 14 30
Suffix Property
P appears at location i of T iff
P is a prefix of the suffix Ti
T=
T14 =
5 14 30
Suffix Tree
A suffix tree for string S is a compressed trie of all suffixes of S.
{ $ b$ ab$ bab$ abab$ }
ab
ab
$
ab
$
b
$
$
$
Example: s=abab$
Suffix Tree
The size of the suffix tree of S is O(|S|).
{ $ b$ ab$ bab$ abab$ }
Example: s=abab$
01
ab
ab
$
ab
$
b
2
$ 3
$
4
$
Suffix Tree
The size of the suffix tree of S is O(|S|).
{ $ b$ ab$ bab$ abab$ } 0
1
[2,3]
2
3
4
Example: s=abab$
[2,4] [4,4]
[4,4]
[4,4]
[1,1]
[2,4]
Indexing and Suffix Trees
Navigate from root. (Use suffix property).
P = ssi
Time: O(|P| + occ)
Indexing and Suffix Trees
Navigate from root. (Use suffix property).
P = ssi
Time: O(|P| log|Σ| + occ)
Suffix Trees
Weiner 1973 (linear time construction!)
McCreight 1975 (space efficient)
Ukkonen 1995 (online)
Farach 1997 (poly range alphabets)
Suffix Array POS
11
8
5
2
1
10
9
7
4
6
3
All suffixesS1 mississippi
S2 ississippi
S3 ssissippi
S4 sissippi
S5 issippi
S6 ssippi
S7 sippi
S8 ippi
S9 ppi
S10 pi
S11 i
Sorted suffixesS11 i
S8 ippi
S5 issippi
S2 ississippi
S1 mississippi
S10 pi
S9 ppi
S7 sippi
S4 sissippi
S6 ssippi
S3 ssissippi
Suffix Array
11 8 5 2 1 10 9 7 4 6 3
m i s s i s s i p p i S =
SA(S) =
P = pi
Suffix Array
11 8 5 2 1 10 9 7 4 6 3
m i s s i s s i p p i S =
SA(S) =
P = pi
Suffix Array
11 8 5 2 1 10 9 7 4 6 3
m i s s i s s i p p i S =
SA(S) =
P = pi
Suffix Array
11 8 5 2 1 10 9 7 4 6 3
m i s s i s s i p p i S =
SA(S) =
P = pi
Suffix Array
11 8 5 2 1 10 9 7 4 6 3
m i s s i s s i p p i S =
SA(S) =
P = pi
Time: O(|P|*log |S|)
Suffix Array
Introduced:
Manber and Myers (1993).
Gonnet, Baeza-Yates, Snider (1992) (PAT arrays).
Manber and Myers (1993):
Time - O(|P| + log |S|)
Suffix Array Construction
Manber and Myers (1993) - O(n log n).
Karkkainen-Sanders (2003) - O(n) (poly range)
2 Other papers as well.
End of Story?
No. Lots of questions.
1.Construction Time of Suffix Trees.
2.Query Time.
3.Compressed Indexing Structures.
4. Indexing with Errors.
5.Real-Time S.T. construction.
Query Time for Large Alphabets
Suffix Trees: O(|P|*log|Σ|) (deterministic)
Suffix Arrays: O(|P| + log |T|)
Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}
Query Time for Large Alphabets
Actually it is easy to answer queries in O(|P|) time.
Create at every node of suffix tree - |∑| length array.
Then navigation at every node is O(1).
However, time and space of suffix tree construction = O(n|∑| )
Query Time for Large Alphabets
Suffix Trees: O(|P|*log|Σ|) (deterministic)
Suffix Arrays: O(|P| + log |S|)
Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}
Suffix Tree – Suffix Array connection
The ordering of the suffixes (leaves) in suffix tree is exactly the suffix array
Suffix Array POS
8
5
2
11
1
9
10
6
3
7
4
12
All suffixesS1 mississippi$
S2 ississippi$
S3 ssissippi$
S4 sissippi$
S5 issippi$
S6 ssippi$
S7 sippi$
S8 ippi$
S9 ppi$
S10 pi$
S11 i$
S12 $
sorted suffixesS8 ippi$
S5 issippi$
S2 ississippi$
S11 i$
S1 mississippi$
S9 ppi$
S10 pi$
S6 ssippi$
S3 ssissippi$
S7 sippi$
S4 sissippi$
S12 $
Example: Mississippi$
8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =
Suffix Tree – Suffix Array connectionWe utilize this connection as follows:
Every node in the suffix tree corresponds to an interval in suffix array.
Example: Mississippi$
8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =
Suffix Tree – Suffix Array connectionMoreover,
Time to search in suffix array on interval I is:
O(|P| + log |I|).
Suffix Tree – Suffix Array connectionDFN: a |Σ|-leaf is a node that
(1) has at least |Σ| leaves in its subtree
(2) all its children do not.
Number of leaves in subtree of |Σ|-leaf is O(|Σ|2).
Why?
At most |Σ| children – each with less than |Σ| leaves in subtree.
Suffix Tree – Suffix Array connection
Number of leaves in subtree of |Σ|-leaf is O(|Σ|2).
Time to search in suffix array for |Σ|-leaf is:
O(|P| + log |Σ|).
Example: Mississippi$
8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =
Suffix Tray
Idea Outline:
Navigate in suffix tree till a |Σ|-leaf is hit and then move to suffix array (time in SA - O(|P| + log |Σ|))
Problem:
Navigation in suffix tree O(|P| log |Σ|) time.
We promised O(|P| + log |Σ|) .
Suffix Tray
Recall idea:
Create at every node of suffix tree - |∑| length array.
Then navigation at every node is O(1).
Too expensive overall: O(n|∑| )
But OK for O(n/|Σ|) nodes.
Suffix TrayIdea: Truncate suffix trees at |Σ|-leaves for Σ-tree
Would be nice: size of Σ-tree = O(n/|Σ|)
However, this is not the case.a
$
$
$
$$a
a
aa
$
< |Σ| leaves
|Σ|-leaf
- the rest
< |Σ| leaves
|Σ|-leaf
- the rest
$a
$
$
$
$
$ab
ab
ab
ab
$
ab
ab
$ab
$
$
ba
S=ababababa$
Suffix Tray
Alternative Idea: Extend def. of Σ-tree by removing all nodes with fewer than |Σ| leaves in its subtree.
Nodes in Σ-tree:
1.Σ-leaf
2.Branching-Σ-node: node with at least 2 children
3.Others – nodes with only one child.
Suffix Tray - Example
$a
$
$
$
$$ab
ab
abab
$
ab
ab
$ab
$
$
ba
< |Σ| leaves
|Σ|-leaf
- others
- branching |Σ|- node
Suffix TrayObservation:
# of Σ-leafs = O(n/|Σ|)
Hence, # of branching-Σ-nodes = O(n/|Σ|)
So, we can save Σ-tables for navigation at each.
Suffix Tray – What is Left?
$a
$
$
$
$$ab
ab
abab
$
ab
ab
$ab
$
$
ba
< |Σ| leaves
|Σ|-leaf
- others
- branching |Σ|- node
Suffix Tray
Nodes in Σ-tree with only one child.
ab b c d
e
8 5 2 11 1 9 10 6 3 7 4 12
Interval less than |Σ|2
Suffix Tray
Size of suffix Tray: O(n)
Navigation: 1.Σ-leaf - jump to suffix array2.Branching-Σ-node: look at Σ-array3.Others – look at one character to Σ-tree child.
Time: O(|P| + log|Σ|)
End of Story?
No. Lots of questions.
1.Construction Time of Suffix Trees.
2.Query Time.
3.Compressed Indexing Structures.
4. Indexing with Errors.
5.Real-Time S.T. construction.