Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith.
Suffix trees
description
Transcript of Suffix trees
![Page 1: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/1.jpg)
Suffix trees
![Page 2: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/2.jpg)
Trie
• A tree representing a set of strings.
a
c
b
c
e
e
f
d b
f
e g
{ aeef ad bbfe bbfg c }
![Page 3: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/3.jpg)
Trie (Cont)
• Assume no string is a prefix of another
a
c
b
c
e
e
f
d b
f
e g
Each edge is labeled by a letter,no two edges outgoing from the same node are labeled the same.
Each string corresponds to a leaf.
![Page 4: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/4.jpg)
Compressed Trie
• Compress unary nodes, label edges by strings
a
c
b
c
e
e
f
d b
f
e g
a
c
bbf
c
eefd
e g
![Page 5: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/5.jpg)
Suffix tree
Given a string s a suffix tree of s is a compressed trie of all suffixes of s
To make these suffixes prefix-free we add a special character, say $, at the end of s
![Page 6: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/6.jpg)
Suffix tree (Example)
Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$
{ $ b$ ab$ bab$ abab$ }
ab
ab
$
ab
$
b
$
$
$
![Page 7: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/7.jpg)
Trivial algorithm to build a Suffix tree
Put the largest suffix in
Put the suffix bab$ in
abab$
abab
$
ab$
b
![Page 8: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/8.jpg)
Put the suffix ab$ in
ab
ab
$
ab$
b
ab
ab
$
ab$
b
$
![Page 9: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/9.jpg)
Put the suffix b$ in
ab
ab
$
ab$
b
$
ab
ab
$
ab$
b
$
$
![Page 10: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/10.jpg)
Put the suffix $ in
ab
ab
$
ab$
b
$
$
ab
ab
$
ab$
b
$
$
$
![Page 11: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/11.jpg)
We will also label each leaf with the starting point of the corres. suffix.
ab
ab
$
ab$
b
$
$
$
12
ab
ab
$
ab
$
b
3
$ 4
$
5
$
![Page 12: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/12.jpg)
Analysis
Takes O(n2) time to build.
We will see how to do it in O(n) time
![Page 13: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/13.jpg)
What can we do with it ?
Exact string matching:Given a Text T, |T| = n, preprocess it
such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T.
We may also want to find all occurrences of P in T
![Page 14: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/14.jpg)
Exact string matchingIn preprocessing we just build a suffix tree in O(n) time
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Given a pattern P = ab we traverse the tree according to the pattern.
![Page 15: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/15.jpg)
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
If we did not get stuck traversing the pattern then the pattern occurs in the text.
Each leaf in the subtree below the node we reach corresponds to an occurrence.
By traversing this subtree we get all k occurrences in O(n+k) time
![Page 16: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/16.jpg)
Generalized suffix tree
Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s S
To associate each suffix with a unique string in S add a different special char to each s
![Page 17: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/17.jpg)
Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2
{ $ # b$ b# ab$ ab# bab$ aab# abab$ }
1
2
a
b
ab
$
ab$
b
3
$
4
$
5
$
1
b#
a
2
#
3
#
4
#
![Page 18: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/18.jpg)
So what can we do with it ?
Matching a pattern against a database of strings
![Page 19: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/19.jpg)
Longest common substring (of two strings)Every node with a leaf
descendant from string s1
and a leaf descendant
from string s2
represents a maximal common substring and vice versa.
1
2
a
b
ab
$
ab$
b
3
$
4
$
5
$
1
b#
a
2
#
3
#
4
#
Find such node with largest “string depth”
![Page 20: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/20.jpg)
Lowest common ancetorsA lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
![Page 21: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/21.jpg)
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
2
a
b
ab
$
ab$
b
3
$
4
$
5
$
1
b#
a
2
#
3
#
4
#
![Page 22: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/22.jpg)
Finding maximal palindromes
• A palindrome: caabaac, cbaabc• Want to find all maximal palindromes in a
string s
Let s = cbaaba
The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr
![Page 23: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/23.jpg)
Maximal palindromes algorithm
Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc#
For every i find the LCA of suffix i of s and suffix m-i+1 of sr
![Page 24: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/24.jpg)
3
a
a b
a
baaba$
b
3
$
7
$
b
7
#
c
1
6a b
c #
5
2 2
a $
c #
a
5
6
$
4
4
1
c #
a $
$abc #
c #
Let s = cbaaba$ then sr = abaabc#
![Page 25: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/25.jpg)
Analysis
O(n) time to identify all palindromes
![Page 26: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/26.jpg)
Can we construct a suffix tree in linear time ?
![Page 27: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/27.jpg)
Ukkonen’s linear time construction
ACTAATC
A
1
A
![Page 28: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/28.jpg)
ACTAATC
A
1
AC
![Page 29: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/29.jpg)
ACTAATC
AC
1
AC
![Page 30: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/30.jpg)
ACTAATC
AC
1
AC
2
C
![Page 31: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/31.jpg)
ACTAATC
AC
1
ACT
2
C
![Page 32: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/32.jpg)
ACTAATC
ACT
1
ACT
2
C
![Page 33: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/33.jpg)
ACTAATC
ACT
1
ACT
2
CT
![Page 34: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/34.jpg)
ACTAATC
ACT
1
ACT
2
CT
3
T
![Page 35: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/35.jpg)
ACTAATC
ACT
1
ACTA
2
CT
3
T
![Page 36: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/36.jpg)
ACTAATC
ACTA
1
ACTA
2
CT
3
T
![Page 37: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/37.jpg)
ACTAATC
ACTA
1
ACTA
2
CTA
3
T
![Page 38: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/38.jpg)
ACTAATC
ACTA
1
ACTA
2
CTA
3
TA
![Page 39: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/39.jpg)
ACTAATC
ACTA
1
ACTAA
2
CTA
3
TA
![Page 40: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/40.jpg)
ACTAATC
1
ACTAA
2
CTA
3
TAACTAA
![Page 41: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/41.jpg)
ACTAATC
ACTAA
1
ACTAA
2
CTAA
3
TA
![Page 42: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/42.jpg)
ACTAATC
ACTAA
1
ACTAA
2
CTAA
3
TAA
![Page 43: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/43.jpg)
ACTAATC
CTAA
1
ACTAA
2
CTAA
3
TAAA
4
A
![Page 44: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/44.jpg)
Phases & extensions
• Phase i is when we add character i
i
• In phase i we have i extensions of suffixes
![Page 45: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/45.jpg)
ACTAATC
CTAA
1
ACTAAT
2
CTAA
3
TAAA
4
A
![Page 46: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/46.jpg)
ACTAATC
CTAAT
1
ACTAAT
2
CTAA
3
TAAA
4
A
![Page 47: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/47.jpg)
ACTAATC
CTAAT
1
ACTAAT
2
CTAAT
3
TAAA
4
A
![Page 48: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/48.jpg)
ACTAATC
CTAAT
1
ACTAAT
2
CTAAT
3
TAATA
4
A
![Page 49: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/49.jpg)
ACTAATC
CTAAT
1
ACTAAT
2
CTAAT
3
TAATA
4
AT
![Page 50: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/50.jpg)
ACTAATC
CTAAT
1
ACTAAT
2
CTAAT
3
TAATA
4
AT
5
T
![Page 51: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/51.jpg)
Extension rules
• Rule 1: The suffix ends at a leaf, you add a character on the edge entering the leaf
• Rule 2: The suffix ended internally and the extended suffix does not exist, you add a leaf and possibly an internal node
• Rule 3: The suffix exists and the extended suffix exists, you do nothing
![Page 52: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/52.jpg)
ACTAATC
1
ACTAATC
2
CTAAT
3
TAATA
4
AT
5
TCTAAT
![Page 53: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/53.jpg)
ACTAATC
CTAATC
1
ACTAATC
2
CTAAT
3
TAATA
4
AT
5
T
![Page 54: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/54.jpg)
ACTAATC
CTAATC
1
ACTAATC
2
CTAATC
3
TAATA
4
AT
5
T
![Page 55: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/55.jpg)
ACTAATC
CTAATC
1
ACTAATC
2
CTAATC
3
TAATCA
4
AT
5
T
![Page 56: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/56.jpg)
ACTAATC
CTAATC
1
ACTAATC
2
CTAATC
3
TAATCA
4
ATC
5
T
![Page 57: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/57.jpg)
ACTAATC
CTAATC
1
ACTAATC
2
CTAATC
3
TAATCA
4
ATC
5
TC
![Page 58: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/58.jpg)
ACTAATC
CTAATC
1
ACTAATC
2
CTAATC
3
TA
4
ATC
5
TCAATC
6
C
![Page 59: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/59.jpg)
1 2
C
3
TA
4
ATCAC
5
TCAC
AATCAC
6
CAC
ACTAATCAC
TAATCAC
7
AC
CTAATCAC
Skip forward..
![Page 60: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/60.jpg)
1 2
C
3
TA
4
ATCAC
5
TCAC
AATCAC
6
CAC
ACTAATCACT
TAATCAC
7
AC
CTAATCAC
![Page 61: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/61.jpg)
1 2
C
3
TA
4
ATCAC
5
TCAC
AATCAC
6
CAC
ACTAATCACT
TAATCAC
7
AC
CTAATCACT
![Page 62: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/62.jpg)
1 2
C
3
TA
4
ATCAC
5
TCAC
AATCAC
6
CAC
ACTAATCACT
TAATCACT
7
AC
CTAATCACT
![Page 63: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/63.jpg)
1 2
C
3
TA
4
ATCAC
5
TCAC
AATCACT
6
CAC
ACTAATCACT
TAATCACT
7
AC
CTAATCACT
![Page 64: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/64.jpg)
1 2
C
3
TA
4
ATCACT
5
TCAC
AATCACT
6
CAC
ACTAATCACT
TAATCACT
7
AC
CTAATCACT
![Page 65: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/65.jpg)
1 2
C
3
TA
4
ATCACT
5
TCACT
AATCACT
6
CAC
ACTAATCACT
TAATCACT
7
AC
CTAATCACT
![Page 66: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/66.jpg)
1 2
C
3
TA
4
ATCACT
5
TCACT
AATCACT
6
CACT
ACTAATCACT
TAATCACT
7
AC
CTAATCACT
![Page 67: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/67.jpg)
1 2
C
3
TA
4
ATCACT
5
TCACT
AATCACT
6
CACT
ACTAATCACT
TAATCACT
7
ACT
CTAATCACT
![Page 68: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/68.jpg)
1 2
C
3
TA
4
ATCACT
5
TCACT
AATCACT
6
CACT
ACTAATCACTG
TAATCACT
7
ACT
CTAATCACT
![Page 69: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/69.jpg)
1 2
C
3
TA
4
ATCACT
5
TCACT
AATCACT
6
CACT
ACTAATCACTG
TAATCACT
7
ACT
CTAATCACTG
![Page 70: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/70.jpg)
1 2
C
3
TA
4
ATCACT
5
TCACT
AATCACT
6
CACT
ACTAATCACTG
TAATCACTG
7
ACT
CTAATCACTG
![Page 71: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/71.jpg)
1 2
C
3
TA
4
ATCACT
5
TCACT
AATCACTG
6
CACT
ACTAATCACTG
TAATCACTG
7
ACT
CTAATCACTG
![Page 72: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/72.jpg)
1 2
C
3
TA
4
ATCACTG
5
TCACT
AATCACTG
6
CACT
ACTAATCACTG
TAATCACTG
7
ACT
CTAATCACTG
![Page 73: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/73.jpg)
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACT
ACTAATCACTG
TAATCACTG
7
ACT
CTAATCACTG
![Page 74: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/74.jpg)
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
TAATCACTG
7
ACT
CTAATCACTG
![Page 75: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/75.jpg)
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
TAATCACTG
7
ACTG
CTAATCACTG
![Page 76: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/76.jpg)
CT
1
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
7
ACTG
AATCACTG
8
G
2
TAATCACTG
![Page 77: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/77.jpg)
CT
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
T
7
ACTG
AATCACTG
8
GAATCACTG
9
G
![Page 78: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/78.jpg)
CT
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
T
7
ACTG
AATCACTG
8
GAATCACTG
9
G
10
G
11
G
![Page 79: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/79.jpg)
Observations
At the first extension we must end at a leaf because no longer suffix exists (rule 1)
i
At the second extension we still most likely to end at a leaf.
i
We will not end at a leaf only if the second suffix is a prefix of the first
![Page 80: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/80.jpg)
Say at some extension we do not end at a leaf
i
Is there a way to continue using ith character ?
(Is it a prefix of a suffix where the next character is the ith character ?)
Rule 3 Rule 2
Then this suffix is a prefix of some other suffix (suffixes)We will not end at a leaf in subsequent extensions
![Page 81: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/81.jpg)
Rule 3 Rule 2
If we apply rule 3 then in all subsequent extensions we will apply rule 3
Otherwise we keep applying rule 2 until in some subsequent extensions we will apply rule 3
Rule 3
![Page 82: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/82.jpg)
In terms of the rules that we apply a phase looks like:
1 1 1 1 1 1 1 2 2 2 2 3 3 3 3
We have nothing to do when applying rule 3, so once rule 3 happens we can stop
We don’t really do anything significant when we apply rule 1 (the structure of the tree does not change)
![Page 83: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/83.jpg)
Representation
• We do not really store a substring with each edge, but rather pointers into the starting position and ending position of the substring in the text
• With this representaion we do not really have to do anything when rule 1 applies
![Page 84: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/84.jpg)
How do phases relate to each other
1 1 1 1 1 1 1 2 2 2 2 3 3 3 3i
The next phase we must have:
1 1 1 1 1 1 1 1 1 1 1 2/3
So we start the phase with the extension that was the first where we applied rule 3 in the previous phase
![Page 85: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/85.jpg)
Suffix Links
CT
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
T
7
ACTG
AATCACTG
8
GAATCACTG
9
G
10
G
11
G
![Page 86: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/86.jpg)
CT
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
T
7
ACTG
AATCACTG
8
GAATCACTG
9
G
10
G
11
G
![Page 87: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/87.jpg)
CT
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
T
7
ACTG
AATCACTG
8
GAATCACTG
9
G
10
G
11
G
![Page 88: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/88.jpg)
CT
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
T
7
ACTG
AATCACTG
8
GAATCACTG
9
G
10
G
11
G
![Page 89: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/89.jpg)
CT
1 2
C
3
TA
4
ATCACTG
5
TCACTG
AATCACTG
6
CACTG
ACTAATCACTG
T
7
ACTG
AATCACTG
8
GAATCACTG
9
G
10
G
11
G
![Page 90: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/90.jpg)
Suffix Links
• From an internal node that corresponds to the string aβ to the internal node that corresponds to β (if there is such node)
aββ
![Page 91: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/91.jpg)
• Is there such a node ?
v
aβSuppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy
So there was a suffix βx… x..
β
y
![Page 92: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/92.jpg)
• Is there such a node ?
v
aβSuppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy
So there was a suffix βx… x..
β
If there was also a suffix βz…
Then a node corresponding to β is there
y x.. z..
![Page 93: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/93.jpg)
• Is there such a node ?
v
aβSuppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy
So there was a suffix βx… x..
β
If there was also a suffix βz…
Then a node corresponding to β is there
y x.. y
Otherwise it will be created in the next extension when we add βy
![Page 94: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/94.jpg)
Inv: All suffix links are there except (possibly) of the last internal node added
You are at the (internal) node corresponding to the last extension
You start a phase at the last internal node of the first extension in which you applied rule 3 in the previous iteration
i
Remember: we apply rule 2
i
![Page 95: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/95.jpg)
1) Go up one node (if needed) to find a suffix link2) Traverse the suffix link
3) If you went up in step 1 along an edge that was labeled δ then go down consuming a string δ
![Page 96: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/96.jpg)
v
δ
x..
δ
y
aβ
β
Create the new internal node if necessary
![Page 97: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/97.jpg)
v
δ
x..
δ
y
aβ
β
Create the new internal node if necessary
![Page 98: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/98.jpg)
v
δ
x..
δ
y
aβ
β
Create the new internal node if necessary, add the suffix
y
![Page 99: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/99.jpg)
v
δ
x..
δ
y
aβ
β
Create the new internal node if necessary, add the suffix and install a suffix link if necessary
y
![Page 100: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/100.jpg)
AnalysisHandling all extensions of rule 1 and all extensions of rule 3 per phase take O(1) time O(n) total
How many times do we carry out rule 2 in all phases ?O(n)
Does each application of rule 2 takes constant time ?No ! (going up and traversing the suffix link takes constant time, but then we go down possibly on many edges..)
![Page 101: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/101.jpg)
v
δ
x..
δ
y
aβ
β
y
![Page 102: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/102.jpg)
So why is it a linear time algorithm ?
How much can the depth change when we traverse a suffix link ?
It can decrease by at most 1
![Page 103: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/103.jpg)
Punch lineEach time we go up or traverse a suffix link the depth decreases by at most 1
When starting the depth is 0, final depth is at most n
So during all applications of rule 2 together we cannot go down more than 3n times
THM: The running time of Ukkonen’s algorithm is O(n)
![Page 104: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/104.jpg)
Another application
• Suppose we have a pattern P and a text T and we want to find for each position of T the length of the longest substring of P that matches there.
• How would you do that in O(n) time ?
![Page 105: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/105.jpg)
![Page 106: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/106.jpg)
Drawbacks of suffix trees
• Suffix trees consume a lot of space
• It is O(n) but the constant is quite big
• Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node
![Page 107: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/107.jpg)
Suffix array
• We loose some of the functionality but we save space.
Let s = ababSort the suffixes lexicographically: ab, abab, b, bab
The suffix array gives the indices of the suffixes in sorted order
3 1 4 2
![Page 108: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/108.jpg)
How do we build it ?
• Build a suffix tree• Traverse the tree in DFS,
lexicographically picking edges outgoing from each node and fill the suffix array.
• O(n) time
![Page 109: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/109.jpg)
How do we search for a pattern ?
• If P occurs in T then all its occurrences are consecutive in the suffix array.
• Do a binary search on the suffix array
• Takes O(mlogn) time
![Page 110: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/110.jpg)
ExampleLet S = mississippi
iippiissippiississippimississippipi
8
5
2
1
10
9
7
4
11
6
3
ppisippisisippissippississippi
L
R
Let P = issa
M
![Page 111: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/111.jpg)
How do we accelerate the search ?
L
R
Maintain l = LCP(P,L)Maintain r = LCP(P,R)
M
If l = r then start comparing M to P at l + 1
r
![Page 112: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/112.jpg)
How do we accelerate the search ?
L
R
Suppose we know LCP(L,M)
If LCP(L,M) < l we go left
If LCP(L,M) > l we go right
If LCP(L,M) = l we start comparing at l + 1
M
If l > r then
r
![Page 113: Suffix trees](https://reader035.fdocuments.in/reader035/viewer/2022062519/5681522b550346895dc07419/html5/thumbnails/113.jpg)
Analysis of the acceleration
If we do more than a single comparison in an iteration then max(l, r ) grows by 1 for each comparison O(logn + m) time