Suffix Trees

25
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

description

Suffix Trees. Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …. Suffix Trees. String … any sequence of characters. Substring of string S … string composed of characters i through j , i

Transcript of Suffix Trees

Page 1: Suffix Trees

Suffix Trees

Suffix trees

• Linearized suffix trees

• Virtual suffix treesSuffix arrays

• Enhanced suffix arrays

• Suffix cactus, suffix vectors, …

Page 2: Suffix Trees

Suffix Trees

• String … any sequence of characters.

• Substring of string S … string composed of characters i through j, i <= j of S. S = cater => ate is a substring. car is not a substring. Empty string is a substring of S.

Page 3: Suffix Trees

Subsequence

• Subsequence of string S … string composed of characters i1 < i2 < … < ik of S. S = cater => ate is a subsequence. car is a subsequence. The empty string is a subsequence.

Page 4: Suffix Trees

String/Pattern Matching

• You are given a source string S.

• Answer queries of the form: is the string pi a substring of S?

• Knuth-Morris-Pratt (KMP) string matching. O(|S| + | pi |) time per query.

O(n|S| + i | pi |) time for n queries.

• Suffix tree solution. O(|S| + i | pi |) time for n queries.

Page 5: Suffix Trees

String/Pattern Matching

• KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S.

• An application of string matching. Genome project. Databank of strings (gene sequences). Character set is ATGC. Determine if a “new” sequence is a substring of

a databank sequence.

Page 6: Suffix Trees

Definition Of Suffix Tree

• Compressed trie with edge information.

• Keys are the nonempty suffixes of a given string S.

• Nonempty suffixes of S = sleeper are: sleeper leeper eeper eper per, er, and r.

Page 7: Suffix Trees

String Matching & Suffixes• pi is a substring of S iff pi is a prefix of some

suffix of S.• Nonempty suffixes of S = sleeper are:

sleeper leeper eeper eper per, er, and r.

• Which of these are substrings of S? leep, eepe, pe, leap, peel

Page 8: Suffix Trees

Last Character Of S Repeats• When the last character of S appears more

than once in S, S has at least one suffix that is a proper prefix of another suffix.

• S = creeper creeper, reeper, eeper, eper, per, er, r

• When the last character of S appears more than once in S, use an end of string character # to overcome this problem.

• S = creeper# creeper#, reeper#, eeper#, eper#, per#, er#, r#, #

Page 9: Suffix Trees

Suffix Tree For S = abbbabbbb#

abbb b #

abbbb# b##abbbb#

b

#abbbb#

#abbbb#

b

b#

1

2

3

4

5

Page 10: Suffix Trees

Suffix Tree For S = abbbabbbb#

abbb b #

abbbb# b##abbbb#

b

#abbbb#

#abbbb#

b

b#

abbbabbbb#12345678910

1 5 4

3

2 6 7

8

9

10

1

2

3

4

5

Page 11: Suffix Trees

Suffix Tree For S = abbbabbbb#

abbb b #

abbbb# b##abbbb#

b

#abbbb#

#abbbb#

b

b#

abbbabbbb#12345678910

1 5 4

3

2 6 7

8

9

10

1

14

8

2

1

5 2

3

4

Page 12: Suffix Trees

Suffix Tree Construction

• See Web write up for algorithm.• Time complexity

|S| = n, alphabet size = r. O(nr) using array nodes. This is O(n) for r a constant (or r <= c). O(n) expected time using a hash table. O(n) time algorithm for large r in reference

cited in Web write up.

Page 13: Suffix Trees

Suffix Array

• Array that contains the start position of suffixes in lexicographic order.

• abbbabbbb# Assume # < a < b # < abbbabbbb# < abbbb# < b# < babbbb# <

bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] LCP = length of longest common prefix

between adjacent entries of SA. LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]

Page 14: Suffix Trees

Suffix Array

• Less space than suffix tree• Linear time construction• Can be used to solve several of the problems

solved by a suffix tree with same asymptotic complexity. Substring matching binary search for p using SA. O(|p| log |S|).

Page 15: Suffix Trees

O(|pi|) Time Substring Matching

babb abbba baba

abbb b #

abbbb# b##abbbb#

b

#abbbb#

#abbbb#

b

b#

abbbabbbb#12345678910

1 5 4

3

2 6 7

8

9

10

Page 16: Suffix Trees

Find All Occurrences Of pi

• Search suffix tree for pi.

• Suppose the search for pi is successful.

• When search terminates at an element node, pi appears exactly once in the source string S.

Page 17: Suffix Trees

Search Terminates At Element Node

abbb b #

abbbb# b##abbbb#

b

#abbbb#

#abbbb#

b

b#

abbbabbbb#12345678910

1 5 4

3

2 6 7

8

9

10

abbbb#

Page 18: Suffix Trees

Search Terminates At Branch Node

• When the search for pi terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of pi.

Page 19: Suffix Trees

Search Terminates At Branch Node

abbb b #

abbbb# b##abbbb#

b

#abbbb#

#abbbb#

b

b#

abbbabbbb#12345678910

1 5 4

3

2 6 7

8

9

10

ab

Page 20: Suffix Trees

Find All Occurrences Of pi

• To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree: Link all element nodes into a chain in inorder. Each branch node keeps a pointer to the left most

and right most element node in its subtree.

Page 21: Suffix Trees

Augmented Suffix Tree

abbb b #

abbbb# b##abbbb#

b

#abbbb#

#abbbb#

b

b#

abbbabbbb#12345678910

1 5 4

3

2 6 7

8

9

10

b

Page 22: Suffix Trees

Longest Repeating Substring

• Find longest substring of S that occurs more than m > 1 times in S.

• Label branch nodes with number of element nodes in subtree.

• Find branch node with label >= m and max char# field.

Page 23: Suffix Trees

Longest Repeating Substring

abbb b #

abbbb# b##abbbb#

b

#abbbb#

#abbbb#

b

b#

abbbabbbb#12345678910

1 5 4

3

2 6 7

8

9

10

m = 2

2

3

5

7

m = 5

10

Page 24: Suffix Trees

Longest Common Substring

• Given two strings S and T.

• Find the longest common substring.

• S = carport, T = airports Longest common substring = rport Longest common subsequence = arport

• Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming.

• Longest common substring may be found in O(|S|+|T|) time using a suffix tree.

Page 25: Suffix Trees

Longest Common Substring

• Let $ be a new symbol.• Construct the suffix tree for the string U = S$T#.

U = carport$airports# No repeating substring includes $. Find longest repeating substring that is both to left and

right of $.

• Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.