A Space-Economical Suffix Tree Construction Algorithm Edward M
Suffix Tree Applications
-
Upload
guruprasad-sridharan -
Category
Documents
-
view
221 -
download
0
Transcript of Suffix Tree Applications
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 1/48
Applications
• Exact string and substring matching
• Longest common substrings
• Finding and representing repeatedsubstrings efficiently
• Applications that lead to alternative, space
efficient implementations – Matching statistics
– Suffix Arrays
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 2/48
String and substrings
• Exact String matching: – Input
• Pattern P of length n
• Text T of length m
– Output• Position of all occurrences of P in T
• Solution method – Preprocess to create suffix tree for T
• O(m) time, O(m) space
– Maximally match P in suffix tree
– Output all leaf positions below match point
• O(n+k) time where k is number of matches
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 3/48
String and substrings
• Exact set matching: – Input
• Set of patterns {Pi} of total length n
• Text T of length m
– Output• Position of all occurrences of each pattern Pi in T
• Solution method – Preprocess to create suffix tree for T
• O(m) time, O(m) space
– Maximally match each Pi in suffix tree
– Output all leaf positions below match point
• O(n+k) time where k is number of total matches
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 4/48
Comparison with Aho-Corasick
• Aho-Corasick – O(n) preprocess time and space
• to build keyword tree of set of patterns P
– O(m+k) search time
• Suffix Tree Approach – O(m) preprocess time and space
• to build suffix tree of T
– O(n+k) search time
– Using matching statistics to be defined, can make thistradeoff similar to that of Aho-Corasick
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 5/48
String and substrings
• Substring problem: – Input
• Set of patterns {Pi} of total length n
• Text T of length m (m < n now)
– Output• Position of all occurrences of T in each pattern Pi
• Solution method
– Preprocess to create generalized suffix tree for {Pi}• O(n) time, O(n) space
– Maximally match T in generalized suffix tree
– Output all leaf positions below match point
• O(m+k) time where k is number of total matches
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 6/48
Common Substrings
• Longest Common Substring problem: – Input
• Strings S and T
– Output• longest common substring of S and T (and position in S and T)
• Solution method – Preprocess to create generalized suffix tree for {S,T}
– Mark each node by whether or not its subtree contains aleaf node of S, T, or both
• Simple postfix tree traversal algorithm to do this
– Path label of node with greatest string depth is the
longest common substring of S and T
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 7/48
Common Substrings
• Common substrings of length k problem: – Input
• Strings S and T
• Integer k
– Output• all substrings of S and T (and position in S and T) of length at
least k
• Solution method – Same as previous problem
– Look for all nodes with 2 leaf labels of string depth atleast k
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 8/48
Longest Common Substrings of
>2 Strings• Definition: For a given set of K strings, l(j) for 2
<= j <= K is the length of the longest substringcommon to at least j of the K strings
• Example: {sanddollar, sandlot, handler, grand,pantry} – j l(j) one string
– 2 4 sand – 3 3 and
– 4 3 and
– 5 2 an
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 9/48
Problem definition and solution
• Longest common substrings of >2 strings: – Input
• Strings S1, …, SK (total length n) – Output
• l(j) (and pointers to substrings) for 2 <= j <= K
• Solution – Build a generalized suffix tree for the K strings
• each string has a unique end character, so each leaf shows up only once
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 10/48
Solution continued
– Build a generalized suffix tree for the K strings• each string has a unique end character, so each leaf shows up
only once
– C(v): number of distinct leaf labels in subtree rooted at
node v – Given C(v) values and string-depth values, do a simple
traversal of tree to find these K-1 values and pointers tolocations in substrings
– Computing C(v) efficiently• # of leaves is not correct as some leaves may have same label
• length K bit vector, 1 bit per string in set
• OR your way up the tree
– Each OR op takes O(K) time which give O(Kn) running time
• Can be improved to be O(n) later
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 11/48
Repeated substrings
• Given a single string S
• Definitions
– maximal pair in S is a pair of identical substrings a andb in S such that the character to the immediate left(right) of a is different than the character to theimmediate left (right) of b.
• Add unique characters to front and end of S to include prefixes
and suffixes. – Representation: (p1, p2, n’)
• starting positions and length of the maximal pair
– R(S) is the set of all triples representing maximal pairsin S
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 12/48
Example
• S = xabcyiiizabcqabcyrxar
• 123456789012345678901
– (2, 10, 3) is a maximal pair
– (10, 14, 3) is a maximal pair
– (2, 14, 3) is not a maximal pair
• (2, 14, 4) is a maximal pair
– Note positions 2 and 14 are the start positions
of two distinct maximal pairs
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 13/48
More definitions
• A maximal repeat a is a substring in S that is the
substring defined by a maximal pair of S
• R’(S) is the set of maximal repeats • Previous example
– abc and abcy are maximal repeats of S
– abc is represented only once
– |R’(S)| is smaller than R(S) as abc shows up twice in
the second set but only once in the first set
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 14/48
Even more definitions
• A supermaximal repeat a is a maximal
repeat of S that never occurs as a substring
of another maximal repeat of S• Previous example
– abcy is a supermaximal repeat of S
– abc is NOT a supermaximal repeat of S
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 15/48
Problem definition
• Maximal repeats
– Input
• String S (length n)
– Output
• R’(S)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 16/48
Properties of maximal repeats
• Construct suffix tree for S
• Lemma
– If a is a maximal repeat in S, then a is the path-label of an internal node v in T• a does not end in the middle of an edge
• (captures next character after a is distinct)
• Corollary – There are at most n maximal repeats
• n leaves
• all internal nodes except the root have at least two children
• therefore, at most n internal nodes
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 17/48
More properties of maximal
repeats• Definitions
– Character S(i-1) is the left character of i
– The left character of a leaf of a suffix tree T is the leftcharacter of the suffix position represented by that leaf
– A node v of T is called left diverse if at least 2 leaves in
v’s subtree have different left characters
• Theorem – String a labeling the path to an internal node v of T is a
maximal repeat if and only if v is left diverse
• Capture that character before a is different
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 18/48
Identifying left diverse nodes
• Bottom up procedure – All nodes will have a left character label
– Leaf node:• Label leaves with their left character
– Internal node v:• If any child is left diverse, so is v
• If two children have different left character labels, v is left
diverse• Otherwise, take on left character value of children
• Compact representation – There is a compact tree T that consists only of left
diverse nodes that represents all maximal repeats
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 19/48
Problem definition
• Supermaximal repeats – Input
• String S (length n)
– Output• The set of supermaximal repeats of S
• Key property – A left diverse node v represents a supermaximal repeat
if and only if all of v’s children are leaves, and each hasa distinct left character
– Prove this
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 20/48
Matching Statistics
• Setting – Text T of length m
– Pattern P of length n
• Definition – For 1 <= i <=m, matching statistic ms(i) is the length of
the longest substring beginning at T(i) that matches asubstring somewhere in P
• With matching statistics, one can solve severalproblems with less space than a suffix tree – Exact matching example: P occurs at i in T if and only
if ms(i) = |P|
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 21/48
Why study matching statistics
• With matching statistics, one can solve
several problems with less space than a
suffix tree – Exact matching example
• We’ll show an O(n) preprocessing time and O(m)
search time solution matching the traditional
methods
• Key observation: P matches substring beginning at i
in T if and only if ms(i) = |P|
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 22/48
Construction Problem
• Input
– Text T of length m
– Pattern P
• Output
– Compute ms(i) for 1 <=i <= m
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 23/48
Solution
• Compute suffix tree of P retaining suffix links
• ms(1): match T against tree
• ms(i+1) given ms(i) – we are at some node v in the tree
• If it is internal, follow suffix link to s(v)
• Else if it is a leaf, go up one level to parent w
– If we is an internal node, follow suffix link to s(w)
– Traverse downwards using skip/count trick until we havematched all the characters in edge label (w,v)
• Now match against T character by character till we have amismatch and can output ms(i+1)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 24/48
Adding location of substring in P
• p(i): a location in P such that the substringat p(i) matches substring starting at T(i) for
exactly ms(i) positions• Before computing ms(i) values, mark each
node in T with the leaf number of one of itsleaves
• Simply output this value when outputtingms(i) values
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 25/48
Applying matching statistics to
LCS problem• Input
– strings S and T
• Output – longest common substring of S and T
• Solution method
– Compute suffix tree for shortest string, say S – Compute ms(i) values for T
– Maximal ms(i) value identifies LCS
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 26/48
Suffix Arrays
• Setting – Text T of length m
• Definition – A suffix array for T, called Pos, is an array of integersin the range 1 to m specifying the lexicographic orderof the m suffixes of string T
• Add terminating character $ which is lexically smallest
• Example– T = m i s s i s s i p p i
– i 1 2 3 4 5 6 7 8 9 0 1
– Pos(i) 5 4 119 3 108 2 7 6 1
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 27/48
Computing Suffix Arrays
• Input – Text T of length m
• Output – Pos array
• Solution – Compute suffix tree of T
– Do a lexical depth-first traversal of T labeling Pos(i)with leafs in order of encountering them
– Edge (v,u) is lexically smaller than edge (v,w) iff firstcharacter of (v,u) is lexically smaller than first characterof (v,w)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 28/48
Using Suffix Arrays
• Input
– Text T of length m
– Pattern P of length n
• Output
– All occurrences of P in T
• Solution
– Compute suffix array Pos for T
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 29/48
Properties of Suffix Arrays
• If P is in T, then all these locations will be
grouped consecutively in Pos
• O(n log m) solution to matching problem – Using binary search, find smallest index i’ such
that P exactly matches the n characters of suffix
Pos(i’) – Similarly, find largest index i such that P
exactly matches the n characters of suffix Pos(i)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 30/48
Speeding up binary search
• Let L and R denote current left and rightboundaries of current search interval – Initialization: L= 1, R = m
• Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P,respectively
• Define M = ceiling((L+R)/2) – Define mlr = min(l,r)
– Can begin comparison of Pos(M) at position mlr+1
• In practice, this is sufficient to achieve O(n + log
m) search time, but worst case is W(n log m)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 31/48
Longest common prefixes
• Definition: Lcp(i,j) is the length of the
longest common prefix of the suffixes
beginning at Pos(i) and Pos(j).• Mississippi Example
– Pos(3) = 5 (issippi)
– Pos(4) = 2 (ississippi) – Lcp(3,4) = 4
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 32/48
Getting to max(l,r) with Lcp’s
• L, R, M, l, r defined as before
• If l=r, compare P against Pos(m) starting atposition l+1 = r+1
• Suppose l > r – If Lcp(L,M) > l, the common prefix of suffix Pos(L)
and suffix Pos(M) is longer than the common prefix of P and Pos(L)
– Therefore, P agrees with suffix Pos(M) up throughposition l but disagrees in position l+1
– Furthermore, Pos(M) suffix is lexically smaller than P
– Update: L = M, l and r unchanged
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 33/48
Getting to max(l,r) with Lcp’s
• Suppose l > r
– If Lcp(L,M) < l, the common prefix of suffix Pos(L)
and suffix Pos(M) is shorter than the common prefix of
P and Pos(L)
– Therefore, P agrees with suffix Pos(M) up through
position Lcp(L,M).
– The Lcp(L,M)+1 characters of P and L are lexically
smaller than the corresponding character of Pos(M)
– Update: R = M, r = Lcp(L,M)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 34/48
Getting to max(l,r) with Lcp’s
• Suppose l > r
– If Lcp(L,M) = l, the common prefix of suffix Pos(L)
and suffix Pos(M) is equal to the common prefix of P
and Pos(L)
– Therefore, P agrees with suffix Pos(M) up through
position l and maybe even further
– Need to compare P(l+1) to corresponding position in
Pos(M)
– Update: Will update R or L according to final
determination of comparisons
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 35/48
O(n + log m) bound
• Since we begin at max(l,r), we never
compare a matched position in P more than
once• Redundant comparisons of P are eliminated
to at most once per binary search phase
giving us O(n + log m)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 36/48
Computing Lcp values quickly
• We want to get them in O(m) time
• However, there are potentially O(m2)
different possible pairs of Lcp values
• Crucial point
– Since this is binary search, there are only O(m)
values that are ever needed, and these have a lotof structure
– See Figure 7.7 for an example
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 37/48
Process for needed Lcp values
• Lcp(i,i+1): string depth of lowest commonancestor encountered during lexical depth-
first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf
• Other Lcp values – Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)
– Take min of Lcp values of children in thebinary tree of needed Lcp values (not the suffixtree)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 38/48
Lowest common ancestor
• 1-time input
– Tree T (not necessarily a suffix tree)
• Later input
– 2 nodes, v and w, of T
• Output
– lowest common ancestor of v,w in T
• Goal – linear preprocess time
– O(1) query time
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 39/48
Longest Common Extension
• 1-time input
– Strings S1 and S2
• Later input
– index positions i and j
• Output
– length of longest substring of S1 beginning at i thatmatches substring of S
2beginning at j
• Goal
– linear preprocess time
– O(1) query time
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 40/48
Illustration
• Relationship to longest common substring
– Similar, but now start positions are fixed
S1
S2
i
j
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 41/48
Solution
• Linear Preprocessing – Create general suffix tree for S1 and S2
– Compute string depth at each node – Process tree to allow for constant time LCA
queries
– Establish pointers to all leaf nodes in tree
• Constant time query processing – Find u = lca(v,w)
– Output string depth of u
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 42/48
More space-efficient solution
• Linear Preprocessing (Assume |S2| < |S1|) – Create general suffix tree for S2
– Compute matching statistic ms(i) and p(i) for S1 • length of longest match of substring starting at i in S1 with
some substring in S2
• p(i) is the starting point of a location in S2 that matches
– Process tree to allow for constant time LCA queries
– Establish pointers to all leaf nodes in tree
• Constant time query processing – Find u = lca(p(v), w) in suffix tree for S2
– Output min(ms(v), string depth of u)• why is this correct?
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 43/48
Related Problem
• Maximal Palindromes
• Input – String S
• Output – Location of all maximal palindromes in S
• Solution
– Longest common extensions of specific pairs of positions in S and Sr
– O(S) solution
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 44/48
Common substrings revisited
• Longest common substrings of >2 strings:
– Input
• Strings S1
, …, SK
(total length n)
– Output
• l(j) (and pointers to substrings) for 2 <= j <= K
• Problem with previous solution
– O(kn) time to compute C(v) values – C(v): number of distinct leaf labels in subtree rooted at
node v
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 45/48
Definitions
• S(v): total number of leaves in v’s subtree
• U(v): number of “duplicate suffixes” from
same string that occur in v’s subtree • C(v) = S(v) - U(v)
• ni(v) = number of leaves with identifier i in
the subtree rooted at node v• ni = total number of leaves with identifier i
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 46/48
Key Concepts
• Definitions – S(v): total number of leaves in v’s subtree
– U(v): number of “duplicate suffixes” from same string
that occur in v’s subtree – ni(v) = number of leaves with identifier i in the subtreerooted at node v
– ni = total number of leaves with identifier i
• Observations – U(v) = S max((ni(v) - 1), 0)
– C(v) = S(v) - U(v)
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 47/48
Solution
• Computing U(v) values – DF traversal of tree numbering leaves in order that they
are encountered
– For each string label i• Let Li be the list of leaves with identifier i, in increasing order of their dfs numbers
• Compute lca of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li
• For each node v, let h(v) denote the number of times it is the lcacomputed from step above
– Key property• ni(v) = Si h(w) where w is in v’s subtree
8/3/2019 Suffix Tree Applications
http://slidepdf.com/reader/full/suffix-tree-applications 48/48
Solution
• Computing U(v) values
– DF traversal of tree numbering leaves in order that they
are encountered
– Set h(v) to 0 for all nodes v – For each string label i
• Compute lca v of consecutive pair of leaves in Li for all pairs of
consecutive leaves in Li
• Increment h(v) by 1 – Propagate h(v) values up the tree by addition to set U(v)
– Set C(v) = S(v) - U(v)