Suffix Tree Applications

48
Applications Exact string and substring matching Longest common substrings Finding and representing repeated substrings efficiently Applications that lead to alternative, space efficient implementations   Matching statistics   Suffix Arrays

Transcript of Suffix Tree Applications

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 1/48

Applications

• Exact string and substring matching

• Longest common substrings

• Finding and representing repeatedsubstrings efficiently

• Applications that lead to alternative, space

efficient implementations – Matching statistics

 – Suffix Arrays

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 2/48

String and substrings

• Exact String matching: –  Input

• Pattern P of length n

• Text T of length m

 –  Output• Position of all occurrences of P in T

• Solution method –  Preprocess to create suffix tree for T

• O(m) time, O(m) space

 –  Maximally match P in suffix tree

 –  Output all leaf positions below match point

• O(n+k) time where k is number of matches

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 3/48

String and substrings

• Exact set matching: –  Input

• Set of patterns {Pi} of total length n

• Text T of length m

 –  Output• Position of all occurrences of each pattern Pi in T

• Solution method –  Preprocess to create suffix tree for T

• O(m) time, O(m) space

 –  Maximally match each Pi in suffix tree

 –  Output all leaf positions below match point

• O(n+k) time where k is number of total matches

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 4/48

Comparison with Aho-Corasick 

• Aho-Corasick  –  O(n) preprocess time and space

• to build keyword tree of set of patterns P

 –  O(m+k) search time

• Suffix Tree Approach –  O(m) preprocess time and space

• to build suffix tree of T

 –  O(n+k) search time

 –  Using matching statistics to be defined, can make thistradeoff similar to that of Aho-Corasick 

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 5/48

String and substrings

• Substring problem: –  Input

• Set of patterns {Pi} of total length n

• Text T of length m (m < n now)

 –  Output• Position of all occurrences of T in each pattern Pi 

• Solution method

 –  Preprocess to create generalized suffix tree for {Pi}• O(n) time, O(n) space

 –  Maximally match T in generalized suffix tree

 –  Output all leaf positions below match point

• O(m+k) time where k is number of total matches

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 6/48

Common Substrings

• Longest Common Substring problem: –  Input

• Strings S and T

 –  Output• longest common substring of S and T (and position in S and T)

• Solution method –  Preprocess to create generalized suffix tree for {S,T}

 –  Mark each node by whether or not its subtree contains aleaf node of S, T, or both

• Simple postfix tree traversal algorithm to do this

 –  Path label of node with greatest string depth is the

longest common substring of S and T

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 7/48

Common Substrings

• Common substrings of length k problem: –  Input

• Strings S and T

• Integer k 

 –  Output• all substrings of S and T (and position in S and T) of length at

least k 

• Solution method –  Same as previous problem

 –  Look for all nodes with 2 leaf labels of string depth atleast k 

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 8/48

Longest Common Substrings of 

>2 Strings• Definition: For a given set of K strings, l(j) for 2

<= j <= K is the length of the longest substringcommon to at least j of the K strings

• Example: {sanddollar, sandlot, handler, grand,pantry} –  j l(j) one string

 –  2 4 sand –  3 3 and

 –  4 3 and

 –  5 2 an

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 9/48

Problem definition and solution

• Longest common substrings of >2 strings: – Input

• Strings S1, …, SK (total length n) – Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Solution – Build a generalized suffix tree for the K strings

• each string has a unique end character, so each leaf shows up only once

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 10/48

Solution continued

 –  Build a generalized suffix tree for the K strings• each string has a unique end character, so each leaf shows up

only once

 –  C(v): number of distinct leaf labels in subtree rooted at

node v –  Given C(v) values and string-depth values, do a simple

traversal of tree to find these K-1 values and pointers tolocations in substrings

 –  Computing C(v) efficiently• # of leaves is not correct as some leaves may have same label

• length K bit vector, 1 bit per string in set

• OR your way up the tree

 –  Each OR op takes O(K) time which give O(Kn) running time

• Can be improved to be O(n) later

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 11/48

Repeated substrings

• Given a single string S

• Definitions

 –  maximal pair in S is a pair of identical substrings a andb in S such that the character to the immediate left(right) of a is different than the character to theimmediate left (right) of b.

• Add unique characters to front and end of S to include prefixes

and suffixes. –  Representation: (p1, p2, n’) 

• starting positions and length of the maximal pair

 –  R(S) is the set of all triples representing maximal pairsin S

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 12/48

Example

• S = xabcyiiizabcqabcyrxar

• 123456789012345678901

 – (2, 10, 3) is a maximal pair

 – (10, 14, 3) is a maximal pair

 – (2, 14, 3) is not a maximal pair

• (2, 14, 4) is a maximal pair

 – Note positions 2 and 14 are the start positions

of two distinct maximal pairs

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 13/48

More definitions

• A maximal repeat a is a substring in S that is the

substring defined by a maximal pair of S

• R’(S) is the set of maximal repeats • Previous example

 –  abc and abcy are maximal repeats of S

 –  abc is represented only once

 – |R’(S)| is smaller than R(S) as abc shows up twice in

the second set but only once in the first set

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 14/48

Even more definitions

• A supermaximal repeat a is a maximal

repeat of S that never occurs as a substring

of another maximal repeat of S• Previous example

 – abcy is a supermaximal repeat of S

 – abc is NOT a supermaximal repeat of S

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 15/48

Problem definition

• Maximal repeats

 – Input

• String S (length n)

 – Output

• R’(S) 

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 16/48

Properties of maximal repeats

• Construct suffix tree for S

• Lemma

 –  If a is a maximal repeat in S, then a is the path-label of an internal node v in T•  a does not end in the middle of an edge

• (captures next character after a is distinct)

• Corollary –  There are at most n maximal repeats

• n leaves

• all internal nodes except the root have at least two children

• therefore, at most n internal nodes

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 17/48

More properties of maximal

repeats• Definitions

 –  Character S(i-1) is the left character of i

 –  The left character of a leaf of a suffix tree T is the leftcharacter of the suffix position represented by that leaf 

 –  A node v of T is called left diverse if at least 2 leaves in

v’s subtree have different left characters 

• Theorem –  String a labeling the path to an internal node v of T is a

maximal repeat if and only if v is left diverse

• Capture that character before a is different

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 18/48

Identifying left diverse nodes

• Bottom up procedure –  All nodes will have a left character label

 –  Leaf node:• Label leaves with their left character

 –  Internal node v:• If any child is left diverse, so is v

• If two children have different left character labels, v is left

diverse• Otherwise, take on left character value of children

• Compact representation –  There is a compact tree T that consists only of left

diverse nodes that represents all maximal repeats

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 19/48

Problem definition

• Supermaximal repeats –  Input

• String S (length n)

 –  Output• The set of supermaximal repeats of S

• Key property –  A left diverse node v represents a supermaximal repeat

if and only if all of v’s children are leaves, and each hasa distinct left character

 –  Prove this

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 20/48

Matching Statistics

• Setting –  Text T of length m

 –  Pattern P of length n

• Definition –  For 1 <= i <=m, matching statistic ms(i) is the length of 

the longest substring beginning at T(i) that matches asubstring somewhere in P

• With matching statistics, one can solve severalproblems with less space than a suffix tree –  Exact matching example: P occurs at i in T if and only

if ms(i) = |P|

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 21/48

Why study matching statistics

• With matching statistics, one can solve

several problems with less space than a

suffix tree – Exact matching example

• We’ll show an O(n) preprocessing time and O(m)

search time solution matching the traditional

methods

• Key observation: P matches substring beginning at i

in T if and only if ms(i) = |P|

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 22/48

Construction Problem

• Input

 – Text T of length m

 – Pattern P

• Output

 – Compute ms(i) for 1 <=i <= m

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 23/48

Solution

• Compute suffix tree of P retaining suffix links

• ms(1): match T against tree

• ms(i+1) given ms(i) –  we are at some node v in the tree

• If it is internal, follow suffix link to s(v)

• Else if it is a leaf, go up one level to parent w

 –  If we is an internal node, follow suffix link to s(w)

 –  Traverse downwards using skip/count trick until we havematched all the characters in edge label (w,v)

• Now match against T character by character till we have amismatch and can output ms(i+1)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 24/48

Adding location of substring in P

• p(i): a location in P such that the substringat p(i) matches substring starting at T(i) for

exactly ms(i) positions• Before computing ms(i) values, mark each

node in T with the leaf number of one of itsleaves

• Simply output this value when outputtingms(i) values

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 25/48

Applying matching statistics to

LCS problem• Input

 – strings S and T

• Output – longest common substring of S and T

• Solution method

 – Compute suffix tree for shortest string, say S – Compute ms(i) values for T

 – Maximal ms(i) value identifies LCS

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 26/48

Suffix Arrays

• Setting –  Text T of length m

• Definition –  A suffix array for T, called Pos, is an array of integersin the range 1 to m specifying the lexicographic orderof the m suffixes of string T

• Add terminating character $ which is lexically smallest

• Example– T = m i s s i s s i p p i

– i 1 2 3 4 5 6 7 8 9 0 1

– Pos(i) 5 4 119 3 108 2 7 6 1

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 27/48

Computing Suffix Arrays

• Input –  Text T of length m

• Output –  Pos array

• Solution –  Compute suffix tree of T

 –  Do a lexical depth-first traversal of T labeling Pos(i)with leafs in order of encountering them

 –  Edge (v,u) is lexically smaller than edge (v,w) iff firstcharacter of (v,u) is lexically smaller than first characterof (v,w)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 28/48

Using Suffix Arrays

• Input

 – Text T of length m

 – Pattern P of length n

• Output

 – All occurrences of P in T

• Solution

 – Compute suffix array Pos for T

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 29/48

Properties of Suffix Arrays

• If P is in T, then all these locations will be

grouped consecutively in Pos

• O(n log m) solution to matching problem – Using binary search, find smallest index i’ such

that P exactly matches the n characters of suffix

Pos(i’)  – Similarly, find largest index i such that P

exactly matches the n characters of suffix Pos(i)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 30/48

Speeding up binary search

• Let L and R denote current left and rightboundaries of current search interval –  Initialization: L= 1, R = m

• Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P,respectively

• Define M = ceiling((L+R)/2) –  Define mlr = min(l,r)

 –  Can begin comparison of Pos(M) at position mlr+1

• In practice, this is sufficient to achieve O(n + log

m) search time, but worst case is W(n log m)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 31/48

Longest common prefixes

• Definition: Lcp(i,j) is the length of the

longest common prefix of the suffixes

beginning at Pos(i) and Pos(j).• Mississippi Example

 – Pos(3) = 5 (issippi)

 – Pos(4) = 2 (ississippi) – Lcp(3,4) = 4

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 32/48

Getting to max(l,r) with Lcp’s 

• L, R, M, l, r defined as before

• If l=r, compare P against Pos(m) starting atposition l+1 = r+1

• Suppose l > r –  If Lcp(L,M) > l, the common prefix of suffix Pos(L)

and suffix Pos(M) is longer than the common prefix of P and Pos(L)

 –  Therefore, P agrees with suffix Pos(M) up throughposition l but disagrees in position l+1

 –  Furthermore, Pos(M) suffix is lexically smaller than P

 –  Update: L = M, l and r unchanged

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 33/48

Getting to max(l,r) with Lcp’s 

• Suppose l > r

 –  If Lcp(L,M) < l, the common prefix of suffix Pos(L)

and suffix Pos(M) is shorter than the common prefix of 

P and Pos(L)

 –  Therefore, P agrees with suffix Pos(M) up through

position Lcp(L,M).

 –  The Lcp(L,M)+1 characters of P and L are lexically

smaller than the corresponding character of Pos(M)

 –  Update: R = M, r = Lcp(L,M)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 34/48

Getting to max(l,r) with Lcp’s 

• Suppose l > r

 –  If Lcp(L,M) = l, the common prefix of suffix Pos(L)

and suffix Pos(M) is equal to the common prefix of P

and Pos(L)

 –  Therefore, P agrees with suffix Pos(M) up through

position l and maybe even further

 –  Need to compare P(l+1) to corresponding position in

Pos(M)

 –  Update: Will update R or L according to final

determination of comparisons

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 35/48

O(n + log m) bound

• Since we begin at max(l,r), we never

compare a matched position in P more than

once• Redundant comparisons of P are eliminated

to at most once per binary search phase

giving us O(n + log m)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 36/48

Computing Lcp values quickly

• We want to get them in O(m) time

• However, there are potentially O(m2)

different possible pairs of Lcp values

• Crucial point

 – Since this is binary search, there are only O(m)

values that are ever needed, and these have a lotof structure

 – See Figure 7.7 for an example

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 37/48

Process for needed Lcp values

• Lcp(i,i+1): string depth of lowest commonancestor encountered during lexical depth-

first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf 

• Other Lcp values – Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)

 – Take min of Lcp values of children in thebinary tree of needed Lcp values (not the suffixtree)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 38/48

Lowest common ancestor

• 1-time input

 –  Tree T (not necessarily a suffix tree)

• Later input

 –  2 nodes, v and w, of T

• Output

 –  lowest common ancestor of v,w in T

• Goal –  linear preprocess time

 –  O(1) query time

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 39/48

Longest Common Extension

• 1-time input

 –  Strings S1 and S2 

• Later input

 –  index positions i and j

• Output

 –  length of longest substring of S1 beginning at i thatmatches substring of S

2beginning at j

• Goal

 –  linear preprocess time

 –  O(1) query time

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 40/48

Illustration

• Relationship to longest common substring

 – Similar, but now start positions are fixed

S1 

S2 

i

 j

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 41/48

Solution

• Linear Preprocessing – Create general suffix tree for S1 and S2 

 – Compute string depth at each node – Process tree to allow for constant time LCA

queries

 – Establish pointers to all leaf nodes in tree

• Constant time query processing – Find u = lca(v,w)

 – Output string depth of u

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 42/48

More space-efficient solution

• Linear Preprocessing (Assume |S2| < |S1|) –  Create general suffix tree for S2 

 –  Compute matching statistic ms(i) and p(i) for S1 • length of longest match of substring starting at i in S1 with

some substring in S2

• p(i) is the starting point of a location in S2 that matches

 –  Process tree to allow for constant time LCA queries

 –  Establish pointers to all leaf nodes in tree

• Constant time query processing –  Find u = lca(p(v), w) in suffix tree for S2 

 –  Output min(ms(v), string depth of u)• why is this correct?

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 43/48

Related Problem

• Maximal Palindromes

• Input –  String S

• Output –  Location of all maximal palindromes in S

• Solution

 –  Longest common extensions of specific pairs of positions in S and Sr 

 –  O(S) solution

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 44/48

Common substrings revisited

• Longest common substrings of >2 strings:

 –  Input

• Strings S1

, …, SK

(total length n)

 –  Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Problem with previous solution

 –  O(kn) time to compute C(v) values –  C(v): number of distinct leaf labels in subtree rooted at

node v

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 45/48

Definitions

• S(v): total number of leaves in v’s subtree 

• U(v): number of “duplicate suffixes” from

same string that occur in v’s subtree • C(v) = S(v) - U(v)

• ni(v) = number of leaves with identifier i in

the subtree rooted at node v• ni = total number of leaves with identifier i

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 46/48

Key Concepts

• Definitions – S(v): total number of leaves in v’s subtree 

 – U(v): number of “duplicate suffixes” from same string

that occur in v’s subtree  –  ni(v) = number of leaves with identifier i in the subtreerooted at node v

 –  ni = total number of leaves with identifier i

• Observations –  U(v) = S max((ni(v) - 1), 0)

 –  C(v) = S(v) - U(v)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 47/48

Solution

• Computing U(v) values –  DF traversal of tree numbering leaves in order that they

are encountered

 –  For each string label i• Let Li be the list of leaves with identifier i, in increasing order of their dfs numbers

• Compute lca of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li

• For each node v, let h(v) denote the number of times it is the lcacomputed from step above

 –  Key property• ni(v) = Si h(w) where w is in v’s subtree 

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 48/48

Solution

• Computing U(v) values

 –  DF traversal of tree numbering leaves in order that they

are encountered

 –  Set h(v) to 0 for all nodes v –  For each string label i

• Compute lca v of consecutive pair of leaves in Li for all pairs of 

consecutive leaves in Li

• Increment h(v) by 1 –  Propagate h(v) values up the tree by addition to set U(v)

 –  Set C(v) = S(v) - U(v)