Suffix Tree Applications

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 1/48

Applications

• Exact string and substring matching

• Longest common substrings

• Finding and representing repeatedsubstrings efficiently

• Applications that lead to alternative, space

efficient implementations – Matching statistics

– Suffix Arrays



String and substrings

• Exact String matching: – Input

• Pattern P of length n

• Text T of length m

– Output• Position of all occurrences of P in T

• Solution method – Preprocess to create suffix tree for T

• O(m) time, O(m) space

– Maximally match P in suffix tree

– Output all leaf positions below match point

• O(n+k) time where k is number of matches




• Exact set matching: – Input

• Set of patterns {Pi} of total length n

• Text T of length m

– Output• Position of all occurrences of each pattern Pi in T

• Solution method – Preprocess to create suffix tree for T

• O(m) time, O(m) space

– Maximally match each Pi in suffix tree


• O(n+k) time where k is number of total matches



Comparison with Aho-Corasick

• Aho-Corasick – O(n) preprocess time and space

• to build keyword tree of set of patterns P

– O(m+k) search time

• Suffix Tree Approach – O(m) preprocess time and space

• to build suffix tree of T

– O(n+k) search time

– Using matching statistics to be defined, can make thistradeoff similar to that of Aho-Corasick




• Substring problem: – Input

• Set of patterns {Pi} of total length n

• Text T of length m (m < n now)

– Output• Position of all occurrences of T in each pattern Pi

• Solution method

– Preprocess to create generalized suffix tree for {Pi}• O(n) time, O(n) space

– Maximally match T in generalized suffix tree


• O(m+k) time where k is number of total matches



Common Substrings

• Longest Common Substring problem: – Input

• Strings S and T

– Output• longest common substring of S and T (and position in S and T)

• Solution method – Preprocess to create generalized suffix tree for {S,T}

– Mark each node by whether or not its subtree contains aleaf node of S, T, or both

• Simple postfix tree traversal algorithm to do this

– Path label of node with greatest string depth is the

longest common substring of S and T



Common Substrings

• Common substrings of length k problem: – Input

• Strings S and T

• Integer k

– Output• all substrings of S and T (and position in S and T) of length at

least k

• Solution method – Same as previous problem

– Look for all nodes with 2 leaf labels of string depth atleast k



Longest Common Substrings of

>2 Strings• Definition: For a given set of K strings, l(j) for 2

<= j <= K is the length of the longest substringcommon to at least j of the K strings

• Example: {sanddollar, sandlot, handler, grand,pantry} – j l(j) one string

– 2 4 sand – 3 3 and

– 4 3 and

– 5 2 an



Problem definition and solution

• Longest common substrings of >2 strings: – Input

• Strings S1, …, SK (total length n) – Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Solution – Build a generalized suffix tree for the K strings

• each string has a unique end character, so each leaf shows up only once



Solution continued

– Build a generalized suffix tree for the K strings• each string has a unique end character, so each leaf shows up

only once

– C(v): number of distinct leaf labels in subtree rooted at

node v – Given C(v) values and string-depth values, do a simple

traversal of tree to find these K-1 values and pointers tolocations in substrings

– Computing C(v) efficiently• # of leaves is not correct as some leaves may have same label

• length K bit vector, 1 bit per string in set

• OR your way up the tree

– Each OR op takes O(K) time which give O(Kn) running time

• Can be improved to be O(n) later



Repeated substrings

• Given a single string S

• Definitions

– maximal pair in S is a pair of identical substrings a andb in S such that the character to the immediate left(right) of a is different than the character to theimmediate left (right) of b.

• Add unique characters to front and end of S to include prefixes

and suffixes. – Representation: (p1, p2, n’)

• starting positions and length of the maximal pair

– R(S) is the set of all triples representing maximal pairsin S



Example

• S = xabcyiiizabcqabcyrxar

• 123456789012345678901

– (2, 10, 3) is a maximal pair

– (10, 14, 3) is a maximal pair

– (2, 14, 3) is not a maximal pair

• (2, 14, 4) is a maximal pair

– Note positions 2 and 14 are the start positions

of two distinct maximal pairs



More definitions

• A maximal repeat a is a substring in S that is the

substring defined by a maximal pair of S

• R’(S) is the set of maximal repeats • Previous example

– abc and abcy are maximal repeats of S

– abc is represented only once

– |R’(S)| is smaller than R(S) as abc shows up twice in

the second set but only once in the first set



Even more definitions

• A supermaximal repeat a is a maximal

repeat of S that never occurs as a substring

of another maximal repeat of S• Previous example

– abcy is a supermaximal repeat of S

– abc is NOT a supermaximal repeat of S



Problem definition

• Maximal repeats

– Input

• String S (length n)

– Output

• R’(S)



Properties of maximal repeats

• Construct suffix tree for S

• Lemma

– If a is a maximal repeat in S, then a is the path-label of an internal node v in T• a does not end in the middle of an edge

• (captures next character after a is distinct)

• Corollary – There are at most n maximal repeats

• n leaves

• all internal nodes except the root have at least two children

• therefore, at most n internal nodes



More properties of maximal

repeats• Definitions

– Character S(i-1) is the left character of i

– The left character of a leaf of a suffix tree T is the leftcharacter of the suffix position represented by that leaf

– A node v of T is called left diverse if at least 2 leaves in

v’s subtree have different left characters

• Theorem – String a labeling the path to an internal node v of T is a

maximal repeat if and only if v is left diverse

• Capture that character before a is different



Identifying left diverse nodes

• Bottom up procedure – All nodes will have a left character label

– Leaf node:• Label leaves with their left character

– Internal node v:• If any child is left diverse, so is v

• If two children have different left character labels, v is left

diverse• Otherwise, take on left character value of children

• Compact representation – There is a compact tree T that consists only of left

diverse nodes that represents all maximal repeats



Problem definition

• Supermaximal repeats – Input

• String S (length n)

– Output• The set of supermaximal repeats of S

• Key property – A left diverse node v represents a supermaximal repeat

if and only if all of v’s children are leaves, and each hasa distinct left character

– Prove this



Matching Statistics

• Setting – Text T of length m

– Pattern P of length n

• Definition – For 1 <= i <=m, matching statistic ms(i) is the length of

the longest substring beginning at T(i) that matches asubstring somewhere in P

• With matching statistics, one can solve severalproblems with less space than a suffix tree – Exact matching example: P occurs at i in T if and only

if ms(i) = |P|



Why study matching statistics

• With matching statistics, one can solve

several problems with less space than a

suffix tree – Exact matching example

• We’ll show an O(n) preprocessing time and O(m)

search time solution matching the traditional

methods

• Key observation: P matches substring beginning at i

in T if and only if ms(i) = |P|



Construction Problem

• Input

– Text T of length m

– Pattern P

• Output

– Compute ms(i) for 1 <=i <= m



Solution

• Compute suffix tree of P retaining suffix links

• ms(1): match T against tree

• ms(i+1) given ms(i) – we are at some node v in the tree

• If it is internal, follow suffix link to s(v)

• Else if it is a leaf, go up one level to parent w

– If we is an internal node, follow suffix link to s(w)

– Traverse downwards using skip/count trick until we havematched all the characters in edge label (w,v)

• Now match against T character by character till we have amismatch and can output ms(i+1)



Adding location of substring in P

• p(i): a location in P such that the substringat p(i) matches substring starting at T(i) for

exactly ms(i) positions• Before computing ms(i) values, mark each

node in T with the leaf number of one of itsleaves

• Simply output this value when outputtingms(i) values



Applying matching statistics to

LCS problem• Input

– strings S and T

• Output – longest common substring of S and T

• Solution method

– Compute suffix tree for shortest string, say S – Compute ms(i) values for T

– Maximal ms(i) value identifies LCS



Suffix Arrays

• Setting – Text T of length m

• Definition – A suffix array for T, called Pos, is an array of integersin the range 1 to m specifying the lexicographic orderof the m suffixes of string T

• Add terminating character $ which is lexically smallest

• Example– T = m i s s i s s i p p i

– i 1 2 3 4 5 6 7 8 9 0 1

– Pos(i) 5 4 119 3 108 2 7 6 1



Computing Suffix Arrays

• Input – Text T of length m

• Output – Pos array

• Solution – Compute suffix tree of T

– Do a lexical depth-first traversal of T labeling Pos(i)with leafs in order of encountering them

– Edge (v,u) is lexically smaller than edge (v,w) iff firstcharacter of (v,u) is lexically smaller than first characterof (v,w)



Using Suffix Arrays

• Input

– Text T of length m

– Pattern P of length n

• Output

– All occurrences of P in T

• Solution

– Compute suffix array Pos for T



Properties of Suffix Arrays

• If P is in T, then all these locations will be

grouped consecutively in Pos

• O(n log m) solution to matching problem – Using binary search, find smallest index i’ such

that P exactly matches the n characters of suffix

Pos(i’) – Similarly, find largest index i such that P

exactly matches the n characters of suffix Pos(i)



Speeding up binary search

• Let L and R denote current left and rightboundaries of current search interval – Initialization: L= 1, R = m

• Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P,respectively

• Define M = ceiling((L+R)/2) – Define mlr = min(l,r)

– Can begin comparison of Pos(M) at position mlr+1

• In practice, this is sufficient to achieve O(n + log

m) search time, but worst case is W(n log m)



Longest common prefixes

• Definition: Lcp(i,j) is the length of the

longest common prefix of the suffixes

beginning at Pos(i) and Pos(j).• Mississippi Example

– Pos(3) = 5 (issippi)

– Pos(4) = 2 (ississippi) – Lcp(3,4) = 4



Getting to max(l,r) with Lcp’s

• L, R, M, l, r defined as before

• If l=r, compare P against Pos(m) starting atposition l+1 = r+1

• Suppose l > r – If Lcp(L,M) > l, the common prefix of suffix Pos(L)

and suffix Pos(M) is longer than the common prefix of P and Pos(L)

– Therefore, P agrees with suffix Pos(M) up throughposition l but disagrees in position l+1

– Furthermore, Pos(M) suffix is lexically smaller than P

– Update: L = M, l and r unchanged




• Suppose l > r

– If Lcp(L,M) < l, the common prefix of suffix Pos(L)

and suffix Pos(M) is shorter than the common prefix of

P and Pos(L)

– Therefore, P agrees with suffix Pos(M) up through

position Lcp(L,M).

– The Lcp(L,M)+1 characters of P and L are lexically

smaller than the corresponding character of Pos(M)

– Update: R = M, r = Lcp(L,M)




• Suppose l > r

– If Lcp(L,M) = l, the common prefix of suffix Pos(L)

and suffix Pos(M) is equal to the common prefix of P

and Pos(L)

– Therefore, P agrees with suffix Pos(M) up through

position l and maybe even further

– Need to compare P(l+1) to corresponding position in

Pos(M)

– Update: Will update R or L according to final

determination of comparisons



O(n + log m) bound

• Since we begin at max(l,r), we never

compare a matched position in P more than

once• Redundant comparisons of P are eliminated

to at most once per binary search phase

giving us O(n + log m)



Computing Lcp values quickly

• We want to get them in O(m) time

• However, there are potentially O(m2)

different possible pairs of Lcp values

• Crucial point

– Since this is binary search, there are only O(m)

values that are ever needed, and these have a lotof structure

– See Figure 7.7 for an example



Process for needed Lcp values

• Lcp(i,i+1): string depth of lowest commonancestor encountered during lexical depth-

first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf

• Other Lcp values – Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)

– Take min of Lcp values of children in thebinary tree of needed Lcp values (not the suffixtree)



Lowest common ancestor

• 1-time input

– Tree T (not necessarily a suffix tree)

• Later input

– 2 nodes, v and w, of T

• Output

– lowest common ancestor of v,w in T

• Goal – linear preprocess time

– O(1) query time



Longest Common Extension

• 1-time input

– Strings S1 and S2

• Later input

– index positions i and j

• Output

– length of longest substring of S1 beginning at i thatmatches substring of S

2beginning at j

• Goal

– linear preprocess time

– O(1) query time



Illustration

• Relationship to longest common substring

– Similar, but now start positions are fixed

S1

S2

i

j



Solution

• Linear Preprocessing – Create general suffix tree for S1 and S2

– Compute string depth at each node – Process tree to allow for constant time LCA

queries

– Establish pointers to all leaf nodes in tree

• Constant time query processing – Find u = lca(v,w)

– Output string depth of u



More space-efficient solution

• Linear Preprocessing (Assume |S2| < |S1|) – Create general suffix tree for S2

– Compute matching statistic ms(i) and p(i) for S1 • length of longest match of substring starting at i in S1 with

some substring in S2

• p(i) is the starting point of a location in S2 that matches

– Process tree to allow for constant time LCA queries

– Establish pointers to all leaf nodes in tree

• Constant time query processing – Find u = lca(p(v), w) in suffix tree for S2

– Output min(ms(v), string depth of u)• why is this correct?



Related Problem

• Maximal Palindromes

• Input – String S

• Output – Location of all maximal palindromes in S

• Solution

– Longest common extensions of specific pairs of positions in S and Sr

– O(S) solution



Common substrings revisited

• Longest common substrings of >2 strings:

– Input

• Strings S1

, …, SK

(total length n)

– Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Problem with previous solution

– O(kn) time to compute C(v) values – C(v): number of distinct leaf labels in subtree rooted at

node v



Definitions

• S(v): total number of leaves in v’s subtree

• U(v): number of “duplicate suffixes” from

same string that occur in v’s subtree • C(v) = S(v) - U(v)

• ni(v) = number of leaves with identifier i in

the subtree rooted at node v• ni = total number of leaves with identifier i



Key Concepts

• Definitions – S(v): total number of leaves in v’s subtree

– U(v): number of “duplicate suffixes” from same string

that occur in v’s subtree – ni(v) = number of leaves with identifier i in the subtreerooted at node v

– ni = total number of leaves with identifier i

• Observations – U(v) = S max((ni(v) - 1), 0)

– C(v) = S(v) - U(v)



Solution

• Computing U(v) values – DF traversal of tree numbering leaves in order that they

are encountered

– For each string label i• Let Li be the list of leaves with identifier i, in increasing order of their dfs numbers

• Compute lca of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li

• For each node v, let h(v) denote the number of times it is the lcacomputed from step above

– Key property• ni(v) = Si h(w) where w is in v’s subtree



Solution

• Computing U(v) values

– DF traversal of tree numbering leaves in order that they

are encountered

– Set h(v) to 0 for all nodes v – For each string label i

• Compute lca v of consecutive pair of leaves in Li for all pairs of

consecutive leaves in Li

• Increment h(v) by 1 – Propagate h(v) values up the tree by addition to set U(v)

– Set C(v) = S(v) - U(v)

Suffix Tree Applications

Documents

Transcript of Suffix Tree Applications