Suffix trees and suffix arrays presentation by Haim Kaplan.

33
Suffix trees and suffix arrays presentation by Haim Kaplan
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    230
  • download

    2

Transcript of Suffix trees and suffix arrays presentation by Haim Kaplan.

Page 1: Suffix trees and suffix arrays presentation by Haim Kaplan.

Suffix trees and suffix arrays

presentation by Haim Kaplan

Page 2: Suffix trees and suffix arrays presentation by Haim Kaplan.

Trie

• A tree representing a set of strings.

a

c

b

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Page 3: Suffix trees and suffix arrays presentation by Haim Kaplan.

Trie (Cont)

• Assume no string is a prefix of another

a

c

b

c

e

e

f

d b

f

e g

Each edge is labeled by a letter,no two edges outgoing from the same node are labeled the same.

Each string corresponds to a leaf.

Page 4: Suffix trees and suffix arrays presentation by Haim Kaplan.

Compressed Trie

• Compress unary nodes, label edges by strings

a

c

b

c

e

e

f

d b

f

e g

a

c

bbf

c

eefd

e g

Page 5: Suffix trees and suffix arrays presentation by Haim Kaplan.

Suffix tree

Given a string s a suffix tree of s is a compressed trie of all suffixes of s

To make these suffixes prefix-free we add a special character, say $, at the end of s

Page 6: Suffix trees and suffix arrays presentation by Haim Kaplan.

Suffix tree (Example)

Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab

$

ab

$

b

$

$

$

Page 7: Suffix trees and suffix arrays presentation by Haim Kaplan.

Trivial algorithm to build a Suffix tree

Put the largest suffix in

Put the suffix bab$ in

abab$

abab

$

ab$

b

Page 8: Suffix trees and suffix arrays presentation by Haim Kaplan.

Put the suffix ab$ in

ab

ab

$

ab$

b

ab

ab

$

ab$

b

$

Page 9: Suffix trees and suffix arrays presentation by Haim Kaplan.

Put the suffix b$ in

ab

ab

$

ab$

b

$

ab

ab

$

ab$

b

$

$

Page 10: Suffix trees and suffix arrays presentation by Haim Kaplan.

Put the suffix $ in

ab

ab

$

ab$

b

$

$

ab

ab

$

ab$

b

$

$

$

Page 11: Suffix trees and suffix arrays presentation by Haim Kaplan.

We will also label each leaf with the starting point of the corres. suffix.

ab

ab

$

ab$

b

$

$

$

12

ab

ab

$

ab

$

b

3

$ 4

$

5

$

Page 12: Suffix trees and suffix arrays presentation by Haim Kaplan.

Analysis

Naively, takes O(n2) time to build.

More sophisticated algorithms in O(n) time

Page 13: Suffix trees and suffix arrays presentation by Haim Kaplan.

What can we do with it ?

Exact string matching:

Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T.

W e may also want to find all occurrences of P in T

Page 14: Suffix trees and suffix arrays presentation by Haim Kaplan.

Exact string matchingIn preprocessing we just build a suffix tree in O(n) time

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Given a pattern P = ab we traverse the tree according to the pattern.

Page 15: Suffix trees and suffix arrays presentation by Haim Kaplan.

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

If we did not get stuck traversing the pattern then the pattern occurs in the text.

Each leaf in the subtree below the node we reach corresponds to an occurrence.

By traversing this subtree we get all k occurrences in O(n+k) time

Page 16: Suffix trees and suffix arrays presentation by Haim Kaplan.

Generalized suffix tree

Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s S

To make these suffixes prefix-free we add a special char, say $, at the end of s

To associate each suffix with a unique string in S add a different special char to each s

Page 17: Suffix trees and suffix arrays presentation by Haim Kaplan.

Generalized suffix tree (Example)

Let s1=abab and s2=aab

here is a generalized suffix tree for s1 and s2

{ $ # b$ b# ab$ ab# bab$ aab# abab$ }

1

2

a

b

ab

$

ab$

b

3

$

4

$

5

$

1

b#

a

2

#

3

#

4

#

Page 18: Suffix trees and suffix arrays presentation by Haim Kaplan.

So what can we do with it ?

Matching a pattern against a database of strings

Page 19: Suffix trees and suffix arrays presentation by Haim Kaplan.

Longest common substring (of two strings)

Every node with a leaf

descendant from string s1 and a

leaf descendant from string s2

represents a maximal common substring and vice versa.

1

2

a

b

ab

$

ab$

b

3

$

4

$

5

$

1

b#

a

2

#

3

#

4

#

Find such node with largest “string depth”

Page 20: Suffix trees and suffix arrays presentation by Haim Kaplan.

Lowest common ancestors

A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Page 21: Suffix trees and suffix arrays presentation by Haim Kaplan.

Why?

The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

2

a

b

ab

$

ab$

b

3

$

4

$

5

$

1

b#

a

2

#

3

#

4

#

Page 22: Suffix trees and suffix arrays presentation by Haim Kaplan.

Finding maximal palindromes

• A palindrome: caabaac, cbaabc• Want to find all maximal palindromes in a string s

Let s = cbaaba

The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

Page 23: Suffix trees and suffix arrays presentation by Haim Kaplan.

Maximal palindromes algorithmPrepare a generalized suffix tree for s = cbaaba$ and sr = abaabc#

For every i find the LCA of suffix i of s and suffix m-i+1 of sr

Page 24: Suffix trees and suffix arrays presentation by Haim Kaplan.

3

a

a b

a

baaba$

b

3

$

7

$

b

7

#

c

1

6a b

c #

5

2 2

a $

c #

a

5

6

$

4

4

1

c #

a $

$abc #

c #

Let s = cbaaba$ then sr = abaabc#

Page 25: Suffix trees and suffix arrays presentation by Haim Kaplan.

Analysis

O(n) time to identify all palindromes

Page 26: Suffix trees and suffix arrays presentation by Haim Kaplan.

Drawbacks

• Suffix trees consume a lot of space

• It is O(n) but the constant is quite big

• Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

Page 27: Suffix trees and suffix arrays presentation by Haim Kaplan.

Suffix array

• We lose some of the functionality but we save space.

Let s = abab

Sort the suffixes lexicographically: ab, abab, b, bab

The suffix array gives the indices of the suffixes in sorted order

3 1 4 2

Page 28: Suffix trees and suffix arrays presentation by Haim Kaplan.

How do we build it ?

• Build a suffix tree

• Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array.

• O(n) time

Page 29: Suffix trees and suffix arrays presentation by Haim Kaplan.

How do we search for a pattern ?

• If P occurs in T then all its occurrences are consecutive in the suffix array.

• Do a binary search on the suffix array

• Takes O(mlogn) time

Page 30: Suffix trees and suffix arrays presentation by Haim Kaplan.

ExampleLet S = mississippi

iippiissippiississippimississippipi

8

5

2

1

10

9

7

4

11

6

3

ppisippisisippissippississippi

L

R

Let P = issa

M

Page 31: Suffix trees and suffix arrays presentation by Haim Kaplan.

How do we accelerate the search ?

L

R

Maintain l = LCP(P,L)Maintain r = LCP(P,R)

M

If l = r then start comparing M to P at l + 1

l

r

Page 32: Suffix trees and suffix arrays presentation by Haim Kaplan.

How do we accelerate the search ?

L

R

Suppose we know LCP(L,M)

If LCP(L,M) < l we go left

If LCP(L,M) > l we go right

If LCP(L,M) = l we start comparing at l + 1

M

If l > r then

r

l

Page 33: Suffix trees and suffix arrays presentation by Haim Kaplan.

Analysis of the acceleration

If we do more than a single comparison in an iteration then max(l, r ) grows by 1 for each comparison O(logn + m) time