Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree...

36
Information Retrieval and Organisation Suffix Trees Dell Zhang Birkbeck, University of London adapted from http://www.math.tau.ac.il/~haimk/seminar02/suffixtrees.ppt

Transcript of Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree...

Page 1: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Information Retrieval and Organisation

Suffix Trees

Dell Zhang Birkbeck, University of London

adapted from

http://www.math.tau.ac.il/~haimk/seminar02/suffixtrees.ppt

Page 2: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Trie

A tree representing a set of strings

a

c

b

c

e

e

f

d b

f

e g

{

aeef

ad

bbfe

bbfg

c

}

Page 3: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Trie

Assume no string is a prefix of another

1) Each edge is labeled by

a letter.

2) No two edges outgoing

from the same node are

labeled the same.

3) Each string corresponds

to a leaf.

a b

c

e

e

f

d b

f

e g

Page 4: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Compressed Trie

Compress unary nodes, label edges by strings

a b

c

e

e

f

d b

f

e g

a

bbf

c

eef

d

e g

Page 5: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Given a string s, a suffix tree of s is a

compressed trie of all suffixes of s.

To make these suffixes prefix-free we add a

special character, say $, at the end of s.

Suffix Tree

Page 6: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Suffix Tree

For example, let s = abab, a suffix tree of s is a

compressed trie of all suffixes of abab$.

{

$

b$

ab$

bab$

abab$

}

a b

a b

$

a b $

b

$

$

$

Note that a suffix tree has O(n) nodes n = |s|. Why?

Page 7: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Suffix Tree Construction

The trivial algorithm

Put the largest suffix in

a b a b $

Page 8: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Put the suffix bab$ in

a b a b $

a b a b

$

a b $

b

Page 9: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Put the suffix ab$ in

a b a b

$

a b $

b

a b

a b

$

a b $

b

$

Page 10: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Put the suffix b$ in

a b

a b

$

a b $

b

$

a b

a b

$

a b $

b

$

$

Page 11: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Put the suffix $ in

a b

a b

$

a b $

b

$

$

a b

a b

$

a b $

b

$

$

$

Page 12: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

We will also label each leaf

with the starting point of the

corresponding suffix

a b

a b

$

a b $

b

$

$

$

0 1

a b

a b

$

a b

$

b

2

$ 3

$

4

$

Page 13: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Suffix Tree Construction

The trivial algorithm takes O(n2) time.

It is possible to build a suffix tree in O(n) time

using Ukkonen’s algorithm.

But, how come? Does it take O(n) space?

To use only O(n) space, encode the edge-labels

as (beginning-position, end-position) .

Page 14: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

$

a

bbbbbb$

a b

a

a

a

b

b

b

b

b$

bbbbbb$

bbbbbb$

bbbbbb$

bbbbbb$

$

$

$

$

$

abbbbbb$

Consider the string aaaaaabbbbbb$

Page 15: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Consider the string aaaaaabbbbbb$

$

a

bbbbbb$

a b

a

a

a

b

b

b

b

b$

bbbbbb$

bbbbbb$

bbbbbb$

(6,12)

$

$

$

$

$

abbbbbb$

Page 16: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Consider the string aaaaaabbbbbb$

(0,0)

(6,12)

(6,6)

(11,12)

(6,12)

(6,12)

(6,12)

(6,12)

(5,12)

(1,1)

(2,2)

(3,3)

(4,4)

(7,7)

(8,8)

(9,9)

(10,10)

(12,12)

(12,12)

(12,12)

(12,12)

(12,12)

(12,12)

Page 17: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Suffix Tree Applications

What Can We Do with It?

Exact String Matching

Exact Set Matching

The Substring Problem for a Database of Patterns

Longest Common Substring of Two Strings

Recognising DNA Contamination

Common Substring of More Than Two Strings

……

Page 18: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Exact String Matching

Given text T (|T| = n), pre-process it such that

when a pattern P (|P| = m) arrives you can quickly

decide when it occurs in T.

We may also want to find all occurrences of P in

T.

Page 19: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Exact String Matching

In pre-processing, we just build a suffix tree in

O(n) time

0 1

a b

a b

$

a b $

b

2

$ 3

$

4

$

Page 20: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Exact String Matching

Given a pattern P = ab we traverse the tree

according to the pattern.

If we do not get stuck traversing the pattern then

the pattern occurs in the text, otherwise it does

not.

Each leaf in the subtree below the node we

reach corresponds to an occurrence.

By traversing this subtree we get all k

occurrences in O(n+k) time.

Page 21: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Exact String Matching

How to match a pattern (query) against a

database of strings (documents)?

Page 22: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Generalized Suffix Tree

Given a set of strings S, the generalized suffix

tree of S is a compressed trie of all suffixes of

each s S.

To make these suffixes prefix-free we add a

special char, say $, at the end of s.

To associate each suffix with a unique string in S,

add a different special char to each s.

Each leaf node needs to be labelled by the

document id together with the suffix position.

Page 23: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Generalized Suffix Tree

For example, Let s1 = abab and s2 = aab, here is a

generalized suffix tree for s1 and s2.

{

$ #

b$ b#

ab$ ab#

bab$ aab#

abab$

}

0

1

a

b

a b

$

a b $

b

2

$

3

$

4

$

0

b #

a

1

#

2

#

3

#

Page 24: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Longest Common Substring

Given two strings s1 and s2, we build their

generalized suffix tree.

Every node with a leaf descendant from string s1

and a leaf descendant from string s2 represents a

maximal common substring and vice versa.

Find such node with largest “string depth”.

Page 25: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Lowest Common Ancestor

A lot more can be gained from the suffix tree, if

we pre-process it so that we can answer LCA

queries on it in constant time.

Page 26: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Lowest Common Ancestor

Why? The LCA of two leaves represents the

longest common prefix (LCP) of these 2 suffixes

0

1

a

b

a b

$

a b $

b

2

$

3

$

4

$

0

b #

a

1

#

2

#

3

#

Page 27: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Finding Maximal Palindromes

A palindrome: cbaabc, caabaac, …

To find all palindromes in a string s (of length m),

we build a generalized suffix tree for the string s

and the reversed string sr.

The palindrome with centre between i-1 and i is

the LCP of the suffix at position i of s and the

suffix at position m-i of sr.

Page 28: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

For example, consider the string cbaaba.

Prepare a generalized suffix tree for

s = cbaaba$ and sr = abaabc#

For every i find the LCA of

the suffix i of s and the suffix m-i of sr.

All palindromes can be identified in linear time.

Finding Maximal Palindromes

Page 29: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

2

a

a

b

2

$

6

$

b

6

#

c

0

5

4

1 1

a

4

5

$

3

3

0

a

$

$

Let s = cbaaba$ then sr = abaabc#

Page 30: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Suffix Tree Drawbacks

It is O(n) but the constant is quite big.

It consume a lot of space.

Notice that if we indeed want to traverse an edge

in O(1) time then we need an array (of pointers) of

size |Σ| in each node, where Σ is the alphabet.

Page 31: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Suffix Array

It is much simpler and easier to implement.

Compared with suffix trees, we lose some

functionality, but we save space.

Page 32: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Suffix Array

For example, let s = abab

Sort the suffixes lexicographically: ab, abab, b, bab

The suffix array gives the indices of the suffixes in

sorted order

2 0 3 1

Page 33: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Suffix Array Construction

The trivial algorithm

Quicksort

The linear time algorithm

Build a suffix tree in O(n) time first, and then

traverse the tree in in-order, lexicographically

picking edges outgoing from each node, and fill

the suffix array.

It can also be built in O(n) time directly.

Page 34: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Exact String Matching

How do we search for a pattern P in the text T,

using the suffix array of T?

If P occurs in T, then all its occurrences are

consecutive in the suffix array.

So we can do two binary searches on the suffix

array: the first search locates the starting position

of the interval, and the second one determines

the end position.

It takes O(m log(n)) time, as a single suffix

comparison needs to compare up to m

characters.

Page 35: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

Exact String Matching

It is also possible to do it in O(m+log(n)) with an

additional array of LCP.

Manber & Myers (1990)

Page 36: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix

T = mississippi

P = issa

i

ippi

issippi

ississippi

mississippi

pi

7

4

1

0

9

8

6

3

10

5

2

ppi

sippi

sisippi

ssippi

ssissippi

L

R

M