Space Efficient Suffix Trees

28

description

Space Efficient Suffix Trees. J. Ivan Munro, Venkatesh Raman, S. Srinivasa Rao. Introduction. n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes O(m) time. - PowerPoint PPT Presentation

Transcript of Space Efficient Suffix Trees

Page 1: Space Efficient Suffix Trees
Page 2: Space Efficient Suffix Trees

Introduction

n – length of text, m – length of search pattern string

Generally suffix tree construction takes O(n) time, O(n) space and searching takes O(m) time.

Although space requirement is O(n), the constant is usually big.

Page 3: Space Efficient Suffix Trees

Introduction (Cont.)

Motivation is to develop a space efficient data structure with a minimal constant over n.

We present suffix tree that uses n+O(n/lgn) words, or equivalently nlgn+O(n) bits and supports string searching in O(m) time.

Page 4: Space Efficient Suffix Trees

Previous Representations

Had either a higher lower order in space and some expectation assumption or required more time for searching

Below are some approaches:Keep alphabet size vector for each node. So constant to space is at least |Σ|. Keep a pair <start, end> for each compressed node (or equivalently a pair <start, length>).Save only the length, called the “skip value “.

Page 5: Space Efficient Suffix Trees

Using Skip Values for Search

Skip value – the length of the compressed string at node.

At compressed node skip as many characters as specified by skip value before comparing with the pattern.

Search until the pattern is exhausted or the current character of the pattern has no match at the current node. In first case any leaf of the subtree rooted at the node gives a possible starting point of the pattern in the text.

Start at position given by any of leaves and confirm if the pattern exists in the text.

Page 6: Space Efficient Suffix Trees

Suffix Trees (Patricias)

0

217

6 5

2 2

4 32 1

Skip value

a ba

ba

ba

#

#

#

# #

ba#

ba#

Input text: bababa#

Page 7: Space Efficient Suffix Trees

Example

0

217

6 5

2 2

4 32 1

Skip value

a ba

ba

ba

#

#

#

# #

ba#

ba#

Input text: bababa#

Pattern: aba

Page 8: Space Efficient Suffix Trees

Previous Representations (Cont.)

In compressed suffix tree there are n+1 leaf nodes and at most n internal nodes. Total: 2n+1 nodes.

Storage requirement:For the treeFor skip valuesFor position indices at the leaves

Representation using skip values require: 2n+1+n+n+1, about 4n words. Each word takes lgn bits, so total required space is about 4nlgn+O(n) bits.

Suffix array uses 2n words and has O(m+lgn) search time. More compact representation uses 1.25n words but the search time is given as expected bound.

Page 9: Space Efficient Suffix Trees

Binary tree rooted ordered tree

Isomorphism between binary trees and rooted ordered trees.

In the ordered tree there is a root which does not corresponds to any node in the binary tree.

Left child of binary tree node corresponds to the leftmost child of the corresponding node in the ordered tree.

Right child of binary tree node corresponds to the next sibling to the right in the ordered tree.

Page 10: Space Efficient Suffix Trees

Binary tree representation using the parenthesis sequence

The given binary tree

on 10 nodes

1

2 6

3

4 5

7

8

109

Page 11: Space Efficient Suffix Trees

Binary tree representation using the parenthesis sequence

Equivalent rooted ordered tree

The parenthesis representation 0 1 2 2 3 4 4 3 5 5 1 6 6 7 8 9 9 8 1010 7 0

( ( ( ) ( ( ) )( ) ) ( )( ( ( ) ) ( ) ) )

0

761

52

4

8

9

103

Page 12: Space Efficient Suffix Trees

Parentheses tree representation

A general rooted ordered tree on n nodes can be represented by 2n parentheses.

Use 2n+o(n) bit encoding of n node binary tree that supports, in constant time:

1. move to left/right child

2. move to parent

3. get the size of subtree

Page 13: Space Efficient Suffix Trees

Succint Suffix Tree Representation

Convert each symbol of the alphabet to binary 0,1 . Our suffix tree becomes binary tree. Support additional operations in constant time:

leafrank(x): return the number of leaves to the left of node x (in the preorder numbering)

leafselect(j): return the jth leaf in the left to right ordering of the leaves.

leafsize(x): return the number of leaves in the subtree rooted at node xleftmost(x): return the leftmost leaf in the subtree

rooted at node xrightmost(x): return the rightmost leaf in the subtree

rooted at node x

Page 14: Space Efficient Suffix Trees

Example

1

2 5

3 4 6

Leafrank(1) = 2

1

2 5

3 4 6

Leafselect(3) = 6

1

2 5

3 4 6

Leafsize(1) = 3

Page 15: Space Efficient Suffix Trees

Succint Suffix tree Representation (Cont.)

Important navigation operations:rank(i): the number of 1’s up to and including the position i select(i): the position of the ith 1

rankp(i): the number of occurrences of pattern p up to and including the

position i

selectp(i): position of the ith occurrence of p in given binary string

Page 16: Space Efficient Suffix Trees

THEOREM 1

Given a binary string of length n and a binary pattern p of length at most єlgn, where є is any constant less than ½, both rankp(i) and selectp(i) can be supported in constant time using o(n) bits, in addition to the space required for the given binary string.

Page 17: Space Efficient Suffix Trees

Intuition

Divide the string into blocks of size lg2n and keep the rank info for the first element of every block.

Each block further divide into small blocks. In the smallest blocks keep precomputed

table of answers in o(n) bits.

Page 18: Space Efficient Suffix Trees

THEOREM 2

A static binary tree on n nodes can be represented using 2n+o(n) bits such that, given a node x, in addition to finding its parent, left child, right child, and the size of the subtree rooted at node x, we can support leafrank(x), leafselect(j), leafsize(x), leftmost(x), and rightmost(x) operations in constant time.

Page 19: Space Efficient Suffix Trees

Proof

Convert binary tree into rooted ordered tree.

Leaves in binary tree correspond to the rightmost leaves in general tree.

Rightmost leaves in general tree correspond to “())” pattern in the string.

Page 20: Space Efficient Suffix Trees

Proof (Cont.)

1

2 6

3

4 5

7

8

109

0

761

52

4

8

9

103

0 1 2 2 3 4 4 3 5 5 1 6 6 7 8 9 9 8 1010 7 0 ( ( ( ) ( ( ) ) ( ) ) ( ) ( ( ( ) ) ( ) ) )

Page 21: Space Efficient Suffix Trees

Proof (Cont.)

Since rankp(x) searches the pattern from the left of the string, then the number of p occurrences is the number of leaves to the left of node x.

leafrank(x) rankp(x), p=“())”

Page 22: Space Efficient Suffix Trees

Proof (Cont.)

leafselect(j) selectp(j)

When p = “())” then operation selectp(j) chooses j’th leaf from the left.

leftmost(x) selectp(rankp(x)+1) rightmost(x)

selectp(rankp(close(parent(x))-1)) leafsize(x) rankp(f(x))- rankp(x)

note that f(x) is the closing parenthesis of parent of node x.

Page 23: Space Efficient Suffix Trees

Representing Suffix Tree

Binary encoding of suffix tree will make 2n+1 nodes of binary tree.

Use succint representation of binary tree: 2n+o(n) bits of space.

Our suffix tree now has 4n+o(n) bits. The third component takes nlgn bits. The second component – skip values are not kept. Total space needed: 4n+nlg(n)+o(n) bits

nlgn+O(n) bits n+O(n/lgn) words.

Page 24: Space Efficient Suffix Trees

Skip values storage trick

Skip values need not to be stored. They can be found online when needed.

To find the skip value, go to leftmost and rightmost leaves and compare the text until disagreement, suppose k characters are the same and they occupy l bits.

Find how many first bits are the same in those two different characters. Suppose j bits.

Skip value is l+j bits.

Page 25: Space Efficient Suffix Trees

Searching

Perform the search as before. If the search stops at a leaf, first find leafrank

of that leaf and then find the suffix index from the array of pointers.

If the end of pattern is encountered in internal node, then any leaf in the subtree represent a possible matching suffix. The leaf can be found by the leftmost(x) or rightmost(x) at constant time.

Page 26: Space Efficient Suffix Trees

Searching (Cont.)

Working with |Σ| alphabet, time to find skip values is O(lg|Σ|+skip value).

The sum of skip values is at most m. So total time spent to find skip values is O(mlg|Σ|).

Page 27: Space Efficient Suffix Trees

Searching (Cont.)

Once we confirm that the pattern exists (O(m)), the number of pattern occurrences is the leafsize of the node where the search ended.

Theorem 3:

A suffix tree for a text of length n can be represented using nlgn+O(n) bits such that, given a pattern of size m, the number of occurrences of the pattern in the string can be found in O(mlg|Σ|) time. Finding the positions of all the occurrences of the pattern requires O(m+s) time, where s is the number of occurrences of the pattern in the text.

Page 28: Space Efficient Suffix Trees