Suffix arrays

23
Suffix Arrays in Linear Time

Transcript of Suffix arrays

Suffix Arrays in Linear Time

Index text, so substring queries can be answered fast

C G A C G

The Text

C

C

G

T

T

A C

A G

A C

T

Suffix Tree

C G A C G

The Text

C

C

G

T

T

A C

A G

A C

T

C G CSubstring

Query

Trees take too much space. Are there smaller

indices?

C G A C G

The Text

C

C

G

T

T

A C

A G

A C

T

3 1 4 6 2 5 7

Suffix Tree

Suffix ArraySorted List of

Suffixes

C G A C G

The Text

C T

3 1 4 6 2 5 7

Suffix Array

Burrows-Wheeler Index (an array)

How can one compute the Suffix Array in Linear

Time?

Task

Sort these suffixes

lexicographically

O(n log n) comparisons

each taking up to n time

Obtain two arrays, f[i]: sorted order of

ith suffix, g[i]: which suffix is ith

highest

String of length n with characters in the range 1..n

Divide and Conquer

Separate odd and even

suffixes; sort each recursively,

then combine

Sorting Even Suffixes

Sort these n/2 pairs and map them to single

chars in the range 1..n/2

A1A2

A3A4

New text of half the

length; sort suffixes

recursively

Sorting Odd Suffixes

A1,E1 A2,E2 A3,E3 A4,E4

Sort these n/2 pairs, E’s are

the even suffixes, whose order we know

O1 O2 O3 O4

Time Complexity

T(n) = O(n) + T(n/2) + Time for merging even and odd suffixes

O(n)

Merging

Do we have any info to determine

the relative order of an odd suffix and

an even one?

A,E B,O

O E

The Trick Sanders, Karkkainnen

Split suffixes into 3 groups instead of 2, so 0 mod 3, 1 mod 3 and 2

mod 3

0 1 2

Sorting 0 and 1 Together

A B C D E F G H I J K L

Sort these 2n/3 triplets

and map them to single chars

New text of length 2n/3; sort suffixes recursively

Sorting Suffixes in 2

A1,01

Sort these n/3 pairs, 0’s are

the mod 0 suffixes, whose order we know

21 22 23 24

A2,02 A3,03 A4,04

Merging

We know the order of all 0,1

suffixes!

AB,0 CD,1

1 2

Time Complexity

T(n) = O(n) + T(2n/3) + O(n)

O(n)

Generalization

v 2v 3v

This string has size |D|n/v

Set D of indices mod v

Time taken to create this string

is O(n |D|)

Sorting suffixes of this string gives the sorted order

of all suffixes which begin at

indices j such that j mod v is in D

Key Property of D

For any 2 indices i and j i-j mod v is the distance between some two beads in D

x<v

D is a Difference Cover if distances between beads in D generate 0,1…,v-1

x<v

Size of D

There exists a Difference Cover of size 1.5*sqrt(v)!

sqrt(v)

sqrt(v)

Time Complexity

T(n) = O(n|D|) + T(|D|n/v) + O(nv)

For |D|=2.5 sqrt(v)

T(n) = O(n sqrt(v))+ T(n/srqt(v)) + O(nv)