Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

54
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

Transcript of Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

Page 1: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

Compressed Suffix Arrays and Suffix Trees

Roberto Grossi, Jeffery Scott Vitter

Page 2: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

2

Outline

Reminders Motivation Compression results

Time & Space bounds Compressed Suffix Tree Compressed Suffix Array

Proof of bounds

Page 3: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

3

Reminder - Symbols

T = t1t2...tn-1 text of length n-1eof symbol # at the nth position

T[i,n] is suffix i of text Ti=1,…,n

Page 4: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

4

Reminder - Symbols

P = p1p2...pm

pattern of length m 0<ε≤1

Page 5: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

5

Reminder - Main Goal

Search string pattern P within text T Support fast queriesText T being fully scanned only once

Page 6: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

6

Reminder – Suffix Trees

Leaf with value i represents suffix [i,n]

Build time O(n)

Search time O(m)

Structure spaceO(n)

Page 7: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

7

Reminder – Suffix Arrays

Lexicographically ordered SA[i] = the starting position in T of the i-th

suffix

Σ={a,b} a<#<b

T = bbba#

a# # ba# bba#

bbba#

1 2 3 4 54 5 3 2 1

Page 8: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

8

Reminder – Suffix Arrays

Build timeO(nlogn)

Search timeO(m+logn)

Structure spaceO(n)

Page 9: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

9

Motivation

So FarGreedy in spaceFast searching

Need for space-efficient text indexing Reduce both space and query time

Page 10: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

10

Compressed Suffix Tree

Build timeO(n)

Search timeO(m/logn+(logn)ε)

Structure space(ε -1+O(1)) n

Page 11: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

11

Compressed Suffix Tree

Build Suffix Array Build Compressed Suffix Tree

Patricia Tries Compress Suffix Array

Page 12: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

12

CSA Basic Operations

Compress(T,SA)Return succinct representation of SARetain TDiscard SA

Lookup(i)Return SA[i]Use compressed SA

Page 13: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

13

CSA Primary measures

CompressPreprocessing compressed SASpace of compressed SA

lookupQuery time

Page 14: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

14

Compressed Suffix Array

Build time O(n)

Structure space ½nloglogn + O(n)

lookup time O(loglogn)

Page 15: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

15

Suffix Arrays Optimization

Main ideaDecomposition schemeRecursive structure of permutations

Page 16: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

16

Decomposition Scheme

K levels, K=0,….,l

SA0 = SA (Original SA) n 0=n

n = |T|assumption - n is a power of 2

n k=n/2k

SAk={1,2,…,nk)

Page 17: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

17

SAk Succinct Representation

4 main steps:1. Produce bit vector Bk

2. Map Bk 0’s to 1’s

3. Compute 1’s for each prefix in Bk

Using function rankk(j)

4. ‘Pack’ SAk

Page 18: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

18

Step #1: Produce bit vector Bk

|Bk| = nk

Bk[i]=1 if SAk[i] is even

Bk[i]=0 if SAk[i] is odd

T = bba#243 1SA0

Bo 110 0

Page 19: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

19

Step #2 : Map Bk 0’s to 1’s

New Fuction Ψk(i), i=1,…,nk

Ψk(i) =j SAk[i] is odd

and SAk[j]= SAk[i]+1

i otherwise (SAk[i] is even) T = bba#243 1

Bo110 0

322 3Ψo

SA0

Page 20: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

20

Step #3 : Compute 1’s for Bk

Recall fuction rankk(j), j=1,…,lk rankk(j) = number of 1’s on first j bits

of Bk

T = bba#243 1

Bo110 0

SA0

210 2ranko

Page 21: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

21

Step #4 : ‘Pack’ SAk

Pack even values of SAk

Divide by 2 New permutation {1,2,..,nk+1}

nk+1=nk/2=n/2k+1

Store new permutation into SAk+1

Remove SAk

|SAk+1| = |SAk|/2

Page 22: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

22

Example: level 0, steps 1-3

Page 23: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

23

Example: level 0, step 4

Page 24: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

24

Lemma : Reconstruct SAk

Results of phase k

Bk, Ψk, rankk,SAk+1

Reconstruct SAk

SAk[i] = 2*SAk+1[rankk(Ψk(i))] + (Bk[i]-1)

i = 1,….nk

Page 25: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

25

Proof, case 1, Bk[i] = 1

SAk[i] = 2*SAk+1[rankk(Ψk(i))] + (Bk[i]-1)

Step #4 : SAk[i]/2 stored in rankk(i)th entry of SAk+1

SAk[i] = 2 * SAk+1[rankk(i)]

Step #2 : Ψk(i) = i

Page 26: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

26

Proof, case 2, Bk[i] = 0

SAk[i] = 2*SAk+1[rankk(Ψk(i))] + (Bk[i]-1)

Ψk(i) = j

Step #2 : SAk[i] = SAk[j]-1

Bk[j] = 1

Apply case 1 on j SAk[j] = 2 * SAk+1[rankk(j)]

Page 27: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

27

Example, case 1, Bk[i] = 1

SA0[2] = ?

B0[2]=1, Ψ0(2)=2, rank0(2) = 1

SA0[2]/2 stored in 1st entry of SA1

SA0[2] = 2 * SA1[1] = 2 * 8 = 16

Page 28: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

28

Example, case 2, Bk[i] = 0

SA0[3] = ?

B0[3]=0, Ψ0(3) = 14, rank0(14) = 6 SA0[14] = 2 * SA1[6] = 2 * 16 = 32

SA0[3] = SA0[14] - 1 = 32 - 1 = 31

Page 29: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

29

Example - Decomposition

Page 30: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

30

Determining l

n 0 = n = 32

n 3 = 4 ~ n/logn can be stored in ≤ n bits

Conclusion l = loglogn

Page 31: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

31

CSA Structure

K levels, k = 0,1,….,l-1 Store Bk, Ψk, rankk

Final Level k = l Store only SAl

Page 32: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

32

CSA Structure & Build

Bk

nk bits per vector

O(nk) build

rankk

O(nk(loglognk)/lognk) bits• As shown before

O(nk) build

Sal

(n/2l)logn bits

Page 33: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

33

CSA Structure space - Ψk

List method 2K lists

possibilities for ‘prefixes’ of suffixes Number of lists increases Lk = concatenation of all 2K lists

|Lk| = nk/2

|Lk| decreases

Page 34: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

34

CSA Structure space - Ψk

For i = 1,…,nk/2

j = ith 1 in Bk

Pattern in 2K(SAk[j]-1),…, 2K*SAk[j]-1

matched to a list

Page 35: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

35

Level 0

a list = {2,14,15,18,23,28,30,31} b list = {7,8,10,13,16,17,21,27}

Page 36: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

36

Levels 1,2

Level 1 aa = {} //empty list ab = {9} ba = {1,6,12,14} bb = {2,4,5}

Level 2 abba = {5,8} baba = {1} aabb = {4}

Page 37: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

37

Reconstruct Ψk

Bk[i] =1

Ψk(i) = i

Bk[i] =0

h = number of 0’s in Bk

Ψk(i) = Lk[h]

Page 38: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

38

example : Reconstruct Ψk

Ψ0(25) = ?

B0[25] =0

h = 25 - 12 = 13

Ψ0(25) = L0[13] = 16

Page 39: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

39

example : Reconstruct Ψk

rank0(16) = 8

SA1[8] = ?

Ψ1[8] = ?

B1[8] =0

h = 8 - 5 = 3

Ψ1(8) = L1[3] = 6

Page 40: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

40

Lemma

S sorted integers w bits per number S < 2w

Store integers S(2+w-logs)+O(s/loglogs)

Retrieve hth integer O(1)

Page 41: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

41

Store Lk

Store integers n(1/2+3/2K+1 )+O(n/2kloglogn)

Retrieve hth integer O(1)

Preprocess time O(n/2k+22k)

Page 42: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

42

CSA Structure - Summary

Bk

nk

rankk

O(nk(loglognk)/lognk)

Sal

(n/2l)logn Ψk

n(½+3/2K+1 )+O(n/2kloglogn)

Page 43: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

43

Summing it up…

nlogn/2l + ½l*n + 5n + O(n/loglogn)

≤½nloglogn+n

½nloglogn + O(n) bits of storage

Page 44: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

44

Preprocess - summary

Bk

O(nk) rankk

O(nk) Ψk

O(n/2k+22k)

Summing up 0,..,l-1 levels Preprocess time O(n)

Page 45: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

45

lookup(i)

lookup(i) refers to SA0[i]

Need to reconstruct SA0[i]

New procedure - rlookup(i,k) Recursive Based on lemma of reconstructing SAk

Page 46: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

46

rlookup(i,k)

rlookup(i,k) If k = l

Return Sal[i]

else

Return 2*rlookup(rankk(Ψk(i)),k+1)+(Bk[i]-1)

Page 47: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

47

Reconstruct SAk

Lemma 2*SAk+1[rankk(Ψk(i))] + (Bk[i]-1)

lookup(i) = rlookup(i,0)

Page 48: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

48

Example - lookup(i)

lookup(5) = rlookup(5,0), l=3

2*rlookup(rank0(Ψ0(5)),1)+(B0[5]-1)

2*rlookup(10,1)+(-1)

Page 49: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

49

Example – cont.

rlookup(10,1) = 2*rlookup(rank1(Ψ1(10)),2)+(B1[10]-1)

2*rlookup(7,2)+(-1)

Page 50: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

50

Example - cont.

rlookup(7,2) = 2*rlookup(rank2(Ψ2(7)),3)+(B2[7]-1)

2*rlookup(2,3)+(-1)

Page 51: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

51

Example - cont.

rlookup(2,3) =

lookup(5) = 2*(2*(2*3+(-1))+(-1))+(-1)= 2*(2*(5)+(-1))+(-1) = 2*(9)+(-1) = 17

Page 52: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

52

lookup(i)

lookup(i) = rlookup(i,0) l+1 levels O(1) per level

O(loglogn) lookup time

Page 53: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

53

Compressed Suffix Array

Build time O(n)

Structure space ½nloglogn + O(n)

lookup time O(loglogn)

Page 54: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

54

Compressed Suffix Tree

Build timeO(n)

Search timeO(m/logn+(logn)ε)

Structure space(ε -1+O(1)) n