The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm...

The Burrows-Wheeler Transform:Theory and Practice

Article by: Giovanni Manzini

Original Algorithm by:M. Burrows and D. J. Wheeler

Lecturer: Eran Vered

Overview

The Burrows-Wheeler transform (bwt).

Statistical compression overview

Compressing using bwt

Analysis of the results of the compression.

General

bwt: Transforms the order of the symbols of a text.

The bwt output can be very easily compressed.

Used by the compressor bzip2.

Calculating bw(s)

Add an end-of-string symbol ($) to s Generate a matrix of all the cyclic shifts of s Sort the matrix rows, in right to left

lexicographic order bw(s) is the first column of the matrix $ sign is dropped. Its location saved

BWT Example

s = mississipimississippi$ississippi$mssissippi$misissippi$misissippi$missssippi$missisippi$missisippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi

mississippi$ssissippi$mi$mississippissippi$missippi$mississi ississippi$mpi$mississipi$mississipp sissippi$missippi$missisissippi$missippi$mississ bw(s)= (msspipissii,

3)

Sorting the rows of the matrix is equivalent to

sorting the suffixes of sr

(ippississim)

BWT Matrix Properties

m ississippi $s sissippi$m i$ mississipp is sippi$miss ip pi$mississ i i ssissippi$ mp i$mississi pi $mississip p s issippi$mi ss ippi$missi si ssippi$mis si ppi$missis s

F L Sorting F gives L s1=F1

Fi follows Li in s$ Equal symbols in L are

ordered the same as in F

Add $ to get F

Reconstructing s

ms$spipissii

F$iiiimppssss

L

s= m

Sort F to get L s1=F1

Fi follows Li in s$ Equal order of appearance

issi

?

Reconstructing s

ms$spipissii

F$iiiimppssss

LL=sort(F)s =F1

j=1for i=2 to n{ a=# of appearances of Fj in {F1 , F2 , …Fj } j = index of the a’th appearance of Fj in L s = s + Fj

}

What’s good about bwt?

bwt(s) is locally homogenous: For every substring w of s, all

the symbols following w in s are grouped together.

mississippi$ssissippi$mi$mississippissippi$missippi$mississi

ississippi$mpi$mississipi$mississipp

sissippi$missippi$missisissippi$missippi$mississ

These symbols will usually be homogenous.

What’s good about bwt?

miss_mississippi_misses_miss_missouri

mmmmmssssss_spiiiiiupii_ssssss_e_ioirfollow

mifollow

_

follow m

bwt

followmis

Statistical Compression

We will discuss lossless statistical compression with the following notations:

s = input string over the alphabet:Σ = { a1 , a2 , a3 , …, ah }h = |Σ|n = |s|ni = number of appearances of ai in s.log x = log2x

Zeroth Order Encoding

Every input symbol is replaced by the same codeword for all its appearances: ai ci

a

0

1

1

1

e

b

c1

1

f

0

d

0

0

121

ich

iKraft’s Inequality:

e 0a 10c 111…

Output size:

h

iii cn

1

Minimum achieved for:

nnc i

i log

Zeroth Order Encoding

Compressing a string using Huffman Coding or Arithmetic Coding produces an output which size is close to |s|H0(s) bits.

Specifically: 210)( ssHssArit 01.01

is the Empirical Entropy (zeroth order) of s.

Output size is bounded by |s|H0(s), where:

nn

nnsH i

h

i

i log1

0

Zeroth order Entropy: Example

n1 = n2 = … = nh :

82.1)112log(

112)

114log(

114)

114log(

114)

111log(

111

0 sH

00 sH

hsH log0

n1 >> n2, n3 … , nh :

s = mississippi

k-th Order Encoding

The codeword that encodes an input symbol is determined according to that symbol, and its k preceding symbols.

ws – A string containing all the symbols following w in s.

Output size is bounded by |s|Hk(s) bits

sw

sk wHws

sHk

01

k-th Order Empirical Entropy of s:

k-th order Entropy: Example

79.0121492.030111

1 sH

s = mississippi (k=1)

ms=i H0(i)=0 is=ssp H0(ssp)=0.92 ss=sisi H0(sisi)=1 ps=pi H0(pi)=1

82.10 sH

Did we get an optimal k-th order compressor?

k-th Order Encoding and bwt After applying bwt, for every substring w of s,

all the symbols following w in s are grouped together:

)(01

11011 i

t

iitt wHwwwwHwww

kislss wwwwsbwt 21

Not yet:Local homogeneity instead of global homogeneity.

mmmmmssssss_spiiiiiupii_ssssss_e_ioiri$ s_ mi

i_ se

k-th Order Encoding and bwt

For example:s=ababababababab….bwt(s)= abbbbbbbbbbaaaaaaaaa

w1 ($) w2 (a) w3 (b)

H1(s)=0 (wa=bbb… , wb=aaa…)

H0(wi)=0

H0(w1 w2 w3 )=H0(s)=1

Compressing bwt

bwt

Arithmetic coding

MoveToFront

MoveToFront Compression

Every input symbol is encoded by the number of distinct symbols that occurred since the last appearance of this symbol.

Implemented using a list of symbols sorted by recency of usage.

Output contains a lot of small numbers if the text is locally homogenous.Transforms local homogeneity into global homogeneity.

MoveToFront Compression

Σ = { d,e,h,l,o,r,w }s= h e l l o w o r l d

mtf-list=mtf(s)=

{ d, e, h, l, o, r, w }

2 2 3 0 4 6

{ h, d, e, l, o, r, w }{ e, h, d, l, o, r, w }{ l, e, h, d, o, r, w }{ o, l, e, h, d, r, w }{ w, o, l, e, h, d, r }

1 …

Initial list may be either: Ordered alphabetically Symbols in order of appearance in the string

(need to add it to the output)

bwt0 Compression

bwt0(s) arit( mtf( bw(s) ) )

Theorem 1For any k:

(h=size of alphabet)

21 9log280 chhhscsHssbwt kk

252

1c

Notations

x’ = mtf(x) for a string w over {0,1,2, …, m} define: w01 : w, with all the non-zeros replaced by 1. x’01 : x’, with all the non-zeros replaced by 1. Note: |bwt(x)| = |x|

|mtf(x)|=|x|

Theorem 1 - Proof

9log22528''

100

hhtssHssHst

iii

Lemma 1s=s1s2…st , s’=mtf(s). Then

Theorem 1 - Proof bw(s) can be partitioned into at most hk substrings

w1, w2, …, wl such that:

l

iiik wHw

ssH

10

1

s’=mtf(bw(s)). By Lemma 1:

9log22528''

100

hhhswHwsHs kl

iii

Using bound on output of Arit: 210 ''')'()(0 ssHssAritsbwt

21 9log28 chhhscsHs kk

|s|Hk(s)

Lemma 1 - Proof

Encoding of s’: For each symbol: is it 0 or not?

For non-zeros: encode one of 1, 2, 3, …, h-1

Note: Ignoring some inter-substrings problems.

9log22528''

100

hhtssHssHst

iii

s=s1s2…st , s’=mtf(s). Then

Encoding non-zeros of s’

Use prefix code (i ci ): s’’ = pcnz(s’)c1 = 10

c2 = 11

ci = 0 0 0 … 0 0 B(i+1) (i>2)|B(i+1)| - 2 |B(i+1)|

|ci| <= 2log(i+1) (|c0| = 0)

1

11)'( 1log2''

h

ii

n

js imcs

j

mi= # occurrences of i in s’.

Encoding non-zeros of s’=mtf(s)

sHss 0''21

ProofNa Occurrences of symbol a in s: p1, p2,…, pNa

,1' 11ps p ,1' 1 iip pps

i

aa

ip

N

iii

N

is pppc

211

1' )log()log(

21

aN

iii

aa ppp

NN

211 )(1log

N

ii xN

x

1

Sum over all symbols

of s

npaN

aa N

nN log

For any string s:

1log21

ici

Encoding non-zeros of s’

For every i:

iii sHss 02''

Summing for all substrings:

t

iii sHss

102''

s=s1s2…st

Encoding of s’

For non-zeros: encode one of 1, 2, 3, …, h-1No more than bits

For each symbol: Is it 0 or not?Encode s’01

t

iii sHs

102

Encoding s’01

If for every si’01 the number of 0’s is at least as large

as the number of 1’s:

'401''3''

1

010

01010

01 ssHssHst

iii

and iiii sHssHs 0

010

01 2''

It follows that: ssHssHs

t

iii 40

16''1

001

001

Otherwise …

Encoding s’01 (second case) If si’01

has more 1’s than 0’s for i=1,2,…l:

If there are more 1’s than 0’s in si’01, then

It follows that:

sssHssHsl

ii

t

liii 40

36''11

001

001

iii sHss 0

ssHssHst

iii 40

36''1

001

001

Encoding of s’ For non-zeros: encode one of 1, 2, 3, …, h-1

No more than bits

For each symbol: Is it 0 or not? (Encode s’01 ) No more than bits '

403''6

10 ssHs

t

iii

Total: (after fixing some inaccuracies) No more than

9log22528

10

hhtssHst

iii bits

t

iii sHs

102

Improvement

Use RLE:

bw0RL(s) arit( rle( mtf( bw(s) ) ) )

Better performance Better theoretical bound:

kkRL gsHbw *)35(0

Notes Compressor Implementation:

Use blocks of text. Sort using one of: Compact suffix trees (long average LCP) Suffix arrays (medium average LCP) General String sorter (short average LCP)

Search in a compressed text: Extract suffix-array from bwt(s).

Empirical Results…

The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm...

Documents

Transcript of The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm...