The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm...
-
Upload
alexandra-stephens -
Category
Documents
-
view
226 -
download
0
description
Transcript of The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm...
The Burrows-Wheeler Transform:Theory and Practice
Article by: Giovanni Manzini
Original Algorithm by:M. Burrows and D. J. Wheeler
Lecturer: Eran Vered
Overview
The Burrows-Wheeler transform (bwt).
Statistical compression overview
Compressing using bwt
Analysis of the results of the compression.
General
bwt: Transforms the order of the symbols of a text.
The bwt output can be very easily compressed.
Used by the compressor bzip2.
Calculating bw(s)
Add an end-of-string symbol ($) to s Generate a matrix of all the cyclic shifts of s Sort the matrix rows, in right to left
lexicographic order bw(s) is the first column of the matrix $ sign is dropped. Its location saved
BWT Example
s = mississipimississippi$ississippi$mssissippi$misissippi$misissippi$missssippi$missisippi$missisippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi
mississippi$ssissippi$mi$mississippissippi$missippi$mississi ississippi$mpi$mississipi$mississipp sissippi$missippi$missisissippi$missippi$mississ bw(s)= (msspipissii,
3)
Sorting the rows of the matrix is equivalent to
sorting the suffixes of sr
(ippississim)
BWT Matrix Properties
m ississippi $s sissippi$m i$ mississipp is sippi$miss ip pi$mississ i i ssissippi$ mp i$mississi pi $mississip p s issippi$mi ss ippi$missi si ssippi$mis si ppi$missis s
F L Sorting F gives L s1=F1
Fi follows Li in s$ Equal symbols in L are
ordered the same as in F
Add $ to get F
Reconstructing s
ms$spipissii
F$iiiimppssss
L
s= m
Sort F to get L s1=F1
Fi follows Li in s$ Equal order of appearance
issi
?
Reconstructing s
ms$spipissii
F$iiiimppssss
LL=sort(F)s =F1
j=1for i=2 to n{ a=# of appearances of Fj in {F1 , F2 , …Fj } j = index of the a’th appearance of Fj in L s = s + Fj
}
What’s good about bwt?
bwt(s) is locally homogenous: For every substring w of s, all
the symbols following w in s are grouped together.
mississippi$ssissippi$mi$mississippissippi$missippi$mississi
ississippi$mpi$mississipi$mississipp
sissippi$missippi$missisissippi$missippi$mississ
These symbols will usually be homogenous.
What’s good about bwt?
miss_mississippi_misses_miss_missouri
mmmmmssssss_spiiiiiupii_ssssss_e_ioirfollow
mifollow
_
follow m
bwt
followmis
Statistical Compression
We will discuss lossless statistical compression with the following notations:
s = input string over the alphabet:Σ = { a1 , a2 , a3 , …, ah }h = |Σ|n = |s|ni = number of appearances of ai in s.log x = log2x
Zeroth Order Encoding
Every input symbol is replaced by the same codeword for all its appearances: ai ci
a
0
1
1
1
e
b
c1
1
f
0
d
0
0
121
ich
iKraft’s Inequality:
e 0a 10c 111…
Output size:
h
iii cn
1
Minimum achieved for:
nnc i
i log
Zeroth Order Encoding
Compressing a string using Huffman Coding or Arithmetic Coding produces an output which size is close to |s|H0(s) bits.
Specifically: 210)( ssHssArit 01.01
is the Empirical Entropy (zeroth order) of s.
Output size is bounded by |s|H0(s), where:
nn
nnsH i
h
i
i log1
0
Zeroth order Entropy: Example
n1 = n2 = … = nh :
82.1)112log(
112)
114log(
114)
114log(
114)
111log(
111
0 sH
00 sH
hsH log0
n1 >> n2, n3 … , nh :
s = mississippi
k-th Order Encoding
The codeword that encodes an input symbol is determined according to that symbol, and its k preceding symbols.
ws – A string containing all the symbols following w in s.
Output size is bounded by |s|Hk(s) bits
sw
sk wHws
sHk
01
k-th Order Empirical Entropy of s:
k-th order Entropy: Example
79.0121492.030111
1 sH
s = mississippi (k=1)
ms=i H0(i)=0 is=ssp H0(ssp)=0.92 ss=sisi H0(sisi)=1 ps=pi H0(pi)=1
82.10 sH
Did we get an optimal k-th order compressor?
k-th Order Encoding and bwt After applying bwt, for every substring w of s,
all the symbols following w in s are grouped together:
)(01
11011 i
t
iitt wHwwwwHwww
kislss wwwwsbwt 21
Not yet:Local homogeneity instead of global homogeneity.
mmmmmssssss_spiiiiiupii_ssssss_e_ioiri$ s_ mi
i_ se
k-th Order Encoding and bwt
For example:s=ababababababab….bwt(s)= abbbbbbbbbbaaaaaaaaa
w1 ($) w2 (a) w3 (b)
H1(s)=0 (wa=bbb… , wb=aaa…)
H0(wi)=0
H0(w1 w2 w3 )=H0(s)=1
Compressing bwt
bwt
Arithmetic coding
MoveToFront
MoveToFront Compression
Every input symbol is encoded by the number of distinct symbols that occurred since the last appearance of this symbol.
Implemented using a list of symbols sorted by recency of usage.
Output contains a lot of small numbers if the text is locally homogenous.Transforms local homogeneity into global homogeneity.
MoveToFront Compression
Σ = { d,e,h,l,o,r,w }s= h e l l o w o r l d
mtf-list=mtf(s)=
{ d, e, h, l, o, r, w }
2 2 3 0 4 6
{ h, d, e, l, o, r, w }{ e, h, d, l, o, r, w }{ l, e, h, d, o, r, w }{ o, l, e, h, d, r, w }{ w, o, l, e, h, d, r }
1 …
Initial list may be either: Ordered alphabetically Symbols in order of appearance in the string
(need to add it to the output)
bwt0 Compression
bwt0(s) arit( mtf( bw(s) ) )
Theorem 1For any k:
(h=size of alphabet)
21 9log280 chhhscsHssbwt kk
252
1c
Notations
x’ = mtf(x) for a string w over {0,1,2, …, m} define: w01 : w, with all the non-zeros replaced by 1. x’01 : x’, with all the non-zeros replaced by 1. Note: |bwt(x)| = |x|
|mtf(x)|=|x|
Theorem 1 - Proof
9log22528''
100
hhtssHssHst
iii
Lemma 1s=s1s2…st , s’=mtf(s). Then
Theorem 1 - Proof bw(s) can be partitioned into at most hk substrings
w1, w2, …, wl such that:
l
iiik wHw
ssH
10
1
s’=mtf(bw(s)). By Lemma 1:
9log22528''
100
hhhswHwsHs kl
iii
Using bound on output of Arit: 210 ''')'()(0 ssHssAritsbwt
21 9log28 chhhscsHs kk
|s|Hk(s)
Lemma 1 - Proof
Encoding of s’: For each symbol: is it 0 or not?
For non-zeros: encode one of 1, 2, 3, …, h-1
Note: Ignoring some inter-substrings problems.
9log22528''
100
hhtssHssHst
iii
s=s1s2…st , s’=mtf(s). Then
Encoding non-zeros of s’
Use prefix code (i ci ): s’’ = pcnz(s’)c1 = 10
c2 = 11
ci = 0 0 0 … 0 0 B(i+1) (i>2)|B(i+1)| - 2 |B(i+1)|
|ci| <= 2log(i+1) (|c0| = 0)
1
11)'( 1log2''
h
ii
n
js imcs
j
mi= # occurrences of i in s’.
Encoding non-zeros of s’=mtf(s)
sHss 0''21
ProofNa Occurrences of symbol a in s: p1, p2,…, pNa
,1' 11ps p ,1' 1 iip pps
i
aa
ip
N
iii
N
is pppc
211
1' )log()log(
21
aN
iii
aa ppp
NN
211 )(1log
N
ii xN
x
1
Sum over all symbols
of s
npaN
aa N
nN log
For any string s:
1log21
ici
Encoding non-zeros of s’
For every i:
iii sHss 02''
Summing for all substrings:
t
iii sHss
102''
s=s1s2…st
Encoding of s’
For non-zeros: encode one of 1, 2, 3, …, h-1No more than bits
For each symbol: Is it 0 or not?Encode s’01
t
iii sHs
102
Encoding s’01
If for every si’01 the number of 0’s is at least as large
as the number of 1’s:
'401''3''
1
010
01010
01 ssHssHst
iii
and iiii sHssHs 0
010
01 2''
It follows that: ssHssHs
t
iii 40
16''1
001
001
Otherwise …
Encoding s’01 (second case) If si’01
has more 1’s than 0’s for i=1,2,…l:
If there are more 1’s than 0’s in si’01, then
It follows that:
sssHssHsl
ii
t
liii 40
36''11
001
001
iii sHss 0
ssHssHst
iii 40
36''1
001
001
Encoding of s’ For non-zeros: encode one of 1, 2, 3, …, h-1
No more than bits
For each symbol: Is it 0 or not? (Encode s’01 ) No more than bits '
403''6
10 ssHs
t
iii
Total: (after fixing some inaccuracies) No more than
9log22528
10
hhtssHst
iii bits
t
iii sHs
102
Improvement
Use RLE:
bw0RL(s) arit( rle( mtf( bw(s) ) ) )
Better performance Better theoretical bound:
kkRL gsHbw *)35(0
Notes Compressor Implementation:
Use blocks of text. Sort using one of: Compact suffix trees (long average LCP) Suffix arrays (medium average LCP) General String sorter (short average LCP)
Search in a compressed text: Extract suffix-array from bwt(s).
Empirical Results…