Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele...

18
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino [email protected] [email protected] University of Palermo ITALY

Transcript of Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele...

Page 1: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression

Algorithms

Raffaele Giancarlo Marinella [email protected] [email protected]

University of Palermo ITALY

Page 2: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Outline of the Talk

• The Burrows-Wheeler Transform [BW94]

• abraca bacraa, 1

• A New Class of Algorithms

• Combinatorial Dependency [BCCFM00, BFG02]

• Lower Bound on Compression Performance

• Conjecture by Manzini [M01]

• Universal Encoding of Integers [L68, E75]

BWT MTF H/ACI O

• BWT Compression Algorithms

Page 3: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Why BWT is Useful ?INTUITION

Let us consider the effect on a single letter in a common word in a block of English text:

w = … The…the… The… the…those…the…the…that…the…

The characters following th are grouped together inside BWT(w).

F L

e t h a t h e T h

e t h e t h e T h o t h e t h

Extensive experimental work confirms this “clustering effect” [BW94, F96]

Page 4: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Why Useful • “Clustering” of Symbols and MTF

• MoveToFront Coding (MFT) [BeSTaWe86]:

Encodes an instance of a character x by an integer that counts the number of the distinct symbols seen after the latest occurrence of x.

EXAMPLE abaaaabbbbbcccccaaaaa 01100010000200020000

• BWT+MTF =many runs of zeroes good for order 0 encodersRelation between compressibility of files and high percentage of zeroes [F96]

Page 5: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Two Main Research Questions

• Is MTF an essential step for the successful use of BWT [F96] ?

• Experiments [AM97, BK98, WM01];

• Theory ?

• Analysis of the compression performance of BWT-based algorithms.

• Experiments (see DCC )

• Information Theory [Ef99, Sa98]

• Worst Case Setting

• Empirical Entropy of Strings [M01] - No Assumptions

Page 6: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Zero-th Order Empirical Entropy

• s is string over the alphabet ={a1, a2, …, ah}

• ni number of occurrences of ai in s. Assume that nini+1

• The zero-th order empirical entropy of s:

H0(s)= - sn

sn i

h

i

i log1

• The zero-th order modified empirical entropy [M01]:

otherwise

0)( and 0 if

0 if

)(

/)log(1

0

)( 0

0

*0

sHs

s

sH

ss

sH

Page 7: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

k-th Order Empirical Entropy

• k the set of all strings of length k over

• k the set of all strings of length at most k

• Fixed an integer k0, for any string y in k, ys is the string consisting of the characters following y in s.

• The k-th order empirical entropy of s is

)(1

)( 0 sy

sk yHys

sHk

• The k-th order modified empirical entropy:

)(1

)( 0*

sTy

sk yHys

sHk

where Tk denotes a set of strings in k such that each of them has a unique suffix in Tk and such that among the sets having this property, Tk is that one minimizing the right hand.

Page 8: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Results by Manzini

• Let BW0 be a BWT-based algorithm with Arithmetic coder as zero-th order compressor. Then, k0

ssHssBW k 252

)(8)(0

• Let BW0RL be a BWT-based algorithm using run-length encoding with Arithmetic coder as zero-th order compressor. Then, k0 gk’ 0 such that

where =10-2.

kkRL gsHssBW ')()5()(0 *

Page 9: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Insights by ManziniTHEOREM (Manzini): Let s be a string. For each k0, there exists an fhk and a partition s’1, s’2, …,s’f of BWT(s) such that

An analogous result holds for Hk*(s).

)'('1

)( 01

i

f

iik sHs

ssH

REMARK: If there existed an ideal compressor A such that, for any partition s1,s2,…,sp of a string s

then A(BWT(s))|s|Hk(s). Analogously for Hk*(s).

)()( 01

i

p

ii sHss

A

Page 10: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Open Problems by Manzini

• Conjectures by Manzini:

• No BWT-compression method can get to a bound of the form |s\Hk

*(s)+gk for k0 and gk0 constant.

• The ideal algorithm A does not exist.

We show that A does not exist, but we can approximate it.

So, we prove that both conjectures are true.

Page 11: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

• We provide a new class of BWT-based algorithms, based on partition of strings, that do not use MTF as a part of the compression process.

• We analyze two of those new methods in the worst case setting. We obtain better theoretic bounds than Manzini.

• Under a natural hypothesis on the inner working of the algorithm no BWT-compressor using that type of algorithm can achieve

|s|Hk*(s) + gk

Our Contributions

0 ,)()( kssHssBW kCD

kkRLCDk gsHssBWgk )(5.2)(that such 0 ,0 *,

Page 12: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Algorithms That Use Optimal Partitions of Strings(rather than MTF)

• Compute BWT(s);

# # # # #

• Optimally Partition the transformed string with respect to a suitable cost function;

• Compress each piece separately.

Page 13: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Combinatorial Dependency• Techniques by Buchsbaum et al. [BCCFM00, BFG02] for Table Compression.

Surprisingly, it specializes to strings

Fix a data compressor C that adds a special end-of-string # before compressing the string.

DEFINITION: Two strings x and y are combinatorial dependent with respect the data compressor C if |C(xy#)|<|C(x#)|+|C(y#)|.

OPTIMAL PARTITION IN TERMS OF THE BASE COMPRESSOR C: By Dynamic Programming

)]#,1[(][min][0

jksCkEjEjk

Page 14: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

The new class BWTOPT

Given the input string s

1. Compute BWT(s);

2. Optimally partition of BWT(s) using C as the base compressor;

3. Compress separately each pieces of the partition.

TIME COMPLEXITY of BWTOPT: It depends critically on that of C and it is (n2). Fortunately, if C has a linear time decompression algorithm then BWTOPT also admits a linear time

decompression algorithm.

ASSUMPTIONS: Let C be a data compressor such that:

• given an input string x adds a special end-of-string # and compress x#

• either # is really appended at the end of the string or the length of x is explicitly stored as a prefix of the compressed string (universal encoding of integer).

Page 15: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

Lower BoundASSUMPTIONS:

• Given a compressor C, we assume that {C(an) |n>0} is a codeword set for the integers.

• For technical reasons we also assume that |C(an)| is non-decreasing function of n.

The lower bound comes from a theorem in [Levenshtein,1968], which we restate in our notation:

THEOREM There exists a countable number of string s such that

|C(s)||s|Hk*(s)+(|s|)

where (n) is a diverging function of n.COROLLARY No compression algorithm satisfying previous assumptions can achieve the bound formulated in conjecture by Manzini, i.e. |s\Hk

*(s)+gk for k0 and gk 0 constant.

Such a result holds independently of whether or not BWT is applied as a preprocessing step.

Page 16: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.
Page 17: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

A prefix code compressor HC

• # is an end-of-string marker

• The base compressor C is a modification of Huffman encoding so that we can encode # basically for free.

THEOREM Consider a string s. Let p1, p2, …, ph be the empirical probability distribution of s. Then

0 ),log()()#( khhOssHssHC kk

Page 18: Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it.

A compressor RHC based on Prefix and Run Length Encoding

THEOREM Consider a string s. Let p1, p2, …, ph be the empirical probability distribution of s. Then

0),log()(5.2)#( * khhOsHssRHC kk

It combines Huffman encoding with Run length encoding.• It use knowledge about the symbol frequency in a string. For low entropy string it is essential to use RLE.• The RLE scheme we use depends critically on a variable length encoding of a sequence of integers.

The solution we propose works well in conjunction with CD where the lengths of the strings we need to compress may even consists of few symbols.

PROBLEM Given two positive integers t and w, t<w, and the increasing sequence of integers d1,d2,…,dt in [1,w], find an algorithm to produce a binary encoding of d1,d2,…,dt and w.