Compression Algorithms

Ecole Polytechnique Federale de Lausanne

Master Semester Project

Compression Algorithms

Author:Ludovic Favre

Supervisor:Ghid Maatouk

Professor :Amin Shokrollahi

June 11, 2010

Contents

1 Theory for Data Compression 41.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Source Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Bound on the optimal code length . . . . . . . . . . . . . . . . . . . 41.3.2 Other properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Source Coding Algorithms 82.1 Huffman coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Arithmetic coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Adaptive Dictionary techniques: Lempel-Ziv 113.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 LZ77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 LZ77 encoding and decoding . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Performance discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 LZ78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3.1 LZ78 encoding and decoding . . . . . . . . . . . . . . . . . . . . . . 133.3.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Improvements for LZ77 and LZ78 . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Burrows-Wheeler Transform 174.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.3 Why it compresses well . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Algorithms used in combination with BWT . . . . . . . . . . . . . . . . . . 204.3.1 Run-length encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3.2 Move-to-front encoding . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Implementation 215.1 LZ78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1.1 Code details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Burrows Wheeler Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

CONTENTS 2

5.3 Huffman coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3.1 Binary input and output . . . . . . . . . . . . . . . . . . . . . . . . . 235.3.2 Huffman implementation . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4 Move-to-front . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.5 Run-length encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.6 Overview of source files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Practical Results 276.1 Benchmark files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.1.1 Notions used for comparison . . . . . . . . . . . . . . . . . . . . . . . 286.1.2 Other remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Lempel-Ziv 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.2.1 Lempel-Ziv 78 with dictionary reset . . . . . . . . . . . . . . . . . . 286.2.2 Comparison between with and without dictionary reset version . . . 296.2.3 Comparison between my LZ78 implementation and GZIP . . . . . . 29

6.3 Burrows-Wheeler Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3.1 Comparison of BWT schemes to LZ78 . . . . . . . . . . . . . . . . . 326.3.2 Influence of the block size . . . . . . . . . . . . . . . . . . . . . . . . 326.3.3 Comparison between my optimal BWT method and BZIP2 . . . . . 35

6.4 Global comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Supplementary Material 377.1 Using the program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.1.1 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.1.2 Building the program . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.2 Collected data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.2.1 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.2.2 Spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.2.3 Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8 Conclusion 39

Introduction

With the increasing amount of data traveling by various means like wireless networks linkfrom mobile phone to servers, lossless data compression has become an important factorto optimize the spectrum utilization.

As a computer science student with interest in domain like computational biology, dataprocessing was important for me to understand how to handle the large amount of datacoming from high-throughput sequencing technologies.Since I have both interest in algorithms and concrete data processing, getting in touchwith data compression techniques immediately interested me.

During this semester project, it has been decided to focus on two lossless compressionalgorithms to allow some comparison. The chosen algorithms for implementation wereLempel-Ziv 78 and the most recent Burrows-Wheeler Transform, which enable well-knowntechniques such as run-length encoding, move-to-front and Huffman coding to outperformeasily in most of the situations the more complicated Lempel-Ziv-based techniques. Thosetwo compression techniques both have a very different approach on the way to compressdata and are commonly used in GZip1 and BZip22 software for example.In this report, I will first introduce some information theory material and present the the-oretical part of the project in which I learnt how popular compression techniques attemptto reduce the size required for heterogeneous types of data.The subsequent chapters will then detail my practical work during the semester and presentthe implementation I have done in C/C++.Finally, the last two chapters consist of the results obtained on the famous Calgary Corpusbenchmark files where I will highlight the differences in performance and explain the choicesmade in actual compression software.

1http://www.gnu.org/software/gzip (May 31. 2010)2http://www.bzip.org (May 31. 2010)

3

http://www.gnu.org/software/gzip

http://www.bzip.org

Chapter 1

Theory for Data Compression

1.1 Model

The general model used for the theoretical part is the First-Order Model [1, theory]: Inthis model, the symbols are independent of one another, and the probability distributionof the symbols is determined by the source X. We will considere X as a discrete randomvariable with alphabet A1. We will also assume that there is a probability mass functionp(x) over A. Let also denote a finite sequence of length n by Xn.

1.2 Entropy

In information theory, the concept of entropy is due to Claude Shannon in 19482. It isused to quantify the minimal average number of bits required to encode a source X.

Definition 1. The entropy H(X) of a discret random variable X is defined as:

H(x) = −∑x∈A

p(x)log p(x)

where p(x) is the probability mass function for x ∈ A to be encountered.

1.3 Source Coding

1.3.1 Bound on the optimal code length

Before giving a bound for the entropy, we will have to introduce some definitions. The firstdefinition introduces the notion of codeword and binary code.

Definition 2. A binary code C for the random variable X is a mapping from A to a finitebinary string. Let denote by C(x) the codeword mapped to x ∈ A and let l(x) be the lengthof C(x).

Moreover, a property that is often wanted for a binary code is to be instantaneous:1The alphabet A of X is the set of all possible symbols X can output.2http://en.wikipedia.org/wiki/Entropy_(information_theory)

4

http://en.wikipedia.org/wiki/Entropy_(information_theory)

CHAPTER 1. THEORY FOR DATA COMPRESSION 5

Definition 3. A code is said to be instantaneous (or prefix-free) if no codeword is a prefixof any other codeword.

The property of an instantaneous code is quite interesting since it permits to transmit thecodeword for multiple input symbols x1, x2, x2, · · · by simply concatenating the codewordsC(x1)C(x2)C(x3) · · · while still being able to decode xi instantly after C(xi) has beenreceived.Another definition required for the entropy bound theorem is about the expected lengthof a binary code C.

Definition 4. Given a binary code C, the expected length for C is given by

L(C) =∑x∈A

p(x)l(x)

Finally, the Kraft inequality permits connecting the instantaneous property for a code tothe code length.

The Kraft inequality

The theorem formalizing the Kraft inequality is given below:

Theorem 1. (Kraft inequality [7, p.107]) For any instantaneous code (prefix code) over analphabet of size D, the codeword lengths l(x1), l(x2), · · · , l(xm) must satisfy the inequality∑

i

D−l(xi) ≤ 1

Conversely, given a set of codeword lengths that satisfy this inequality, there exists aninstantaneous code with these word lengths.

The Kraft inequality theorem will not be proven here. The proof can be found in [7,pp.107-109].

We are now able to give the theorem for the entropy bound on the expected length of abinary code C.

Theorem 2. The expected length of any code C satisfies the following double inequality

H(X) ≤ L(C) ≤ H(X) + 1

Proof. The proof will take place in two phases:

1. We will first probe the upper bound, that is L(C) ≤ H(X) + 1We chose an integer word-length assignment for the word xi:

l(xi) =

⌈logD

1

p(xi)

⌉These lengths satisfy the craft inequality because∑

D−⌈logD

1p(xi)

⌉≤∑

D−logD 1

p(xi) =∑

p(xi) = 1


hence there exists a code with these word lengths. The upper bound is obtained asfollows using Theorem 1:∑

x∈Ap(x)l(x) =

∑x∈A

p(x)

⌈logD

1

p(x)

⌉≤

∑x∈A

p(x)(logD1

p(x)+ 1)

= H(X) + 1

which proves the upper bound.

2. The lower bound is obtained as follows: By our word-length assignments, we candeduce that

logD1

p(xi)≤ l(xi)

and therefore we obtain

∑x∈A

p(x)l(x) = L(C)

≥∑x∈A

p(x)logD1

p(xi)

= H(X)

which proves the two inequality parts from Theorem 2.

1.3.2 Other properties

The entropy can also be used to qualify multiple sources (random variables). For suchcases, we use the joint entropy.

Definition 5. The joint entropy H(X,Y ) of a pair of discret random variables (X,Y) witha joint distribution p(x, y) is defined as [7, p.16]

H(X,Y ) = −∑(x,y)

p(x, y)log p(x, y)

It is also possible to use the conditional entropy :

Definition 6. The conditional entropy H(Y |X) is defined as

H(Y |X) =∑x

p(x)H(Y |X = x)

= −∑x

p(x)∑y

p(y|x)log p(y|x)

= −∑y

∑x

p(x, y)log p(y|x)


Theorem 3. From the previous definitions, we obtain the following theorem:

H(X,Y ) = H(X) +H(Y |X) = H(Y ) +H(X|Y )

Proof. Using the previously seen definitions and properties, proving Theorem 3 is simplya matter of developing formulas.

H(X,Y ) = −∑x

∑y

p(x, y)log p(x, y)

= −∑x

∑y

p(x, y)log p(x)p(y|x)

= −∑x

∑y

p(x, y)log p(x)−∑x

∑y

p(x, y)log p(y|x)

= −∑x

p(x)log p(x)−∑x

∑y

p(x, y)log p(y|x)

= H(X) +H(Y |X)

The proof is similar for the second part of the equality.

In the next chapters, the presented notions will be used introduce the optimality of Huffmanand Lempel-Ziv algorithms.

Chapter 2

Source Coding Algorithms

2.1 Huffman coding

2.1.1 History

Huffman coding was developed by David A. Huffman while he was a Ph.D. student at MIT,and published in the 1952 paper "A Method for the Construction of Minimum-RedundancyCodes"1. The idea behind his technique was to represent by fewer symbols that have higherprobability than symbols with lower one.

2.1.2 Description

The entropy coding problem can be defined as follows:

Definition 7. Entropy coding problem: Given a set of symbols and their correspondingprobabilities, find a prefix-free binary code so that the expected codeword length is minimized

Formally, we have:A = {x1, x2, · · · , xn}, the alphabet of size n,

P = {p1, p2, · · · , pn}, the corresponding symbol probabilities, withn∑

i=1

pi = 1,

and we want to find:

C = {c1, c2, · · · , cn} such that L(C) =n∑

i=1

l(ci)pi is minimized.

As most of the readers are probably familiar with Huffman coding, I will just present aquick review of it in Algorithm 1:

1http://en.wikipedia.org/wiki/Huffman_coding (April 1st 2010)

8

http://en.wikipedia.org/wiki/Huffman_coding

CHAPTER 2. SOURCE CODING ALGORITHMS 9

Algorithm 1 Huffman coding1: Q ←Priority queue containing nodes where the node with lowest probability is given

the highest priority2: Create a node for each symbol and add it to Q3: while Q.size > 1 do4: n1 ← Q.head5: Q.removeHead6: n2 ← Q.head7: Q.removeHead8: Create a new internal node n3 with probability equal to the sum of n1’s and n2’s

probabilities and n1, n2 as left and right child respectively9: Q.add(n3)10: end while11: Traverse the tree from the root assigning different bits to left and right child and

propagate to the leaves every time

The Figure 2.1 illustrates graphically Algorithm 1.

Figure 2.1: Illustration of a Huffman coding tree

2.1.3 Optimality

The optimality of Huffman codes can be proved by induction [7, pp.123-127]. It has to bekept in mind that there exist many optimal codes and inverting all bits is just an exampleof getting another code from a given code.

We know that a code is optimal ifn∑i

pil(ci) is minimal.

Lemma 1. For any distribution, there is an optimal instantaneous code (with minimumexpected length) that satisfies the following properties:

CHAPTER 2. SOURCE CODING ALGORITHMS 10

1. The lengths are ordered inversely with the probabilities: pj > pk ⇒ l(cj) ≤ l(ck).

2. The two longest codewords have the same length.

3. Two of the longest codewords differ only in the last bit and correspond to the twoleast likely symbols.

Proof. I will not explain the complete proof but rather the insight into the formal proof,which is based on proof by contradiction.

1. If this is not the case, then inversing the two codewords cj and ck will diminish the

sumn∑i

pil(ci) giving a better code.

2. If the two longest codewords do not have the same length, using the fact that thecode is prefix-free we can delete the last bit of the longest codeword and achieving abetter expected codeword length.

3. If the code is optimal but does not have two maximal-length codeword of same length,a sibling can be created by removing the last bit of the longest codeword. Since thecode is prefix-free, this does not alter the property and this will yield a better code,contradicting the optimality hypothesis. Therefore every maximal-length codewordin any optimal code has a sibling.Now if the two longest codewords differing only in the last bit have the lowest prob-ability, otherwise it will be possible to create a better code, contradicting the opti-mality hypothesis.

Proof of the optimality

A code satisfying Lemma 1 is called a canonical code. To prove the optimality of theHuffman code we can observe that at each step of the Huffman algorithm, Lemma 1 isverified. Using induction and Lemma 1, we can show that Huffman coding is optimal (forthe formal proof, you can refer to [7, pp.125-127]).

2.2 Arithmetic coding

With Huffman coding, we have seen a popular method for variable-length codes. The mainadvantage of Huffman codes was the simplicity to produce such codes as it uses the notionof tree to build the codes.

Arithmetic coding has a complete other approach in the way of producing variable-lengthcodes. It is especially useful when dealing with sources with small alphabet, such as binarysources, and alphabets with highly skewed probabilities [10, p.81].An important fact about arithmetic coding is that the amount of patents covering it in-fluenced BZip and JPEG file format to use Huffman coding instead2. Since I did not useArithmetic coding in the implementation part of the project, I will not present the detailsof this technique.

2http://en.wikipedia.org/wiki/Arithmetic_coding#US_patents_on_arithmetic_coding (June 9. 2010)

http://en.wikipedia.org/wiki/Arithmetic_coding#US_patents_on_arithmetic_coding

http://en.wikipedia.org/wiki/Arithmetic_coding#US_patents_on_arithmetic_coding

Chapter 3

Adaptive Dictionary techniques:Lempel-Ziv

3.1 History

The Lempel-Ziv compression algorithms are due to Abraham Lempel and Jacob Ziv. Theypublished two lossless data compression algorithms in papers in 1977 and 1978 (LZ77 andLZ78 are the respective abbreviations for those two algorithms). In those two papers, twodifferent approaches are provided to build adaptive dictionaries. I will briefly present theLZ77 algorithm and then detail more precisely the LZ78 method, which I implemented formy project.

3.2 LZ77

In LZ77, the dictionary consists of a portion of the previously encoded sequence. Thealgorithm consists of a sliding window made of two parts [10, p.123]: the search buffer andthe look-ahead buffer. Those two buffers are illustrated in Figure 3.1.

Figure 3.1: Representation of the search buffer and look-ahead buffer with the matchpointer

The search buffer is used as a dictionary and belongs to the previously encoded sequence.The look-ahead contains the sequence to encode. The offset is the distance from the look-ahead buffer to the match pointer. The length of the match is the number of consecutivesymbols in the search buffer that match the same consecutive symbols in the look-aheadbuffer.

11

CHAPTER 3. ADAPTIVE DICTIONARY TECHNIQUES: LEMPEL-ZIV 12

3.2.1 LZ77 encoding and decoding

We now look at the way LZ77 encodes (and compresses) a sequence. Suppose it has alook-ahead buffer of size six and a search buffer of size seven. At each slide of the window,the following situation can be encountered by the LZ77 algorithm (with correspondingoutput):

• no match is found : < 0, 0, c >

• a match is found : < o, l, c >

• the match is extended inside the look-ahead buffer: < o, l, c >

where o is the offset to the match pointer, l is the length of the match and c is the codeof the symbol following the matched string in the look-ahead buffer. Note that the thirdcase is just a special case of the second.

For example, if we look at Figure 3.2, we can see that o = 7, l = 4 and c = code(r).Therefore, the transmitted triple is < 7, 4, code(r) >.

Figure 3.2: A step of the encoding

The decoding of the triple follows logically from the encoding procedure. The outputsequence is produced by construction, based on previous elements extracted to the outputsequence. An illustration based on the previous triple is given for the decoder in Figure 3.3


Figure 3.3: Decoding the next elements of the sequence with the received triple <7, 4, code(r) >

For more details about LZ77, you can have a look at [10, p.121-125].

3.2.2 Performance discussion

The most costly part of this method is to find the longest match. It has been proven byA. D. Wyner and J. Ziv in [12] that LZ77 is asymptotically optimal.LZ77 is for example used within GZip1. The implicit assumption of LZ77 approach is thatexact pattern copies will occurs close to each others. This assumption can be a drawback iftwo pattern copies are not located within the sliding window. However, the sliding windowis usually big and such problems will usually not occur to often to have a significant impacton the quality of the compression.I will now present how LZ78 approach tries to solve the problem of patterns not beinglocated within the same window.

3.3 LZ78

The LZ78 approach solves the above LZ77 problem by dropping the concept of local searchwindow and using an incrementally built dictionary.

3.3.1 LZ78 encoding and decoding

The way the dictionary is built during the encoding is expressed in Algorithm 2.1http://www.gnu.org/software/gzip (May 31. 2010)

http://www.gnu.org/software/gzip


Algorithm 2 LZ781: B ←Empty buffer of symbol2: D ←Empty dictionary3: while not End−Of − File do4: S ←the next symbol from the input stream5: I ← index of B in D6: B.append(S)7: if B is not in D then8: Add B to the end of the dictionary9: Output the two symbols < I, S >10: B ← Empty buffer11: end if12: end while13: if not B is empty then14: I ← index of B without its last symbol in D15: Output the two symbols < I, S >16: end if

Note that the buffer has to be handled carefully (important during the implementation)so that we won’t forget symbols at the end of the file (see line 13 in Algorithm 2).

Figure 3.4: Encoding with LZ78. The shown state of the dictionary is after having encodedthe given input

You can see an example of LZ78 encoding in Figure 3.4.The decoding process is quite similar and should not be too difficult to build after havingseen how it is encode. If you want more details about the decoding phase, you may havea look at [10, pp.125-127].

Note that the pure algorithmic description does not cover the implementations details likethe dictionary growth without bound. This is usually where problems are encountered ordecisions have to be made. For more details about the implementation, you can refer tothe implementation chapter.This algorithm is particularly simple to understand and because of its speed and efficiency,it has become popular as one of the standard algorithm for file compressions in currentcompression software, like LZ77 has in GZip.We will now have a look at the optimality of the LZ78 algorithm.

3.3.2 Optimality

The complete optimality proof can be found in [7, pp.448-455]. Note that I will not gointo the complete proof because of its size (at least ten pages to fit in this report). I willrather highlight the main ideas of the proof.


We first start by a definition[7, p.448] for a parsing of a string.

Definition 8. A parsing S of a binary string x1x2 · · ·xn is a division of the string intophrases, separated by commas. A distinct parsing is a parsing such that no two phrases areidentical. For example, 0, 111, 1 is a distinct parsing of 01111, while 0, 11, 11 is not.

We now denote by c(n) the number of phrases in the LZ78 parsing of a sequence Xn oflength n. After applyling Lempel-Ziv algorithm, the compressed sequence consists of c(n)pairs < p, b > of number. Each pair is made of a pointer p to the previous occurence ofthe phrase prefix and the last bit b of the phrase.Each pointer p requires logc(n) bits. Thus the total length of the compressed sequence isc(n) · (logc(n) + 1) bits.

The goal of the proof is to show that c(n)·(log c(n)+1)n → H(X), i.e. LZ78 is bounded

by the entropy rate with probability 1 for big n (asymptotic optimality).

The first part of the proof highlights the fact that the number of phrases in a discinctparsing of a sequence is less than n

log n , arguing that there are not enough discinct shortphrases.Then, the second idea is to bound the probability of a sequence based on the number ofdistinct phrases. As an example, if we consider an independent and identically-distributedsequence of four random variables taking four possibles values with distinct probabilities,the probability to have a sequence with four distinct values is maximized ( 1

256) when theprobabilites are equal (with value 1

4). On the other hand, if the sequence is made of twicethe two same values, the probability to have this sequence is maximized ( 1

16) if the twovalues have a probability of 1

2 each. This illustrate the following point: sequences with alarge number of distinct symbols or phrases cannot have a large probability. This idea isused in Ziv’s inequality [7, p.452].Finally, since the description length of a sequence after the parsing grows as c · log c, se-quences that have very few distinct phrases can be compressed efficiently and correspondto strings that could have a high probability. On the other hand, by Ziv’s inequality, theprobability of sequences that have a large number of distinct phrases (and do not compresswell) cannot be too large. Starting with Ziv’s inequality, it is possible, by connecting thelogarithm of the probability of the sequence with the number of phrases in its parsing, toprove the theorem stating that LZ78 is asymptotically optimal.

3.4 Improvements for LZ77 and LZ78

There are a number of ways to modify the LZ77/78 algorithms. And the most obviousmodification we may imagine to improve the algorithms have probably already been per-formed. A well known modification of LZ78 is the one proposed by Terry Welch[10, p.127],known as LZW[11]

LZW

The idea in LZW is to remove the need for the second element in the pair < I, S >. Themain idea is to start with a dictionary that is pre-filled with all possible symbols and toonly send the index of the dictionary to the output. I will not go further in the description


of LZW since it is not used in the implementation of my project.

The UNIX compress utility uses for example the LZW approach[10, p.133]. It has anadaptive dictionary size, starting at 512 and doubled every time the dictionary is full,enabling reasonable bit usage. In this way, during the earlier part of the coding processwhen the strings are still short, the codewords used to encode them is also small. The com-press utility also performs a compression ratio monitoring: when it falls below a threshold,the dictionary is flushed and the dictionary building process is restarted, permitting thedictionary to reflect the local characteristics of the source.

Others improvements

Other kinds of improvements are still actively used in actual LZ77/78 implementations.Most of them use tricky methods to pre-fill the dictionary2 or to allow the algorithm tomonitor its current performance in the running compression and taking an appropriateaction3. Interest for LZ77/78 based compression techniques is still present: most of thoseimprovements are often protected by patents and this may require you to pay the authorof the method to use it in your software.Therefore, using an efficient LZ77/78 based algorithm implementation nowadays may re-quire you to either both invent and patent a new technique or to use an existing modifica-tion and paying fees. Another possibility may also be to use a method whose patent hasexpired, which will likely be the case for LZ77/78 because of the time elapsed since theintroduction of these algorithms.

2http://www.freepatentsonline.com/5951623.html3http://www.freepatentsonline.com/5243341.html

http://www.freepatentsonline.com/5951623.html

http://www.freepatentsonline.com/5243341.html

Chapter 4

Burrows-Wheeler Transform

4.1 History

The Burrows–Wheeler transform (denoted BWT, also called block-sorting compression)was invented by Michael Burrows and David Wheeler in 1994 [6] while working at DECSystems Research Center in Palo Alto, California. It is based on a previously unpublishedtransformation discovered by Wheeler in 19831.

4.2 Description

The description of the encoding and decoding phase is taken from the original paper fromMichael Burrows and David Wheeler [6]

4.2.1 Encoding

The most convenient description of Burrows-Wheeler transform is to present you an exam-ple. Suppose we want to perform the BWT on the string mississippi$.The first step is togenerate all the possible circular permutations for this string and place them in a matrixas shown in Figure 4.1.

Figure 4.1: The matrix of circular permutations for the string mississippi$

The main step of BWT consists then of sorting the lines of the matrix by alphabeticalorder as in Figure 4.2(a). Once sorted, you can see that the original string is now locatedat the 6th row of the matrix.

1http://en.wikipedia.org/wiki/Burrows-Wheeler_transform (April 20. 2010)

17

http://en.wikipedia.org/wiki/Burrows-Wheeler_transform

CHAPTER 4. BURROWS-WHEELER TRANSFORM 18

(a) (b)

Figure 4.2: Sorting of the rows, and output illustration for BWT

The sorting phase is in fact the most expensive phase of BWT . After that, the output ofBWT is simply the last column of the matrix and the position of the original string asillustrated in Figure 4.2(b), that is ipssm$pissii6.

4.2.2 Decoding

Decoding the string L = ipssm$pissii6may seem almost impossible at first sight. However,the technique to decode it is very easy to understand and faster than the one used forencoding. We initialize a matrix with last column equal to L. The first column F ofthe matrix can also easily be computed by sorting the symbols in L and placing them inalphabetical order as shown in Figure 4.3.

Figure 4.3: Initialization of the first and last column of the matrix and retrieval of the firsttwo symbols

After that, the original string (located at row 6), is reconstructed using a simple iterativemethod: We first start at the provided original string row. Then, we proceed as follows: IfF [j] is the kth instance of symbol c in F , then the row to consider at the next step is therow where L[i] is the kth instance of c in L. The next symbol to add to the original stringis given by F [i]. The retrieval of the first two symbols ′i′ and ′s′ is shown in the secondpart of Figure 4.3.The same procedure is repeated until all remaining symbols have been found, as in Fig-ure 4.4.


Figure 4.4: Retrieval of the complete original sequence

Finally, we can recover the original string at the given position in L.Giovanni Manzini, professor of computer science at the University of "Piemonte Orientale",shows in [8] how to bounds the compression ratio of two BWT-based algorithms in termsof the k − th order empirical entropy of the input string, meaning that those algorithmsare able to make use of all the regularities which is in the input string.Due to the remaining material to present in this report, I will not go into the proof. Youcan find it in [8].

4.2.3 Why it compresses well

In fact, a BWT string has a lot of similarity with Lempel-Ziv approaches in the ideaof capturing similarities. Let consider, as an example, a common word in the Englishlanguage2, like “the ”, the assumed top one used word.Based on what we have seen before, L (like F ) will contain a large number of ′t′ char-acters, intermingled with other characters that can proceed the “he ” suffix like ′s′ (i.e.,the word “she ”). The same argument can be applied to all other words, like “ of ” with “ if ”.

As a result, L is likely to contain a large number of few distinct characters in some regions.This is exactly the kind of string where the move-to-front coder will output a majority oflow numbers, highly repeated (see Figure 4.6 for a concrete example on a simple string).

2http://www.world-english.org/english500.htm lists some of them for example

http://www.world-english.org/english500.htm


Then, the move-to-front output can be efficiently encoded using Huffman.

In the results, I will show that this is empirically the best compression scheme for most ofthe situations.

4.3 Algorithms used in combination with BWT

4.3.1 Run-length encoding

Run-length encoding is a simple method to encode long runs of the same symbol. Anexample is provided in Figure 4.5.

Figure 4.5: Run-length encoding on the string abbacccabaabbb

This encoding technique has been used with BWT in a compression scheme during theproject.

4.3.2 Move-to-front encoding

Move-to-front encoding is another simple useful method to improve the performance ofentropy encoding like Huffman. The main idea is to start with an initial stack of recentlyused symbols. Then, every time a symbol is met, its index in the stack is written to theoutput. At the same time, the symbol is moved to the top of the stack if it is not alreadyon the top. Move-to-front utilization is illustrated in Figure 4.6.

Figure 4.6: Move-to-front encoding on the string abbacccabaabbb

We will see in the results that it enables a very good compression ratio in combinationwith BWT and Huffman.

Chapter 5

Implementation

In this chapter, details about the algorithm implementations will be presented. The imple-mentation part of the project (with benchmarking) took me the majority of the total timededicated to the project. Those implementations were done on two GNU/Linux x86_64machines, running under Gentoo1. The code has been compiled with GCC2 version 4.3.4,4.4.3 and 4.5. Therefore, compilation on Linux or UNIX systems should work but is notguaranteed. Under windows, it will probably need to be done under Cygwin3. For moredetails about the compilation process, see supplementary material.

All of those algorithms were implemented using a mix of C++ and C. C++4 permitseasily readable object-oriented code, while C ensures low level access to files. This lowlevel access aspect is fundamental when dealing with file encoding, especially when byte-level or even bit-level access is required.

5.1 LZ78

Being the first algorithm implemented in my project, LZ78 also brought me into C/C++difficulties: I unfortunately started with std::string container for phrases which led meto tricks usage for string termination, like manually adding ′\0′ symbol for end-of-string.Moreover, the 0x7F byte code, usually known as the EOF constant on GNU/Linux, termi-nated the stream input from a file before the real file end. All these constraints made thecompressor incompatible with binary files, and consequently with most of the benchmarkfiles [9]. Fortunately, having a little more time during the last weeks of the semester, I de-cided to re-implement this algorithm using std::vector as container, allowing compressionof binary files and performing benchmarks on the complete Calgary Corpus.

5.1.1 Code details

As previously said, std::vector are used as container in the dictionary. The dictionary itselfis a STL-map, mapping the vectors to the dictionary index. These indexes are boundbetween 0 and 127 (for a dictionary of size 1 byte) since the first bit has to be reserved to

1Gentoo Linux distribution : http://www.gentoo.org2http://gcc.gnu.org/3www.cygwin.com/4http://www.cplusplus.com/reference/clibrary/ (April 20. 2010)

21

http://www.gentoo.org

http://gcc.gnu.org/

www.cygwin.com/

http://www.cplusplus.com/reference/clibrary/

CHAPTER 5. IMPLEMENTATION 22

signify a match or a miss in the decoding phase. Some of these definitions are illustratedin Listing 5.1 (the complete code can be found in LZ78.hpp).

Listing 5.1: Representation of match, miss and masks for LZ78 for 1 to 3 bytes dictionary

#define DICT_NUMBYTE 2

#if DICT_NUMBYTE == 3#define MAX_DICSIZE 8388607#elif DICT_NUMBYTE == 2#define MAX_DICSIZE 32767#else#define MAX_DICSIZE 127#endif

#if DICT_NUMBYTE == 1#define ZERO 0x00#define MASK 0x80#define MATCH 0x80#define MISS 0x0F#define DICT_FULL 0x08#define EOFILE 0x7F // 01111111b#define UNSET_MASK 0x7F#endif

//... similar for other sizes of dictionary

5.2 Burrows Wheeler Transform

The Burrows Wheeler Transform (BWT) implementation was probably the most challeng-ing one in my project. In fact, the performance of BWT is critical for the rest of thecompression pipeline since it is the first component of this pipeline.

The main idea of the implementation is to read the input file block by block. But as soonas a decision concerning the way to sort permutations has to be taken, a lot of options andcurrently in research techniques are to be considered. Even in current implementations ofalgorithms using BWT, optimizing the permutation phase is still heavily wanted.

The first version I implemented simply stored the complete set of permutations into anordered list of std::vector. This implementation was space consuming because for a blockof size n, we will use O(n2) space per permutation. Knowing that blocks of size 500’000bytes are common, we will reach more than 250’000’000’000 required bytes of storage inthe RAM. This is clearly too expensive to run even on current last computer generations.

A possibility to reduce the consumed size and still be able to produce the desired output isto perform suffix sorting [5, pp.75-94]. It reduces the required space to store permutationsto O(n) but induces some complications in the sorting: we will have to be really carefullwhen sorting blocks where a suffix is a prefix of another suffix. If we use sets where elementsare unique, a way to distinguish them has to be found. Many suffix sorting algorithms areproposed in [5] since this is a currently active domain of research to optimize the runningtime of BWT .


My final version uses the notion of pointers and is shown in Listing 5.2. It will not storethe suffix but rather its position in the original block into a pointer. The reason for thisis that memcmp will perform a bytewise comparison of the two sequences referred by thepointer until it finds a difference. Note that with this method, poor performances occursespecially with repeated sequences of similar suffixes which cause some fluctuations in thetime required for sorting permutations.

Listing 5.2: The sorting routine used in the final BWT implementation

class RuntimeCmp {public:

int operator()( const unsigned char *p1,const unsigned char *p2 ) const

{int result = memcmp( doubleBuffer-(buffer - p1), doubleBuffer-(

buffer - p2), curr_length );if ( result < 0 )

return 1;if ( result > 0 )

return 0;return p1 > p2;

}};

5.3 Huffman coding

For Huffman, even bit-level access was needed. This is indeed quite obvious since Huffmangoal is to take as few symbols as possible to represent the most frequent symbols (or wordsif the encoding is at higher level).

5.3.1 Binary input and output

In actual computers, the lowest possible unity of access to a file is the byte. Therefore, toperform per bit read and write, caching technique has been used. I will not go into detailsas the implementation of binary input and output is very simple. The main idea is to storethe last read byte into a buffer and shift by a bit every time a bit is read. Once 8 bits havebeen read, a new byte is fetch from the file. The output works in the same manner, andboth input and output have an automatic flush mechanism so that the user of the classdoes not have to check by itself at every bit if the buffer has to be flushed or refilled. TheBinary input and output class is accessible in BinaryIO.cpp/hpp files (see Table 5.1 forexact location).

5.3.2 Huffman implementation

Since my Huffman implementation is coding at byte level, the goal is to code a symbolrepresented by a byte with less than 8 bits. I have just presented the BinaryIO class givingme access bit by bit to a file. The implementation of Huffman coding follows the main ideafrom Algorithm 1 and Figure 2.1. I have created a node class (see HNode in Table 5.1) torepresent the node that the algorithm merge together up to the root. The only differenceI have in my algorithm is that the line 11 of Algorithm 1 is performed during the nodemerging step, so that a new traversal of the three is not required. Once the root is the


only node remaining, I simply have to retrieve the references to the leaves to get the codeincrementally created and use it as a coding table.To store the dictionary in the file, I first start by writing it into a header. Because codesof size 8 bits or less are likely to all be taken, the number of bytes to read by the decoderis also written in the header. This will avoid reserving a special pattern when the encodedfile does not terminate exactly at 8 bits (full byte).

5.4 Move-to-front

The implementation of move-to-front is really trivial. It mostly consists of maintaining thesorted stack up to date and moving the last seen entry to the front. This was probablythe most impressing result during the project since it enables Huffman coding to do anoutstanding entropy coding.

5.5 Run-length encoding

The run-length encoding implementation is as simple as move-to-front, except that thebenefit from it is very limited. In fact, because we always have to specify the length of therun we have encoded, an isolated occurrence of one byte will use two bytes in the output.This is why using run-length encoding permits a very limited gain in compression ratioand even performs worse than using just move-to-front in combination with Huffman.

5.6 Overview of source files

Using the notion of class, all the required algorithms were implemented in separated fileswith their corresponding header. A small UML diagram is shown in Figure 5.1 to have anoverview of those classes.Those classes are the core of the compression program, but some other files are requiredto allow it to run. In Table 5.1 is the list of all source code files and others required sourcefiles.


Table 5.1: List of the important source files required to build the program and use all theimplemented algorithms

Folder File name Descriptionsrc: CMakeLists.txt File used by CMake

to generate the Makefiledefines.hpp Some utility functions and shared

definesLicense.txt GPLv3 licensemain.cpp CLI code, launches the required

coding schemesrc/codecs: BWT.cpp/hpp The new BWT implementation,

using pointers for block sortingBWTOld.cpp/hpp The old BWT implementation,

was too expensive in term of running timeHuffman.cpp/hpp Huffman encoding classICodec.hpp Codec interfaceIEncoder.hpp Encoder interfaceLZ78.cpp/hpp Second version of LZ78, flushing

the dictionary when fullLZ78NR.cpp/hpp First version of LZ78, without dictionary

flushMTF.cpp/hpp Move-to-front encoding classRLE.cpp/hpp Run-length encoding class

src/codecs/huffman: BinaryIO.cpp Utility class, allowing per-bit access tofiles

HNode.cpp Class to represent a node in theHuffman coding step

src/codeschemes: HuffBwt.cpp/hpp Performs BWT→HuffmanHuffMtfBwt.cpp/hpp Performs BWT→MTF→HuffmanHuffMtfRleBwt.cpp/hpp Performs BWT→RLE→MTF→HuffmanHuffRleBwt.cpp/hpp Performs BWT→RLE→HuffmanHuffRleMtfBwt.cpp/hpp Performs BWT→MTF→RLE→HuffmanLz.cpp/hpp Performs LZ78 encoding (LZ78.cpp)LzNoReset.cpp/hpp Performs LZ78 encoding (LZ78NR.cpp)

src/test: Makefile Used to build the Unit testsrun.sh Build and run the Unit testsTestBinIO.cpp Test BinaryIO class (outdated test!)TestBWT.cpp Test BWTTestHuffBWT.cpp Test BWT followed by HuffmanTestRLE.cpp Test RLETestBWT2.cpp Test BWTTestHNode.cpp Test BWTTestHuffman.cpp Test Huffman codingTestMTF.cpp Test MTF


Figure 5.1: UML diagram for the most important classes. Note that coding schemes classesand main class are not drawn on it. For simplicity, methods and members of the codecsand encoders classes have been omitted

Chapter 6

Practical Results

6.1 Benchmark files

To evaluate the performance of the implemented algorithms, I used the Calgary Corpus[9] benchmark files, specially designed for comparing compression methods. There are atotal of fourteen files in the large version of this Corpus. Note that there also exists animproved version of this corpus, called The Canterbury Corpus1 which has been chosenin 1997 because the results of existing compression algorithms are "typical", and so itis hoped this will also be true for new methods. However, since LZ78 is rather old andthere was not a clear advantage on performing benchmarks with this new corpus for myimplementation, I decided to stay with my initial choice.I will present and overview of the Calgary Corpus files; the following table details whateach file is supposed to contain or represent:

File Category Size (in bytes)bib Bibliography (refer format) 111261book1 Fiction book 768771book2 Non-fiction book (troff format) 610856geo Geophysical data 102400news USENET batch file 377109obj1 Object code for VAX 21504obj2 Object code for Apple Mac 246814paper1 Technical paper 53161paper2 Technical paper 82199paper3 Technical paper 46256paper4 Technical paper 13286paper5 Technical paper 11954paper6 Technical paper 38105pic Black and white fax picture 513216progc Source code in "C" 39611progl Source code in LISP 71646progp Source code in PASCAL 49379trans Transcript of terminal session 93695

1http://corpus.canterbury.ac.nz/descriptions/#cantrbry

27

http://corpus.canterbury.ac.nz/descriptions/#cantrbry

CHAPTER 6. PRACTICAL RESULTS 28

When writing the report, I also learnt on the Canterbury Corpus website2 that paper 3 to6 where no longer used because they did not add something to the evaluation. However, Idecided to keep the results for these four files.

6.1.1 Notions used for comparison

Data Compression ratio

The data compression ratio, also known as compression power is used to quantify thereduction in data-representation size produced by a data compression algorithm. Theformula used to compute this ratio is :

Compression ratio =Compressed size

Uncompressed sizeNote that I will further refer to it as the compression ratio.

Running time

The running time calculated in the results is taken from the UNIX time3 command. Iused the real time (i.e. total running time) as comparison criterion.

6.1.2 Other remarks

Before going into the results, I wanted to highlight the fact that I used line-charts instead ofscatter with only markers because they give a better tractability of performance evolutionacross algorithms, even if the line between two markers has no real meaning: this is onlyfor visual purpose. See supplementary material for the complete set of data.

6.2 Lempel-Ziv 78

6.2.1 Lempel-Ziv 78 with dictionary reset

In this subsection, I will present the results obtained when comparing the three possibledictionary sizes for LZ78: from 1 byte to 3 bytes. In general, for big files, the 1 byte versionwill have to perform a lot of dictionary flushes while the 3 bytes dictionary will only haveto be cleared occasionally. Due to the constant growing behavior of the dictionary, theobtained results were very similar across files and differences in performance are mostlydue to the file size or the very specific structure of the file.The Figure 6.1 provides a graphical view of compression result.

2http://corpus.canterbury.ac.nz/descriptions/#large3see man time

http://corpus.canterbury.ac.nz/descriptions/#large


(a) The complete result for the CalgaryCorpus

(b) Same as 6.1(a) but with a smaller fileset

Figure 6.1: Obtained compressed file size comparison for LZ78 implementation with dic-tionary sizes from 1 to 3 bytes

As it can be seen in Figure 6.1(a), the optimal dictionary size for files like books is whenthe block size is 2 bytes. The most impressive result from the benchmark is obtained withthe picture file with which the algorithm reaches a very good compression ratio.For smaller files, Figure 6.1(b) confirms the previous observations: the two bytes versionof LZ78 is the optimal one for this corpus.

6.2.2 Comparison between with and without dictionary reset version

During the project, I also decided to compare the first implemented algorithm where nodictionary flush was performed as with the conventional implementation where the dictio-nary is cleared every time it is full. Figure 6.2(a) shows the different file sizes obtainedwith the dictionary reset for 1 to 3 bytes, compared to the best version using 2 bytes withdictionary flush as seen in the previous subsection.As we can see, in Figure 6.2, the benefit from not resetting the dictionary is almost inex-istent. This will only be useful in situations with local similarities spread all over the filewere the first part of the file will permit LZ78 to build a good dictionary which captureevery pattern in the file.See supplementary material for the complete set of obtained data. Pre-filtered data areavailable as text file of MS-Excel format.

6.2.3 Comparison between my LZ78 implementation and GZIP

The comparison performed in this subsection is mostly for information purpose and shouldnot be considered as a real competitive implementation to GZip.


(a) The complete result for the CalgaryCorpus

(b) The only benchmark files where LZ78implementation without dictionary resetperforms (slightly) better than the com-mon LZ78 technique

Figure 6.2: Ratio comparison between LZ78 with 2 bytes dictionary reset and LZ78 im-plementation without reset from 1 to 3 bytes

The following have to be kept in mind for the results:

1. I am using the LZ78 compression technique while GZIP4 obviously uses a LZ77 basedalgorithm.

2. GZIP was first introduced in 1992, therefore there is no hope for me to bet theperformances of this software during only few weeks of work with myself being theonly contributor to the code.

However, it is interesting to see in Figure 6.3 that I can achieve a pretty good ratio for thepicture and geo file compared to GZIP, despite the poor performances my implementationobtains for other files. The running time of the compression process is also an importantparameter to take into account to evaluate compression algorithms. In Figure 6.4, we cansee that my implementation fluctuates a lot when dealing with big files or with the picturewhile GZIP seems more dependent to the file size only. These results prove that LZ77 canstill perform better than LZ78 using some tweaks and optimizations, especially with largesearch and look-ahead buffers.The running time difference obtained in Figure 6.4 has at least two explanations:

1. LZ77 (used in GZip) only searches the limited search-buffer for a match while myLZ78 implementation has to look for dictionary of 32767 entries (when using 2 bytes).

2. The running time is particularly high for large file like books, news, obj2 and pic.These files tends to fill quickly the dictionary with entries, making the search expen-sive.

In the next section, I will present the results obtained with BWT techniques.4see GZIP man pages


Figure 6.3: Compression ratio comparison between my LZ78 implementation and GZIP

Figure 6.4: Compression time comparison between my LZ78 implementation using twobytes for the dictionary and GZIP

6.3 Burrows-Wheeler Transform

The benchmarks for BWT involved a lot more work than the Lempel-Ziv part since manycompression schemes have to be considered. In [5, p.92], some schemes are suggested touse BWT with. Therefore, I decided to use some of them which I introduce in the tablebelow:


Short Name Coding scheme DescriptionBH Huff(BWT(Input)) Perform an Huffman coding on the

Input transformed by BWTBMH Huff(MTF(BWT(Input))) Perform BWT followed by move-to-

front and Huffman codingBRH Huff(RLE(BWT(Input))) Perform BWT followed by run-

length encoding and Huffman codingBMRH Huff(MTF(RLE(BWT(Input)))) Perform BWT followed by run-

length encoding, move-to-front andHuffman coding

BRMH Huff(RLE(MTF(BWT(Input)))) Perform BWT followed by move-to-front, run-length encoding Huffmancoding

The BH scheme immediatly showed in the first benchs to be very inefficient; this is not asurprise since performing directly and Huffman coding on the BWT (Input) will be verysimilar (in term of performance) to just applying Huffman coding directly to the source.Such coding scheme can only be efficient with source files where the most probable symbolsare very limited and with a significant higher probability than less probable ones, allowingthe average code length to be of at most of 8 bits (this is the case for my implementation,since I use per-byte symbol coding in binary for Huffman).In fact, the BH scheme could probably be interesting only if per-block coding on the sourcewould be applied, since BWT will tend to group similar block together. But the difficultyarising from this will be to detect such blocks. Due to the poor results of BH scheme, itwill not be mentioned anymore in the upcoming results of this chapter but could obviouslybe a path of research for a future algorithm using just Huffman with BWT .

I will start the next subsection by comparing the above schemes to the previously seenLZ78 algorithm.

6.3.1 Comparison of BWT schemes to LZ78

The first results obtained in Figure 6.6 and also visible in Figure 6.5 as a chart involve acompression ratio comparison between LZ78 and the presented schemesWhat can immediately be seen from Figure 6.6 is that the only valuable schemes basedon BWT are BMH and BMRH. The poor results for BRH and BRMH can easily beexplained by two observations:

1. Run-length encoding always output two symbols even when a symbol is repeated asingle time (see implementation chapter).

2. Run-length encoding breaks the blocks of similar symbols, introducing a new bytefor the length of the repetition and causing move-to-front to move the length symbolcorresponding to this new byte to the front. Move-to-front, alone, will tend to breakless this kind of block, and will also enable a small set of bytes to have high probability.

6.3.2 Influence of the block size

According to BZip2 documentation,


(a) Ratios for all files (b) Second chart with bad ratios removedfrom 6.5(a)

Figure 6.5: Comparison chart between LZ78 with two bytes for the dictionary and variousBWT schemes

Figure 6.6: Comparison between LZ78 with two bytes for the dictionary and various BWTschemes

“Larger block sizes give rapidly diminishing marginal returns. Most of the compressioncomes from the first two or three hundred k of block size, a fact worth bearing in mind whenusing bzip2 on small machines. It is also important to appreciate that the decompression


memory requirement is set at compression time by the choice of block size.”5

Thus, I decided to verify if this was also the case with BMH, the most efficient scheme interm of ratio. I used various block size from 500 to 900′000 to verify the above statement.In Figure 6.7, many irregularities in the ratio are visible for block sizes below 128k. Forthe geo file, ratio is even very bad, growing the file size to five times its original size.According to this result, we can observe that the block size is a critical parameter anefficient compression.

(a) The complete result for theCalgary Corpus

(b) Highest ratio removed from6.7(a)

(c) Isolated ratios not quite vis-ible in 6.7(b)

Figure 6.7: Ratios comparison for different block sizes using the BMH scheme

The values of the ratio are visible in the table at Figure 6.8.

Figure 6.8: Ratios table for BMH containing data used for charts in Figure 6.7 withper-row coloring of ratios. In dark green is the best ratio while in red is the worst ratio.

For the next subsection, all the comparisons were done using BMH with a block size of128k. I will now compare the performance of my implementation with BZip2, a well knowncompression software.

5see man bzip2


6.3.3 Comparison between my optimal BWT method and BZIP2

In this subsection, the same remarks as with LZ78 apply for the development resourcedifference between the two implementations. The result for the running time can be ob-served in Figure 6.9. As intended, my version has almost always a higher running timethan BZip2. But the absolute time is still good, staying in the order of one second. More-over, the pic file is not part of the chart. It has been removed because the running timeof my BMH implementation was taking approximately 50 seconds to compress it whileBZip2 was able to stay at the order of a second. Such poor performance on the pic filecan easily be explained by the implementation of permutation sorting algorithm. WhileBZip2 certainly uses suffix sorting with the imaginary end-of-block symbol, my version willsometimes have to extend the suffix to complete the comparison. In the case of the picture,where block of hundreds thousands of bytes with value zero are common, my sorting stepwill require 128′0002 comparisons for each suffix, which clearly affects the running time.

Figure 6.9: Compression time comparison between BZip2 and BMH

Therefore, for decent and stable running time, suffix sorting is mandatory in BWT .In Figure 6.10, I present the compression ratio comparison between BMH and BZip2. It isquite obvious that the two algorithms are very close, due to the shape of the ratio “curve”.Except for the obj1 file which tends to penalize the 128k version of BMH (you can observein 6.8 that the block size 500 give interestingly better results than higher block size forthis file). Therefore, according to Figure 6.10, I can highlight the fact that my algorithm

Figure 6.10: Compression ratio comparison between BZip2 and BMH

performs very well compared to BZip2.


6.4 Global comparison

After having compared LZ78 and BMH separately, I will now synthetise the results bycomparing the best versions of the two algorithms with BZip2 and GZip. In Figure 6.11,we can observe than BZip2 outperforms all other algorithms (except for the obj1 file).

Figure 6.11: Comparison between all tested algorithms / software

It means that with few year of research, BWT has enabled efficient algorithms like BZip2to be available and it is worth being used when online compression is not needed.

Chapter 7

Supplementary Material

In this chapter, you will find all the required material and information to verify the dataprovided in the Implementation and Results chapter.

7.1 Using the program

7.1.1 License

I have decided to put my C/C++ code under the GPLv31 license to allow my contributionto be preserved without disallowing future work based on mine. If there is a need foranother type of license, it can also be discussed by contacting me. All the details aboutthis license can be found in the source files.

7.1.2 Building the program

To build the program, you will need CMake2 to generate the build file and a GNU compati-ble system (it has only been tested on Linux). Once installed, run the following commands:

1. cd Compress/

2. mkdir build && cd build

3. cmake ../There should be no error if your system is compatible

4. makeBuild the program, it will be located in bin/ and named compress

Calling ./bin/compress -h will provide you the required information to run a specific algo-rithm and will show you the possible options.

1http://gplv3.fsf.org/2http://www.cmake.org/

37

http://gplv3.fsf.org/

http://www.cmake.org/

CHAPTER 7. SUPPLEMENTARY MATERIAL 38

Unit testing files

To build the unit testing files (located in the src/test subdirectory, you will need theBoost C++3 library compiled with test support. CMake is not required since a standardMakefile is provided. Note that the unit test for binary I/O is no longer up to date and isnot supposed to succeed.

7.2 Collected data

7.2.1 Scripts

The results of the benchmarks were collected using simple shell scripts located in eachbenchmark directory. The common scripts located in these directories are:

• run.sh: The main script used to run the compression program and to collect the filesize with running time.

• run-dec.sh: Script used to verify the integrity of the compressed files, i.e. if thedecompressed file is the same as the starting one. This is useful to check that thealgorithm has not been completely broken with a trivial change in the code. Thismay also be used to check the compatibility of the algorithm on other platforms

7.2.2 Spreadsheets

If you do not want to perform from scratch the complete data collection phase, you maywant to have it directly filtered and formatted in a spreadsheets. There are three differentspreadsheets files (.xlsx format) located in the filtered-benchmarks subdirectory on therepository, and also provided on the CD-ROM:

1. BWT.xlsx : Burrows-Wheeler Transform and BZip2 related data

2. Lz78.xlsx : Lempel-Ziv 78 and GZip related data

3. ResultsMixed.xlsx : Old (outdated) data for all tested software and implementedalgorithms

7.2.3 Repository

The complete project files are also accessible on polysvn 4 via svn using:

svn co https://polysvn.ch/repos/Compression/tags/submitted-version

If you need an access to it and do not have one, you can send me an email.

3http://www.boost.org/4http://www.polysvn.ch/

http://www.boost.org/

http://www.polysvn.ch/

Chapter 8

Conclusion

Contributions on the personal and professional plans

Having done this project about compression algorithms has been a great opportunity forme to get more familiar with information theory, briefly seen in various courses duringmy first year at EPFL. Moreover, the compression domain was very attractive since usingcompression software is common in everyday usage of computer. Having the possibilityto look at and implement compression algorithms in a sense demystified the impressivepossibility to gain a huge amount of space on a disk or on a wireless transmission.I particularly appreciated the Burrows-Wheeler approach which is capable of boosting clas-sical compression algorithms such as move-to-front and run-length encoding.

Another benefit I have had from this project is an improvement of my C/C++ skills espe-cially at a very low-level approach of file processing.

Finally, I appreciated the strong link existing between compression domain and Compu-tational Biology (DNA domain) since I was following a course about this topic duringthe same semester. Many techniques used in DNA data handling are inspired from thecompression domain.

General conclusion

In this project, two very different lossless compression techniques were analyzed and testedagainst the Calgary Corpus. Remarkably, the Burrows-Wheeler transform allows efficientcompression of data with very few efforts. The most noticeable drawback of this approachbeing that it is not online like Lempel-Ziv. But since caching is often used in either wiredor wireless communications for example in video streaming, this could be a path to followbecause of the incredible compression ratios it can achieve.

Finally, BWT sorting remains an active research area since the performance in term oftime is clearly what has to be optimized.

39

CHAPTER 8. CONCLUSION 40

Acknowledgments

I would like to take this opportunity to thank all the people who have contributed in someway to the progression of this project, particularly Ghid Maatouk who proposed the projectand also supervised my work during the whole semester. Her advices were really usefuland appreciated. My thanks also go to Masoud Alipour who gave me some hint about theimplementation part of the project and to all the members of ALGO and LMA lab1 Imay have met during this project. Finally, I would like to thank Prof. Amin Shokrollahi(Head of ALGO and LMA) for the suggestions about the project orientation.

1http://algo.epfl.ch/en/group/members

http://algo.epfl.ch/en/group/members

Bibliography

[1] http://data-compression.com/theory.html(01.02.2010).

[2] http://en.wikipedia.org/wiki/lz77_and_lz78(march 1st 2010).

[3] http://en.wikipedia.org/wiki/suffix_array(april 15 2010).

[4] Introduction to arithmetic coding- theory and practice amir said, 2004.

[5] Donald Adjeroh, Timothy Bell, and Amar Mukherjee. The Burrows-Wheeler Trans-form: Data Compression, Suffix Arrays, and Pattern Matching. Springer, 1 edition,July 2008.

[6] M. Burrows, D. J. Wheeler, M. Burrows, and D. J. Wheeler. A block-sorting losslessdata compression algorithm. Technical report, 1994.

[7] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991.

[8] Giovanni Manzini. The burrows-wheeler transform: Theory and practice. In LectureNotes in Computer Science, pages 34–47. Springer, 1999.

[9] Matt Powell. http://corpus.canterbury.ac.nz/descriptions/#calgary.

[10] Khalid Sayood. Introduction to Data Compression, Third Edition. December 2005.

[11] T.A. Welch. A technique for high-performance data compression. Computer, 17:8–19,1984.

[12] Aaron D. Wyner and Jacob Ziv. Wyner and ziv: Sliding-window lempel-ziv algorithm.Proceeding of the IEEE.

[13] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compres-sion. IEEE TRANSACTIONS ON INFORMATION THEORY, 23(3):337–343, 1977.

41

Compression Algorithms

Documents

Transcript of Compression Algorithms