1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data;...

1

Huffman Codes

2

Introduction

Huffman codes are a very effective technique for compressing data; savings of 20% to

90% are typical, depending on the characteristics of the file being compressed.

Huffman’s greedy algorithm uses a table of frequencies of occurrence of each

character in the file to build up an optimal way of representing each character as a

binary string.

Suppose we have a 100,000-character data file that we wish to store compactly. Further

suppose the characters in the file occur with the following frequencies:

3

Introduction

That is, there are only 6 different characters in the file, and, for example, the character a

appears 45,000 times.

There are many ways to represent such a file of information. We consider the problem of

designing a binary character code (or code for short), wherein each character is

represented by a unique binary string. If we use a fixed-length code, we need 3 bits

to represent six characters, and 300,000 bits for the entire file.

4

Introduction

A variable-length code can do considerably better, by giving frequent characters short

codewords, and infrequent characters long codewords. In our example:

If we use the given variable-length code, we only need 224,000 bits

(1*45+3*13+3*12+3*16+4*9+4*5=224).

We saved approximately 25%. In fact, this is an optimal character code for this file.

5

Prefix codes

We consider here only codes in which no codeword is also a prefix of some other

codeword. Such codes are are called prefix(-free) codes. It is possible to show that

optimal data compression achievable by a character code can always be achieved with a

prefix code, so there is no loss of generality in restricting attention to prefix codes.

6

Prefix codes

Prefix codes are desirable because they simplify encoding (compression) and decoding.

Encoding is always easy for any binary character code; we just concatenate the

codewords representing each character of the file. In the example, we code “abc” by

0101100 (if we use the variable-length prefix code).

7

Prefix codes

Decoding is also quite simple with a prefix code. Since no codeword is a prefix of any

other, the codeword that begins an encoded file is unambiguous. We can simply identify

the initial codeword, translate it back to the original character, remove it from the encoded

file, and repeat the decoding process on the remainder of the encoded file. In our

example, the string 001011101 parses uniquely as 0 0 101 1101, which decodes to

“aabe”.

8

Prefix codes

The decoding process needs a convenient representation for the prefix code so that the

initial codeword can be easily picked off. A binary tree whose leaves are the given

characters provides one such representation. We interpret the binary codeword for a

character as the path form the root to the character, where 0 means “go to the left

child” and 1 means “go to the right child”.

The following figure shows the trees for the two codes of our example.

9

Prefix Codes

10

Prefix codes

An optimal code for a file is always represented by a full binary tree, in which every

nonleaf node has two children (why?). The fixed-length code in our example is not

optimal since its tree is not a full binary tree: there are codewords beginning 10…, but

none beginning 11…. Since we can now restrict our attention to full binary trees, we can

say that if C is the alphabet from which the characters are drawn, then the tree for an

optimal prefix code has exactly |C| leaves, one for each letter of the alphabet, and exactly

|C| -1 internal nodes.

11

Prefix codes

Given a tree T corresponding to a prefix code, it is a simple matter to compute the

number of bits required to encode a file. For each character c in the alphabet C, let

f(c) denote the frequency of c in the file and let dT(c) denote the depth of c’s leaf in the

tree. Note that dT(c) is also the length of the codeword for the character c. The

number of bits required to encode a file is thus

B(T) = ∑cC f(c) dT(c)

Which is defined as the cost of the tree T.

12

Constructing a Huffman code

Huffman invented a greedy algorithm that constructs an optimal prefix code called a

Huffman code.

In the pseudocode that follows, C is a set of n characters and each cC has a defined

frequency f[c]. The algorithm builds the tree T corresponding to the optimal code in a

bottom-up manner. It begins with a set of |C| leaves and performs a sequence of |C| -

1 “merging” operations to create the final tree. A priority queue Q, keyed on f, is used

to identify the two least-frequent objects to merge together. The result of the merger

of two objects is the a new object whose frequency is the sum of the frequencies of

the two objects that were merged.

13


HUFFMAN(C)

n |C|

Q C

for i 1 to n-1

do allocate a new node z

left[z] x EXTRACT-MIN(Q)

right[z] y EXTRACT-MIN(Q)

f[z] f[x] + f[y]

INSERT (Q,z)

return EXTRACT-MIN(Q)

14


For our example, the following figures show how the algorithm works.

There are 6 letters, and so the size of the initial queue is n = 6. There are 5 merge steps.

The final tree represents the optimal prefix code. The codeword for a letter is a

sequence of the edge labels on the path from the root to the letter.

15


16


The analysis of the code is quite simple: we first define the queue, then we have n-1

merge steps: we pick the two most infrequent characters and merge them to a new

one, that finds its proper place in the queue.

If we implement the queue via a heap, the running time is easily found to be O(nlog n).

17

Correctness of Huffman’s algorithm

We present several lemmas that will lead to the desired conclusion.

Lemma 16.2: Let C be an alphabet in which each character cC has frequency f[c]. Let x

and y be two characters in C having the lowest frequencies. Then there exists an

optimal prefix code for C in which the codewords for x and y have the same length

and differ only in the last bit.

18


Proof:

The idea is to take the tree T representing an arbitrary optimal prefix code and modify it to

make a tree representing another optimal prefix code such that the characters x and y

appear as sibling leaves of maximum depth in the new tree. If we succeed, then their

codewords will have the same length and will only differ in the last bit.

19


Let a and b be two characters that are sibling leaves of maximum depth in T. Without loss

of generality, we assume that f[a] f[b] and f[x] f[y]. Since f[x] and f[y] are the two

lowest leaf frequencies, in order, and f[a] and f[b] are two arbitrary frequencies, in

order, we have f[x] f[a], and f[y] f[b]. We now exchange the positions in T of a

and x, and get the tree T’, and then exchange the positions of b and y, to produce the

tree T’’.

We should now calculate the difference in cost between T and T’’.

20


21


We start with

B(T) - B(T’) = ∑cC f[c] dT(c) - ∑cC f[c] dT’(c)

= f[x] dT(x) + f[a] dT(a) - f[x] dT’(x) - f[a] dT’(a)

= f[x] dT(x) + f[a] dT(a) - f[x] dT(a) - f[a] dT(x)

= ( f[a] - f[x] ) ( dT(a) - dT(x) )

0,

because both f[a] - f[x] and dT(a) - dT(x) are nonnegative (Why?).

Similarly, when we move from T’ to T’’, we do not increase the cost. Therefore, B(T’’)

B(T), but since T was optimal, B(T) B(T’’), which implies B(T) = B(T’’).

Thus, T’’ is an optimal tree in which x and y appear as sibling leaves of maximum depth,

and the lemma follows.

22


The lemma implies that the process of building up an optimal tree by mergers can,

without loss of generality, begin with the greedy choice of merging together two

characters of lowest frequency.

The next lemma shows that the problem of constructing optimal prefix codes has (what

we call) the optimal substructure property:

23


Lemma 16.3: Let T be a full binary tree representing an optimal prefix code over an

alphabet C, where frequency f[c] is defined for each character cC. Consider any two

characters x and y that appear as sibling leaves in T, and let z be their parent. Then,

considering z as a character with frequency f[z] = f[x] + f[y], the tree

T’ = T – { x, y }

represents an optimal prefix code for the alphabet

C’ = C – { x, y } { z }.

24


Proof: We first show that the cost B(T) of the tree T can be expressed in terms of the cost

B(T’) of the tree T’ by considering the different summands in the definition of B( ). For

each c C – { x, y }, we have dT(c) = dT’(c), resulting in

f[c] dT(c) = f[c] dT’(c).

Since dT(x) = dT(y) = dT’(z) + 1, we have

f[x] dT(x) + f[y] dT(y) = ( f[x] + f[y] ( ( dT’(z) + 1 )

= f[z] dT’(z) + ( f[x] + f[y] (,

leading to

B(T) = B(T’) + f[x] + f[y].

25


If T’ represents a non-optimal prefix code for the alphabet C’, then there exists a tree T’’

whose leave are characters in C’ such that B(T’’) < B(T’). Since z is treated as a

character in C’, it appears as a leaf in T’’. If we add x and y as the children of z in T’’,

we obtain a prefix code for C with cost

B(T’’) + f[x] + f[y] < B(T),

contradicting the optimality of T. Thus, T’ must be optimal for the alphabet C’.

26


Theorem: Procedure HUFFMAN produces an optimal prefix code.

Proof: Immediate from the two lemmas.

Last updated: 2/08/2010

1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data;...

Documents

Transcript of 1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data;...