Download - EE800 Term Project Huffman Coding

Transcript
Page 1: EE800 Term Project Huffman Coding

Abdullah Aldahami(11074595)

April 6, 2010 1

Page 2: EE800 Term Project Huffman Coding

2

Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average size.

Huffman codes are part of several data formats as ZIP, MPEG and JPEG.

The code is generated based on the estimated probability of occurrence.

Huffman coding works by creating an optimal binary tree of nodes, that can be stored in a regular array.

Page 3: EE800 Term Project Huffman Coding

3

The method starts by building a list of all the alphabet symbols in descending order of their probabilities (frequency of appearance).

It then construct a tree from bottom to top.

Step by step, the two symbols with the smallest probabilities are selected; added to the top.

When the tree is completed, the codes of the symbols are assigned.

Page 4: EE800 Term Project Huffman Coding

4

Example: circuit elements in digital computations

Summation of frequencies (Number of events) is 40

Character

Frequency

i 6

t 5

space 4

c 3

e 3

n 3

u 2

l 2

Character

Frequency

m 2

s 2

a 2

o 2

r 1

d 1

g 1

p 1

Page 5: EE800 Term Project Huffman Coding

5

Example: circuit elements in digital computations

r 1 d 1 g 1 p 1

2u 2 l 2 m 2 2 a 2 o 2s 2

4 4 4 4c 3 e 3 n 3‘ ‘

4

7t 5 7i 6 7 8

12

13

13

25

40

0 1

0 1

0 1 0 1

0 10 10 1

0 10 1

0 1

0 10 1

0 1 0 1

0 1

Page 6: EE800 Term Project Huffman Coding

6

So, the code will be generated as follows:

Character

Frequency

Code CodeLengt

h

TotalLengt

h

i 6 010 3 18t 5 000 3 15

space 4 110 3 12c 3 0010 4 12e 3 0110 4 12n 3 100 3 9u 2 0011

05

10l 2 0111

05

10

Character

Frequency

Code CodeLengt

h

TotalLengt

h

m 2 01111

510

s 2 1001 4 8a 2 1110 4 8o 2 1111 4 8r 1 0011

106

6d 1 0011

116

6g 1 1000

05

5p 1 1000

15

5

Total is 154 bits with Huffman Coding compared to 240 bits with no compression

Page 7: EE800 Term Project Huffman Coding

7

Input

Symbol i t ‘ ’ c e n u l

ProbabilityP(x) 0.15

0.125

0.10.07

50.07

50.07

50.05 0.05

Output

Code 010 000 110 0010 0110 1000011

00111

0Code

length(in bits) (Li)

3 3 3 4 4 3 5 5

Weighted path length

Li ×P(x)0.45

0.375

0.3 0.3 0.30.22

50.25 0.25

Optimality

Probability budget (2-

Li)

1/8 1/8 1/8 1/16 1/16 1/8 1/32 1/32

Information of a

Message I(x)

= – log2 P(x)

2.74 3.00 3.32 3.74 3.74 3.74 4.32 4.32

Entropy H(x)

=-P(x) log2 P(x)

0.411

0.375

0.332

0.280

0.280

0.280

0.216

0.216

• Entropy is a measure defined in information theory that quantifies the information of an information source.•The measure entropy gives an impression about the success of a data compression process.

Page 8: EE800 Term Project Huffman Coding

8

Input

Symbol m s a o r d g p Sum

ProbabilityP(x) 0.05 0.05 0.05 0.05 0.025 0.025

0.025

0.025 = 1

Output

Code01111

1001

1110

1111

001110

001111

10000

10001

Code length

(in bits) (Li)5 4 4 4 6 6 5 5

Weighted path length

Li ×P(x)0.25 0.2 0.2 0.2 0.15 0.15

0.125

0.125 3.85

Optimality

Probability budget (2-

Li)

1/32 1/16 1/16 1/16 1/64 1/64 1/32 1/32 = 1

Information of a

Message I(x)

= – log2 P(x)

4.32 4.32 4.32 4.32 5.32 5.32 5.32 5.32

Entropy H(x)

=-P(x) log2 P(x)

0.216

0.216

0.216

0.216

0.133 0.1330.13

30.13

3

3.787

Bit/sym

•The sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is equal to one; as a result, the code is termed a complete code.

• Huffman coding approaches the optimum on 98.36% = (3.787 / 3.85) *100

Page 9: EE800 Term Project Huffman Coding

9

Static probability distribution (Static Huffman Coding)

Coding procedures with static Huffman codes operate with a predefined code tree,

previously defined for any type of data and is independent from the particular contents.

The primary problem of a static, predefined code tree arises, if the real probability

distribution strongly differs from the assumptions. In this case the compression rate

decreases drastically.

Page 10: EE800 Term Project Huffman Coding

10

Adaptive probability distribution (Adaptive Huffman Coding)

The adaptive coding procedure uses a code tree that is permanently adapted to the

previously encoded or decoded data.

Starting with an empty tree or a standard distribution.

This variant is characterized by its minimum requirements for header data, but the

attainable compression rate is unfavourable at the beginning of the coding or for small

files.

Page 11: EE800 Term Project Huffman Coding

11