Huffman coding  unisi.itmarco/bdm/Materiale_didattico/2005... · Huffman coding  notes In the...
Embed Size (px)
Transcript of Huffman coding  unisi.itmarco/bdm/Materiale_didattico/2005... · Huffman coding  notes In the...

Huffman coding

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 2
Optimal codes  I
A code is optimal if it has the shortest codeword length L
This can be seen as an optimization problem1
m
i ii
L p l=
=∑
1
1
min
subject to 1i
m
i ii
ml
i
l p
D
=
−
=
≤
∑
∑

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 3
Optimal codes  II
Let’s make two simplifying assumptionsno integer constraint on the codelengthsKraft inequality holds with equality
Lagrangemultiplier problem
1 1
1im m
li i
i i
J p l Dλ −= =
⎛ ⎞= + −⎜ ⎟⎝ ⎠
∑ ∑
0 log 0 log
j jl l jj
j
pJ p D D Dl D
λλ
− −∂ = → − = → =∂

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 4
Optimal codes  III
Substitute into the Kraft inequality
that is
Note that
logjl jpD
Dλ− =
1
11 log log
i
mli
ii
p p DD D
λλ
−
=
= → = → =∑* logi D il p= −
** log ( ) !!m m
i i i D i Dp l p pL H X= == −∑ ∑
the entropy, when we use base D for logarithms
1 1i i= =

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 5
Optimal codes  IV
In practice the codeword lengths must be integer value, so obtained results is a lower bound
TheoremThe expected length of any istantaneous Dary code for a r.v. X satisfies
this fundamental result derives frow the work of Shannon
( )DL H x≥

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 6
Optimal codes  V
What about the upper bound?
TheoremGiven a source alphabet (i.e. a r.v.) of entropy it is possible to find an instantaneous binary code which length satisfies
A similar theorem could be stated if we use the wrong probabilities instead of the true ones ; the only difference is a term which accounts for the relative entropy
( )H X
( ) ( ) 1H X L H X≤ ≤ +
{ }ip{ }iq

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 7
The redundance
It is defined as the average codeword legths minus the entropy
Note that
(why?)
Redundancy logi ii
L p p⎛ ⎞= − −⎜ ⎟⎝ ⎠∑
0 redundancy 1≤ ≤

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 8
Compression ratio
It is the ratio between the average number of bit/symbol in the original message and the same quantity for the coded message, i.e.
average original symbol lengthaverage compressed symbol length
C < >=< >
( )!!L X≠

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 9
Uniquely decodable codes
The set of the instantaneous codes are a small subset of the uniquely decodable codes. It is possible to obtain a lower average code length L using a uniquely decodable code that is not
instantaneous? NOSo we use instantaneous codes that are easier to decode

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 10
Summary
Average codeword length Lfor uniquely decodable codes
(and for instantaneous codes)In practice for each r.v. with entropy we can build a code with average codeword length that satisfies
( )L H X≥
( )H XX
( ) ( ) 1H X L H X≤ ≤ +

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 11
ShannonFano codingThe main advantage of the ShannonFano technique is its semplicity
Source symbols are listed in order of nonincreasing probability.The list is divided in such a way to form two groups of as nearly equal probabilities as possibleEach symbol in the first group receives a 0 as first digit of its codeword, while the others receive a 1Each of these group is then divided according to the same criterion and additional code digits are appendedThe process is continued until each group contains only one message

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 12
example
1 2 1 4 1 8 1 16 1 32 1 32
abcdef
011111
01111
0111
011
01
H=1.9375 bits
L=1.9375 bits

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 13
ShannonFano coding  exercise
Symb. Prob. * 12% ? 5% ! 13% & 2% $ 29% € 13% § 10% ° 6% @ 10%
Encode, using ShannonFano algorithm

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 14
Is ShannonFano coding optimal?
0.35 0.17 0.17 0.16 0.15
abcde
0100101110111
000110110111
H=2.2328 bits
L=2.31 bits
L1=2.3 bits

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 15
Huffman coding  I
There is another algorithm which performances are slightly better than ShannoFano, the famous Huffman codingIt works constructing bottomup a tree, that has symbols in the leafsThe two leafs with the smallest probabilities becomes sibling under a parent node with probabilities equal to the two children’s probabilities

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 16
Huffman coding  II
At this time the operation is repeated, considering also the new parent node and ignoring its childrenThe process continue until there is only parent node with probability 1, that is the root of the treeThen the two branches for every nonleaf node are labeled 0 and 1 (typically, 0 on the left branch, but the order is not important)

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 17
Huffman coding  example
0
Symbol Prob. 0.05 0.05 0.1 0.2 0.3 0.2 0.1
abcdefg a
0.05b
0.05c
0.1d
0.2e
0.3f
0.2g
0.1
0.1
0.2
0.3
0.4
0.6
1.00
0
0
0
0
1
1
1
1
1
1
a0.05
b0.05
c0.1
d0.2
e0.3
f0.2
g0.1
0.1
0.2
0.3
0.4
0.6
1.0

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 18
Huffman coding  example
Symbol Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 11
abcdef 0
0.1 111g
Exercise: evaluate H(X) and L(X)
H(X)=2.5464 bits
L(X)=2.6 bits !!

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 19
Huffman coding  exercise
Code the sequence
aeebcddegfced and calculate the compression ratio
Sol: 0000 10 10 0001 001 01 01
10 111 110 001 10 01
Aver. orig. symb. length = 3 bits
Aver. compr. symb. length = 34/13
C=.....
Symbol Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 11
abcdef 0
0.1 111g

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 20
Huffman coding  exercise
Decode the sequence0111001001000001111110
Sol: dfdcadgf
Symbol Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 11
abcdef 0
0.1 111g

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 21
Huffman coding  exercise
Encode with Huffman the sequence01$cc0a02ba10
and evaluate entropy, average codeword length and compression ratio
Symb. Prob. 0.10 0.03 0.14 0 0.4 1 0.22 2 0.04 $ 0.07
abc

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 22
Huffman coding  exercise
Symb. Prob. 0 0.16 1 0.02 2 0.15 3 0.29 4 0.17 5 0.04 % 0.17
Decode (if possible) the Huffman coded bit streaming01001011010011110101...

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 23
Huffman coding  notes
In the huffman coding, if, at any time, there is more than one way to choose a smallest pair of probabilities, any such pair may be chosen
Sometimes, the list of probabilities is inizialized to be nonincreasing and reordered after each node creation. This details doesn’t affect the correctness of the algorithm, but it provides a more efficient implementation

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 24
Huffman coding  notes
There are cases in which the Huffman coding does not uniquely determine codeword lengths, due to the arbitrary choice among equal minimum probabilities.For example for a source with probabilities
it is possible to obtain codeword lengths of and ofIt would be better to have a code which codelength has the minimum variance, as this solution will need the minimum buffer space in the transmitter and in the receiver
{ }0.4, 0.2, 0.2, 0.1, 0.1{ }1, 2, 3, 4, 4 { }2, 2, 2, 3, 3

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 25
Huffman coding  notes
Schwarz defines a variant of the Huffman algorithm that allows to build the code with minimum .
There are several other variants, we will explain the most important in a while.
maxl

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 26
Optimality of Huffman coding  I
It is possible to prove that, in case of character coding (one symbol, one codeword), Huffman coding is optimal
In another terms Huffman code has minimum redundancyAn upper bound for redundancy has been found
where is the probability of the most likely simbol
( )1 2 2 2 1redundancy 1 log log log 0.086p e e p≤ + − + +
1p

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 27
Optimality of Huffman coding  II
Why Huffman code “suffers” when there is one symbol with very high probability?Remember the notion of uncertainty...
The main problem is given by the integer constraint on codelengths!!
This consideration opens the way to a more powerful coding... we will see it later
( ) 1 log( ( )) 0p x p x→ ⇒ − →

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 28
Huffman coding  implementation
Huffman coding can be generated in O(n) time, where n is the number of source symbols, provided that probabilities have been presorted (however this sort costs O(nlogn)...)
Nevertheless, encoding is very fast

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 29
Huffman coding  implementation
However, spatial and temporal complexity of the decoding phase are far more important, because, on average, decoding will happen more frequently.Consider a Huffman tree with n symbols
n leafs and n1 internal nodes
has the pointer to a symbol and the info that it is a leaf
has two pointers
2 2( 1) 4 words (32 bits)n n n+ −

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 30
Huffman coding  implementation
1 million symbols 16 MB of memory!Moreover traversing a tree from root to leaf involves follow a lot of pointers, with little locality of reference. This causes several page faults or cache misses.
To solve this problem a variant of Huffman coding has been proposed: canonical Huffman coding

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 31
canonical Huffman coding  I
Symb. Prob. Code 1 Code 2 Code 3 0.11 000 0.12 001 0.13 100
1111
0000010
1001
10
01
abcd .14 101
0.24 01 0.26 11
0101000
011
10
1 1ef
b0.12
c0.13
d0.14
e0.24
f0.26
a0.11
0.23 0.27
0.470.53
1.0
00
1
1 1
1
(0)
(0)
(0)
(0)(0)
(1)
(1)(1)
(1) (1)
?
10
00

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 32
canonical Huffman coding  IIThis code cannot be obtained through a Huffman tree!
We do call it an Huffman code because it is instantaneous and the codeword lengths are the same than a valid Huffman code
numerical sequence propertycodewords with the same length are ordered lexicographicallywhen the codewords are sorted in lexical order they are also in order from the longest to the shortest codeword
Symb. Code 3
000001010011
10 1 1
abcdef

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 33
canonical Huffman coding  III
The main advantage is that it is not necessary to store a tree, in order to decodingWe need
a list of the symbols ordered according to the lexical order of the codewordsan array with the first codeword of each distinct length

34
canonical Huffman coding  IVEncoding. Suppose there are n disctinct symbols, that for symbol
i we have calculated huffman codelength andil ii l maxlength∀ ≤for 1 to { [ ] 0; }for 1 to { [ ] [ ] 1; }
[ ] 0;for 1 downto 1 { [ ] ( [ 1] [ 1]) / 2 ; }
for 1 to
i i
k maxlength numl ki n numl l numl l
firstcode maxlengthk maxlength
firstcode k firstcode k numl kk maxlength
= == = +
== −
= + + +⎡ ⎤⎢ ⎥=
[ ]
{ [ ]= [ ]; }for 1 to { [ ] [ ]; , [ ]  [ ] ; [ ] [ ] 1; }
i
i i i
i i
nextcode k firstcode ki n
codeword i nextcode lsymbol l nextcode l firstcode l inextcode l nextcode l
==
=
= +
numl[k] = number of codewords with length k
firstcode[k] = integer for first code of length k
nextcode[k] = integer for the next codeword of length k to be assigned
symbol[,] used for decoding
codeword[i] the rightmost bits of this integer are the code for symbol i
il

35
canonical Huffman  example
1. Evaluate array numlSymb. length 2 5 5 3 2 5 5 2
ii labcdefgh
: [0 3 1 0 4]numl
2. Evaluate array firstcode
: [2 1 1 2 0]firstcode3. Construct array codeword and symbol
[ ]
for 1 to { [ ]= [ ]; }for 1 to { [ ] [ ]; , [ ] [ ] ; [ ] [ ] 1; }
i
i i i
i i
k maxlengthnextcode k firstcode ki n
codeword i nextcode lsymbol l nextcode l firstcode l inextcode l nextcode l
=
==
=
= +
   
a e h 
d   
   
b c f g
symbol0 1 2 3
1
2
3
4
5
code bitsword 1 01 0 00000 1 00001 1 001 2 10 2 00010 3 00011 3 11
for 1 downto 1 {
[ ] ( [ 1]
[ 1]) / 2 ; }
k maxlength
firstcode k firstcode k
numl k
= −
= + +
+ +

Gabriele Monfardini  Corso di Basi di Dati Multimediali a.a. 20052006 36
canonical Huffman coding  VDecoding. We have the arrays firstcode and symbols
[ ]
();1;
while [ ] { 2* (); 1; }Return , [ ] ;
v nextinputbitk
v firstcode kv v nextinputbitk k
symbol k v firstcode k
==
<= += +
−
nextinputbit() function that returns next input bit
firstcode[k] = integer for first code of length k
symbol[k,n] returns the symbol number n with codelength k

37
canonical Huffman  example
[ ]
();1;
while [ ] { 2* (); 1; }Return , [ ] ;
v nextinputbitk
v firstcode kv v nextinputbitk k
symbol k v firstcode k
==
<= += +
−
   
a e h 
d   
   
b c f g
symbol0 1 2 3
1
2
3
4
5: [2 1 1 2 0]firstcode
00 00 00 000 0011 11 1100 00 00 000 0011 11 11
symbol[3,0] = dsymbol[2,2] = hsymbol[2,1] = esymbol[5,0] = bsymbol[2,0] = asymbol[3,0] = d
symbol[3,0] = dsymbol[2,2] = hsymbol[2,1] = esymbol[5,0] = bsymbol[2,0] = asymbol[3,0] = d
Decoded: dhebad
Huffman codingOptimal codes  IOptimal codes  IIOptimal codes  IIIOptimal codes  IVOptimal codes  VThe redundanceCompression ratioUniquely decodable codesSummaryShannonFano codingexampleShannonFano coding  exerciseIs ShannonFano coding optimal?Huffman coding  IHuffman coding  IIHuffman coding  exampleHuffman coding  exampleHuffman coding  exerciseHuffman coding  exerciseHuffman coding  exerciseHuffman coding  exerciseHuffman coding  notesHuffman coding  notesHuffman coding  notesOptimality of Huffman coding  IOptimality of Huffman coding  IIHuffman coding  implementationHuffman coding  implementationHuffman coding  implementationcanonical Huffman coding  Icanonical Huffman coding  IIcanonical Huffman coding  IIIcanonical Huffman coding  IVcanonical Huffman  examplecanonical Huffman coding  Vcanonical Huffman  example