Information and Communication Theory Lecture 4

35
Information and Communication Theory Lecture 4 Optimal Coding ario A. T. Figueiredo DEEC, Instituto Superior T´ ecnico, University of Lisbon, Portugal 2021 Lecture 4 (Optimal Coding) Information and Communication Theory 2021 1/

Transcript of Information and Communication Theory Lecture 4

Page 1: Information and Communication Theory Lecture 4

Information and Communication Theory

Lecture 4

Optimal Coding

Mario A. T. Figueiredo

DEEC, Instituto Superior Tecnico, University of Lisbon, Portugal

2021

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 1/∞

Page 2: Information and Communication Theory Lecture 4

Source Coding

Discrete source Encoder DecoderError-freechannel

Lossless encoding: output of the decoder equal to that of the source.

Assumption: when encoding Xt, fXt is known:

X For memoryless sources, this is just fX ;

X For Markov sources, this is fXt|Xt−1,...,

Without loss of generality, we simply write fX .

Goal: economy, that is, use the channel as little as possible.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 2/∞

Page 3: Information and Communication Theory Lecture 4

Variable-Length Coding

Discrete source Encoder DecoderError-freechannel

Code uses D-ary alphabet D = {0, 1, ..., D − 1}.

Typically, binary coding, D = 2, D = {0, 1}.

Variable-length encoding: D∗ is the Kleene closure of D:

D∗ = {all finite strings of symbols of D} =∞⋃n=0

Dn

Example: for D = {0, 1},D∗ = {ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, 0000, ...}

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 3/∞

Page 4: Information and Communication Theory Lecture 4

Non-Singular and Uniquely Decodable Codes

Discrete source Encoder DecoderError-freechannel

For C−1 to exist: non-singular code (C injective). For any x, y ∈ X ,

x 6= y ⇒ C(x) 6= C(y)

To be useful for a sequence of symbols, this is not good enough.

Example: {C(1) = 0, C(2) = 10, C(3) = 01} is non-singular

Received sequence: 010; is it C(1)C(2) or C(3)C(1)?

Impossible to know!

Codes that do not have this problem are called uniquely decodable.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 4/∞

Page 5: Information and Communication Theory Lecture 4

Instantaneous Codes

Consider a uniquely decodable code:{C(1) = 01, C(2) = 11, C(3) = 00, C(4) = 110}

How to decode the sequence 1100....00︸ ︷︷ ︸n zeros

11?

X If n is even: C−1(1100....00︸ ︷︷ ︸n zeros

11) = 2 3....3︸ ︷︷ ︸n/2

2

X If n is odd: C−1(1100....00︸ ︷︷ ︸n zeros

11) = 4 3....3︸ ︷︷ ︸n−12

2

To decode the first symbol, we many need to wait for many others.

A code that does not have this problem is called instantaneous.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 5/∞

Page 6: Information and Communication Theory Lecture 4

Instantaneous Codes

If no codeword is prefix of another, decoding is instantaneous.Other names: prefix codes, prefix-free codes.

Length function:

lC(x) = length(C(x))

Expected length

L(C) = E[lC(X)] =∑x∈X

fX(x) lC(x)

Example:

L(C) = 7/4 bits/symbol

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 6/∞

Page 7: Information and Communication Theory Lecture 4

Instantaneous Codes: Tree Representation

Instantaneous code: no codeword is prefix of another.

Decoding instantaneous code: path from root to leaf of a tree:

For D-ary codes: D−ary trees.

L(C) is the sum of the probabilities of the inner nodes. (show why)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 7/∞

Page 8: Information and Communication Theory Lecture 4

Instantaneous Codes: Kraft-McMillan Inequality

If C is a D-ary instantaneous code, it necessarily satisfies∑x∈X

D−lC(x) ≤ 1 (KMI)

...i.e., if some words are short others have to be long!

Proof:X let lmax = max{lC(1), ..., lC(N)} (length of the longest word).

X there are Dlmax words of length lmax.

X for each word C(x), there are Dlmax−lC(x) words of length lmax thathave C(x) as prefix;

X the sets of length-lmax words that have each word as prefix are disjoint.∑x∈X

Dlmax−lC(x) ≤ Dlmax divide by Dlmax

−→∑x∈X

D−lC(x) ≤ 1

Important: the KMI is a necessary, not sufficient, condition. (why?)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 8/∞

Page 9: Information and Communication Theory Lecture 4

Kraft-McMillan Inequality: Graphical Proof

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 9/∞

Page 10: Information and Communication Theory Lecture 4

Exercises

For each of the following codes

say if it is non-singular, uniquely decodable, or instantaneous.Compute the expected length of each of the codes.

Converse of the KMI: given a collection of N positive integers,l1, ..., lN , that satisfy

N∑x=1

D−lx ≤ 1,

is it necessarily possible to construct an instantaneous code such thatlC(x) = lx? Give an example.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 10/∞

Page 11: Information and Communication Theory Lecture 4

Exercises

Given the collection of numbers (2, 2, 2, 3, 4), is it possible to built aninstantaneous code with these lengths? Why? If yes, give an example.Could the code be optimal, or can we build another instantaneouscode with necessarily shorted expected code-length?

Repeat the previous question for the set of numbers (1, 2, 2, 3, 4).

List all possible distributions of lengths for instantaneous codes withthe following numbers of words: N = 3, N = 4, and N = 5.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 11/∞

Page 12: Information and Communication Theory Lecture 4

Source Coding Theorem

Source X ∈ X = {1, ..., N} with probability mass function fX .

For any collection of N positive integers, l1, ..., lN ,

(KMI)N∑x∈X

D−lx ≤ 1 ⇒∑x∈X

fX(x) lx ≥ H(X).

Proof: let q(x) =D−lx

A> 0, where A =

∑x∈X

D−lx ≤ 1;∑x∈X

q(x) = 1

0 ≤ DKL(fX ‖ q) =∑x∈X

fX(x) logDfX(x)

q(x)

=∑x∈X

fX(x) logD fX(x)︸ ︷︷ ︸−HD(X)

+ logD A︸ ︷︷ ︸≤0

∑x∈X

fX(x) +∑x∈X

fX(x) lx

...equality iff A = 1 and q(x) = fX(x)⇔ lx = − logD fX(x) (only possible

if integers).

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 12/∞

Page 13: Information and Communication Theory Lecture 4

Source Coding Theorem

Source X ∈ X = {1, ..., N} with probability mass function fX .

Corollary of the result in previous slide:

C is instantaneous ⇒ KMI ⇒∑x∈X

fX(x) lC(x)︸ ︷︷ ︸expected

code-length L(C)

≥ HD(X)

...with equality if and only if lC(x) = − log fX(x).

Equality is only possible if − logD fX(x) are integers.

Shannon-Fano code (SFC): just take lSFC (x) = d− logD fX(x)e

Clearly, the SFC satisfies the KMI (due ≥ u, for any u ∈ R)∑x∈X

D−lSFC (x) ≤

∑x∈X

DlogD fX(x) =∑x∈X

fX(x) = 1

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 13/∞

Page 14: Information and Communication Theory Lecture 4

Optimal Code

Source X ∈ X = {1, ..., N} with probability mass function fX .

Optimal code lengths:(loptimal1 , ..., loptimal

N

)=arg min

(l1,...,lN )

∑x∈X

fX(x) lx

subject to l1, ..., lN ∈ N∑x∈X

D−lx ≤ 1

Optimal code: Coptimal is any instantaneous code with

lCoptimal(x) = loptimal

x , for x ∈ X

Because it satisfies the KMI, L(Coptimal) ≥ H(X).

Because it is optimal, L(Coptimal) ≤ L(CSF)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 14/∞

Page 15: Information and Communication Theory Lecture 4

Bounds on Optimal Code-length

Because it is optimal, L(Coptimal) ≤ L(CSF)

Because, due < u+ 1, for any u ∈ R,

L(Coptimal) ≤ L(CSF) =∑x∈X

fX(x)d− logD fX(x)e

<∑x∈X

fX(x)(− logD fX(x) + 1) = H(X) + 1

In summary: H(X) ≤ L(Coptimal) < H(X) + 1

Code efficiency: ρC =H(X)

L(C).

Ideal code: ρC = 1. Important: ideal⇒6⇐ optimal

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 15/∞

Page 16: Information and Communication Theory Lecture 4

Exercises

1 List all possible distributions of word-lengths of optimal binary codesfor sources with N = 3, N = 4, and N = 5 symbols. (For N = 2,there is only one choice: (1, 1).)

2 For the source X ∈ X = {1, 2, 3, 4} withfX = (1/3, 1/3, 1/9, 1/9, 1/9), obtain a binary Shannon-Fano codeand compute its expected length and efficiency. Is it an optimal code?

3 For the source X ∈ X = {1, 2, 3, 4} with fX = (1/2, 1/4, 1/8, 1/8),obtain an optimal binary code and compute its expected length andefficiency. Is it an ideal code?

4 Repeat the two previous exercise, but now for ternary codes.

5 Show that there are sources and corresponding optimal codes suchthat L(Coptimal) is arbitrarily close to H(X) + 1.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 16/∞

Page 17: Information and Communication Theory Lecture 4

Coding With a Wrong Distribution

Source X ∈ X = {1, ..., N} with probability mass function fX .

Build Shannon-Fano code assuming gX : lC(x) = d− log gX(x)e

Lower bound:

L(C) =∑x∈Xd− log gX(x)efX(x)

≥−∑x∈X

fX(x) log gX(x)

=∑x∈X

fX(x) logfX(x)

gX(x) fX(x)

= −∑x∈X

fX(x) log fX(x)︸ ︷︷ ︸H(X)

+∑x∈X

fX(x) logfX(x)

gX(x)︸ ︷︷ ︸DKL(fX‖gX)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 17/∞

Page 18: Information and Communication Theory Lecture 4

Coding With a Wrong Distribution

Source X ∈ X = {1, ..., N} with probability mass function fX .

Build Shannon-Fano code assuming gX : lC(x) = d− log gX(x)e

Upper bound:

L(C) =∑x∈Xd− log gX(x)efX(x)

<∑x∈X

fX(x)(− log gX(x) + 1)

= H(X) +DKL(fX ‖ gX) + 1

Summarizing: if C is built from gX and the true distribution is fX

H(X)+DKL(fX ‖ gX) ≤ L(C) < H(X)+DKL(fX ‖ gX) + 1

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 18/∞

Page 19: Information and Communication Theory Lecture 4

Approaching the Bound: Source Extension

Discrete stationary source Xt ∈ X = {1, ..., N}

Extension: group n consecutive symbols: (X1, ..., Xn) ∈ {1, ..., N}n.

The optimal code for the extended symbols (X1, ..., Xn) satisfies

H(X1, ..., Xn) ≤ L(Coptimaln

)︸ ︷︷ ︸bits/(n symbols)

< H(X1, ..., Xn) + 1

Memoryless source: H(X1, ..., Xn) = nH(X1), thus

L(Coptimaln

)≤ nH(X1) + 1 ⇒

L(Coptimaln

)n︸ ︷︷ ︸

bits/symbol

< H(X1) +1

n

...via extension, expected code-length can arbitrarily approach the entropy.

Non-memoryless source: H(X1, ..., Xn) < nH(X1) and the result iseven stronger.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 19/∞

Page 20: Information and Communication Theory Lecture 4

Exercises

1 For the source X ∈ X = {1, 2} with fX = (7/8, 1/8), obtain a binaryoptimal code and compute its expected length and efficiency.

2 Find de optimal code for the order-2 extension of the source in theprevious question.

3 Repeat the two previous questions for fX = (1/2, 1/2).

4 Consider a stationary Markov source Xt ∈ X = {1, 2, 3}, with thefollowing probability transition matrix:

P =

0 1/8 7/87/8 0 1/81/8 7/8 0

.Obtain the optimal coding for this source and for its order-2 extension.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 20/∞

Page 21: Information and Communication Theory Lecture 4

Huffman Codes

Huffman (1952) algorithm to obtain optimal codes.

Builds a D−ary tree, starting from the leaves, which are the symbols.

Algorithm (for D = 2; the generalizing to D > 2 requires some care).

1 Input: a list of symbol probabilities (p1, ..., pN ).

2 Output: a binary tree with each symbol as a leaf.

3 Assign each symbol to a leaf of the tree.

4 Find the 2 smallest probabilities: pi and pj .

5 Create the parent node for nodes i and j with probability pi + pj .

6 Remove pi and pj from the list and insert pi + pj .

7 If the list of symbols has more than 2 probabilities, go back to step 4.

As seen before, a binary tree corresponds to an instantaneous code.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 21/∞

Page 22: Information and Communication Theory Lecture 4

Huffman Codes

Illustration: probabilities (0.4, 0.1, 0.05, 0.25, 0.2)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 22/∞

Page 23: Information and Communication Theory Lecture 4

Huffman Codes

Illustration: probabilities (0.4, 0.1, 0.05, 0.25, 0.2)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 23/∞

Page 24: Information and Communication Theory Lecture 4

Huffman Codes

Illustration: probabilities (0.4, 0.1, 0.05, 0.25, 0.2)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 24/∞

Page 25: Information and Communication Theory Lecture 4

Huffman Codes

Illustration: probabilities (0.4, 0.1, 0.05, 0.25, 0.2)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 25/∞

Page 26: Information and Communication Theory Lecture 4

Huffman Codes

Illustration: probabilities (0.4, 0.1, 0.05, 0.25, 0.2)

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 26/∞

Page 27: Information and Communication Theory Lecture 4

Huffman Codes

Label the edges (arbitrarily) to obtain the code words

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 27/∞

Page 28: Information and Communication Theory Lecture 4

Huffman Codes

Expected code-length: sum of the inner node probabilities:L(C) = 1 + 0.6 + 0.35 + 0.15 = 2.1 bits/symbol

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 28/∞

Page 29: Information and Communication Theory Lecture 4

Huffman Codes

Huffman codes are optimal; see proof in recommended reading.

Converse is not true

Huffman code⇒6⇐ optimal code

Exercise: give an example of a source and an optimal code that is nota Huffman code.

In the case of ties, break them arbitrarily.

For D-ary codes, merge D symbols to build a D-ary tree.

For D-ary codes, optimality requires N = k(D− 1) + 1, where k ∈ N....if not satisfied, just append zero-probability symbols.

Exercise: obtain a Huffman ternary code for a source with 4equiprobable symbols.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 29/∞

Page 30: Information and Communication Theory Lecture 4

Exercises

Find a binary Huffman code for the source X ∈ X = {1, 2, 3, 4} withfX = (0.27, 0.26, 0.24, 0.23). Are there optimal codes for this sourcethat are not Huffman codes? If yes, how many?

Find a ternary Huffman code for the source in the previous question.Are there optimal ternary codes for this source that are not Huffmancodes? If yes, how many?

Consider a coding scheme where the first symbol of the codeword isternary (in {0, 1, 2}) and the remaining symbols are binary (in {0, 1}).That is, the codewords belong to {0, 1, 2} × {0, 1}∗. Find the optimalcode with this structure for thes source X ∈ X = {1, 2, 3, 4, 5, 6} withfX = (17, 16, 8, 7, 6, 5)/59. How to express the expected length ofthis code?

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 30/∞

Page 31: Information and Communication Theory Lecture 4

Exercises

Consider 9 apparently equal spheres, 8 of which weight 1Kg and one weights1.05Kg. Suppose you have a balance scale.

What is the minimum number of weightings needed to identify the heavier sphere?Propose an non-sequential procedure (i.e., each weighting does not depend on theresults of previous ones) to find the sphere in the minimum number of weightings.

You have 6 bottles of wine, but you know that one has gone bad (tastes likevinegar). By inspection, you determine that the probabilities of bottle i being thebad one are (8, 6, 4, 2, 2, 1)/23. In what order should you taste the bottles to findthe bad one in the minimal number of tastings? What is the expected number oftastings? Can you do better if you are allowed to mix wines and taste the mixture?

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 31/∞

Page 32: Information and Communication Theory Lecture 4

Elias Codes for Natural Numbers

Standard number representation is not uniquely decodable.

Binary representation of natural numbers is not uniquely decodable.Example: C(3) = 11, C(21) = 10101, but decoding 1110101 isimpossible; it could be C(14)C(5) or C(58)C(1).

Length of binary representation for x ∈ N is blog2 xc+ 1.Example: C(13) = 1101 has length 4; blog2 13c+1 = b3.70c+1 = 4.

Elias coding:

X instantaneous code for arbitrary natural numbers;

X length not much worse than blog2 xc+ 1.

Useful not only for N, but also for large alphabets X = {1, ..., N}with large and unknown N .

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 32/∞

Page 33: Information and Communication Theory Lecture 4

Elias Gamma Code

Length of binary representation for x ∈ N is blog2 xc+ 1.

Let C2 denote the standard binary representation.

Elias gamma code: Cγ(x) = 0...0︸︷︷︸blog2 xczeros

C2(x)

x Cγ(x)

1 12 0104 001005 001017 001119 0001001

10 0001010...

...19 000010011...

...147 000000010010011

Obviously instantaneous.

Length:lCγ (x) = 2blog2 xc+ 1.

Twice as bad as C2.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 33/∞

Page 34: Information and Communication Theory Lecture 4

Elias Delta Code

Elias delta code: Cδ(x) = Cγ(blog2 xc+ 1) C2(x)

C2 is C2 without the leading 1 (e.g. C2(9) = 1001, C2(10) = 001)

Length: lCδ(x) = lCγ (blog2 xc+ 1) + blog2 xc= 2 blog2(blog2 xc+ 1)c+ blog2 xc+ 1

x Cδ(x)

1 12 01003 01014 011007 011118 0010000010 00100010...

...19 001010011...

...147 00010000010011

Obviously instantaneous.

For x > 32, lCδ(x) < lCγ (x)

Approaches C2 for large x:

limx→∞

lCδ(x)

C2(x)= 1

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 34/∞

Page 35: Information and Communication Theory Lecture 4

Recommended Reading

T. Cover and J. Thomas, “Elements of Information Theory”, JohnWiley & Sons, 2006 (Chapter 5).

M. Figueiredo, “Elias Coding for Arbitrary Natural Numbers”,available at the course webpage in Fenix.

Lecture 4 (Optimal Coding) Information and Communication Theory 2021 35/∞