INFORMATION THEORY & CODING - SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 07.pdf · 1-1...

Post on 30-Aug-2018

222 views 0 download

Transcript of INFORMATION THEORY & CODING - SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 07.pdf · 1-1...

1-1

INFORMATION THEORY & CODING

Dr. Qi WangDepartment of Computer Science and TechnologyOffice: Nanshan i-park A7Email: wangqi@sustc.edu.cn

Dr. Rui WangDepartment of Electrical and Electronic EngineeringOffice: Nanshan i-park A7-1107Email: wangr@sustc.edu.cnWebsite: eee.sustc.edu.cn/p/wangrui

2-1

Review Summary

McMillan inequality

Uniquely decodable codes ⇔∑

D−`i ≤ 1.

Huffman code

L∗ = min∑D−`i≤1

∑pi`i

HD(X ) ≤ L∗ < HD(X ) + 1.

3-1

Optimality of Huffman Codes

⇒ If p1 ≥ p2 ≥ · · · pm, then there exists an optimalcode with `1 ≤ `2 ≤ · · · `m−1 = `m, and codewordsC (xm−1) and C (xm) differ only in the last bit.(canonical codes)

Lemma 5.8.1 For any distribution, the optimal prefixcodes (with minimum exptected length) should satisfythe following properties:1. If pj > pk , then `j ≤ `k .2. The two longest codewords have the same length.3. There exists an optimal prefix code, such that two

of the longest codewords differ only in the last bitand correspond to the two least likely symbols.

4-1

Optimality of Huffman Codes

We prove the optimality of Huffman codes byinduction.Assume binary code in the proof.

Condense

Expand

0.25

0.15 0.15

0.3

0.45 0.55

1

1

4 5

0

0

0

1

1

1

0.25 0.15 0.15

0.2 0.25 0.250.3

0.45 0.55

1

0

0 0 11

0.2 0.25 0.25 0.3

1234 + 5

0.2 0.25

0.2 0.25

23

0 1

4-2

Optimality of Huffman Codes

We prove the optimality of Huffman codes byinduction.Assume binary code in the proof.

Condense

Expand

0.25

0.15 0.15

0.3

0.45 0.55

1

1

4 5

0

0

0

1

1

1

0.25 0.15 0.15

0.2 0.25 0.250.3

0.45 0.55

1

0

0 0 11

0.2 0.25 0.25 0.3

1234 + 5

0.45 0.55

0.45

1

0 10.55

2 + 3 1 + 4 + 5

0.2 0.25

0.2 0.25

23

0 1

0.250.3

0.45 0.55

1

0

0

1

1

0.45 0.25 0.3

14 + 5

2 + 3

5-1

Optimality of Huffman Codes

Proof. For p = (p1, p2, . . . , pm) withp1 ≥ p2 ≥ · · · ≥ pm, we define the Huffman reductionp′ = (p1, p2, . . . , pm−1 + pm) over an alphabet size ofm − 1. Let C∗m−1(p′) be an optimal Huffman code forp′, and let C∗m(p) be the canonical optimal code for p.

5-2

Optimality of Huffman Codes

Proof. For p = (p1, p2, . . . , pm) withp1 ≥ p2 ≥ · · · ≥ pm, we define the Huffman reductionp′ = (p1, p2, . . . , pm−1 + pm) over an alphabet size ofm − 1. Let C∗m−1(p′) be an optimal Huffman code forp′, and let C∗m(p) be the canonical optimal code for p.

Key idea.

expand C∗m−1(p′) to Cm(p) ⇒ Lm(p) = L∗m(p)where Lm(p) is the expected length of code Cm(p), andL∗m(p) is the minimum expected length for sourcedistribution p

6-1

Optimality of Huffman Codes

Proof. For p = (p1, p2, . . . , pm) withp1 ≥ p2 ≥ · · · ≥ pm, we define the Huffman reductionp′ = (p1, p2, . . . , pm−1 + pm) over an alphabet size ofm − 1. Let C∗m−1(p′) be an optimal Huffman code forp′, and let C∗m(p) be the canonical optimal code for p.

6-2

Optimality of Huffman Codes

Proof. For p = (p1, p2, . . . , pm) withp1 ≥ p2 ≥ · · · ≥ pm, we define the Huffman reductionp′ = (p1, p2, . . . , pm−1 + pm) over an alphabet size ofm − 1. Let C∗m−1(p′) be an optimal Huffman code forp′, and let C∗m(p) be the canonical optimal code for p.

Cm−1(p′) C∗m(p)

7-1

Optimality of Huffman Codes

Proof. For p = (p1, p2, . . . , pm) withp1 ≥ p2 ≥ · · · ≥ pm, we define the Huffman reductionp′ = (p1, p2, . . . , pm−1 + pm) over an alphabet size ofm − 1. Let C∗m−1(p′) be an optimal Huffman code forp′, and let C∗m(p) be the canonical optimal code for p.

L(p) = L∗(p′) + pm−1 + pm

L∗(p) = L(p′) + pm−1 + pm

expand C∗m−1(p′) to Cm(p)

condense C∗m(p) to Cm−1(p′)

8-1

Optimality of Huffman Codes

Proof. For p = (p1, p2, . . . , pm) withp1 ≥ p2 ≥ · · · ≥ pm, we define the Huffman reductionp′ = (p1, p2, . . . , pm−1 + pm) over an alphabet size ofm − 1. Let C∗m−1(p′) be an optimal Huffman code forp′, and let C∗m(p) be the canonical optimal code for p.

(L(p′)− L∗(p′)︸ ︷︷ ︸≥0

) + (L(p)− L∗(p)︸ ︷︷ ︸≥0

) = 0

L(p) = L∗(p′) + pm−1 + pm

L∗(p) = L(p′) + pm−1 + pm

9-1

Optimality of Huffman Codes

Proof. For p = (p1, p2, . . . , pm) withp1 ≥ p2 ≥ · · · ≥ pm, we define the Huffman reductionp′ = (p1, p2, . . . , pm−1 + pm) over an alphabet size ofm − 1. Let C∗m−1(p′) be an optimal Huffman code forp′, and let C∗m(p) be the canonical optimal code for p.

Thus, L(p) = L∗(p). Minimizing the expected lengthL(Cm) is equivalent to minimizing L(Cm−1). Theproblem is reduced to one with m − 1 symbols andprobability masses (p1, p2, . . . , pm−1 + pm).Proceeding this way, we finally reduce the problem totwo symbols, in which case the optimal code isobvious.

10-1

Coin tossing vs. Poker

Toss a fair coin and see the sequences

Head, Tail, Tail, Head, · · ·

Play card games and see the sequence

p(x1, x2, . . . , xn) ≈ 2−nH(X )

p(x1, x2, . . . , xn) = ?

11-1

Outline

Time-invarant Markov Chain: simple but powerfultool to model random phenomenon.

Entropy Rate: measure the information of onestochastic process

12-1

How to Model dependence: Markov Chains

A stochastic process {Xi} is an indexed sequence ofrandom variables (X1,X2, . . .) characterized by the jointPMF Pr[(X1,X2, . . . ,Xn) = (x1, x2, . . . , xn)] =p(x1, x2, . . . , xn), where (x1, x2, . . . , xn) ∈ X n forn = 1, 2, . . ..

12-2

How to Model dependence: Markov Chains

A stochastic process {Xi} is an indexed sequence ofrandom variables (X1,X2, . . .) characterized by the jointPMF Pr[(X1,X2, . . . ,Xn) = (x1, x2, . . . , xn)] =p(x1, x2, . . . , xn), where (x1, x2, . . . , xn) ∈ X n forn = 1, 2, . . ..

Definition A stochastic process is said to be stationaryif the joint distribution of any subset of the sequence ofrandom variables is invariant with respect to shifts inthe time index, i.e.,

Pr[X1 = x1,X2 = x2, . . . ,Xn = xn]

= Pr[X1+` = x1,X2+` = x2, . . . ,Xn+` = xn]

for every n and every shift ` and for allx1, x2, . . . , xn ∈ X .

13-1

Markov Chains

Definition A discrete stochastic process X1,X2, . . . issaid to be a Markov chain or a Markov process whenfor n = 1, 2, . . . ,

for all x1, x2, . . . , xn, xn+1 ∈ X .

Pr[Xn+1 = xn+1|Xn = xn,Xn−1 = xn−1, . . . ,X1 = x1]

= Pr[Xn+1 = xn+1|Xn = xn]

13-2

Markov Chains

Definition A discrete stochastic process X1,X2, . . . issaid to be a Markov chain or a Markov process whenfor n = 1, 2, . . . ,

for all x1, x2, . . . , xn, xn+1 ∈ X .

Pr[Xn+1 = xn+1|Xn = xn,Xn−1 = xn−1, . . . ,X1 = x1]

= Pr[Xn+1 = xn+1|Xn = xn]

In this case, the joint PMF can be written as

p(x1, x2, . . . , xn) = p(x1)p(x2|x1)p(x3|x2) · · · p(xn|xn−1).

14-1

Markov Chains

Definition The Markov chain is called time invariant if theconditional probability Pr[Xn+1|Xn] does NOT depend on n, i.e.,for n = 1, 2, . . .,

Pr[Xn+1 = b|Xn = a] = Pr[X2 = b|X1 = a] for all a, b ∈ X .

14-2

Markov Chains

Definition The Markov chain is called time invariant if theconditional probability Pr[Xn+1|Xn] does NOT depend on n, i.e.,for n = 1, 2, . . .,

Pr[Xn+1 = b|Xn = a] = Pr[X2 = b|X1 = a] for all a, b ∈ X .

We deal with time invariant Markov chains, where theterminologies are defined belows:

• If {Xi} is a Markov chain Xn is called the state attime n.

• Pr[Xn+1|Xn] is the state transition probability.• A time invariant Markov chain is characterized by

its initial distribution and a probability transitionmatrix P = [Pij ], i , j ∈ {1, 2, . . . ,m}, wherePij = Pr[Xn+1 = j |Xn = i ].

15-1

Simple weather model

X = {Sunny: S, Rainy: R}

p(S |S) = 1− β, p(R|R) = 1− α, p(R|S) = β, p(S |R) = α

P =

[1− β βα 1− α

]SSR

R

16-1

Simple weather model

Probability of seeing a sequence SSRR:

p(SSRR) = p(S)p(S |S)p(R|S)p(R|R) = p(S)(1− β)β(1− α)

The joint distribution of a time invariant Markov chain isdetermined by initial distribution and probability transitionmatrix.

17-1

Stationary Distribution

If the PMF of the random variable at time n is p(xn), the PMFat time n + 1 is

p(xn+1) =∑xn

p(xn)Pxnxn+1 .

A distribution µ on the states such that the distribution at timen + 1 is the same as the distribution at time n + 1 is called astationary distribution.

18-1

Stationary Distribution

– If µ(S) = αα+β , µ(R) = β

α+β

P =

[1− β βα 1− α

]–

p(Xn+1 = S) = p(S |S)µ(S) + p(S |R)µ(R)

= (1− β)α

α + β+ α

β

α + β=

α

α + β= µ(S).

19-1

Stationary Distribution

How to calculate stationary distribution?– Stationary distribution µi , i = 1, 2, . . . , |X | satisfies

µi =∑j

µjpji (µ = µP), and

|X |∑i=1

µi = 1.

20-1

Entropy Rate

When Xi ’s are i.i.d., the entropy

H(X n) = H(X1,X2, . . . ,Xn) =n∑

i=1

H(Xi ) = nH(X ).

20-2

Entropy Rate

When Xi ’s are i.i.d., the entropy

H(X n) = H(X1,X2, . . . ,Xn) =n∑

i=1

H(Xi ) = nH(X ).

With dependent sequences Xi ’s, how does H(X n) grow with n?

20-3

Entropy Rate

When Xi ’s are i.i.d., the entropy

H(X n) = H(X1,X2, . . . ,Xn) =n∑

i=1

H(Xi ) = nH(X ).

With dependent sequences Xi ’s, how does H(X n) grow with n?

Entropy rate characterized the growth rate.

21-1

Entropy Rate

Definition 1: average entropy per symbol

H(X ) = limn→∞

H(X1,X2, . . . ,Xn)

n

Definition 2: conditional entropy of the last r.v. giventhe past

H ′(X ) = limn→∞

H(Xn|Xn−1,Xn−2, . . . ,X1)

22-1

Entropy Rate

Theorem 4.2.2 For a stationary stochastic process,H(Xn|Xn−1, . . . ,X1) is nonincreasing in n and has alimit H ′(X ).

22-2

Entropy Rate

Theorem 4.2.2 For a stationary stochastic process,H(Xn|Xn−1, . . . ,X1) is nonincreasing in n and has alimit H ′(X ).

Proof.

H(Xn+1|X1,X2, . . . ,Xn) ≤ H(Xn+1|Xn, . . . ,X2)

= H(Xn|Xn−1, . . . ,X1),

conditioning reduces entropy

stationarity

– H(Xn|Xn−1, . . . ,X1) decreases as n increases– H(X ) ≥ 0– The limit must exist.

23-1

Entropy Rate

Theorem 4.2.1 For a stationary stochastic process,H(X ) = H ′(X ).

23-2

Entropy Rate

Theorem 4.2.1 For a stationary stochastic process,H(X ) = H ′(X ).

Proof. By the chain rule,

1

nH(X1, . . . ,Xn) =

1

n

n∑i=1

H(Xi |Xi−1, . . . ,X1).

• H(Xn|Xn−1, . . . ,X1)→ H ′(X )• Cesaro mean:

If an → a, bn = 1n

∑ni=1 ai , then bn → a.

• So1

nH(X1, . . . ,Xn)→ H ′(X )

24-1

Entropy Rate for Markov Chain

For a time invariant Markov chain with stationary initialdistribution, the entropy rate is

H(X ) = H ′(X ) = limH(Xn|Xn−1, . . . ,X1) = limH(Xn|Xn−1)

= H(X2|X1).

By definitionp(X2 = j |X1 = i) = Pij

Entropy rate of stationary Markov chain

H(X ) = H(X2|X1) =∑i

µi (∑j

−Pij logPij) = −∑ij

µiPij logPij .

25-1

To Calculte Entropy Rate

1. Find stationary distribution µi

µi =∑j

µjpji (µ = µP), and

|X |∑i=1

µi = 1.

2. Use transition probability Pij

H(X ) = −∑ij

µiPij logPij

26-1

Entropy Rate of Weather Model

Stationary distribution µ(S) = αα+β , µ(R) = β

α+β

P =

[1− β βα 1− α

]

H(X ) = µ(S)H(β) + µ(R)H(α)

α + βH(β) +

β

α + βH(α)

≤ H(2αβ

α + β)

Jensen’s inequality

26-2

Entropy Rate of Weather Model

Stationary distribution µ(S) = αα+β , µ(R) = β

α+β

P =

[1− β βα 1− α

]

H(X ) = µ(S)H(β) + µ(R)H(α)

α + βH(β) +

β

α + βH(α)

≤ H(2αβ

α + β)

Jensen’s inequality

Maximum when α = β = 1/2: degenerate to independentprocess

27-1

Examples

Random Walk on a Weighted Graph (Chapter4.3)

Second law of thermodynamics (Chapter 4.4)