EE514a Information Theory I Fall Quarter 2019EE514a { Information Theory I Fall Quarter 2019 Prof....

EE514a – Information Theory IFall Quarter 2019

Prof. Jeff Bilmes

University of Washington, SeattleDepartment of Electrical & Computer Engineering

Fall Quarter, 2019https://class.ece.uw.edu/514/bilmes/ee514_fall_2019/

Lecture 4 - Oct 7th, 2019

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 4 - Oct 7th, 2019 L4 F1/50(pg.1/244)

https://class.ece.uw.edu/514/bilmes/ee514_fall_2019/

Logistics Review

Class Road Map - IT-I

L1 (9/25): Overview, Communications,Information, Entropy

L2 (9/30): Entropy, Mutual Information,KL-Divergence

L3 (10/2): More KL, Jensen, more Venn,Log Sum,Data Proc. Inequality

L4 (10/7): Data Proc. Ineq.,thermodynamics, Stats, Fano,

L5 (10/9): M. of Conv, AEP, SourceCoding

L6 (10/14):

L7 (10/16):

L8 (10/21):

L9 (10/23):

L10 (10/28):

L11 (10/30):

L12 (11/4):

LXX (11/6): In class midterm exam

LXX (11/11): Veterans Day holiday

L13 (11/13):

L14 (11/18):

L15 (11/20):

L16 (11/25):

L17 (11/27):

L18 (12/2):

L19 (12/4):

LXX (12/10): Final exam

Finals Week: December 9th–13th.


Logistics Review

Cumulative Outstanding Reading

Read chapters 1 and 2 in our book (Cover & Thomas, “InformationTheory”) (including Fano’s inequality).

Read chapters 3 and 4 in our book (Cover & Thomas, “InformationTheory”).


Logistics Review

Homework

Homework 1 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), wasdue Tuesday, Oct 8th, 11:55pm.

Homework 2 will be out by early next week.


https://canvas.uw.edu/courses/1319497/assignments

Logistics Review

Fano’s Inequality: Summary

Consider the following situation where we send X through a noisychannel, receive Y , and do further processing

X

p(y|x)

Yg(·) X̂

NoisyChannel

processing

X̂ is an estimate of X.

An error if X 6= X̂. How do we measure the error? With probability,Pe , p(X 6= X̂).Intuitively, conditional entropy should tell us something about theerror possibilities, in fact, we have

Theorem 4.2.7 (Fano’s Inequality)

H(Pe) + Pe log(|X | − 1) ≥ H(X|X̂) ≥ H(X|Y ) (4.34)


Logistics Review

Fano’s Inequality

Theorem 4.2.7

H(Pe) + Pe log(|X | − 1) ≥ H(X|X̂) ≥ H(X|Y ) (4.28)

So Pe = 0 requires that H(X|Y ) = 0!Note, the theorem simplifies (and implies)1 + Pe log(|X |) ≥ H(X|Y ), or

Pe ≥H(X|Y )− 1

log |X | (4.29)

yielding a lower bound on the error.

This will be used to prove the converse to Shannon’s codingtheorem, i.e., that any code with probability of error → 0 as theblock length increases must have a rate R < C = the capacity ofthe channel (to be defined).


Fano WLLN & Modes of Conv. Towards AEP AEP Source Coding

An interesting bound on probability of equality

Lemma 4.3.1

Let X,X ′ be two independent r.v.s with X ∼ p(x) and X ′ ∼ r(x), withx, x′ ∈ X (same alphabet). L. bound on cross collision probability:

p(X = X ′) ≥ max(

2−H(p)−D(p||r), 2−H(r)−D(r||p),)

(4.1)

Proof.

2−H(p)−D(p||r) = 2∑x p(x) log p(x)+

∑x p(x) log

r(x)p(x) (4.2)

= 2∑x p(x) log r(x) (4.3)

≤∑

x

p(x)2log r(x) (4.4)

=∑

x

p(x)r(x) (4.5)

= p(X = X ′)

(4.6)




Lemma 4.3.1


p(X = X ′) ≥ max(

2−H(p)−D(p||r), 2−H(r)−D(r||p),)

(4.1)

Proof.

2−H(p)−D(p||r) = 2∑x p(x) log p(x)+

∑x p(x) log

r(x)p(x) (4.2)

= 2∑x p(x) log r(x) (4.3)

≤∑

x

p(x)2log r(x) (4.4)

=∑

x

p(x)r(x) (4.5)

= p(X = X ′)

(4.6)




Lemma 4.3.1


p(X = X ′) ≥ max(

2−H(p)−D(p||r), 2−H(r)−D(r||p),)

(4.1)

Proof.

2−H(p)−D(p||r) = 2∑x p(x) log p(x)+

∑x p(x) log

r(x)p(x) (4.2)

= 2∑x p(x) log r(x) (4.3)

≤∑

x

p(x)2log r(x) (4.4)

=∑

x

p(x)r(x) (4.5)

= p(X = X ′) (4.6)




Thus, taking p(x) = r(x), the probability that two i.i.d. randomvariables X,X ′ are the same can be bounded as follows:

P (X = X ′) ≥ 2−H(p) (4.7)

P (X = X ′) =∑

x p(x)2 known as the probability of collision.

Above improves standard collision probability bound∑

x

p(x)2 =∑

x

[(p(x)− 1/n) + 1/n]2 (4.8)

= 1/n+∑

x

[(p(x)− 1/n)]2 ≥ 1/n (4.9)

so equality achieved when p(x) = 1/n (uniform distribution), alsowhen P (X = X ′) = 2−H(p) = 2− logn.Many other probabilistic quantities can be bounded in terms ofentropic quantities as we will see throughout the course.





P (X = X ′) ≥ 2−H(p) (4.7)

P (X = X ′) =∑



x

p(x)2 =∑

x

[(p(x)− 1/n) + 1/n]2 (4.8)

= 1/n+∑

x

[(p(x)− 1/n)]2 ≥ 1/n (4.9)






P (X = X ′) ≥ 2−H(p) (4.7)

P (X = X ′) =∑



x

p(x)2 =∑

x

[(p(x)− 1/n) + 1/n]2 (4.8)

= 1/n+∑

x

[(p(x)− 1/n)]2 ≥ 1/n (4.9)

so equality achieved when p(x) = 1/n (uniform distribution), alsowhen P (X = X ′) = 2−H(p) = 2− logn.

Many other probabilistic quantities can be bounded in terms ofentropic quantities as we will see throughout the course.





P (X = X ′) ≥ 2−H(p) (4.7)

P (X = X ′) =∑



x

p(x)2 =∑

x

[(p(x)− 1/n) + 1/n]2 (4.8)

= 1/n+∑

x

[(p(x)− 1/n)]2 ≥ 1/n (4.9)




Weak Law of Large Numbers (WLLN)

Let X1, X2, . . . be a sequence of i.i.d. r.v.s with EXi = µ 0, WLLN says that

p(| 1nSn − µ| > �)→ 0 as n→∞ (4.10)

Written as 1nSnp−→ µ (converges in probability)

1nSn gets as close to µ as we want and the variance of

1nSn gets

arbitrarily small if n gets big enough.

SLLN: If all r.v.s have finite 2nd moments, i.e., E(X2i )


Aside: modes of convergence

Xn → X almost surely (Xn a.s.−−→ X) if the set{ω ∈ Ω : Xn(ω)→ X(ω) as n→∞} (4.11)

is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.

Xn → X in the rth mean (r ≥ 1), (written Xn r−→ X) ifE|Xrn| �)→ 0 as n→∞, � > 0 (4.13)

Xn → X in distribution (written Xn D−→ X) ifp(Xn ≤ x)→ P (X < x) as n→∞ (4.14)

for all points x at which FX(x) = p(X ≤ x) is continuous.Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 4 - Oct 7th, 2019 L4 F10/50(pg.22/244)


Aside: modes of convergence, expanded definition

Expanded description of convergence in probability.

Let X1, X2, . . . be a sequence of i.i.d. r.v.s, with EX = µ 0, ∀δ > 0, ∃n0 such that for all n > n0, we have

p

(∣∣∣ 1nSn − µ

∣∣∣ > �)< δ for n > n0 (4.15)

Equivalently

p

(∣∣∣ 1nSn − µ

∣∣∣ ≤ �)> 1− δ for n > n0 (4.16)






p

(∣∣∣ 1nSn − µ

∣∣∣ > �)< δ for n > n0 (4.15)

Equivalently

p

(∣∣∣ 1nSn − µ

∣∣∣ ≤ �)> 1− δ for n > n0 (4.16)



Aside: modes of convergence, implications

Some modes of convergence are stronger than others. That is

a.s.

p

r

D ∀r ≥ 1⇒

⇒⇒

Also, if r > s ≥ 1, then r ⇒ s.Different versions of things like the law of large numbers differ onlyin the strength of their required modes of convergence.

Keep this in mind when considering the AEP which we turn to next.





a.s.

p

r

D ∀r ≥ 1⇒

⇒⇒

Also, if r > s ≥ 1, then r ⇒ s.

Different versions of things like the law of large numbers differ onlyin the strength of their required modes of convergence.






a.s.

p

r

D ∀r ≥ 1⇒

⇒⇒





Asymptotic Equipartition Property (AEP)

At first, we’ll emphasize mostly intuition

Please read chapter 3 if you haven’t yet. If you have read it, read itagain.

Consider blocks of outcomes of random variables (i.e., randomvectors of length n). n = the block length.

Let X1, X2, . . . , Xn be i.i.d. r.v.s all distributed according to p (wesay Xi ∼ p(x).As before, ∀i, Xi ∈ {a1, a2, . . . , aK}, so K possible symbols(alphabet or state space of size K).

For the n random variables (X1, X2, . . . , Xn), there are Kn possible

outcomes.









outcomes.







Let X1, X2, . . . , Xn be i.i.d. r.v.s all distributed according to p (wesay Xi ∼ p(x).

As before, ∀i, Xi ∈ {a1, a2, . . . , aK}, so K possible symbols(alphabet or state space of size K).


outcomes.









outcomes.



Towards AEP

We wish to encode these Kn outcomes with binary digit strings (i.e.,code words) of length m.

Hence, M = 2m possible code words.We can represent the encoder as follows:

{X1, X2, . . . ,Xn}

Xi ∈ {a1, a2, . . . , aK}Kn possible messages

n source letters in each source msg

{Y1 , Y 2, . . . , Ym}

Yi ∈ {0, 1}2m possible messages

m total bits

Encoder

Source messages Code words

Example: English letters, would have K = 26 (alphabet size K), a“source message” consists of n letters.We want to have a code word for every possible source message,must have what condition?

M = 2m ≥ Kn ⇒ m ≥ n logK (4.17)



Towards AEP

We wish to encode these Kn outcomes with binary digit strings (i.e.,code words) of length m. Hence, M = 2m possible code words.

We can represent the encoder as follows:

{X1, X2, . . . ,Xn}



{Y1 , Y 2, . . . , Ym}


m total bits

Encoder



M = 2m ≥ Kn ⇒ m ≥ n logK (4.17)



Towards AEP

We wish to encode these Kn outcomes with binary digit strings (i.e.,code words) of length m. Hence, M = 2m possible code words.We can represent the encoder as follows:

{X1, X2, . . . ,Xn}



{Y1 , Y 2, . . . , Ym}


m total bits

Encoder



M = 2m ≥ Kn ⇒ m ≥ n logK (4.17)



Towards AEP


{X1, X2, . . . ,Xn}



{Y1 , Y 2, . . . , Ym}


m total bits

Encoder


Example: English letters, would have K = 26 (alphabet size K), a“source message” consists of n letters.

We want to have a code word for every possible source message,must have what condition?

M = 2m ≥ Kn ⇒ m ≥ n logK (4.17)



Towards AEP


{X1, X2, . . . ,Xn}



{Y1 , Y 2, . . . , Ym}


m total bits

Encoder



M = 2m ≥ Kn ⇒ m ≥ n logK (4.17)



Towards AEP

A question on rate: How many bits are used per source letter?

R = rate =logM

n=m

n≥ logK bits per source letter (4.18)

Not surprising, e.g., for English need dlogKe = 5 bits.Question: can we use fewer than this bits per source letter (onaverage) and still have essentially no error?

Yes.

How? One way: some source messages would not have a code.

Source messages Code wordsgarbage

I.e., code words only assigned to a subset of the source messages!



Towards AEP


R = rate =logM

n=m


Not surprising, e.g., for English need dlogKe = 5 bits.

Question: can we use fewer than this bits per source letter (onaverage) and still have essentially no error?

Yes.






Towards AEP


R = rate =logM

n=m


Not surprising, e.g., for English need dlogKe = 5 bits.Question: can we use fewer than this bits per source letter (onaverage) and still have essentially no error?

Yes.How? One way: some source messages would not have a code.





Towards AEP


R = rate =logM

n=m


Not surprising, e.g., for English need dlogKe = 5 bits.Question: can we use fewer than this bits per source letter (onaverage) and still have essentially no error? Yes.






Towards AEP


R = rate =logM

n=m


Not surprising, e.g., for English need dlogKe = 5 bits.Question: can we use fewer than this bits per source letter (onaverage) and still have essentially no error? Yes.How? One way: some source messages would not have a code.





Towards AEP


R = rate =logM

n=m


Not surprising, e.g., for English need dlogKe = 5 bits.Question: can we use fewer than this bits per source letter (onaverage) and still have essentially no error? Yes.How? One way: some source messages would not have a code.


I.e., code words only assigned to a subset of the source messages!Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 4 - Oct 7th, 2019 L4 F15/50(pg.48/244)


Towards AEP

Any source message assigned to garbage would, if we wish to sendthat message, result in an error.

Alternatively, and perhaps less distressingly, rather than throw somemessages into the trash, we could assign to them long code words,and the non-garbage messages to short code words.

Source messages

Short Code words

Long Code words

In either case, if n gets big enough, we make the code such that theprobability of getting one of those error source messages (orlong-code-word source messages) very small!



Towards AEP



Source messages

Short Code words

Long Code words




Probability of Source Words

The probability of a source word can be expressed

p(X1 = x1, X2 = x2, . . . , Xn = xn) =

n∏

i=1

p(Xi = xi) (4.19)

Recall, the Shannon/Hartley information about an event {X = x} is− log p(x) = I(x), so information of joint event {x1, x2, . . . , xn} is:I(x1, x2, . . . , xn) = − log p(x1, x2, . . . , xn) (4.20)

= − logn∏

i=1

p(xi) =

n∑

i=1

− log p(xi) =n∑

i=1

I(xi)

(4.21)

Also note that EI(X) = H(X).

The WLLN says that 1nSnp−→ µ, where Sn is the sum of i.i.d. r.v.s

with mean µ = EXi.

So, I(Xi) is also a random variable with mean H(X).





p(X1 = x1, X2 = x2, . . . , Xn = xn) =

n∏

i=1

p(Xi = xi) (4.19)


= − logn∏

i=1

p(xi) =

n∑

i=1

− log p(xi) =n∑

i=1

I(xi)

(4.21)



with mean µ = EXi.






p(X1 = x1, X2 = x2, . . . , Xn = xn) =

n∏

i=1

p(Xi = xi) (4.19)


= − logn∏

i=1

p(xi) =

n∑

i=1

− log p(xi) =n∑

i=1

I(xi)

(4.21)



with mean µ = EXi.

So, I(Xi) is also a random variable with mean H(X).Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 4 - Oct 7th, 2019 L4 F17/50(pg.56/244)


WLLN and entropy

Combining the above, we get

1

n

n∑

i=1

I(Xi)p−−−→

n→∞H(X) (4.22)

Thus, if n is big enough, we have that (this is where it gets cool ,)

1

n

n∑

i=1

I(xi) ≈ H(X) when ∀i, xi ∼ p(x) (4.23)

⇒ − 1n

n∑

i=1

log p(xi) ≈ H(X) (4.24)

⇒ − logn∏

i=1

p(xi) ≈ nH(X) (4.25)

⇒ − log p(x1, x2, . . . , xn) ≈ nH(X) (4.26)⇒ p(x1, . . . , xn) ≈ 2−nH(X)

(4.27)



WLLN and entropy


1

n

n∑

i=1

I(Xi)p−−−→

n→∞H(X) (4.22)


1

n

n∑

i=1

I(xi) ≈ H(X) when ∀i, xi ∼ p(x) (4.23)

⇒ − 1n

n∑

i=1

log p(xi) ≈ H(X) (4.24)

⇒ − logn∏

i=1

p(xi) ≈ nH(X) (4.25)


(4.27)



WLLN and entropy


1

n

n∑

i=1

I(Xi)p−−−→

n→∞H(X) (4.22)


1

n

n∑

i=1

I(xi) ≈ H(X) when ∀i, xi ∼ p(x) (4.23)

⇒ − 1n

n∑

i=1

log p(xi) ≈ H(X) (4.24)

⇒ − logn∏

i=1

p(xi) ≈ nH(X) (4.25)

⇒ − log p(x1, x2, . . . , xn) ≈ nH(X) (4.26)

⇒ p(x1, . . . , xn) ≈ 2−nH(X)

(4.27)



WLLN and entropy


1

n

n∑

i=1

I(Xi)p−−−→

n→∞H(X) (4.22)


1

n

n∑

i=1

I(xi) ≈ H(X) when ∀i, xi ∼ p(x) (4.23)

⇒ − 1n

n∑

i=1

log p(xi) ≈ H(X) (4.24)

⇒ − logn∏

i=1

p(xi) ≈ nH(X) (4.25)

⇒ − log p(x1, x2, . . . , xn) ≈ nH(X) (4.26)⇒ p(x1, . . . , xn) ≈ 2−nH(X) (4.27)



Towards AEP

We repeat this last equation: When n is large enough, we have

p(x1, . . . , xn) ≈ 2−nH(X) when ∀i, xi ∼ p(x) (4.28)

Note, perhaps somewhat startlingly, r.h.s. is the probability and itdoes not depend on the specific sequence instance x1, x2, . . . , xn!

So, if n gets large enough, pretty much all sequences that happenhave the same probability . . .

. . . and that probability is equal to 2−nH .

Those sequences that have that probability (which means prettymuch all of them that occur) are called the typical sequences,represented by the set A.



Towards AEP









Said again: Almost all events are almost equally probable

If X1, X2, . . . , Xn are i.i.d. and Xi ∼ p(x) for all i, and

if n is large enough,

then for any sample x1, x2, . . . , xn

The probability of the sample is essentially independent of thesample, i.e.,

p(x1, . . . , xn) ≈ 2−nH(X) (4.29)

where H(X) is the entropy of p(x).

Thus, there can only be 2nH such samples, and it may be that2nH � KnThose samples that will happen are called typical, and they are

represented by A(n)� .

Thus, a large portion of X n essentially won’t happen, i.e., could bethat 2nH ≈ |A(n)� | � |X n| = Kn.




If X1, X2, . . . , Xn are i.i.d. and Xi ∼ p(x) for all i, andif n is large enough,



p(x1, . . . , xn) ≈ 2−nH(X) (4.29)











p(x1, . . . , xn) ≈ 2−nH(X) (4.29)











p(x1, . . . , xn) ≈ 2−nH(X) (4.29)


Thus, there can only be 2nH such samples, and it may be that2nH � Kn

Those samples that will happen are called typical, and they are









p(x1, . . . , xn) ≈ 2−nH(X) (4.29)







The Typical Set

Let A(n)� be the set of typical sequences (i.e., those with probability

2−nH(X).

If “all” events have the same probability p, then there are 1/p ofthem.So the number of typical sequences is

|A(n)� | ≈ 2nH(X). (4.30)Thus, to represent (or code for) the typical sequences, we need onlynH(X) bits, so we take

m = nH(X) (4.31)

in the encoder model. Thus, the rate of the code is H(X).

{X1, X2, . . . ,Xn}



{Y1 , Y 2, . . . , Ym}


m total bits

Encoder




The Typical Set


2−nH(X).If “all” events have the same probability p, then there are 1/p ofthem.

So the number of typical sequences is


m = nH(X) (4.31)


{X1, X2, . . . ,Xn}



{Y1 , Y 2, . . . , Ym}


m total bits

Encoder




The Typical Set


2−nH(X).If “all” events have the same probability p, then there are 1/p ofthem.So the number of typical sequences is

|A(n)� | ≈ 2nH(X). (4.30)

Thus, to represent (or code for) the typical sequences, we need onlynH(X) bits, so we take

m = nH(X) (4.31)


{X1, X2, . . . ,Xn}



{Y1 , Y 2, . . . , Ym}


m total bits

Encoder




The Typical Set


2−nH(X).If “all” events have the same probability p, then there are 1/p ofthem.So the number of typical sequences is


m = nH(X) (4.31)


{X1, X2, . . . ,Xn}



{Y1 , Y 2, . . . , Ym}


m total bits

Encoder




Code Only The Typical Set

If we take m = nH, then the average number of bits per sourcealphabet letter is (i.e., the rate becomes):

m

n= H, which could be ≤ logK (4.32)

So to summarize, we have three uses and interpretations of entropyhere for source coding.

1 The probability of a typical sequence is 2−nH(X).2 The number of typical sequences is 2nH(X).3 Number of bits per source symbol is H(X), when we code only for

the typical set.





m




the typical set.





m



1 The probability of a typical sequence is 2−nH(X).

2 The number of typical sequences is 2nH(X).3 Number of bits per source symbol is H(X), when we code only for

the typical set.





m



1 The probability of a typical sequence is 2−nH(X).2 The number of typical sequences is 2nH(X).

3 Number of bits per source symbol is H(X), when we code only forthe typical set.





m




the typical set.



AEP setup

Consider Bernoulli trials X1, X2, . . . , Xn i.i.d. withp(Xi = 1) = p = 1− p(Xi = 0).

The probability of a single sequence is

p(x1, x2, . . . , xn) =n∏

i=1

pxi(1− pi)1−xi = p∑i xi(1− p)n−

∑i xi

(4.33)

There are 2n possible sequences.Do they all have the same probability?

No, consider when p = 0.1,(1− p) = 0.9. The sequence of all zeros is much more likely.

What is the most probable sequence?

When p = 0.1, the sequenceof all 0s.

Do the sequences that collectively have “any” probability all havethe same probability?

Depends what we mean by “any”, but forsmall n, no. But as n gets large, something funny happens and“yes” becomes a more appropriate answer.



AEP setup

Consider Bernoulli trials X1, X2, . . . , Xn i.i.d. withp(Xi = 1) = p = 1− p(Xi = 0).The probability of a single sequence is

p(x1, x2, . . . , xn) =

n∏

i=1


∑i xi

(4.33)









AEP setup


p(x1, x2, . . . , xn) =

n∏

i=1


∑i xi

(4.33)

There are 2n possible sequences.

Do they all have the same probability?








AEP setup


p(x1, x2, . . . , xn) =

n∏

i=1


∑i xi

(4.33)


No, consider when p = 0.1,(1− p) = 0.9. The sequence of all zeros is much more likely.What is the most probable sequence?






AEP setup


p(x1, x2, . . . , xn) =

n∏

i=1


∑i xi

(4.33)

There are 2n possible sequences.Do they all have the same probability? No, consider when p = 0.1,(1− p) = 0.9. The sequence of all zeros is much more likely.







AEP setup


p(x1, x2, . . . , xn) =

n∏

i=1


∑i xi

(4.33)

There are 2n possible sequences.Do they all have the same probability? No, consider when p = 0.1,(1− p) = 0.9. The sequence of all zeros is much more likely.What is the most probable sequence?

When p = 0.1, the sequenceof all 0s.Do the sequences that collectively have “any” probability all havethe same probability?




AEP setup


p(x1, x2, . . . , xn) =

n∏

i=1


∑i xi

(4.33)

There are 2n possible sequences.Do they all have the same probability? No, consider when p = 0.1,(1− p) = 0.9. The sequence of all zeros is much more likely.What is the most probable sequence? When p = 0.1, the sequenceof all 0s.





AEP setup


p(x1, x2, . . . , xn) =

n∏

i=1


∑i xi

(4.33)

There are 2n possible sequences.Do they all have the same probability? No, consider when p = 0.1,(1− p) = 0.9. The sequence of all zeros is much more likely.What is the most probable sequence? When p = 0.1, the sequenceof all 0s.Do the sequences that collectively have “any” probability all havethe same probability?




AEP setup


p(x1, x2, . . . , xn) =

n∏

i=1


∑i xi

(4.33)

There are 2n possible sequences.Do they all have the same probability? No, consider when p = 0.1,(1− p) = 0.9. The sequence of all zeros is much more likely.What is the most probable sequence? When p = 0.1, the sequenceof all 0s.Do the sequences that collectively have “any” probability all havethe same probability? Depends what we mean by “any”, but forsmall n, no. But as n gets large, something funny happens and“yes” becomes a more appropriate answer.



AEP setup

Notational reminder: H = H(X) is the entropy of a single randomvariable distributed as p(x).

Can we predict the probability that a particular sequence has aparticular probability value? I.e.,

Pr(p(X1, X2, . . . , Xn) = α) = ? (4.34)

Note: “p(X1, X2, . . . , Xn)” is a random variable! It is a randomprobability, and it is a true random variable since it is a probabilitythat is a function of a set of random variables.

It turns out that

Pr(p(X1, X2, . . . , Xn) ≈ 2−nH) ≈ 1 (4.35)

if n is large enough.

In English, this can be read as: almost all events (that occurcollectively with any appreciable probability) are all equally likely.



AEP setup



Pr(p(X1, X2, . . . , Xn) = α) = ? (4.34)


It turns out that

Pr(p(X1, X2, . . . , Xn) ≈ 2−nH) ≈ 1 (4.35)





Ex: Bernoulli trials

Let Sn ∼ Binomial(n, p) with Sn = X1 +X2 + · · ·+Xn,Xi ∼ Bernoulli(p).

Hence, ESn = np and var(S) = npq, and

p(Sn = k) =

(n

k

)pkqn−k (4.36)

Then, the expression 2−nH can be viewed in an intuitive way. I.e.,

2−nH(p) = 2−n(−p log p−(1−p) log(1−p)) (4.37)

= 2log pnp+log(1−p)n(1−p) (4.38)

= pnpqnq where q = 1− p

(4.39)

here H = H(p) the binary entropy with probability p.

np is the expected number of 1’s.

nq is the expected number of 0’s.




Let Sn ∼ Binomial(n, p) with Sn = X1 +X2 + · · ·+Xn,Xi ∼ Bernoulli(p).Hence, ESn = np and var(S) = npq, and

p(Sn = k) =

(n

k

)pkqn−k (4.36)


2−nH(p) = 2−n(−p log p−(1−p) log(1−p)) (4.37)

= 2log pnp+log(1−p)n(1−p) (4.38)


(4.39)








p(Sn = k) =

(n

k

)pkqn−k (4.36)


2−nH(p) = 2−n(−p log p−(1−p) log(1−p)) (4.37)

= 2log pnp+log(1−p)n(1−p) (4.38)


(4.39)








p(Sn = k) =

(n

k

)pkqn−k (4.36)


2−nH(p) = 2−n(−p log p−(1−p) log(1−p)) (4.37)

= 2log pnp+log(1−p)n(1−p) (4.38)

= pnpqnq where q = 1− p (4.39)








p(Sn = k) =

(n

k

)pkqn−k (4.36)


2−nH(p) = 2−n(−p log p−(1−p) log(1−p)) (4.37)

= 2log pnp+log(1−p)n(1−p) (4.38)







Towards AEP

In other words, all sequences that occur are the ones where thenumber of 1s and 0s are roughly equal to their expected values.

No other sequences have any appreciable probability!

The sequence X1, X2, . . . , Xn was assumed i.i.d., but this can beextended to Markov chains, and to ergotic stationary randomprocesses.

But before doing any of that, we need more formalism.



Towards AEP








Theorem 4.5.1 (AEP)

If X1, X2, . . . , Xn are i.i.d. and Xi ∼ p(x) for all i, then

− 1n

log p(X1, X2, . . . , Xn)p−→ H(X) (4.40)

Proof.

− 1n

log p(X1, X2, . . . , Xn) = −1

nlog

n∏

i=1

p(Xi) (4.41)

= − 1n

∑

i

log p(Xi)p−→ −E log p(X) (4.42)

= H(X)

(4.43)




Theorem 4.5.1 (AEP)


− 1n

log p(X1, X2, . . . , Xn)p−→ H(X) (4.40)

Proof.

− 1n

log p(X1, X2, . . . , Xn)

= − 1n

log

n∏

i=1

p(Xi) (4.41)

= − 1n

∑

i


= H(X)

(4.43)




Theorem 4.5.1 (AEP)


− 1n

log p(X1, X2, . . . , Xn)p−→ H(X) (4.40)

Proof.

− 1n

log p(X1, X2, . . . , Xn) = −1

nlog

n∏

i=1

p(Xi) (4.41)

= − 1n

∑

i


= H(X)

(4.43)




Theorem 4.5.1 (AEP)


− 1n

log p(X1, X2, . . . , Xn)p−→ H(X) (4.40)

Proof.

− 1n

log p(X1, X2, . . . , Xn) = −1

nlog

n∏

i=1

p(Xi) (4.41)

= − 1n

∑

i

log p(Xi)

p−→ −E log p(X) (4.42)

= H(X)

(4.43)




Theorem 4.5.1 (AEP)


− 1n

log p(X1, X2, . . . , Xn)p−→ H(X) (4.40)

Proof.

− 1n

log p(X1, X2, . . . , Xn) = −1

nlog

n∏

i=1

p(Xi) (4.41)

= − 1n

∑

i


= H(X)

(4.43)




Theorem 4.5.1 (AEP)


− 1n

log p(X1, X2, . . . , Xn)p−→ H(X) (4.40)

Proof.

− 1n

log p(X1, X2, . . . , Xn) = −1

nlog

n∏

i=1

p(Xi) (4.41)

= − 1n

∑

i


= H(X) (4.43)



Typical Set

Definition 4.5.2 (Typical Set)

The typical set A(n)� w.r.t. p(x) is the set of sequences

(x1, x2, . . . , xn) ∈ X n with the property that

2−n(H(X)+�) ≤ p(x1, x2, . . . , xn) ≤ 2−n(H(X)−�) (4.44)

Equivalently, we may write A(n)� as

A(n)� =

{(x1, x2, . . . , xn) :

∣∣∣∣−1

nlog p(x1, . . . , xn)−H

∣∣∣∣ < �}

(4.45)

Typical set are those sequences with log probability within the range−nH

n� n�

A(n)� has a number of interesting properties.



Typical Set




2−n(H(X)+�) ≤ p(x1, x2, . . . , xn) ≤ 2−n(H(X)−�) (4.44)


A(n)� =

{(x1, x2, . . . , xn) :

∣∣∣∣−1

nlog p(x1, . . . , xn)−H

∣∣∣∣ < �}

(4.45)


n� n�




Typical Set



(x1,

EE514a Information Theory I Fall Quarter 2019EE514a { Information Theory I Fall Quarter 2019 Prof....

Documents

Transcript of EE514a Information Theory I Fall Quarter 2019EE514a { Information Theory I Fall Quarter 2019 Prof....