Data Compression / Source Codingschaffne/courses/inf... · N H bits. This is the source coding...

Data Compression / Source Coding

source PX

How much “information” is contained in X?

• compress it into minimal number of L bits per source symbol

• decompress reliably⇒ average information content is L bits per symbol

Shannon’s source-coding theorem: L ≈ H(X)

compress inflate

XX

L bits

Data Compression / Source Coding

source PX

How much “information” is contained in X?

• compress it into minimal number of L bits per source symbol

• decompress reliably⇒ average information content is L bits per symbol


compress inflate

X1,…,Xn L*n bitsX1,…,Xn

Two Types of Compressionsource

PX

compress inflate




PX

compress inflate


1. Lossless compression: (e.g. zip)

• maps all source strings to different encodings

• it shortens some, but necessarily makes others longer

• design it such that the average length is shorter



PX

compress inflate


1. Lossless compression: (e.g. zip)

• maps all source strings to different encodings

• it shortens some, but necessarily makes others longer

• design it such that the average length is shorter2. Lossy compression: (e.g. image compression)

• map some source strings to same encoding (recovery fails sometimes)

• If error can be made arbitrarily small, it might be useful in practice


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

78 4 — The Source Coding Theorem

x log2(P (x))

...1...................1.....1....1.1.......1........1...........1.....................1.......11... −50.1

......................1.....1.....1.......1....1.........1.....................................1.... −37.3

........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. −65.91.1...1................1.......................11.1..1............................1.....1..1.11..... −56.4...11...........1...1.....1.1......1..........1....1...1.....1............1......................... −53.2..............1......1.........1.1.......1..........1............1...1......................1....... −43.7.....1........1.......1...1............1............1...........1......1..11........................ −46.8.....1..1..1...............111...................1...............1.........1.1...1...1.............1 −56.4.........1..........1.....1......1..........1....1..............................................1... −37.3......1........................1..............1.....1..1.1.1..1...................................1. −43.71.......................1..........1...1...................1....1....1........1..11..1.1...1........ −56.4...........11.1.........1................1......1.....................1............................. −37.3.1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... −56.4......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. −59.5............11.1......1....1..1............................1.......1..............1.......1......... −46.8

.................................................................................................... −15.21111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 −332.1

Figure 4.10. The top 15 stringsare samples from X100, wherep1 = 0.1 and p0 = 0.9. Thebottom two are the most andleast probable strings in thisensemble. The final column showsthe log-probabilities of therandom strings, which may becompared with the entropyH(X100) = 46.9 bits.

except for tails close to δ = 0 and 1. As long as we are allowed a tinyprobability of error δ, compression down to NH bits is possible. Even if weare allowed a large probability of error, we still can compress only down toNH bits. This is the source coding theorem.

Theorem 4.1 Shannon’s source coding theorem. Let X be an ensemble withentropy H(X) = H bits. Given ϵ > 0 and 0 < δ < 1, there exists a positiveinteger N0 such that for N > N0,

!!!!1N

Hδ(XN ) − H

!!!! < ϵ. (4.21)

4.4 Typicality

Why does increasing N help? Let’s examine long strings from XN . Table 4.10shows fifteen samples from XN for N = 100 and p1 = 0.1. The probabilityof a string x that contains r 1s and N−r 0s is

P (x) = pr1(1 − p1)N−r. (4.22)

The number of strings that contain r 1s is

n(r) ="

N

r

#. (4.23)

So the number of 1s, r, has a binomial distribution:

P (r) ="

N

r

#pr1(1 − p1)N−r. (4.24)

These functions are shown in figure 4.11. The mean of r is Np1, and itsstandard deviation is

$Np1(1 − p1) (p.1). If N is 100 then

r ∼ Np1 ±$

Np1(1 − p1) ≃ 10 ± 3. (4.25)


78 4 — The Source Coding Theorem

x log2(P (x))

...1...................1.....1....1.1.......1........1...........1.....................1.......11... −50.1

......................1.....1.....1.......1....1.........1.....................................1.... −37.3

........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. −65.91.1...1................1.......................11.1..1............................1.....1..1.11..... −56.4...11...........1...1.....1.1......1..........1....1...1.....1............1......................... −53.2..............1......1.........1.1.......1..........1............1...1......................1....... −43.7.....1........1.......1...1............1............1...........1......1..11........................ −46.8.....1..1..1...............111...................1...............1.........1.1...1...1.............1 −56.4.........1..........1.....1......1..........1....1..............................................1... −37.3......1........................1..............1.....1..1.1.1..1...................................1. −43.71.......................1..........1...1...................1....1....1........1..11..1.1...1........ −56.4...........11.1.........1................1......1.....................1............................. −37.3.1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... −56.4......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. −59.5............11.1......1....1..1............................1.......1..............1.......1......... −46.8

.................................................................................................... −15.21111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 −332.1

Figure 4.10. The top 15 stringsare samples from X100, wherep1 = 0.1 and p0 = 0.9. Thebottom two are the most andleast probable strings in thisensemble. The final column showsthe log-probabilities of therandom strings, which may becompared with the entropyH(X100) = 46.9 bits.

except for tails close to δ = 0 and 1. As long as we are allowed a tinyprobability of error δ, compression down to NH bits is possible. Even if weare allowed a large probability of error, we still can compress only down toNH bits. This is the source coding theorem.

Theorem 4.1 Shannon’s source coding theorem. Let X be an ensemble withentropy H(X) = H bits. Given ϵ > 0 and 0 < δ < 1, there exists a positiveinteger N0 such that for N > N0,

!!!!1N

Hδ(XN ) − H

!!!! < ϵ. (4.21)

4.4 Typicality

Why does increasing N help? Let’s examine long strings from XN . Table 4.10shows fifteen samples from XN for N = 100 and p1 = 0.1. The probabilityof a string x that contains r 1s and N−r 0s is

P (x) = pr1(1 − p1)N−r. (4.22)

The number of strings that contain r 1s is

n(r) ="

N

r

#. (4.23)

So the number of 1s, r, has a binomial distribution:

P (r) ="

N

r

#pr1(1 − p1)N−r. (4.24)

These functions are shown in figure 4.11. The mean of r is Np1, and itsstandard deviation is

$Np1(1 − p1) (p.1). If N is 100 then

r ∼ Np1 ±$

Np1(1 − p1) ≃ 10 ± 3. (4.25)

Book by David MacKay

http://www.inference.phy.cam.ac.uk/mackay/itila/book.html


4.4: Typicality 79

Figure 4.11. Anatomy of the typical set T . For p1 = 0.1 and N = 100 and N = 1000, these graphsshow n(r), the number of strings containing r 1s; the probability P (x) of a single stringthat contains r 1s; the same probability on a log scale; and the total probability n(r)P (x) ofall strings that contain r 1s. The number r is on the horizontal axis. The plot of log2 P (x)also shows by a dotted line the mean value of log2 P (x) = −NH2(p1), which equals −46.9when N = 100 and −469 when N = 1000. The typical set includes only the strings thathave log2 P (x) close to this value. The range marked T shows the set TNβ (as defined insection 4.4) for N = 100 and β = 0.29 (left) and N = 1000, β = 0.09 (right).

N = 100 N = 1000

n(r) =!Nr

"

0

2e+28

4e+28

6e+28

8e+28

1e+29

1.2e+29

0 10 20 30 40 50 60 70 80 90 1000

5e+298

1e+299

1.5e+299

2e+299

2.5e+299

3e+299

0 100 200 300 400 500 600 700 800 9001000

P (x) = pr1(1− p1)N−r

0

1e-05

2e-05

0 10 20 30 40 50 60 70 80 90 100

0

1e-05

2e-05

0 1 2 3 4 5

log2 P (x)

-350-300-250-200-150-100

-500

0 10 20 30 40 50 60 70 80 90 100

T

-3500-3000-2500-2000-1500-1000

-5000

0 100 200 300 400 500 600 700 800 9001000

T

n(r)P (x) =!N

r

"pr1(1− p1)N−r

00.020.040.060.08

0.10.120.14

0 10 20 30 40 50 60 70 80 90 1000

0.0050.01

0.0150.02

0.0250.03

0.0350.04

0.045

0 100 200 300 400 500 600 700 800 9001000

r rBook by David MacKay




4.5: Proofs 81

✲

log2 P (x)−NH(X)

TNβ

✻✻✻✻✻

0000000000000. . . 00000000000

0001000000000. . . 00000000000

0100000001000. . . 00010000000

0000100000010. . . 00001000010

1111111111110. . . 11111110111

Figure 4.12. Schematic diagramshowing all strings in the ensembleXN ranked by their probability,and the typical set TNβ.

The ‘asymptotic equipartition’ principle is equivalent to:

Shannon’s source coding theorem (verbal statement). N i.i.d. ran-dom variables each with entropy H(X) can be compressed into morethan NH(X) bits with negligible risk of information loss, as N → ∞;conversely if they are compressed into fewer than NH(X) bits it is vir-tually certain that information will be lost.

These two theorems are equivalent because we can define a compression algo-rithm that gives a distinct name of length NH(X) bits to each x in the typicalset.

4.5 Proofs

This section may be skipped if found tough going.

The law of large numbers

Our proof of the source coding theorem uses the law of large numbers.

Mean and variance of a real random variable are E [u] = u =!

u P (u)uand var(u) = σ2

u = E [(u − u)2] =!

u P (u)(u − u)2.

Technical note: strictly I am assuming here that u is a function u(x)of a sample x from a finite discrete ensemble X . Then the summations!

u P (u)f(u) should be written!

x P (x)f(u(x)). This means that P (u)is a finite sum of delta functions. This restriction guarantees that themean and variance of u do exist, which is not necessarily the case forgeneral P (u).

Chebyshev’s inequality 1. Let t be a non-negative real random variable,and let α be a positive real number. Then

P (t ≥ α) ≤ t

α. (4.30)

Proof: P (t ≥ α) =!

t≥α P (t). We multiply each term by t/α ≥ 1 andobtain: P (t ≥ α) ≤

!t≥α P (t)t/α. We add the (non-negative) missing

terms and obtain: P (t ≥ α) ≤!

t P (t)t/α = t/α. ✷

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Consequences of the AEP

60 ASYMPTOTIC EQUIPARTITION PROPERTY

where the second inequality follows from (3.6). Hence

|A(n)ϵ | ≤ 2n(H(X)+ϵ). (3.12)

Finally, for sufficiently large n, Pr{A(n)ϵ } > 1 − ϵ, so that

1 − ϵ < Pr{A(n)ϵ } (3.13)

≤!

x∈A(n)ϵ

2−n(H(X)−ϵ) (3.14)

= 2−n(H(X)−ϵ)|A(n)ϵ |, (3.15)

where the second inequality follows from (3.6). Hence,

|A(n)ϵ | ≥ (1 − ϵ)2n(H(X)−ϵ), (3.16)

which completes the proof of the properties of A(n)ϵ . !

3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION

Let X1, X2, . . . , Xn be independent, identically distributed random vari-ables drawn from the probability mass function p(x). We wish to findshort descriptions for such sequences of random variables. We divide allsequences in Xn into two sets: the typical set A(n)

ϵ and its complement,as shown in Figure 3.1.

Non-typical set

Typical set

∋∋

A(n) : 2n(H + ) elements

n:| |n elements

FIGURE 3.1. Typical sets and source coding.

Typical set contains almost

all the probability!

3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION 61

Non-typical set

Typical set

Description: n log | | + 2 bits

Description: n(H + ) + 2 bits∋

FIGURE 3.2. Source code using the typical set.

We order all elements in each set according to some order (e.g., lexi-cographic order). Then we can represent each sequence of A(n)

ϵ by givingthe index of the sequence in the set. Since there are ≤ 2n(H+ϵ) sequencesin A(n)

ϵ , the indexing requires no more than n(H + ϵ) + 1 bits. [The extrabit may be necessary because n(H + ϵ) may not be an integer.] We pre-fix all these sequences by a 0, giving a total length of ≤ n(H + ϵ) + 2bits to represent each sequence in A(n)

ϵ (see Figure 3.2). Similarly, we canindex each sequence not in A(n)

ϵ by using not more than n log |X| + 1 bits.Prefixing these indices by 1, we have a code for all the sequences in Xn.

Note the following features of the above coding scheme:

• The code is one-to-one and easily decodable. The initial bit acts asa flag bit to indicate the length of the codeword that follows.

• We have used a brute-force enumeration of the atypical set A(n)ϵ

c

without taking into account the fact that the number of elements inA(n)

ϵc is less than the number of elements in Xn. Surprisingly, this is

good enough to yield an efficient description.• The typical sequences have short descriptions of length ≈ nH .

We use the notation xn to denote a sequence x1, x2, . . . , xn. Let l(xn)be the length of the codeword corresponding to xn. If n is sufficientlylarge so that Pr{A(n)

ϵ } ≥ 1 − ϵ, the expected length of the codeword is

E(l(Xn)) =!

xn

p(xn)l(xn) (3.17)

How many are in this set

useful for source coding

(compression)!



By enumeration!




4.5: Proofs 81

✲

log2 P (x)−NH(X)

TNβ

✻✻✻✻✻

0000000000000. . . 00000000000

0001000000000. . . 00000000000

0100000001000. . . 00010000000

0000100000010. . . 00001000010

1111111111110. . . 11111110111

Figure 4.12. Schematic diagramshowing all strings in the ensembleXN ranked by their probability,and the typical set TNβ.

The ‘asymptotic equipartition’ principle is equivalent to:

Shannon’s source coding theorem (verbal statement). N i.i.d. ran-dom variables each with entropy H(X) can be compressed into morethan NH(X) bits with negligible risk of information loss, as N → ∞;conversely if they are compressed into fewer than NH(X) bits it is vir-tually certain that information will be lost.

These two theorems are equivalent because we can define a compression algo-rithm that gives a distinct name of length NH(X) bits to each x in the typicalset.

4.5 Proofs

This section may be skipped if found tough going.

The law of large numbers

Our proof of the source coding theorem uses the law of large numbers.

Mean and variance of a real random variable are E [u] = u =!

u P (u)uand var(u) = σ2

u = E [(u − u)2] =!

u P (u)(u − u)2.

Technical note: strictly I am assuming here that u is a function u(x)of a sample x from a finite discrete ensemble X . Then the summations!

u P (u)f(u) should be written!

x P (x)f(u(x)). This means that P (u)is a finite sum of delta functions. This restriction guarantees that themean and variance of u do exist, which is not necessarily the case forgeneral P (u).

Chebyshev’s inequality 1. Let t be a non-negative real random variable,and let α be a positive real number. Then

P (t ≥ α) ≤ t

α. (4.30)

Proof: P (t ≥ α) =!

t≥α P (t). We multiply each term by t/α ≥ 1 andobtain: P (t ≥ α) ≤

!t≥α P (t)t/α. We add the (non-negative) missing

terms and obtain: P (t ≥ α) ≤!

t P (t)t/α = t/α. ✷


AEP and data compression


``High-probability set’’ vs. ``typical set’’

• Typical set: small number of outcomes that contain most of the probability

• Is it the smallest such set?



60 ASYMPTOTIC EQUIPARTITION PROPERTY

where the second inequality follows from (3.6). Hence

|A(n)ϵ | ≤ 2n(H(X)+ϵ). (3.12)

Finally, for sufficiently large n, Pr{A(n)ϵ } > 1 − ϵ, so that

1 − ϵ < Pr{A(n)ϵ } (3.13)

≤!

x∈A(n)ϵ

2−n(H(X)−ϵ) (3.14)

= 2−n(H(X)−ϵ)|A(n)ϵ |, (3.15)

where the second inequality follows from (3.6). Hence,

|A(n)ϵ | ≥ (1 − ϵ)2n(H(X)−ϵ), (3.16)

which completes the proof of the properties of A(n)ϵ . !

3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION

Let X1, X2, . . . , Xn be independent, identically distributed random vari-ables drawn from the probability mass function p(x). We wish to findshort descriptions for such sequences of random variables. We divide allsequences in Xn into two sets: the typical set A(n)

ϵ and its complement,as shown in Figure 3.1.

Non-typical set

Typical set

∋∋

A(n) : 2n(H + ) elements

n:| |n elements

FIGURE 3.1. Typical sets and source coding.

Typical set contains almost

all the probability!

3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION 61

Non-typical set

Typical set

Description: n log | | + 2 bits

Description: n(H + ) + 2 bits∋

FIGURE 3.2. Source code using the typical set.

We order all elements in each set according to some order (e.g., lexi-cographic order). Then we can represent each sequence of A(n)

ϵ by givingthe index of the sequence in the set. Since there are ≤ 2n(H+ϵ) sequencesin A(n)

ϵ , the indexing requires no more than n(H + ϵ) + 1 bits. [The extrabit may be necessary because n(H + ϵ) may not be an integer.] We pre-fix all these sequences by a 0, giving a total length of ≤ n(H + ϵ) + 2bits to represent each sequence in A(n)

ϵ (see Figure 3.2). Similarly, we canindex each sequence not in A(n)

ϵ by using not more than n log |X| + 1 bits.Prefixing these indices by 1, we have a code for all the sequences in Xn.

Note the following features of the above coding scheme:

• The code is one-to-one and easily decodable. The initial bit acts asa flag bit to indicate the length of the codeword that follows.

• We have used a brute-force enumeration of the atypical set A(n)ϵ

c

without taking into account the fact that the number of elements inA(n)

ϵc is less than the number of elements in Xn. Surprisingly, this is

good enough to yield an efficient description.• The typical sequences have short descriptions of length ≈ nH .

We use the notation xn to denote a sequence x1, x2, . . . , xn. Let l(xn)be the length of the codeword corresponding to xn. If n is sufficientlylarge so that Pr{A(n)

ϵ } ≥ 1 − ϵ, the expected length of the codeword is

E(l(Xn)) =!

xn

p(xn)l(xn) (3.17)

How many are in this set

useful for source coding

(compression)!



By enumeration!




4.6: Comments 83

Part 2: 1N Hδ(XN ) > H − ϵ.

Imagine that someone claims this second part is not so – that, for any N ,the smallest δ-sufficient subset Sδ is smaller than the above inequality wouldallow. We can make use of our typical set to show that they must be mistaken.Remember that we are free to set β to any value we choose. We will setβ = ϵ/2, so that our task is to prove that a subset S ′ having |S ′| ≤ 2N(H−2β)

and achieving P (x ∈ S ′) ≥ 1− δ cannot exist (for N greater than an N0 thatwe will specify).

So, let us consider the probability of falling in this rival smaller subset S ′.The probability of the subset S ′ is

TNβ S′

✫✪✬✩

✫✪✬✩

❈❈❈❖

S′ ∩ TNβ

❅❅■ S′ ∩ TNβP (x ∈ S′) = P (x ∈ S ′∩TNβ) + P (x ∈ S ′∩TNβ), (4.38)

where TNβ denotes the complement {x ∈ TNβ}. The maximum value ofthe first term is found if S ′ ∩ TNβ contains 2N(H−2β) outcomes all with themaximum probability, 2−N(H−β). The maximum value the second term canhave is P (x ∈ TNβ). So:

P (x ∈ S′) ≤ 2N(H−2β) 2−N(H−β) +σ2

β2N= 2−Nβ +

σ2

β2N. (4.39)

We can now set β = ϵ/2 and N0 such that P (x ∈ S ′) < 1 − δ, which showsthat S′ cannot satisfy the definition of a sufficient subset Sδ. Thus any subsetS′ with size |S ′| ≤ 2N(H−ϵ) has probability less than 1− δ, so by the definitionof Hδ, Hδ(XN ) > N(H − ϵ).

Thus for large enough N , the function 1N Hδ(XN ) is essentially a constant

function of δ, for 0 < δ < 1, as illustrated in figures 4.9 and 4.13. ✷

4.6 Comments

The source coding theorem (p.78) has two parts, 1N Hδ(XN ) < H + ϵ, and

1N Hδ(XN ) > H − ϵ. Both results are interesting.

The first part tells us that even if the probability of error δ is extremelysmall, the number of bits per symbol 1

N Hδ(XN ) needed to specify a longN -symbol string x with vanishingly small error probability does not have toexceed H + ϵ bits. We need to have only a tiny tolerance for error, and thenumber of bits required drops significantly from H0(X) to (H + ϵ).

What happens if we are yet more tolerant to compression errors? Part 2tells us that even if δ is very close to 1, so that errors are made most of thetime, the average number of bits per symbol needed to specify x must still beat least H − ϵ bits. These two extremes tell us that regardless of our specificallowance for error, the number of bits per symbol needed to specify x is Hbits; no more and no less.

Caveat regarding ‘asymptotic equipartition’

I put the words ‘asymptotic equipartition’ in quotes because it is importantnot to think that the elements of the typical set TNβ really do have roughlythe same probability as each other. They are similar in probability only inthe sense that their values of log2

1P (x) are within 2Nβ of each other. Now, as

β is decreased, how does N have to increase, if we are to keep our bound onthe mass of the typical set, P (x ∈ TNβ) ≥ 1 − σ2

β2N , constant? N must growas 1/β2, so, if we write β in terms of N as α/

√N , for some constant α, then


Data Compression / Source Codingschaffne/courses/inf... · N H bits. This is the source coding...

Documents

Transcript of Data Compression / Source Codingschaffne/courses/inf... · N H bits. This is the source coding...