Rate distortion theory - An introductionpaginas.fe.up.pt/~vinhoza/itpa/rate-distortion.pdf · 2010....

Rate distortion theoryAn introduction

Paulo J S G Ferreira

SPL — Signal Processing Lab / IEETAUniv. Aveiro, Portugal

April 23, 2010

Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 1 / 80

Contents

1 Preliminaries

2 Weak and strong typicality

3 Rate distortion function

4 Rate distortion theorem

5 Computation algorithms


Contents

1 Preliminaries






Entropy

The entropy of a discrete random variable X withprobability density p(x) is

H(X) = −∑

x

p(x) logp(x)

Base two logarithms tacitly used throughout the lectureThe term

− logp(x)

is associated with the uncertainty of XThe entropy is the average uncertainty:

H(X) = Ep(x) log1

p(x)

Exercise: show that 0 ≤ H(X) ≤ logn and identify thedistributions for which the bounds are attained


Relative entropy

The relative entropy measures the “distance” betweentwo probability functions:

D(p||q) =∑

x

p(x) logp(x)

q(x)= Ep(x) log

p(x)

q(x)

It is nonnegative and zero if and only if p = qIt is not symmetric and it does not satisfy the triangleinequalityHence, it is not a metricSometimes called the Kullback-Leibler distanceExercise: show that D(p||q) ≥ 0


Mutual information

It is the Kullback-Leibler distance between the jointdistribution p(x,y) and p(x)p(y):

I(X,Y) =∑

x,y

p(x,y) logp(x,y)

p(x)p(y)= Ep(x,y) log

p(x,y)

p(x)p(y)

Meaning:How far from independent X and Y areThe information that X contains about Y

Exercise: show that I(X,Y) ≥ 0


Mutual information and entropy

I(X,Y) =∑

x,y

p(x,y) logp(x,y)

p(x)p(y)

=∑

x,y

p(x,y) logp(x|y)p(y)p(x)p(y)

=∑

x,y

p(x,y) logp(x|y)p(x)

= −∑

x,y

p(x,y) logp(x) +∑

x,y

p(x,y) logp(x|y)

= −∑

x

p(x) logp(x) +∑

x,y

p(x,y) logp(x|y)

= H(X)−H(X|Y)

Meaning: measures the reduction in the uncertainty of Xdue to knowing Y


Conditional entropy and entropy

We have seen that

I(X,Y) = H(X)−H(X|Y)

We have also seen that I(X,Y) is nonnegativeIt follows that

H(X) ≥ H(X|Y)

Meaning: on the average, adding more information doesnot increase the uncertainty


Differential entropy

The continuous case is mathematically much more subtleThe differential entropy of X with probability density f (x) is

h(X) = −∫ +∞

−∞f (x) log f (x)dx

Log: base e, unit: natsExample: the differential entropy of a uniform randomvariable is

h(X) = −∫ b

a

1

b− alog

1

b− adx = log(b− a)

It can be negative!


Gaussian variable

The differential entropy of a Gaussian variable withdensity g(x) is

H(X) = −∫ +∞

−∞

1p

2πσ2e− (x−μ)

2

2σ2

︸︷︷︸

g(x)

log1

p

2πσ2e− (x−μ)

2

2σ2

︸︷︷︸

g(x)

dx

= − log1

p

2πσ2+ loge

∫ +∞

−∞

(x− μ)2

2σ21

p

2πσ2e− (x−μ)

2

2σ2

︸︷︷︸

g(x)

dx

=1

2log2πσ2 +

1

2loge =

1

2log2πeσ2


Gaussian variable

The differential entropy of a Gaussian variable ismaximum among all densities with the same varianceProof depends on

∫ +∞

−∞f (x) log

f (x)

g(x)dx ≥ 0

Recall relative entropy / Kullback-Leibler distance to seewhy


Contents

1 Preliminaries






Main ideas

Typicality: describe “typical” sequencesForms of the asymptotic equipartition propertyThey are to information theory what the law of largenumbers is to probabilityIn fact, they depend on the law of large numbersWeak typicality requires that “empirical entropy”approaches “true entropy”Strong typicality requires that the relative frequency ofeach outcome approaches the probability


Weak typicality

Let Xn be an i.i.d. sequence X1,X2, . . . ,XnIt follows that

p(Xn) = p(X1)p(X2) · · ·p(Xn)

The weakly typical set is formed by sequences such that�

�

�

�

−1

nlogp(xn)−H(X)

�

�

�

�

≤ ε

The term on the left is the empirical entropyThe probability of the typical set approaches 1 as n→∞This does not mean that “most sequences are typical” butrather that the non-typical sequences have smallprobabilityThe most likely sequence may not be weakly typical


Strong typicality

Given ε > 0, a sequence is strongly typical with respect top(x) if the average number of occurrences of each symbol a inthe sequence deviates from its probability by less than ε:

�

�

�

�

N(a)

n− p(a)

�

�

�

�

< ε

Also required: symbols of zero probability do not occur:N(a) = 0 if p(a) = 0


Strong typicality

It is a stronger than weak typicalityIt allows better probability boundsThe probability of a non-typical sequence not only goes tozero, it satisfies an exponential bound:

P[X 6= T(ε)] ≤ 2−nϕ(ε)

where ϕ is a positive functionProof depends on Chernoff-type bound: u(x− a) ≤ 2b(x−a)yields

E[u(X− a)] = P(X ≥ a) ≤ E[2b(X−a)] = 2−baE[2bX]


Contents

1 Preliminaries






Rate distortion theory

Lossy and lossless compressionLossy compression implies distortionRate distortion theory describes the trade-off betweenlossy compression rate and the corresponding distortion


Representation of continuous variables

Shannon wrote in Part V of A Mathematical Theory ofCommunication:

...a continuously variable quantity can assume aninfinite number of values and requires, therefore, aninfinite number of bits for exact specification.

This means that...

...to transmit the output of a continuous source withexact recovery at the receiving point requires, ingeneral, a channel of infinite capacity (in bits persecond). Since, ordinarily, channels have a certainamount of noise, and therefore a finite capacity, exacttransmission is impossible.


Is everything lost?

Still quoting Shannon:

Practically, we are not interested in exacttransmission when we have a continuous source, butonly in transmission to within a certain tolerance.

This leads to the real issue:

The question is, can we assign a definite rate to acontinuous source when we require only a certainfidelity of recovery, measured in a suitable way.


Yet another angle

Consider a source with entropy rate HSource coding theorem: there are good source codes ofrate R if R > HWhat if the available rate is below H?Source coding theorem converse: error probability tendsto 1 as n increases – bad newsWhat is necessary is a rate distortion code thatreproduces the sequence with a certain allowed distortion


The problem

Find a way of measuring distortionDetermine the minimum necessary rate for a certaingiven distortionAs the distortion requirements are increased the rate willincreaseIt turns out that it is possible to define a rate such that:

Proper encoding makes possible transmission over achannel with capacity equal to that rate, at that distortionA channel of smaller capacity is insufficient

This lecture turns around this problem


Specific cases

Lossless caseSpecial case, zero distortion, source coding theorem, etc.

Low rate / high compressionIf the maximum distortion has been fixed, what is theminimum rate?

Low distortionWhat is the minimum distortion that can be achieved for acertain channel?


Rate distortion plane

r

d D

R

minimise

rate

minimise

distortion

lossless


Terminology

There is a source and a reproduction alphabetxn is the source sequence, with n symbols x1,x2, . . . ,xnbxn is the reproduced sequence, also with n symbols

xn

Encoder Decoder x̂n


Distortion measures — symbols

How far are the data and their representations?Distortion measures answer thisThey are functions d(x, bx) on the pairs of symbols x, bxThe functions take nonnegative valuesThe are zero on the “diagonal”: d(x,x) = 0They measure the cost of replacing x with bx


Average distortion

The average distortion is

Ep(x,bx)d(x, bx)

Note that∑

x,bx

p(x, bx)d(x, bx) =∑

x,bx

p(x)p(bx|x)d(x, bx)

There are three terms:d(x, bx), determined by the per-symbol distancep(x), determined by the sourcep(bx|x), determined by the coding procedure

Corollary: vary p(bx|x) to find interesting codes


Distortion measures — example

The absolute value distortion measure is

d(x, bx) = |x− bx|

The average absolute value distortion is

Ep(x,bx)|x− bx|



The squared error distortion measure is

d(x, bx) = (x− bx)2

The average squared distortion is

Ep(x,bx)(x− bx)2



The Hamming distance is

d(x, bx) =

�

0 x = bx1 x 6= bx

The average Hamming distortion is

Ep(x,bx)d(x, bx)

Easy to simplify: equal to

p(x 6= bx)


Distortion measures — sequences

One way of measuring the distortion between thesequences is

d(xn, bxn) =1

n

∑

d(xi, bxi)

This is simply the distortion per symbolAnother possibility: the ℓ∞ (max) norm of the d(xi, bxi)Other norms possible, but seldom used


Rate distortion code

A (2nR,n)-rate distortion code consists of:An encoding function fn : xn→ {1,2, . . . ,2nR}A decoding function gn : {1,2, . . . ,2nR}→ bxn

xn Encoder fn(xn) Decoder x̂n


Why sequences?

We are representing n symbols — i.i.d. random variables— with nR bitsThe encoder replaces the sequence with an indexThe decoder does the oppositeThere are, of course, 2nR possibilitiesSurprisingly, it is better to represent entire sequences atone go than to treat each symbol separately — eventhough they are chosen i.i.d.


Code distortion

The code distortion is

D = E [d(xn, bxn)]

= E [d(xn,gn(fn(xn))]

=∑

p(xn)d(xn,gn(fn(xn)))

The average is over the probability distribution on thesequences


The rate distortion region

A point (R,D) in the rate-distortion plane is achievable ifthere exists a sequence of rate distortion codes (fn,gn)such that

limn→∞

E [d(xn,gn(fn(xn))] ≤ D

The rate distortion region is the closure of the set ofachievable (R,D)Its boundary is important (why?)There are two (equivalent) ways of looking at it


Rate distortion function

Roughly speaking: given D, search for the smallestachievable rateThis defines a function of D (it yields a rate for everygiven D)This function is the rate distortion functionMore precisely: the rate distortion function R(D) is theinfimum of rates R such that (R,D) is in the rate distortionregion C for a given distortion D

R(D) = inf(R,D)∈C

R


Distortion rate function

“The rate distortion function, with a twist”Given R, search for the smallest achievable distortionThis defines a function of R (it yields a distortion for everygiven R)More precisely: the distortion rate function D(R) is theinfimum of distortions D such that (R,D) is in the ratedistortion region C for a given rate R

D(R) = inf(R,D)∈C

D


Information rate distortion

Consider a source X with a distortion measure d(x, bx)The information rate distortion function RI(D) for X is

RI(D) = min I(X, bX)

Minimisation: over all conditional distributions p(bx|x) suchthat E[d(x, bx)] ≤ DRecall that p(x, bx) = p(x)p(bx|x)We thus want the minimum over all p(bx|x) such that

∑

x,bx

p(x)p(bx|x)d(x, bx) ≤ D


Information rate distortion

RI(D) is obviously nonnegativeIt is also nonincreasingIt is convex:

r

dD

R(I)(D)


Contents

1 Preliminaries






Rate distortion theorem

If the rate R is above R(D), there exists a sequence ofcodes bXn(Xn) with at most 2nR codewords with an averagedistortion approaching D:

E[d(Xn, bXn)]→ D

If the rate R is below R(D), no such codes exist



The rate distortion and the information rate distortionfunctions are equal:

R(D) = RI(D)

More explicitly,

R(D) = minp(bx|x):E[d(x,bx)]≤D

I(X, bX)

Approach: show that R(D) ≥ RI(D) and R(D) ≤ RI(D)


Example: binary source

Consider a binary source with

P(X = 1) = p, P(X = 0) = q

Distortion: Hamming distanceNeed to find

R(D) = min I(X, bX)

The minimum is computed with respect to all p(bx|x) thatsatisfy the distortion constraint:

∑

x,bx

p(x)p(bx|x)d(x, bx) ≤ D


R(D) for the binary source

The rate distortion function for the Hamming distance is

R(D) =

�

H(p)−H(D), 0 ≤ D ≤min(p,q)0, otherwise

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p=0.1p=0.2p=0.5


Proof

The idea is to lower bound I(X, bX), then show that it isachievableWe know that

I(X, bX) = H(X)−H(X|bX)

The first term H(X) is just the entropy

Hb(p) = −p logp− q logq

How to handle the second term H(X|bX)?Assume without loss of generality that p ≤ 1/2If D > p, rate zero is sufficient (why?)Therefore assume D < p ≤ 1/2


Proof (continued)

Let ⊕ denote addition modulo 2 (that is, exclusive-or)Note (why?)

H(X|bX) = H(X⊕ bX|bX)

Conditioning cannot increase entropy:

H(X⊕ bX|bX) ≤ H(X⊕ bX)

H(X⊕ bX) is Hb(a), where a = probability of X⊕ bX = 1X⊕ bX = 1 if and only if X 6= bXThe probability of X 6= bX does not exceed D due to thedistortion constraint, thus

H(X⊕ bX) ≤ H(D)


Proof (continued)

Putting it all together, if D < p ≤ 1/2,

I(X, bX) = H(X)−H(X|bX) ≥ H(p)−H(D)

The bound is achieved for

P(X = 0|bX = 1) = P(X = 1|bX = 0) = D


Example: Gaussian source

Source: Gaussian source, zero mean, variance σ2

Quadratic distortion:

d(x,y) = (x− y)2

Need to findR(D) = min I(X, bX)

The minimum must be computed subject to the constraint

E[(X− bX)2] ≤ D


Solution for D ≥ σ2

Claim: for D ≥ σ2 the minimum rate is zeroHow to get rate zero: set bX = 0Then no bits need to be transmitted and I(X, bX) = 0 (why?)

This leads to a distortion of E[(X− bX)2] = E[X2] = σ2

This value does not exceed D and so satisfies theconstraintStill need to consider the case D < σ2


Solution for D < σ2

I(X, bX) = h(X)− h(X|bX)

= h(X)− h(X− bX|bX)

≥ h(X)− h(X− bX)

To minimise this, maximise h(X− bX); thus X− bX = GaussianConstraint: E[(X− bX)2] ≤ D, hence variance X− bX can be DResult:

I(X, bX) ≥1

2log2πeσ2 −

1

2log2πeD =

1

2log

σ2

D

Bound met if X is Gaussian, zero mean, variance D


R(D) for the Gaussian source

Thus, the rate distortion for the Gaussian source is

R(D) =

¨

12 log

σ2

D , D < σ2

0, otherwise

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.5 3 3.5

variance=1variance=2variance=3


Distortion rate for the Gaussian source

Solve for D in

R =1

2log

σ2

D

This leads toD(R) = σ2 2−2R

Increasing R by one bit decreases the distortion by 1/4Define SNR as

SNR = 10 log10σ2

D

ThenSNR = 10 log10 2

2R ≈ 6R dB

Hence SNR varies at the rate of 6 dB per bit


Example: multiple Gaussian variables

How to distribute R bits among n variables N (0, σ2i)?

In this caseR(D) = min I(Xn, bXn)

The minimum is over all density functions f (bxn|xn) suchthat

E[d(Xn, bXn)] ≤ D

Here, d(·, ·) is the sum of the per-symbol distortions:

d(xn, bxn) =∑

(xi − bxi)2

Arguments similar to those previously used lead to

I(Xn, bXn) ≥∑

i

1

2log

σ2i

Di

!+



It turns out that this lower bound can be metWe still need to find

R(D) = min∑

Di=D

∑

1

2lnσ2

i

Di

!+

Simple variational problem, Lagrangian is

L =∑ 1

2lnσ2

i

Di+ λ

∑

Di

This leads to∂L

∂Di= −

1

2Di+ λ = 0

Thus Di =constant, and the optimal bit allocation means“same distortion for each variable”



Based on the previous result it can be shown that the ratedistortion function for multiple N(0, σ2

i) variables is

R(D) =∑ 1

2log

σ2i

Di

where

Di =

¨

c, c < λ2i

σ2i, c ≥ λ2

i

and c must be chosen so that∑

Di = D

is satisfied.



Reverse water-filling: fix c, describe variables such as X1,X2, X3 that have variance greater than cIgnore variables such as X4 of variance smaller than c

σ2

i

c

D1 D2 D3 D4

X1 X2 X3 X4



The proof of the theorem depends on certain typicalsequencesThe idea behind typical sequences is that the typical sethas high probabilityThus if the rate distortion code works well for typicalsequences it will work well on the averageThese ideas enter the proof of the rate distortion theoremBefore going on it is necessary to define the typicalsequences that will be used


Distortion typical sequences

Given ε > 0, a pair of sequences (xn, bxn) is distortion typical if�

�

�

�

−logp(xn)

n−H(X)

�

�

�

�

< ε�

�

�

�

�

−logp(bxn)

n−H(bX)

�

�

�

�

�

< ε

�

�

�

�

�

−logp(xn, bxn)

n−H(X, bX)

�

�

�

�

�

< ε

�

�

�d(xn, bxn)− E[d(X, bX)]�

�

� < ε


Distortion typical sequences

According to the law of large numbers

− logp(xn)n

→−E[logp(X)] = H(X)

as n→∞The same argument applies to the other conditionsAs n→∞ the conditions will be met


Strongly typical sequences

Given ε > 0, a sequence is strongly typical with respect top(x) if the average number of occurrences of each alphabetsymbol a in the sequence deviates from its probability by lessthan ε (divided by the alphabet size):

�

�

�

�

N(a)

n− p(a)

�

�

�

�

<ε

|A|

Also required: symbols of zero probability do not occur:N(a) = 0 if p(a) = 0


Strongly typical pairs

Very similar definition, but for pairs of sequencesHow many symbol pairs (a,b) exist in a sequence pair?Definition involves the joint probability p(a,b):

�

�

�

�

N(a,b)

n− p(a,b)

�

�

�

�

<ε

|A| |B|

for all possible a,bAgain, pairs of zero probability must not occur


Set size

A typical sequence with respect to a distribution followsthat distribution wellIntuitively, if n is large enough, this is “very likely”This can be made rigorous: due to the law of largenumbers, the probability of the strongly typical setapproaches 1 as n→∞Thus, roughly speaking, if a system works well for thetypical set it will work well on average



Select a p(bx|x) and δ > 0The codebook consists of 2nR sequences bXn

To encode, map Xn to an index kTo decode, map the index k to bX(k)

Pick k = 1 if there is no k such that (Xn, bXn(k)) is in thestrongly jointly typical setOtherwise pick, for example, the smallest such kIn this way, only 2nR values of k are needed



It is necessary to compute the average distortion,averaging over the random codebook choice:

D =∑

xnp(xn)E[d(xn, bXn)]

The sequences xn can be divided in three sets:Non-typical sequencesTypical sequences for which a bXn(k) exists that is jointlytypical with themTypical sequences for which no such bXn(k) exists

What is the contribution of each set for the distortion?



The total probability of the non-typical sequences can bemade as small as desired by increasing nHence, they contribute a vanishingly small amount to thedistortion if d(·, ·) is bounded and if n is sufficiently largeMore precisely, they contribute εdmax where dmax is themaximum possible distortion



Consider now the case of typical sequences that havecodewords jointly typical with themThis means that the relative frequency of pairs (a,b) inthe coded and decoded sequences are “close” to the jointprobabilityThe distortion is a continuous function of the jointprobability, thus it will also be “close” to DIf d(·, ·) is bounded, the distortion will be bounded byD + εdmaxThe total probability of this set is below 1, hence it willcontribute at most D + εdmax to the expected distortion



Finally, consider typical sequences without jointly typicalcodewordsLet the total probability of this set be PtIf d(·, ·) is bounded, the contribution due to this set will beat

most PtdmaxIt is possible to derive a bound for Pt that shows that itconverges to

zero as n→∞


Putting it all together

The three terms contribute differently to the expecteddistortionThe non-typical set contributes at most εdmaxThe typical / jointly typical set contributes at mostD + εdmaxFinally, the typical / but not jointly typical set contributes a

vanishingly small fractionThese terms add up to R + δ, where δ can be madearbitrarily smallThis completes the argument


Rate distortion theorem (converse)

Draw independent samples from a source X with densityp(x)

The rate R of any (2nR,n) rate distortion code withdistortion ≤ D satisfies R ≥ R(D)To show this assume that E[d(Xn, bXn) ≥ DThat R ≥ R(D) follows from a chain of inequalities


Rate distortion theorem (converse)nR ≥ H(fn(Xn)) fn assumes 2nR values≥ H(fn(Xn))−H(fn(Xn)|Xn) entropy is nonnegative

= I(Xn, bXn) = H(Xn)−H(Xn|bXn) definition

=∑

H(Xi)−H(Xn|bXn) independence

=∑

H(Xi)−∑

H(Xi|bXn,X1 . . .Xi−1) chain rule

≥∑

H(Xi)−∑

H(Xi|bXi) conditioning reduces entropy

=∑

I(Xi, bXi)

≥∑

R

E[d(Xi, bXi)]

rate distortion definition

= n1

n

∑

R

E[d(Xi, bXi)]

≥ nR1

n

∑

E[d(Xi, bXi)]

convexity of R(·)

= nR(E[d(Xn, bXn)]) definition

≥ nR(D) E(d(Xn, bXn)] ≤ DPaulo J S G Ferreira (SPL) Rate distortion April 23, 2010 72 / 80

Contents

1 Preliminaries






Alternating projections

Consider two subspaces A and B of a common parentspaceA point in the intersection can be found by alternatingprojections

x

y

x

y


POCS

The method depends on the possibility of defining a“projection”Subspaces are obviously convexEverything works for convex sets because projections arewell defined

A B

za

b

c


Distance between convex sets

POCS works to find a point in the intersection of twoconvex sets A, BIf the sets are disjoint, one can find the distance betweenthem

dmin = mina∈Ab∈B

d(a,b)

Project zi in A to find zi+1Project zi+1 in B to find zi+2Repeat to obtain z1, z2, z3, . . .The distance between the points converges to dmin


POCS and rate distortion

Need to recast rate distortion as convex optimisationMutual information I(X, bX) is the Kullback-Leibler distanceD(p||q)Hence rate distortion is obtained by minimising this“distance”How to proceed?



The Kullback-Leibler distance from p(x)p(y|x) to p(x)q(y) isminimised when q(y) is the marginal

q0(y) =∑

x

p(x)p(y|x)

Proof: compute the distances from p(x)p(y|x) to p(x)q(y)and from p(x)p(y|x) to p(x)q0(y) and show that the first is≥ than the secondThis is an elementary consequence of the nonnegativityof D(·||·)



R(D) = minp(bx|x):

∑

p(x)p(bx|x)d(x,bx)≤DI(X, bX)

= minp(bx|x):

∑

p(x)p(bx|x)d(x,bx)≤D

∑

x,bx

p(x)p(bx|x) logp(bx|x)p(bx)

= minp(bx|x):

∑

p(x)p(bx|x)d(x,bx)≤DD

p(x)p(bx|x) ||p(x)p(bx)

= minq(bx)

minp(bx|x):

∑

p(x)p(bx|x)d(x,bx)≤DD

p(x)p(bx|x) ||p(x)q(bx)

= mina∈A

minb∈B

D(a||b)


Blahut-Arimoto

Alternating minimisation as in POCSStart with q(bx), find p(bx|x) that minimises I(X, bX) (subjectto the distortion constraint)For this p(bx|x), find the q(bx) that minimises the mutualinformation, which is simply

q0(bx) =∑

p(x)p(bx|x)


PreliminariesWeak and strong typicalityRate distortion functionRate distortion theoremComputation algorithms

Rate distortion theory - An introductionpaginas.fe.up.pt/~vinhoza/itpa/rate-distortion.pdf · 2010....

Documents

Transcript of Rate distortion theory - An introductionpaginas.fe.up.pt/~vinhoza/itpa/rate-distortion.pdf · 2010....