Rate distortion theory - An introductionpaginas.fe.up.pt/~vinhoza/itpa/rate-distortion.pdf · 2010....

84
Rate distortion theory An introduction Paulo J S G Ferreira SPL — Signal Processing Lab / IEETA Univ. Aveiro, Portugal April 23, 2010 Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 1 / 80

Transcript of Rate distortion theory - An introductionpaginas.fe.up.pt/~vinhoza/itpa/rate-distortion.pdf · 2010....

  • Rate distortion theoryAn introduction

    Paulo J S G Ferreira

    SPL — Signal Processing Lab / IEETAUniv. Aveiro, Portugal

    April 23, 2010

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 1 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 2 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 2 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 2 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 2 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 2 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 3 / 80

  • Entropy

    The entropy of a discrete random variable X withprobability density p(x) is

    H(X) = −∑

    x

    p(x) logp(x)

    Base two logarithms tacitly used throughout the lectureThe term

    − logp(x)

    is associated with the uncertainty of XThe entropy is the average uncertainty:

    H(X) = Ep(x) log1

    p(x)

    Exercise: show that 0 ≤ H(X) ≤ logn and identify thedistributions for which the bounds are attained

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 4 / 80

  • Relative entropy

    The relative entropy measures the “distance” betweentwo probability functions:

    D(p||q) =∑

    x

    p(x) logp(x)

    q(x)= Ep(x) log

    p(x)

    q(x)

    It is nonnegative and zero if and only if p = qIt is not symmetric and it does not satisfy the triangleinequalityHence, it is not a metricSometimes called the Kullback-Leibler distanceExercise: show that D(p||q) ≥ 0

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 5 / 80

  • Conditional entropy

    H(X|Y) is the expected value of the entropies of theconditional distributions p(x|y), averaged over theconditioning variable Y:

    H(X|Y) =∑

    y

    p(y)H(X|Y = y)

    = −∑

    y

    p(y)∑

    x

    p(x|y) logp(x|y)

    = −∑

    x,y

    p(y)p(x|y) logp(x|y)

    = −∑

    x,y

    p(x,y) logp(x|y)

    = −Ep(x,y) logp(x|y)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 6 / 80

  • Mutual information

    It is the Kullback-Leibler distance between the jointdistribution p(x,y) and p(x)p(y):

    I(X,Y) =∑

    x,y

    p(x,y) logp(x,y)

    p(x)p(y)= Ep(x,y) log

    p(x,y)

    p(x)p(y)

    Meaning:How far from independent X and Y areThe information that X contains about Y

    Exercise: show that I(X,Y) ≥ 0

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 7 / 80

  • Mutual information and entropy

    I(X,Y) =∑

    x,y

    p(x,y) logp(x,y)

    p(x)p(y)

    =∑

    x,y

    p(x,y) logp(x|y)p(y)p(x)p(y)

    =∑

    x,y

    p(x,y) logp(x|y)p(x)

    = −∑

    x,y

    p(x,y) logp(x) +∑

    x,y

    p(x,y) logp(x|y)

    = −∑

    x

    p(x) logp(x) +∑

    x,y

    p(x,y) logp(x|y)

    = H(X)−H(X|Y)

    Meaning: measures the reduction in the uncertainty of Xdue to knowing Y

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 8 / 80

  • Conditional entropy and entropy

    We have seen that

    I(X,Y) = H(X)−H(X|Y)

    We have also seen that I(X,Y) is nonnegativeIt follows that

    H(X) ≥ H(X|Y)

    Meaning: on the average, adding more information doesnot increase the uncertainty

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 9 / 80

  • Chain rules

    The simplest is

    H(X,Y) = H(X) + H(X|Y)

    Compare with p(x,y) = p(y|x)p(x)By repeated application

    H(X,Y) = H(X) + H(Y|X)H(X,Y,Z) = H(X,Y) + H(Z|X,Y)

    = H(X) + H(Y|X) + H(Z|X,Y)

    The general case should be apparent:

    H(X1,X2, . . .Xn) =∑

    H(Xi|X1, . . .Xi−1)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 10 / 80

  • Differential entropy

    The continuous case is mathematically much more subtleThe differential entropy of X with probability density f (x) is

    h(X) = −∫ +∞

    −∞f (x) log f (x)dx

    Log: base e, unit: natsExample: the differential entropy of a uniform randomvariable is

    h(X) = −∫ b

    a

    1

    b− alog

    1

    b− adx = log(b− a)

    It can be negative!

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 11 / 80

  • Gaussian variable

    The differential entropy of a Gaussian variable withdensity g(x) is

    H(X) = −∫ +∞

    −∞

    1p

    2πσ2e− (x−μ)

    2

    2σ2

    ︸ ︷︷ ︸

    g(x)

    log1

    p

    2πσ2e− (x−μ)

    2

    2σ2

    ︸ ︷︷ ︸

    g(x)

    dx

    = − log1

    p

    2πσ2+ loge

    ∫ +∞

    −∞

    (x− μ)2

    2σ21

    p

    2πσ2e− (x−μ)

    2

    2σ2

    ︸ ︷︷ ︸

    g(x)

    dx

    =1

    2log2πσ2 +

    1

    2loge =

    1

    2log2πeσ2

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 12 / 80

  • Gaussian variable

    The differential entropy of a Gaussian variable ismaximum among all densities with the same varianceProof depends on

    ∫ +∞

    −∞f (x) log

    f (x)

    g(x)dx ≥ 0

    Recall relative entropy / Kullback-Leibler distance to seewhy

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 13 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 14 / 80

  • Main ideas

    Typicality: describe “typical” sequencesForms of the asymptotic equipartition propertyThey are to information theory what the law of largenumbers is to probabilityIn fact, they depend on the law of large numbersWeak typicality requires that “empirical entropy”approaches “true entropy”Strong typicality requires that the relative frequency ofeach outcome approaches the probability

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 15 / 80

  • Weak typicality

    Let Xn be an i.i.d. sequence X1,X2, . . . ,XnIt follows that

    p(Xn) = p(X1)p(X2) · · ·p(Xn)

    The weakly typical set is formed by sequences such that�

    −1

    nlogp(xn)−H(X)

    ≤ ε

    The term on the left is the empirical entropyThe probability of the typical set approaches 1 as n→∞This does not mean that “most sequences are typical” butrather that the non-typical sequences have smallprobabilityThe most likely sequence may not be weakly typical

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 16 / 80

  • Strong typicality

    Given ε > 0, a sequence is strongly typical with respect top(x) if the average number of occurrences of each symbol a inthe sequence deviates from its probability by less than ε:

    N(a)

    n− p(a)

    < ε

    Also required: symbols of zero probability do not occur:N(a) = 0 if p(a) = 0

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 17 / 80

  • Strong typicality

    It is a stronger than weak typicalityIt allows better probability boundsThe probability of a non-typical sequence not only goes tozero, it satisfies an exponential bound:

    P[X 6= T(ε)] ≤ 2−nϕ(ε)

    where ϕ is a positive functionProof depends on Chernoff-type bound: u(x− a) ≤ 2b(x−a)yields

    E[u(X− a)] = P(X ≥ a) ≤ E[2b(X−a)] = 2−baE[2bX]

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 18 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 19 / 80

  • Rate distortion theory

    Lossy and lossless compressionLossy compression implies distortionRate distortion theory describes the trade-off betweenlossy compression rate and the corresponding distortion

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 20 / 80

  • Representation of continuous variables

    Shannon wrote in Part V of A Mathematical Theory ofCommunication:

    ...a continuously variable quantity can assume aninfinite number of values and requires, therefore, aninfinite number of bits for exact specification.

    This means that...

    ...to transmit the output of a continuous source withexact recovery at the receiving point requires, ingeneral, a channel of infinite capacity (in bits persecond). Since, ordinarily, channels have a certainamount of noise, and therefore a finite capacity, exacttransmission is impossible.

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 21 / 80

  • Is everything lost?

    Still quoting Shannon:

    Practically, we are not interested in exacttransmission when we have a continuous source, butonly in transmission to within a certain tolerance.

    This leads to the real issue:

    The question is, can we assign a definite rate to acontinuous source when we require only a certainfidelity of recovery, measured in a suitable way.

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 22 / 80

  • Yet another angle

    Consider a source with entropy rate HSource coding theorem: there are good source codes ofrate R if R > HWhat if the available rate is below H?Source coding theorem converse: error probability tendsto 1 as n increases – bad newsWhat is necessary is a rate distortion code thatreproduces the sequence with a certain allowed distortion

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 23 / 80

  • The problem

    Find a way of measuring distortionDetermine the minimum necessary rate for a certaingiven distortionAs the distortion requirements are increased the rate willincreaseIt turns out that it is possible to define a rate such that:

    Proper encoding makes possible transmission over achannel with capacity equal to that rate, at that distortionA channel of smaller capacity is insufficient

    This lecture turns around this problem

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 24 / 80

  • Specific cases

    Lossless caseSpecial case, zero distortion, source coding theorem, etc.

    Low rate / high compressionIf the maximum distortion has been fixed, what is theminimum rate?

    Low distortionWhat is the minimum distortion that can be achieved for acertain channel?

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 25 / 80

  • Rate distortion plane

    r

    d D

    R

    minimise

    rate

    minimise

    distortion

    lossless

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 26 / 80

  • Terminology

    There is a source and a reproduction alphabetxn is the source sequence, with n symbols x1,x2, . . . ,xnbxn is the reproduced sequence, also with n symbols

    xn

    Encoder Decoder x̂n

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 27 / 80

  • Distortion measures — symbols

    How far are the data and their representations?Distortion measures answer thisThey are functions d(x, bx) on the pairs of symbols x, bxThe functions take nonnegative valuesThe are zero on the “diagonal”: d(x,x) = 0They measure the cost of replacing x with bx

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 28 / 80

  • Average distortion

    The average distortion is

    Ep(x,bx)d(x, bx)

    Note that∑

    x,bx

    p(x, bx)d(x, bx) =∑

    x,bx

    p(x)p(bx|x)d(x, bx)

    There are three terms:d(x, bx), determined by the per-symbol distancep(x), determined by the sourcep(bx|x), determined by the coding procedure

    Corollary: vary p(bx|x) to find interesting codes

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 29 / 80

  • Distortion measures — example

    The absolute value distortion measure is

    d(x, bx) = |x− bx|

    The average absolute value distortion is

    Ep(x,bx)|x− bx|

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 30 / 80

  • Distortion measures — example

    The squared error distortion measure is

    d(x, bx) = (x− bx)2

    The average squared distortion is

    Ep(x,bx)(x− bx)2

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 31 / 80

  • Distortion measures — example

    The Hamming distance is

    d(x, bx) =

    0 x = bx1 x 6= bx

    The average Hamming distortion is

    Ep(x,bx)d(x, bx)

    Easy to simplify: equal to

    p(x 6= bx)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 32 / 80

  • Distortion measures — sequences

    One way of measuring the distortion between thesequences is

    d(xn, bxn) =1

    n

    d(xi, bxi)

    This is simply the distortion per symbolAnother possibility: the ℓ∞ (max) norm of the d(xi, bxi)Other norms possible, but seldom used

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 33 / 80

  • Rate distortion code

    A (2nR,n)-rate distortion code consists of:An encoding function fn : xn→ {1,2, . . . ,2nR}A decoding function gn : {1,2, . . . ,2nR}→ bxn

    xn Encoder fn(xn) Decoder x̂n

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 34 / 80

  • Why sequences?

    We are representing n symbols — i.i.d. random variables— with nR bitsThe encoder replaces the sequence with an indexThe decoder does the oppositeThere are, of course, 2nR possibilitiesSurprisingly, it is better to represent entire sequences atone go than to treat each symbol separately — eventhough they are chosen i.i.d.

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 35 / 80

  • Code distortion

    The code distortion is

    D = E [d(xn, bxn)]

    = E [d(xn,gn(fn(xn))]

    =∑

    p(xn)d(xn,gn(fn(xn)))

    The average is over the probability distribution on thesequences

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 36 / 80

  • The rate distortion region

    A point (R,D) in the rate-distortion plane is achievable ifthere exists a sequence of rate distortion codes (fn,gn)such that

    limn→∞

    E [d(xn,gn(fn(xn))] ≤ D

    The rate distortion region is the closure of the set ofachievable (R,D)Its boundary is important (why?)There are two (equivalent) ways of looking at it

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 37 / 80

  • Rate distortion function

    Roughly speaking: given D, search for the smallestachievable rateThis defines a function of D (it yields a rate for everygiven D)This function is the rate distortion functionMore precisely: the rate distortion function R(D) is theinfimum of rates R such that (R,D) is in the rate distortionregion C for a given distortion D

    R(D) = inf(R,D)∈C

    R

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 38 / 80

  • Distortion rate function

    “The rate distortion function, with a twist”Given R, search for the smallest achievable distortionThis defines a function of R (it yields a distortion for everygiven R)More precisely: the distortion rate function D(R) is theinfimum of distortions D such that (R,D) is in the ratedistortion region C for a given rate R

    D(R) = inf(R,D)∈C

    D

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 39 / 80

  • Information rate distortion

    Consider a source X with a distortion measure d(x, bx)The information rate distortion function RI(D) for X is

    RI(D) = min I(X, bX)

    Minimisation: over all conditional distributions p(bx|x) suchthat E[d(x, bx)] ≤ DRecall that p(x, bx) = p(x)p(bx|x)We thus want the minimum over all p(bx|x) such that

    x,bx

    p(x)p(bx|x)d(x, bx) ≤ D

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 40 / 80

  • Information rate distortion

    RI(D) is obviously nonnegativeIt is also nonincreasingIt is convex:

    r

    dD

    R(I)(D)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 41 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 42 / 80

  • Rate distortion theorem

    If the rate R is above R(D), there exists a sequence ofcodes bXn(Xn) with at most 2nR codewords with an averagedistortion approaching D:

    E[d(Xn, bXn)]→ D

    If the rate R is below R(D), no such codes exist

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 43 / 80

  • Rate distortion theorem

    The rate distortion and the information rate distortionfunctions are equal:

    R(D) = RI(D)

    More explicitly,

    R(D) = minp(bx|x):E[d(x,bx)]≤D

    I(X, bX)

    Approach: show that R(D) ≥ RI(D) and R(D) ≤ RI(D)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 44 / 80

  • Example: binary source

    Consider a binary source with

    P(X = 1) = p, P(X = 0) = q

    Distortion: Hamming distanceNeed to find

    R(D) = min I(X, bX)

    The minimum is computed with respect to all p(bx|x) thatsatisfy the distortion constraint:

    x,bx

    p(x)p(bx|x)d(x, bx) ≤ D

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 45 / 80

  • R(D) for the binary source

    The rate distortion function for the Hamming distance is

    R(D) =

    H(p)−H(D), 0 ≤ D ≤min(p,q)0, otherwise

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    p=0.1p=0.2p=0.5

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 46 / 80

  • Proof

    The idea is to lower bound I(X, bX), then show that it isachievableWe know that

    I(X, bX) = H(X)−H(X|bX)

    The first term H(X) is just the entropy

    Hb(p) = −p logp− q logq

    How to handle the second term H(X|bX)?Assume without loss of generality that p ≤ 1/2If D > p, rate zero is sufficient (why?)Therefore assume D < p ≤ 1/2

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 47 / 80

  • Proof (continued)

    Let ⊕ denote addition modulo 2 (that is, exclusive-or)Note (why?)

    H(X|bX) = H(X⊕ bX|bX)

    Conditioning cannot increase entropy:

    H(X⊕ bX|bX) ≤ H(X⊕ bX)

    H(X⊕ bX) is Hb(a), where a = probability of X⊕ bX = 1X⊕ bX = 1 if and only if X 6= bXThe probability of X 6= bX does not exceed D due to thedistortion constraint, thus

    H(X⊕ bX) ≤ H(D)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 48 / 80

  • Proof (continued)

    Putting it all together, if D < p ≤ 1/2,

    I(X, bX) = H(X)−H(X|bX) ≥ H(p)−H(D)

    The bound is achieved for

    P(X = 0|bX = 1) = P(X = 1|bX = 0) = D

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 49 / 80

  • Example: Gaussian source

    Source: Gaussian source, zero mean, variance σ2

    Quadratic distortion:

    d(x,y) = (x− y)2

    Need to findR(D) = min I(X, bX)

    The minimum must be computed subject to the constraint

    E[(X− bX)2] ≤ D

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 50 / 80

  • Solution for D ≥ σ2

    Claim: for D ≥ σ2 the minimum rate is zeroHow to get rate zero: set bX = 0Then no bits need to be transmitted and I(X, bX) = 0 (why?)

    This leads to a distortion of E[(X− bX)2] = E[X2] = σ2

    This value does not exceed D and so satisfies theconstraintStill need to consider the case D < σ2

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 51 / 80

  • Solution for D < σ2

    I(X, bX) = h(X)− h(X|bX)

    = h(X)− h(X− bX|bX)

    ≥ h(X)− h(X− bX)

    To minimise this, maximise h(X− bX); thus X− bX = GaussianConstraint: E[(X− bX)2] ≤ D, hence variance X− bX can be DResult:

    I(X, bX) ≥1

    2log2πeσ2 −

    1

    2log2πeD =

    1

    2log

    σ2

    D

    Bound met if X is Gaussian, zero mean, variance D

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 52 / 80

  • R(D) for the Gaussian source

    Thus, the rate distortion for the Gaussian source is

    R(D) =

    ¨

    12 log

    σ2

    D , D < σ2

    0, otherwise

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    0 0.5 1 1.5 2 2.5 3 3.5

    variance=1variance=2variance=3

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 53 / 80

  • Distortion rate for the Gaussian source

    Solve for D in

    R =1

    2log

    σ2

    D

    This leads toD(R) = σ2 2−2R

    Increasing R by one bit decreases the distortion by 1/4Define SNR as

    SNR = 10 log10σ2

    D

    ThenSNR = 10 log10 2

    2R ≈ 6R dB

    Hence SNR varies at the rate of 6 dB per bit

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 54 / 80

  • Example: multiple Gaussian variables

    How to distribute R bits among n variables N (0, σ2i)?

    In this caseR(D) = min I(Xn, bXn)

    The minimum is over all density functions f (bxn|xn) suchthat

    E[d(Xn, bXn)] ≤ D

    Here, d(·, ·) is the sum of the per-symbol distortions:

    d(xn, bxn) =∑

    (xi − bxi)2

    Arguments similar to those previously used lead to

    I(Xn, bXn) ≥∑

    i

    1

    2log

    σ2i

    Di

    !+

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 55 / 80

  • Example: multiple Gaussian variables

    It turns out that this lower bound can be metWe still need to find

    R(D) = min∑

    Di=D

    1

    2lnσ2

    i

    Di

    !+

    Simple variational problem, Lagrangian is

    L =∑ 1

    2lnσ2

    i

    Di+ λ

    Di

    This leads to∂L

    ∂Di= −

    1

    2Di+ λ = 0

    Thus Di =constant, and the optimal bit allocation means“same distortion for each variable”

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 56 / 80

  • Example: multiple Gaussian variables

    Based on the previous result it can be shown that the ratedistortion function for multiple N(0, σ2

    i) variables is

    R(D) =∑ 1

    2log

    σ2i

    Di

    where

    Di =

    ¨

    c, c < λ2i

    σ2i, c ≥ λ2

    i

    and c must be chosen so that∑

    Di = D

    is satisfied.

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 57 / 80

  • Example: multiple Gaussian variables

    Reverse water-filling: fix c, describe variables such as X1,X2, X3 that have variance greater than cIgnore variables such as X4 of variance smaller than c

    σ2

    i

    c

    D1 D2 D3 D4

    X1 X2 X3 X4

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 58 / 80

  • Rate distortion theorem

    The proof of the theorem depends on certain typicalsequencesThe idea behind typical sequences is that the typical sethas high probabilityThus if the rate distortion code works well for typicalsequences it will work well on the averageThese ideas enter the proof of the rate distortion theoremBefore going on it is necessary to define the typicalsequences that will be used

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 59 / 80

  • Distortion typical sequences

    Given ε > 0, a pair of sequences (xn, bxn) is distortion typical if�

    −logp(xn)

    n−H(X)

    < ε�

    −logp(bxn)

    n−H(bX)

    < ε

    −logp(xn, bxn)

    n−H(X, bX)

    < ε

    �d(xn, bxn)− E[d(X, bX)]�

    � < ε

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 60 / 80

  • Distortion typical sequences

    According to the law of large numbers

    − logp(xn)n

    →−E[logp(X)] = H(X)

    as n→∞The same argument applies to the other conditionsAs n→∞ the conditions will be met

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 61 / 80

  • Strongly typical sequences

    Given ε > 0, a sequence is strongly typical with respect top(x) if the average number of occurrences of each alphabetsymbol a in the sequence deviates from its probability by lessthan ε (divided by the alphabet size):

    N(a)

    n− p(a)

    |A|

    Also required: symbols of zero probability do not occur:N(a) = 0 if p(a) = 0

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 62 / 80

  • Strongly typical pairs

    Very similar definition, but for pairs of sequencesHow many symbol pairs (a,b) exist in a sequence pair?Definition involves the joint probability p(a,b):

    N(a,b)

    n− p(a,b)

    |A| |B|

    for all possible a,bAgain, pairs of zero probability must not occur

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 63 / 80

  • Set size

    A typical sequence with respect to a distribution followsthat distribution wellIntuitively, if n is large enough, this is “very likely”This can be made rigorous: due to the law of largenumbers, the probability of the strongly typical setapproaches 1 as n→∞Thus, roughly speaking, if a system works well for thetypical set it will work well on average

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 64 / 80

  • Rate distortion theorem

    Select a p(bx|x) and δ > 0The codebook consists of 2nR sequences bXn

    To encode, map Xn to an index kTo decode, map the index k to bX(k)

    Pick k = 1 if there is no k such that (Xn, bXn(k)) is in thestrongly jointly typical setOtherwise pick, for example, the smallest such kIn this way, only 2nR values of k are needed

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 65 / 80

  • Rate distortion theorem

    It is necessary to compute the average distortion,averaging over the random codebook choice:

    D =∑

    xnp(xn)E[d(xn, bXn)]

    The sequences xn can be divided in three sets:Non-typical sequencesTypical sequences for which a bXn(k) exists that is jointlytypical with themTypical sequences for which no such bXn(k) exists

    What is the contribution of each set for the distortion?

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 66 / 80

  • Rate distortion theorem

    The total probability of the non-typical sequences can bemade as small as desired by increasing nHence, they contribute a vanishingly small amount to thedistortion if d(·, ·) is bounded and if n is sufficiently largeMore precisely, they contribute εdmax where dmax is themaximum possible distortion

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 67 / 80

  • Rate distortion theorem

    Consider now the case of typical sequences that havecodewords jointly typical with themThis means that the relative frequency of pairs (a,b) inthe coded and decoded sequences are “close” to the jointprobabilityThe distortion is a continuous function of the jointprobability, thus it will also be “close” to DIf d(·, ·) is bounded, the distortion will be bounded byD + εdmaxThe total probability of this set is below 1, hence it willcontribute at most D + εdmax to the expected distortion

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 68 / 80

  • Rate distortion theorem

    Finally, consider typical sequences without jointly typicalcodewordsLet the total probability of this set be PtIf d(·, ·) is bounded, the contribution due to this set will beat

    most PtdmaxIt is possible to derive a bound for Pt that shows that itconverges to

    zero as n→∞

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 69 / 80

  • Putting it all together

    The three terms contribute differently to the expecteddistortionThe non-typical set contributes at most εdmaxThe typical / jointly typical set contributes at mostD + εdmaxFinally, the typical / but not jointly typical set contributes a

    vanishingly small fractionThese terms add up to R + δ, where δ can be madearbitrarily smallThis completes the argument

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 70 / 80

  • Rate distortion theorem (converse)

    Draw independent samples from a source X with densityp(x)

    The rate R of any (2nR,n) rate distortion code withdistortion ≤ D satisfies R ≥ R(D)To show this assume that E[d(Xn, bXn) ≥ DThat R ≥ R(D) follows from a chain of inequalities

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 71 / 80

  • Rate distortion theorem (converse)nR ≥ H(fn(Xn)) fn assumes 2nR values≥ H(fn(Xn))−H(fn(Xn)|Xn) entropy is nonnegative

    = I(Xn, bXn) = H(Xn)−H(Xn|bXn) definition

    =∑

    H(Xi)−H(Xn|bXn) independence

    =∑

    H(Xi)−∑

    H(Xi|bXn,X1 . . .Xi−1) chain rule

    ≥∑

    H(Xi)−∑

    H(Xi|bXi) conditioning reduces entropy

    =∑

    I(Xi, bXi)

    ≥∑

    R

    E[d(Xi, bXi)]

    rate distortion definition

    = n1

    n

    R

    E[d(Xi, bXi)]

    ≥ nR1

    n

    E[d(Xi, bXi)]

    convexity of R(·)

    = nR(E[d(Xn, bXn)]) definition

    ≥ nR(D) E(d(Xn, bXn)] ≤ DPaulo J S G Ferreira (SPL) Rate distortion April 23, 2010 72 / 80

  • Contents

    1 Preliminaries

    2 Weak and strong typicality

    3 Rate distortion function

    4 Rate distortion theorem

    5 Computation algorithms

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 73 / 80

  • Alternating projections

    Consider two subspaces A and B of a common parentspaceA point in the intersection can be found by alternatingprojections

    x

    y

    x

    y

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 74 / 80

  • POCS

    The method depends on the possibility of defining a“projection”Subspaces are obviously convexEverything works for convex sets because projections arewell defined

    A B

    za

    b

    c

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 75 / 80

  • Distance between convex sets

    POCS works to find a point in the intersection of twoconvex sets A, BIf the sets are disjoint, one can find the distance betweenthem

    dmin = mina∈Ab∈B

    d(a,b)

    Project zi in A to find zi+1Project zi+1 in B to find zi+2Repeat to obtain z1, z2, z3, . . .The distance between the points converges to dmin

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 76 / 80

  • POCS and rate distortion

    Need to recast rate distortion as convex optimisationMutual information I(X, bX) is the Kullback-Leibler distanceD(p||q)Hence rate distortion is obtained by minimising this“distance”How to proceed?

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 77 / 80

  • POCS and rate distortion

    The Kullback-Leibler distance from p(x)p(y|x) to p(x)q(y) isminimised when q(y) is the marginal

    q0(y) =∑

    x

    p(x)p(y|x)

    Proof: compute the distances from p(x)p(y|x) to p(x)q(y)and from p(x)p(y|x) to p(x)q0(y) and show that the first is≥ than the secondThis is an elementary consequence of the nonnegativityof D(·||·)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 78 / 80

  • POCS and rate distortion

    R(D) = minp(bx|x):

    p(x)p(bx|x)d(x,bx)≤DI(X, bX)

    = minp(bx|x):

    p(x)p(bx|x)d(x,bx)≤D

    x,bx

    p(x)p(bx|x) logp(bx|x)p(bx)

    = minp(bx|x):

    p(x)p(bx|x)d(x,bx)≤DD

    p(x)p(bx|x) ||p(x)p(bx)

    = minq(bx)

    minp(bx|x):

    p(x)p(bx|x)d(x,bx)≤DD

    p(x)p(bx|x) ||p(x)q(bx)

    = mina∈A

    minb∈B

    D(a||b)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 79 / 80

  • Blahut-Arimoto

    Alternating minimisation as in POCSStart with q(bx), find p(bx|x) that minimises I(X, bX) (subjectto the distortion constraint)For this p(bx|x), find the q(bx) that minimises the mutualinformation, which is simply

    q0(bx) =∑

    p(x)p(bx|x)

    Paulo J S G Ferreira (SPL) Rate distortion April 23, 2010 80 / 80

    PreliminariesWeak and strong typicalityRate distortion functionRate distortion theoremComputation algorithms