On Lossy Compression

55
On Lossy Compression Paul Vitanyi CWI, University of Amsterdam, National ICT Australia Joint work with Kolya Vereshchagin

description

On Lossy Compression. Paul Vitanyi CWI, University of Amsterdam, National ICT Australia Joint work with Kolya Vereshchagin. You can import music in a variety of formats, such as MP3 or AAC, and at whatever quality level you’d prefer.  Lossy Compression - PowerPoint PPT Presentation

Transcript of On Lossy Compression

Page 1: On  Lossy  Compression

On Lossy Compression

Paul Vitanyi CWI, University of Amsterdam, National ICT Australia

Joint work with Kolya Vereshchagin

Page 2: On  Lossy  Compression

You can import music in a variety of formats, such as MP3 or AAC, and at whatever quality level you’d prefer. Lossy Compression

You can even choose the new Apple Lossless encoder. Music encoded with that option offers sound quality indistinguishable from the original CDs at about half the file size of the original. Lossless Compression

Page 3: On  Lossy  Compression

Lossy Compression drives the Web

Pictures: JPEGSound: MP3Video: MPEG

Majority of Web transfers are Lossy Compressed Data--- http traffic was exceeded by peer-to-peer music and video sharing in 2002.

Page 4: On  Lossy  Compression

Lena Compressed by JPEG

Original Lena Image (256 x 256 Pixels, 24-Bit RGB) JPEG Compressed (Compression Ratio 43:1)

As can be seen from the comparison images above, at compression ratios above 40:1 the JPEG algorithm begins to lose

its effectiveness, while the JPEG2000 compressed image shows very little distortion.

JPEG2000 Compressed

(Compression Ratio 43:1)

Page 5: On  Lossy  Compression

Rate Distortion Theory Underlies Lossy Compression

Claude Elwood Shannon, 1948 & 1959, Defines Rate Distortion

(With learning mouse “Theseus” in the picture)

Page 6: On  Lossy  Compression

Rate Distortion

X is a set of source wordsY is a set of code words

If |X| < |Y|, then no code is faithfull

Distortion

Page 7: On  Lossy  Compression

Distortion

Choose a distortion measure d: X × Y Real Numbers

XSource words

YCode words

codingx y

Distortion = d(x,y)

Distortion = fidelity of the coded version versus the source data.

Page 8: On  Lossy  Compression

Example Distortion Measures

List Distortion for bit rate R:

x

Hamming Distortion for bit rate R :x = y =

xSource word x is a finite binary string;Code word y is a finite set of source wordscontaining x, and y is described in ≤R bits.

Bit flips

y can be described in ≤R bits

y

coding

coding

Source word x andcode word y are binarystrings of length n.

Distortion d(x,y) = log |y| (rounded up to integer value)

Distortion d(x,y) = number of flipped bits between x and y.

Page 9: On  Lossy  Compression

Example Distortion Measures

Euclidean Distortion for parameter R :

y is a rational number that can be described in ≤R bits

coding

X is a real number0

0

1

1

Distortion d(x,y) = |x-y|

Page 10: On  Lossy  Compression

Distortion-rate function

Minimal distortion as function of given rate R:Random source: x_1 x_2 x_n

Codingusing a sequenceof codes c_1.c_2,...,c_nfrom prescribed code class

y_1 = c_1(x_1) y_2 = c_2(x_2) y_n = c_n(x_n)

Code length | y_1 y_2 ... y_n | ≤ nR bits

Distortion-rate function:

D(R)= lim min ∑ p(x_1x_2...x_n) 1/n ∑ d(x_i,y_i)n∞ x_1x_2...x_n i=1

n

over all codesequences c_1,c_2,...,c_nsatisfyingrate constraints

Page 11: On  Lossy  Compression

Rate-distortion function

Minimal rate as function of maximal allowed distortion D:Random source: x_1 x_2 x_n

Codingusing a sequenceof codes c_1.c_2,...,c_nfrom prescribed code class

y_1 = c_1(x_1) y_2 = c_2(x_2) y_n = c_n(x_n)

Expected distortion ∑ p(x_1x_2...x_n) 1/n ∑ d(x_i,y_i)≤ D

Rate-distortion function:

R(D)= lim min ∑ p(x_1x_2...x_n) 1/n ∑ |y_i|n∞ x_1x_2...x_n i=1

n

over all codesequences c_1,c_2,...,c_nsatisfyingdistortion constraints

x_1x_2...x_n

Since D(R) is convexand nonincreasing, R(D)is its inverse.

Page 12: On  Lossy  Compression

Function graphs

Rate-distortion graph Hamming distortion: R(D)=n(1-H(D/n)) |x_i|=n, D= expected # bit flips

Rate-distortion graph List distortion: R(D) = n – D|x_i|=n, D = expected log-cardinality of list (set).

Rate-distortion graph Euclidean distortion: R(D)= log 1/Dx_i is a real in [0,1], D = expected distance between x_iand rational code word y_i

Page 13: On  Lossy  Compression

Problems with this approach

Functions give expectations or at bestrate-distortion relation for a high-probability set of typical sequences

It is often assumed that the random source is stationary ergodic (to be able to determine the curve)

This is fine for data that satisfy simple statistical properties, But not for complex data that satisfy global relations like

images, music, video Such complex pictures are usually atypical.

Just like lossless compression requires lots of tricks to be able tocompress meaningful data, so does lossy compression.

There is a wealth of ad hoc theories and solutions for special applicationfields and problems.

Can we find a general theory for lossy compression of individual data?

Page 14: On  Lossy  Compression

Andrey Nikolaevich Kolmogorov(1903-1987, Tambov, Russia)

Measure Theory Probability Analysis Intuitionistic Logic Cohomology Dynamical Systems Hydrodynamics Kolmogorov complexity

Page 15: On  Lossy  Compression

Background Kolmogorov complexity: Randomness of individual objects. First: A story of Dr. Samuel Johnson

… Dr. Beattie observed, as something remarkable which had happened to him, that he chanced to see both No.1 and No.1000 hackney-coaches. “Why sir,” said Johnson “there is an equal chance for one’s seeing those two numbers as any other two.” Boswell’s Life of Johnson

Page 16: On  Lossy  Compression

Defining Randomness: Precursor Ideas

• Von Mises: A sequence is random if it has about same # of 1’s and 0’s, and this holds for its ‘reasonably’ selected subsequences.

P. LaPlace: A sequence is “extraordinary” (nonrandom) because it contains rare “regularity”.

• But what is “reasonable”? A. Wald: Countably many selection functions A. Church: Recursive functions J. Ville: von Mises-Wald-Church randomness no good.

Page 17: On  Lossy  Compression

Kolmogorov Complexity

Solomonoff (1960)-Kolmogorov (1965)-Chaitin (1969): The amount of information in a string is the size of the smallest program generating that string.

Invariance Theorem: It does not matter which universal Turing machine U we choose. I.e. all “encoding methods” are ok.

min :p

K x p U p x

Page 18: On  Lossy  Compression

Kolmogorov complexity

K(x)= length of shortest description of x K(x|y)=length of shortest description of x given

y.A string is random if K(x) ≥ |x|.

K(x)-K(x|y) is information y knows about x. Theorem (Mutual Information). K(x)-K(x|y) = K(y)-K(y|x)

Page 19: On  Lossy  Compression

Applications of Kolmogorov complexity

Mathematics --- probability theory, logic.Physics --- chaos, thermodynamics.Computer ScienceBiology: complex systemsPhilosophy – randomness.Information theory – Today’s topic.

Page 20: On  Lossy  Compression

Individual Rate-Distortion

Given datum x, class of models Y={y}, distortion d(x , y):

Rate-distortion function:r (d) = min {K(y): d(x,y) ≤ d}

Distortion-rate function:d (r) = min {d(x,y): K(y) ≤ r}

x

x

y

y

Page 21: On  Lossy  Compression

Individual characteristics: More detail, especially for meaningful (nonrandom) Data

•Example list distortion: data x,y,z of length n, with K(y) = n/2, K(x)= n/3, K(z)= n/9.•All >(1-1/n)2^n data strings u of complexityn- log n ≤ K(u) ≤ n +O(log n) have individual rate-distortioncurves approximately coinciding with Shannon’s single curve .• Therefore, the expected individual rate-distortion curve coincides with Shannon’s single curve (upto small error).• Those data are typical data that are essentially ‘random’ (noise)and have no meaning.•Data with meaning we may be interested in, music, text, picture,are extraordinary (rare) and have regularities expressing that meaning,And hence small Kolmogorov complexity, and rate-distortion curvesdiffering in size and shape from Shannon’s.

Page 22: On  Lossy  Compression

Upper bound Rate-Distortion graph

For all data x the rate-distortion function is monotonic non-increasing and:

r (d ) ≤ K(y )

r (d) ≤ r (d’)+ log [α B(d’)/B(d)] + O(small) [all d ≤ d’ ]

x max 0

x x

Cardinality ballis B_y(d) =|{x: d(x,y) ≤d}|

Ball of all datax within distortion d ofcode word (`center’) y.

We often don’t write thecenter if it is understood

Set of source words X is a ball of radius d_max with center y_0

For all d ≤ d’ such that B(d)>0, every ball of radius d’ in X can be covered by at most α B(d’)/B(d) balls of radius d

B(d)

Ball of radius d’

Covering by ballsof radius d ≤ d’

B(d)B(d’)

For this to be usefull, we require that α be polynomialin n—the number of bits in data x.This is satisfied for many distortion measures

This means the funxtion r_x(d)+log B(d) is monotonic non-increasing up to fluctuations of size O(log α).

B(d)

Page 23: On  Lossy  Compression

Lower Bound Rate-Distortion Graph

r (d) ≥ K(x) – log B(d) + O(small)

If we have the center of the ball in r (d)bits, together with value d in O(log d) bits,then we can enumerate all B(d) elementsand give the index of x in log B(d) bits.

x

x

Page 24: On  Lossy  Compression

Rate-distortion functions of every shape

Lemma: Let r(d)+log B(d) be monotonic non-decreasing, and r(d_max) =0. Then there is datum x such that

|r(d)-r_x(d)|≤ O(small)

That is, for very code and distortion, every function between lower bound and upper bound is realized by some datum x

(up to some small error and provided the function decreases at at least the proper slope)

Page 25: On  Lossy  Compression

Hamming Distortion

Lemma: For n-bit strings, α = O(n^4)

B(d’)B(d)

D is Hamming distance, and radius d=D/n. There is a cover of a ball ofHamming radius d’ with O(n^4) B(d’)/B(d)) balls ofHamming radius d, for everyd ≤ d’.

New result (as far as we know) of sparse covering large Hamming balls by small Hamming balls.

Lemma: i) B(d) = n H(d)+O(log n) with d = D/n ≤ ½ and

H(d)= d log 1/d + (1-d) log (1-d) ; ii) d_max = ½ with D = n/2. Every string is within n/2 bit flips

of either center00...0 or center 11...1

Page 26: On  Lossy  Compression

Hamming Distortion, Continued

d = D/n: distortion

r_x (d): rate

K(x)

½log n

Upper bound: n(1-H(d))n

Lowerbound:K(x)-nH(d)

Actual curve:r_x(d)

Minumum sufficientstatistic

At K(x) rate we canDescribe data x Perfectly: no distortion:D=D/n=0

With distortion d=n/D = ½we only need to specifynumber of bits of data xIn O(log n) bits

Every monotonic non-increasing function r(d),

with r(d)+log B(d) is monotonic non-decreasing, and r(½ )=0, That is, in between the lower- and upper bounds anddescending at at least the proper slope, can be realized as rate-distortion function of some datum x, with precision|r(d)-r_x(d)| ≤ O(√n log n)+K(r)

Page 27: On  Lossy  Compression

Theory to practice, using real compressors—with Steven de Rooij

16.076816 20.000000 0000100100110011000111 16.491853 19.000000 00000100000100010100100 16.813781 18.000000 000001001100100010000101 17.813781 17.000000 0101010010001010100101000 18.076816 16.000000 00101101110111001111011101 18.299208 15.000000 00101101110111001110101110019.299208 14.000000 0101010010001001010011010010 19.884171 13.000000 00001010010101010010100010101 20.299208 12.000000 001011010010101010010101010100 20.621136 11.000000 0010100100010101010010100010101 21.621136 10.000000 01010100100010101010010101010100 22.106563 9.000000 0010110110011010110100110110110101 23.106563 8.000000 01010110110011010110100110110110101 24.106563 7.000000 1101011001010101011010101010111010101 24.691525 6.000000 110101101010100101011010101010111010101 26.621136 5.000000 010101001010001010010101010101001010101 29.206099 4.000000 010101001010001011010100101010101110101 32.469133 3.000000 0101010010101010010101010101101010111110101 33.884171 2.000000 0101010110100010101010010101010111110101 38.130911 1.000000 01010100101000101010101010101010111110101 42.697952 0.000000 010101010101000101010101010101010111110101

Datum x Rate

Distortion

Rate Distortion

Page 28: On  Lossy  Compression

Mouse: Original Picture

Page 29: On  Lossy  Compression

Mouse: Increasing Rate of Codes

Page 30: On  Lossy  Compression

Mouse: MDL code-length

Page 31: On  Lossy  Compression

Penguin: Original (Linux)

Page 32: On  Lossy  Compression

Penguin: Rate of code-lengths

Page 33: On  Lossy  Compression

Euclidean Distortion

Lemma: d=|x-y| (Euclidean distance between real data x and rational code y)

α = 2; d_max = ½;r_x(½) =O(1);r_x(d) ≤ r_x(d’)+log d’/d [all 0<d ≤d’ ≤½]

Every non-increasing function r(d), such that r(d)+log d is monotonicnon-decreasing, and r(½ )=0, can be realized as rate-distortion function of some real x, with precision |r(d)-r_x(d)| ≤ O(√log 1/d) [all 0<d≤½]

Page 34: On  Lossy  Compression

List Distortion

Lemma: d=|y|-- the cardinality of finite set y (the code) containing x

with length |x|=n.

α = 2; d_max = 2^n;r_x(2^n) =O(log n);r_x(d) ≤ r_x(d’)+log d’/d +O(small) [all 0<d ≤d’ ≤2^n]

Every non-increasing function r(d), such that r(d)+log d is monotonicnon-decreasing, and r(2^n )=0, can be realized as rate-distortion function of some string x of length n, with precision |r(d)-r_x(d)| ≤ O(log n +K(r)) [all 1<d≤2^n]

Page 35: On  Lossy  Compression

List distortion continued: Distortion-rate graph

d_x(r)

r

log |y|

Distortion-rate graph

Lower boundd_x(r)=K(x)-r

Page 36: On  Lossy  Compression

List distortion continued: Positive and negative randomness

d_x(r)

d_x’(r)

K(x)=K(x’)r

K(x)=K(x’)

log |y|

|x|=|x’|

X’

x

Page 37: On  Lossy  Compression

List distortion continued: Precision of following given function d(r)

d(r)

d

d_x(r)

Rate r

Distortionlog |y|

Page 38: On  Lossy  Compression

Expected individual rate-distortion equals Shannon’s rate-distortion

Lemma: Given m repetitions of an i.i.d. random variable with probability f(x) of obtaining outcome x, and f is a total recursive function (K(f) is finite),

lim ∑ p(x^m) (1/m) d_x^m (mR) = D(R)

where x^m = x_1 ... x_m, and p(.) is the extensionof f to m repetitions of the random variable.

m ∞ x^m

Page 39: On  Lossy  Compression

Algorithmic Statistics

Paul Vitanyi CWI, University of Amsterdam, National ICT Australia

Joint work with Kolya Vereshchagin

Page 40: On  Lossy  Compression

Kolmogorov’s Structure function

Page 41: On  Lossy  Compression

Non-Probabilistic Statistics

Page 42: On  Lossy  Compression

Classic Statistics--Recalled

Page 43: On  Lossy  Compression

Sufficient Statistic

Page 44: On  Lossy  Compression

Sufficient Statistic, Contn’d

Page 45: On  Lossy  Compression

Kolmogorov Complexity--Revisited

Page 46: On  Lossy  Compression

Kolmogorov complexity and Shannon Information

Page 47: On  Lossy  Compression

Randomness Deficiency

Page 48: On  Lossy  Compression

Algorithmic Sufficient Statistic

Page 49: On  Lossy  Compression

Maximum Likelihood Estimator,Best-Fit Estimator

Page 50: On  Lossy  Compression

Minimum Description Length estimator, Relations between estimators

Page 51: On  Lossy  Compression

Primogeniture of ML/MDL estimators

•ML/MDL estimators can be approximatedfrom above;•Best-fit estimator cannot be approximatedEither from above or below, up to anyPrecision.•But the approximable ML/MDL estimatorsyield the best-fitting models, even thoughwe don’t know the quantity of goodness-of-fit ML/MDL estimators implicitlyoptimize goodness-of-fit.

Page 52: On  Lossy  Compression

Positive- and Negative Randomness,

and Probabilistic Models

Page 53: On  Lossy  Compression

List distortion continued

Page 54: On  Lossy  Compression

Recapitulation

Page 55: On  Lossy  Compression

Selected BibliographyN.K. Vereshchagin, P.M.B. Vitanyi, A theory of lossy

compression of individual data, http://arxiv.org/abs/cs.IT/0411014, Submitted. P.D. Grunwald, P.M.B. Vitanyi, Shannon Information and Kolmogorov

complexity, IEEE Trans. Information Theory, Submitted. N.K. Vereshchagin and P.M.B. Vitanyi, Kolmogorov's

Structure functions and model selection, IEEE Trans. Inform. Theory, 50:12(2004), 3265- 3290.

P. Gacs, J. Tromp, P. Vitanyi, Algorithmic statistics, IEEE Trans. Inform. Theory, 47:6(2001), 2443-2463.

Q. Gao, M. Li and P.M.B. Vitanyi, Applying MDL to learning best model granularity, Artificial Intelligence, 121:1-2(2000), 1--29.

P.M.B. Vitanyi and M. Li, Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity, IEEE Trans. Inform. Theory, IT-46:2(2000), 446--464.