2 Lossless image compression

52
3 Lossy compression In lossy compression the decoded data (e.g. the reconstructed image) does not need to be precisely identical to the original source data. This leads to more compression, since several source sequences may be represented by the same codeword. The codes are not uniquely decodable any more, and the entropy of the codes used may be lower than the entropy of the original source. Rate-distortion theory investigates the trade-offs between the codeword lengths (rate) and the decoding error (distortion). 3.1 Image distortion measures A good measure of image distortion would coincide with the human visual system: the distortion between two images should tell how different the images seem to a human ob- server. Calculating such distortion – if at all possible – would be computationally expensive. Such measure would presumably be also theoretically difficult to analyze and minimize. In practice, simpler measures must be used. Let x and y be the intensities of a pixel in the same position in two images. Two most commonly used measures for their difference are the absolute value |x y | and the square (x y ) 2 . More generally, any power |x y | r can be used, where r> 0. The distortion between two images is the average of the pixel-by-pixel differences over all pixels of the two images. Let x =(x 1 ,x 2 ,...,x N ) and y =(y 1 ,y 2 ,...,y N ) be the vectors of pixel intensities of two grayscale images of resolution N . Then the mean squared error (MSE) is the average of the square differences: MSE( x, y)= N i=1 (x i y i ) 2 N , and the mean absolute error (MAE) is the average of the absolute differences: MAE( x, y)= N i=1 |x i y i | N . More generally, the L r -error is L r ( x, y)= N i=1 |x i y i | r N . Hence the MSE and MAE are the L 2 and the L 1 -errors, respectively. The L -error is the maximum of the pixel differences: L ( x, y)= max i=1,...,N |x i y i |. MSE is the most widely used image distortion measure. It has the advantage that it is the usual Euclidean norm of the vector space R N (squared and normalized by dividing it by N ), so that it is amenable to mathematical treatment. 44

Transcript of 2 Lossless image compression

Page 1: 2 Lossless image compression

3 Lossy compression

In lossy compression the decoded data (e.g. the reconstructed image) does not need to beprecisely identical to the original source data. This leads to more compression, since severalsource sequences may be represented by the same codeword. The codes are not uniquelydecodable any more, and the entropy of the codes used may be lower than the entropy ofthe original source. Rate-distortion theory investigates the trade-offs between the codewordlengths (rate) and the decoding error (distortion).

3.1 Image distortion measures

A good measure of image distortion would coincide with the human visual system: thedistortion between two images should tell how different the images seem to a human ob-server. Calculating such distortion – if at all possible – would be computationally expensive.Such measure would presumably be also theoretically difficult to analyze and minimize. Inpractice, simpler measures must be used.

Let x and y be the intensities of a pixel in the same position in two images. Two mostcommonly used measures for their difference are the absolute value |x − y| and the square(x − y)2. More generally, any power |x − y|r can be used, where r > 0. The distortionbetween two images is the average of the pixel-by-pixel differences over all pixels of the twoimages. Let x = (x1, x2, . . . , xN) and y = (y1, y2, . . . , yN) be the vectors of pixel intensitiesof two grayscale images of resolution N . Then the mean squared error (MSE) is the averageof the square differences:

MSE(x, y) =

∑Ni=1

(xi − yi)2

N,

and the mean absolute error (MAE) is the average of the absolute differences:

MAE(x, y) =

∑Ni=1

|xi − yi|N

.

More generally, the Lr-error is

Lr(x, y) =

∑Ni=1

|xi − yi|rN

.

Hence the MSE and MAE are the L2 and the L1-errors, respectively. The L∞ -error is themaximum of the pixel differences:

L∞(x, y) = maxi=1,...,N

|xi − yi|.

MSE is the most widely used image distortion measure. It has the advantage that it is theusual Euclidean norm of the vector space RN (squared and normalized by dividing it by N),so that it is amenable to mathematical treatment.

44

Page 2: 2 Lossless image compression

MSE is frequently measured in the logarithmic scale. This is the Peak signal to noise

ratio

PSNR(x, y) = 10 log10M

MSE(x, y),

where M is the square of the maximum intensity value, e.g. M = 2552 for 8 bit images.The unit of PSNR is decibel, abbreviated dB. Notice that large PSNR value means smalldistortion, in contrast to MSE and MAE where large values mean large distortions.

If in the formula for PSNR we replace M by the average square intensity of the originalimage, then we get the signal to noise ratio SNR.

3.2 Scalar quantization

In image compression algorithms we typically need to entropy code sequences of numericalvalues. The values can be pixel intensities, prediction errors (in DPCM), transformed inten-sities, etc. In lossless compression we take advantage of the uneven probability distributionof the values to compress the source down to the entropy of the distribution.

In lossy compression we are allowed to change the values to gain in the compression rate.Scalar quantization refers to the simple approach of ”rounding” the numerical values beforelossless coding. The decoder reconstructs the code into the numerical value that minimizesthe average distortion.

A scalar quantizer is specified by the the following two items:

• A partition {D1, D2, . . . , Dm} of the source alphabet S ⊆ R into a finite number ofdecision regions Di. Each region Di corresponds to one code, which we denote by theindex i. All elements of Di are represented as the same code i. Sequences of codesare encoded losslessly as binary words. Without loss of generality we may assume that{D1, D2, . . . , Dm} is a partition of the entire R. (Real numbers that are not in S canbe distributed into the regions Di arbitrarily.) In practice, the decision regions areintervals so they are specified simply by the boundaries between the intervals.

• Reconstruction values ri ∈ R for i = 1, 2, . . .m. The decoder reconstructs the code ias the numerical value ri.

Given the decision regions Di we can calculate the bitrate of the quantizer. The probabilityof a code i is the total probability of the source inside the decision region Di:

P (Di) =∑

a∈Di

P (a).

The codes are encoded losslessly, so the rate is the entropy of the code distribution:

R = −m∑

i=1

P (Di) logP (Di).

45

Page 3: 2 Lossless image compression

Given the decision regions Di and the reconstruction values ri we can also calculate theaverage (or expected) distortion of the quantizer, as the average distortion between thesource symbols and their reconstruction values:

D =

m∑

i=1

a∈Di

P (a)d(a, ri),

where d(a, ri) is a distortion function. Here it is assumed that the image distortion metric weuse is the average of the pixel-to-pixel distortion vales d(·, ·). For example, using d(a, ri) =|a − ri| or d(a, ri) = (a − ri)

2 leads to the MAE or MSE metrics, respectively. Notice thatwe have to change the sums into integrals if the source distribution is not discrete but acontinuous probability density function.

There are algorithms to construct good scalar quantizers for given probability distribu-tions of real numbers, but in practice the uniform quantizer (equal length decision intervals)is used very often. Notice that any digital image has gone through scalar quantizationwhen the real valued, ”analog” pixel intensities are converted into finite bit-depth ”digital”intensities.

3.3 Lossy DPCM

The scalar quantization can be used with Differential Pulse Code Modulation of Section 2.3to get our first lossy image compression method. The idea is to quantize the prediction errorse = x − x. In this way the number of different error values to encode is reduced, loweringthe bitrate. At the same time the reconstruction is not exact, which introduces distortion.

Since the decoder has to be able to calculate the same linear prediction as the encoder,the prediction for each pixel is calculated from the reconstructed – not the original – intensityvalues of the pixels in the neighborhood. So the DPCM encoder works as follows. For eachpixel it

1. calculates the prediction

a = α1a′1 + α2a

′2 + . . .+ αma

′m

from the reconstructed values a′1, a′2, . . . , a

′m of the neighboring pixels.

2. quantizes the prediction errore = a− a.

Typically, a uniform quantizer is used for its simplicity. Let e′ be the quantized valueof e.

3. entropy encodes e′ using Huffman or arithmetic coding. This is the compressed datastored or transmitted to the decoder.

46

Page 4: 2 Lossless image compression

4. calculates the reconstructed pixel value

a′ = e′ + a

to be used in calculating future predictions.

The decoder does the same steps as in the lossless DPCM. Quantization does not affect thedecoder. For each pixel the decoder

1. calculates the prediction a = α1a′1 + α2a

′2 + . . .+ αma

′m.

2. decodes from the compressed data the (quantized) prediction error e′.

3. adds the quantized prediction error e′ to the prediction a to get the reconstructed pixelvalue

a′ = e′ + a.

The same values a′ are reconstructed by the encoder and the decoder.

Example 22. Let us return to the test image ”peppers”, and compress it with lossy DPCM.The predictor is the ten neighbor predictor from page 42. Uniform quantizer with differ-ent step sizes gives rate-distortion pairs, whose graph is shown below. The dotted graphis the rate-distortion performance of JPEG on the same test image. Lossy DPCM com-pares favorably at high bitrates. At lower bitrates the quantization errors in the predictionneighborhood are large so that the prediction a becomes inaccurate, causing large predictionerrors and therefore an increase in the entropy.

47

Page 5: 2 Lossless image compression

3.4 Vector quantization

Vector quantization (VQ) refers to the method of encoding several numerical values togetheras a block. As with scalar quantization, the values can be pixel intensities, prediction errors,etc. In the following discussion we take the intensity values of the pixels as the sourcesymbols, but the same methods work with other interpretations.

The vector quantizer divides the source sequence into blocks of size n, for some n. Sinceimages are two-dimensional arrays, the blocks are usually square (or rectangular) regions ofpixels. Let n = h× v be the block size. A vector quantizer is then specified by the followingtwo items (cf. the scalar quantizer):

• A partition {D1, D2, . . . , Dm} of the set of source blocks Sn ⊆ Rn into a finite number

of decision regions Di. Each region Di corresponds to one code i, and the codes areencoded losslessly as binary words. Without loss of generality we may assume that{D1, D2, . . . , Dm} is a partition of the entire R

n.

• Reconstruction values ri ∈ Rn for i = 1, 2, . . .m. The decoder reconstructs the code i

as the block ri.

In vector quantization literature it is common to call the reconstruction values ri code vectors,and the set {r1, r2, . . . , rm} of the code vectors is the codebook.

An image is encoded by dividing it into non-overlapping rectangular blocks of size h× v.For each block b the encoder determines, which decision region Di contains b. This gives thecode i of the block. In practice, decision regions Di are usually determined by the codebookso that a vector belongs to the decision region Di that corresponds to the closest code vectorri. In this case, the encoder evaluates the distortions between the source block and all codevectors, and chooses the index i that corresponds to the code vector ri that gives the smallestdistortion.

The decoder reconstructs the image by pasting the corresponding code vector ri in theplace of b. The process is identical to scalar quantization, except that encoding and recon-struction is done in blocks.

Vector quantization takes advantage of the statistical correlations between pixels thatbelong to the same block. In scalar quantization the pixels are viewed independent, i.e.the source model is iid. For this reason the scalar quantizer requires DPCM or some othermethod to decorrelate the samples before quantization. Vector quantizer performs well evenwhen applied to blocks of the original image intensities, although prediction may still behelpful in taking full advantage of correlations between adjacent blocks. There exist variousalgorithms aimed at generating good codebooks.

4 Transform coding

4.1 Linear image transformations

Linear image transformations view a grayscale image as a real vector whose components arethe pixel intensities of the image. One can either take the entire image as a single vector, or

48

Page 6: 2 Lossless image compression

one can divide the image into blocks as in the vector quantization, and view the blocks asvectors. Let n be the number of pixels in the image (or in the image blocks) so the images(or blocks) are elements of the real vector space R

n.A linear transformation is a linear function

f : Rn −→ Rn

of the image space into itself. The purpose of transformations in image compression is tochange the input image into a form where the image is easy to compress.

Example 23. Let us divide the ”peppers” image into blocks of size 1 × 2. The blocks arethen elements of R2. Since the neighboring pixels have similar intensities one would expectthat the vectors (x, y) we extract from the image are concentrated around the line y = x.Indeed, if we draw a black dot for each of the over 100,000 vectors we extract from the”peppers” we get the following plot:

The x- and the y-coordinates of the dots have very similar distributions:

0 50 100 150 200 2500

200

400

600

800

1000

1200

1400Distribution of x−coordinates

0 50 100 150 200 2500

200

400

600

800

1000

1200

1400Distribution of y−coordinates

Let us transform the vectors using the linear transformation that corresponds to the 45◦

rotation of the plane:

(xy

)

7→(

x′

y′

)

=1√2·(

1 1−1 1

)(xy

)

49

Page 7: 2 Lossless image compression

Here is the plot of the transformed vectors (x′, y′), obtained from the original plot througha rotation of the coordinate system:

Now the y′ coordinates are much more concentrated around zero, and their distribution haslower entropy than before:

0 50 100 150 200 250 300 3500

200

400

600

800

1000

1200Distribution of x’ −coordinates

−150 −100 −50 0 50 100 1500

2000

4000

6000

8000

10000

12000

14000Distribution of y’ −coordinates

The new coordinates are easier to compress than the original ones. The rotation is lossless,so the original coordinates (x, y) can be obtained from the transformed values (x′, y′) througha 45◦ rotation to the opposite direction. �

The idea of the 45◦ rotation in the previous example was to ”decorrelate” the x- andthe y-coordinates, and to concentrate image ”energy” into the x-coordinate. These are thegeneral goals of transforming the vectors. Let us define the concepts of the image energyand the correlation of the coordinates.

In the following discussion the following notation is used for the coordinates of a vector:If x ∈ R

n then x(i) ∈ R is the i’th coordinate of x, for i = 1, 2, . . . , n.The energy of an image is defined as the square norm of the vector, that is,

Energy(x) =n∑

i=1

x(i)2.

NumberEnergy i(x) = x(i)2

50

Page 8: 2 Lossless image compression

is the energy in coordinate number i, for i = 1, 2, . . . , n. For a finite set

X = {x1, x2, . . . , xm} ⊆ Rn

of images, the energy of the set X is the average of the image energies:

Energy(X) =1

m

m∑

j=1

Energy(xj) =1

m

m∑

j=1

n∑

i=1

xj(i)2,

and the energy in coordinate i is the average energy in coordinate i in the images of the set:

Energy i(X) =1

m

m∑

j=1

Energy i(xj) =1

m

m∑

j=1

xj(i)2.

(Note: We allow xi and xj to be identical for i 6= j so, strictly speaking, X is an indexedset, or a multiset.) Clearly, the total energy in X is the sum of the coordinate energies:

Energy(X) =

n∑

i=1

Energy i(X).

The energy is the average MSE distance between the zero vector and the elements of X .Sometimes the mean of the set X is subtracted from all elements of X before calculating

the energy. We call this the variance of X , or the mean corrected energy. More precisely, let

y =1

m

m∑

j=1

xj

be the centroid of X . The variance is now the energy of the set

X ′ = {x1 − y, x2 − y, . . . , xm − y}

whose elements are the differences xj − y, for j = 1, 2, . . . , m. So the variance in coordinatei is

Var i(X) =1

m

m∑

j=1

[xj(i)− y(i)]2.

The total variance in all n coordinates is

Var(X) =

n∑

i=1

Var i(X).

The variance is the average MSE error between the elements of X and the centroid of X .Intuitively, the variance of a coordinate describes how much there is data in that coordi-nate. Low variance coordinates have a peaked distribution which usually corresponds to lowentropy. Thus they are easier to compress than high variance coordinates.

Example 24. In the previous example, the energies and the variances in the x- and they-coordinates are

51

Page 9: 2 Lossless image compression

Energy Variancex 17223.7 2906.8y 17280.9 2884.7

total 34504.6 5791.5

and in the x′- and y′-coordinates

Energy Variancex’ 34426.1 5713.1y’ 78.5 78.4

total 34504.6 5791.5

The total energy and the total variance remained invariant under the transformation, butthe energy and the variance got concentrated in the x′-coordinate. This is called energy

compaction. �

Here, we defined the terms energy and variance for finite sets of vectors. The concepts couldbe as easily defined for probability distributions over vectors. The energy is the secondmoment of the distribution, and the variance — as commonly defined — is the secondmoment around the mean. Our setup of a finite collection of m vectors is then a special casewhere the vectors are assigned the equal probability 1

m.

In general, the goal of the image transformation f is to pack the energy into a smallnumber of coordinates. The transformation should be lossless, which means that f is in-jective (=one-to-one). Even more strictly, our transformations will be such that the MSEdistortion between images remains invariant under the transformation. In other words, thetransformation should not change the Euclidean distances in the space Rn. This means thatthe transformations are rigid ”rotations” of the n-dimensional space. These transformationsare called orthogonal or unitary. The fact that MSE is invariant implies that the total imageenergy does not change (as it is the MSE to the zero vector) and that the total variance ofa set does not change either (as it is the average MSE to the centroid of the set, and thecentroid of the transformed vectors is the same as the transformed centroid of the originalvectors). The feature that MSE is invariant is useful in lossy compression, because it meansthat we can quantize the transformed coordinates, and the MSE error that the quantizationcauses on the original image is the same as the MSE quantization error in the transformedimage.

On a very general level encoding consists of the following steps:

1. Apply an orthogonal transformation of Rn to the image.

2. Quantize the new coordinates.

3. Entropy code the quantized coordinates.

The decoding does the following:

52

Page 10: 2 Lossless image compression

1. Entropy decode the quantized coordinates.

2. Apply the inverse transformation.

Example 25. Consider the block (x, y) = (70, 75). Our 45◦ rotation transforms this vectorinto (x′, y′) = (102.53; 3.54). Let us quantize both coordinates to the closest multiples of 10.So the quantized (x′, y′) is (100, 0). Coordinates 100 and 0 are entropy coded. The decoderapplies the inverse rotation to (100, 0), and gets vector (70.71; 70.71). This is the decodedblock. The quantization introduces the MSE error

(102.53− 100)2 + (3.54− 0)2

2= 18.9

in the transformed coefficients. The reconstruction error in the original coordinates is thesame

(70− 70.71)2 + (75− 70.71)2

2= 18.9.

Energy compaction, as discussed above, is one goal of the transformation. Decorrelation

of the coordinates is another one. Let us define the coenergy (non-standard terminology!)between the i’th and the k’th coordinates in the set X as

CoE i,k(X) =1

m

m∑

j=0

xj(i)xj(k).

In other words, the coenergy is the inner product between the vectors formed by the i’thand the k’th coordinates of the elements of X , divided by the number m of elements in X .Usually the coenergy is measured between the mean corrected vectors. This is the covarianceof the coordinates. If y is the centroid of X then the covariance is the coenergy in the set

X ′ = {x1 − y, x2 − y, . . . , xm − y}

whose vectors are the differences xj − y. In other words,

CoV i,k(X) =1

m

m∑

j=0

[xj(i)− y(i)] · [xj(k)− y(k)].

Notice that the coenergy and the covariance of a coordinate i with itself are the energy andthe variance of coordinate i, respectively, that is

CoE i,i(X) = Energy i(X), andCoV i,i(X) = Var i(X).

Note, again, that these concepts could be equally well be defined for general probabilitydistributions on vectors. Then, if Xi denotes the random variable for the i’th coordinates

53

Page 11: 2 Lossless image compression

of vectors, the covariance CoV i,k(X) would be the usual covariance between the randomvariables Xi and Xk.

Coordinates i and k are uncorrelated if CoV i,k(X) = 0. This coincides with the usualdefinition of correlation in statistics if the coordinates are viewed as random variables. Recallthat independence of random variables is a stronger property in statistics: all independentvariables are uncorrelated, but the converse statement is not true.

Example 26. The coenergy and the covariance between the x- and the y-coordinates inour earlier examples are 17173.8 and 2817.4, respectively. The coenergy and the covariancebetween the transformed coordinates x′ and y′ are -28.6 and 11.1, that is, much smaller. Thex′- and y′-coordinates are almost, but not totally uncorrelated. �

If two coordinates have a non-zero covariance then they contain common information aboutthe image. Therefore it is wasteful to encode the related coordinates independently, thusstoring the redundant information twice. Hence a second goal of the image transformation isto rotate the coordinate system in such a way that the covariances between the coordinatesbecome zero, or close to zero. Fortunately this goal and the goal of energy compactioncoincide, as we’ll prove later.

4.2 A short review of orthogonal transformations

• Orthogonal transformations are rotations of the n-dimensional coordinate system.They are linear transformations defined by orthogonal square matrices. A squarematrix is orthogonal if

(a) any two different rows of the matrix are orthogonal to each other, that is, theirinner product is zero, and

(b) the rows are normalized in such a way that the inner product of each row withitself is one.

If A is an n×n orthogonal matrix, and x ∈ Rn is any n-dimensional column vector (the

data vector) then c = Ax is the transformed vector, that is, c = f(x) where f : Rn →Rn is the orthogonal transformation specified by matrix A. We use the convention thatall vectors are column vectors, unless otherwise stated.

• For every square matrix A the element (i, j) of A ·AT is the inner product of A’s rowsnumber i and j. Therefore, if matrix A is orthogonal we have

A ·AT = I

where I is the identity matrix. In other words, if A is orthogonal then the inverse A−1

of A is simply the transpose AT . Then we also have AT · A = I, so the transpose AT

is also orthogonal. It is the matrix of the inverse transformation f−1.

54

Page 12: 2 Lossless image compression

• The rows of an orthogonal matrix A will be called the basis vectors (also basis functions)

of the corresponding transformation. Let bi be the i’th basis vector, that is, bT

i is thei’th horizontal row of A.

• Let x = (x1, x2, . . . , xn)T be a data vector, and c = (c1, c2 . . . , cn)

T the transformeddata vector c = Ax. The coordinates ci in the new ”rotated” coordinate system arecalled the transform coefficients of x.

• The i’th transform coefficient ci is just the inner product of the data vector x and thei’th basis vector bi:

ci = bT

i · x.

• The original data vector x is the linear combination of the basis vectors, where thecoefficients in the linear combination are the transform coefficients:

x = c1b1 + c2b2 + . . .+ cnbn.

This follows from the equality x = AT · c, and the fact that vectors b1, b2, . . . , bn arethe columns of matrix AT .

• The inner product between any two vectors is invariant under orthogonal transforma-tions. Namely, let x, y ∈ R

n be two data vectors, and let c = (c1, c2, . . . , cn)T and

d = (d1, d2, . . . , dn)T be the corresponding transformed vectors c = Ax and d = Ay.

Then

xT · y =

(n∑

i=1

cibi

)T

·(

n∑

j=0

djbj

)

=n∑

i=1

n∑

j=1

cidj(bT

i · bj) =n∑

i=1

cidi = cT · d.

• Since the inner product is invariant, so is the square norm

||x||2 = xT · x

and the MSE distance

d(x, y) =||x− y||2

n.

In other words, for all x, y ∈ Rn,

d(Ax,Ay) = d(x, y).

• The counterpart of orthogonal transformations in complex vector spaces are the unitarytransformations. A complex matrix is called unitary if its inverse is the complexconjugate of its transpose. Orthogonal matrices are the same as unitary, real valuedmatrices.

55

Page 13: 2 Lossless image compression

Example 27. In our two-dimensional example 23, the transformation matrix A is

A =1√2·(

1 1−1 1

)

.

It is orthogonal, so its inverse is the transpose

A−1 = AT =1√2·(

1 −11 1

)

.

The basis vectors of the transformation are

b1 =1√2

(11

)

and b2 =1√2

(−11

)

.

The transform coefficients of the data vector x = (70, 75)T are

c1 = bT

1 · (70, 75)T =70 + 75√

2= 102.53,

and

c2 = bT

2 · (70, 75)T =−70 + 75√

2= 3.54.

The new coordinates c1 and c2 are the coefficients in the expression of x as the linear com-bination of the basis vectors b1 and b2:

x = 102.53 · b1 + 3.54 · b2.

There exist several different transformations that are used in image compression. They differin their energy compaction capability and computational complexity:

1. The optimal transformation in the energy compaction sense is the Karhunen-Loevetransform (KLT), also known as the Principal Component Analysis (PCA) of the data.The transform is data dependant. Finding the KLT basis for given images is slow, andsending them to the decoder introduces overhead. Also there is no fast algorithm fortaking the transform.

2. Discrete Cosine Transform (DCT) is used in JPEG. In practical image compressionsituations its energy compaction performance is excellent — almost similar to KLT.There exist fast algorithms for taking DCT transform.

3. Walsh-Hadamar Transform (WHT) is very fast to compute: requires only additionsand subtractions. In energy compaction sense WHT is inferior to DCT.

4. Haar Transform is the simplest wavelet transform. It is very fast to compute.

56

Page 14: 2 Lossless image compression

5. Wavelet transforms concentrate image energy very well into a small number of coordi-nates. But the indices of the high energy coordinates depend on the image. Wavelettransforms also provide a multiresolution representation of the image: Small numberof transform coefficients produce a low resolution version of the input image. This canbe used in progressive transmission of images.

In the following sections we investigate some of the mentioned transformations.The transforms are applied to images, i.e. to ”two-dimensional” data. (Here, term

”dimension” is not the same as the dimension n of the vector space, that is, the numberof pixels in the images or image blocks.) All the transforms above, except KLT, are firstdeveloped for ”one-dimensional” input signals, i.e. sequences of real valued samples. The 2Dbasis vectors are then constructed as tensor products of the 1D basis vectors, as describedbelow. Such a basis is called separable. Let

b1, b2, . . . , bn ∈ Rn

be the 1D basis vectors. Then the corresponding separable 2D basis is obtained by multi-plying the basis vectors bi and bj , for all pairs i, j = 1, 2, . . . , n: The basis vectors are theouter products

bi · bT

j .

These are n × n matrices, and they are understood as elements of Rn2

= Rn×n. (So, to

obtain a standard n2 × n2 transformation matrix, the outer products have to be ”flattened”by reading the values row-by-row. However, when displaying the basis vectors, the 2D versionis more instructive than the flattened one — the same way as displaying the input data asa two-dimensional image makes more sense than displaying it as a long vector.)

If the original 1D transformation is orthogonal, so is the separable product. Indeed, the

inner product of the basis vectors bi · bT

j and bs · bT

t is

n∑

x=1

n∑

y=1

bi(x)bj(y)bs(x)bt(y) =

(n∑

x=1

bi(x)bs(x)

)(n∑

y=1

bj(y)bt(y)

)

= (bT

i · bs)(bT

j · bt).

This is zero, except if i = s and j = t, in which case it is one.

Example 28. Let’s make a separable, orthogonal 2D basis from our sample transformation

(xy

)

7→ 1√2·(

1 1−1 1

)(xy

)

The original transformation is for 1 × 2 blocks, so the new transformation will be for 2 × 2image blocks. The four 2× 2 basis vectors are obtained by pairwise multiplying the original

57

Page 15: 2 Lossless image compression

1× 2 basis vectors 1√2(1, 1) and 1√

2(−1, 1):

b1 =1

2

1

2

1

2

1

2 b2 =−1

2

1

2

−1

2

1

2

b3 =1

2

1

2

−1

2−1

2 b4 =−1

2

1

2

1

2−1

2

Let us transform the 2× 2 image block

x =68 72

70 75

The inner products between x and the orthogonal basis vectors b1, b2, b3 and b4 provide thetransform coefficients 142.4, 4.5, −2.5 and −0.5. As we know, these also tell how to combinethe basis vectors to get the data vector x:

68 72

70 75= 142.5×

1

2

1

2

1

2

1

2 + 4.5×−1

2

1

2

−1

2

1

2 − 2.5×1

2

1

2

−1

2−1

2 − 0.5×−1

2

1

2

1

2−1

2

Let X be an n× n data block we want to transform. Let A be the orthogonal n× n matrixwhose i’th row is the vector bi, so A is the matrix of the 1D transformation. Then, for everyi, j, the element (i, j) of the product matrix AXAT is

(AXAT )ij =

n∑

k=1

n∑

m=1

AikXkmAjm =

n∑

k=1

n∑

m=1

Xkm[bi(k)bj(m)] =

n∑

k=1

n∑

m=1

Xkm(bibT

j )km.

This means that (AXAT )ij is the transform coefficient of X that corresponds to the 2D basis

vector bibT

j . In other words, elements of the matrix AXAT are the output of the separable2D transformation. Because

AXAT = (AX)AT = A(XAT ),

the separable transformation can be taken by first performing the original 1D transformationon the columns of X , and then performing the 1D transformation again on the rows of theresulting matrix AX . (Or in the reverse order.)

This way of calculating the separable transformation is fast:

58

Page 16: 2 Lossless image compression

• Without separation we need n2 multiplications for each coefficient, because the vec-tor space has n2 dimensions. This gives a total of n4 multiplications to get all n2

coefficients.

• With separation we need n2 multiplications for each row in the horizontal step, whichgives a total of n3 multiplications. Then n2 operations are needed for each columnduring the vertical step. Altogether we have 2n3 multiplications.

Example 29. Let us demonstrate the calculation of the separable transformation of theprevious example in the matrix form. The data block is

X =

(70 7568 72

)

and the transformation matrix is

A =1√2·(

1 1−1 1

)

so the transformed block is

AXAT = 1√2

(1 1

−1 1

)(70 7568 72

)

1√2

(1 −11 1

)

= 1

2

(138 147−2 −3

)(1 −11 1

)

= 1

2

(285 9−5 −1

)

=

(140.5 4.5−2.5 −0.5

)

4.3 Walsh-Hadamar and Haar transformations

The Walsh-Hadamar transformation (WHT) and the Haar transformation (HT) are two sim-plest orthogonal linear transformations of images. They are very similar, but their differencesillustrate the different philosophies behind the classical transformations (such as DCT) andthe wavelet transformations. Both WHT and HT are generalizations of the 45◦ rotation ideawe have used in our examples before.

The 2D Walsh-Hadamar transformation is separable, so we start with the 1D WHTfirst. It operates on the n-dimensional space R

n where n = 2k is a power of two. The

59

Page 17: 2 Lossless image compression

transformation matrix is defined recursively. The 2n = 2k+1 dimensional transformationmatrix W2n is expressed in terms of the n dimensional transformation matrix Wn as follows:

W2n =1√2·(

Wn Wn

Wn −Wn

)

(6)

The starting point of the recursion is the familiar 2× 2 matrix

W2 =1√2·(

1 11 −1

)

.

Using tensor products of matrices we can write the recursion as

W2n = W2 ⊗Wn,

so thatW2k = W2 ⊗W2 ⊗ . . .⊗W2

︸ ︷︷ ︸

k times

.

We do not need parentheses as the tensor product is associative.For example,

W4 =1

1 1 1 11 −1 1 −11 1 −1 −11 −1 −1 1

The recursive definition above guarantees that W2n is orthogonal if Wn is orthogonal. Sincethe first matrix W2 is orthogonal, all WHT transformations are orthogonal.

The two-dimensional version of WHT is obtained as the separable product of the verticaland horizontal one-dimensional versions of WHT. The transformation matrix for 2n × 2n

WHT then has all elements ± 1

2n. Therefore the transformation can be done by using only

additions and subtractions, and dividing the coefficients in the end by 2n (=shifting n bitsin binary). The following figure shows the 64 WHT basis vectors for 8 × 8 blocks. Whiterepresents 1

8and black −1

8. The vectors are shown as 8× 8 blocks.

60

Page 18: 2 Lossless image compression

In the illustration, the 2D basis vector in position (i, j) is the outer product of the i’thvertical and the j’th horizontal 1D basis vectors. The 1D vectors are ordered in the orderin which they come out from the recursive formula (6). The basis vector in the upper leftcorner is the constant vector. It is commonly called the DC vector, while the other basisvectors are called AC vectors.

Example 30. Let us apply the 8 × 8 WHT transform on the test image ”peppers”. Theimage is divided into blocks of size 8×8, so there are 512/8×512/8 = 4096 non-overlappingimage blocks, i.e., there are m = 4096 data vectors of the n = 64 dimensional space R

n.In the original data blocks the energy and variance are roughly evenly distributed in the 64coordinates. In the transformed blocks most of the variance (88.4%) is packed into the DCcoordinate. The 10 coordinates of highest variance contain over 97% of the total variance.The following figure shows the cumulative variance in the coordinates, when the coordinatesare ordered from the highest variance coordinate to the lowest variance coordinate:

61

Page 19: 2 Lossless image compression

0 10 20 30 40 50 60 700.88

0.9

0.92

0.94

0.96

0.98

1

number of coefficients

cum

ulat

ive

ener

gy

Accumulation of energy in ’peppers’ with 8x8 WHT

The DC coefficient has the highest variance. Here is the list of the nine highest variancebasis vectors:

88.39% 3.28% 2.28%

0.78% 0.65% 0.58%

0.52% 0.44% 0.26%

Note that more variance is packed in the basis vectors that have few changes betweenblack and white. The variance in the rapidly changing basis vectors is typically small. Looselyspeaking, the number of changes from black to white and back indicates the frequency of thebasis vector. Smooth parts of images contain low frequencies, so the high frequency basisvectors capture less image energy than the low frequency ones.

The WHT transformation itself is lossless and it produces little compression. In lossycompression the transform coefficients are quantized. The transformation is orthogonal so theMSE quantization error in the coefficients is the same as the MSE error in the reconstructedimage.

62

Page 20: 2 Lossless image compression

Let us compress the ”peppers” image using the 8 × 8 WHT transform with a uniformquantizer. For different quantizer step sizes we can calculate the distortion, and as the bitratewe use the entropies of the distributions of the quantized transform coefficients, separatelyfor each of the 64 coefficient. This experiment gives the following rate vs. distortion graph.For comparison, the rate-distortion behavior of JPEG is also plotted. JPEG performs better,at least at the intermediate bitrates.

0 0.5 1 1.5 2 2.50

20

40

60

80

100

120

bitrate (bpp)

dist

ortio

n (m

se)

Rate−distortion of ’Peppers’ using WHT vs. JPEG

JPEG uses another transformation (DCT) which has better energy compaction performancethan WHT. Note that the comparison is not entirely fair because the bitrate of WHT iscalculated as the entropy of the coefficient distributions, i.e. under the assumption that theprobability model fits perfectly the data. �

Let us investigate the computational complexity of the WHT transform. Using the separa-bility of the 2D transform, we perform first the horizontal 1D WHT followed by the verticalWHT. The normalization of the coefficients can be done in the end (one shift per pixel), sowe only need to count the number of additions and subtractions in the 1D WHT.

Consider the 1D transform of size n = 2k. If we simply multiply the vectors by thetransformation matrix Wn we need n2 operations. But using the recursive definition we geta faster algorithm: First divide the data vector x into two parts x1 and x2, both of size n/2.Then perform the WHT on both halves. Since

63

Page 21: 2 Lossless image compression

Wn

(x1

x2

)

= 1√2·(

Wn/2 Wn/2

Wn/2 −Wn/2

)(x1

x2

)

= 1√2·(

Wn/2x1 +Wn/2x2

Wn/2x1 −Wn/2x2

)

we only need to calculate the sum and the difference of n/2-dimensional vectors Wn/2x1 andWn/2x2. This requires n/2 + n/2 = n operations.

Let Sn be the total number of additions/subtractions for the n = 2k dimensional WHT.The analysis above shows that

Sn = 2 · Sn/2 + n.

Clearly S1 = 0. This recurrence equation has exact solution

Sn = n log2 n.

In the 2D case we have n log2 n operations on each row and column of the block. This meansa total of 2n2 log2 n operations per an n × n block, or 2 log2 n operations per pixel. In theend we normalize the coefficients by one binary shift operation per pixel. The speed of WHTmakes it attractive. But there are other transformations (such as DCT) that have betterenergy compaction performance, and are almost as fast to compute.

Next we take a closer look in the definition of WHT. The recursive definition

W2n = W2 ⊗Wn

says that the 2n-size transformation can be done by dividing the data vector in the middleinto two segments of equal size, performing the n-size WHT on both halves independently,and then executing W2 on the corresponding coefficients of the two halves:

+ - - -+

WHTWHT

+

This is the view we used in the analysis of the time complexity above. But using theassociativity of the tensor product we can also write

W2n = W2 ⊗Wn = W2 ⊗W2 ⊗ . . .⊗W2 = Wn ⊗W2.

64

Page 22: 2 Lossless image compression

This means that the same transformation can also be done by dividing the data into segmentsof length 2 and performingW2 on the segments, then combining corresponding elements fromthe segments together into two vectors of size n and performing the n-size WHT on thesetwo vectors:

WHTWHT

+ - + - + --+

Notice that in the illustrations the coefficients are permuted between the applications of W2

and Wn, so at the bottom of the two illustrations the coefficients are the same but in adifferent order,

In 2D, the second illustration corresponds to the following steps to take the WHT trans-form of an image:

(i) Divide the image into 2× 2 blocks

(ii) Transform each 2 × 2 block using the orthogonal transformation whose basis vectorsare

−1

2−1

2

1

2

1

2

1

2−1

2

-12

1

2

1

2

1

2

1

2

1

2

−1

2

1

2

−1

2

1

2

”high-low” ”high-high”

”low-low” ”low-high”

(iii) Collect the corresponding coefficients from the blocks together to form four quarter-sizeimages:

65

Page 23: 2 Lossless image compression

(iv) Recursively transform each quarter separately using WHT.

The quarters are called low-low (LL), low-high (LH), high-low (HL) and high-high (HH)frequency components, or subbands, of the image. The LL component contains most of theimage energy. The LH (and the HL) component captures energy at vertical (resp. horizontal)edges of the image. Note that the whole image was transformed, not just 8×8 image blocks.

All subbands except the LL subband contain large coefficients only along object bound-aries. Smooth areas of the image produce small coefficients. This means that the imageenergy is already well concentrated in the HL, LH and HH subbands. Consequently, apply-ing the WHT transformation recursively to all four subbands may not be such a good idea:The transformation spreads the localized large values into several coefficients.

The Haar transformation (HT) is similar to WHT, except that in the step (iv) aboveonly the LL subband is recursively transformed. The HL, LH and HH subbands are kept asthey are. As a result, HT produces a hierarchy of subbands:

66

Page 24: 2 Lossless image compression

The Haar transformation is the simplest example of a wavelet transform. Here are the 2DHaar basis vectors of size 8× 8:

Notice that the high frequency basis vectors are localized: they capture high frequenciesfrom specific locations, unlike WHT where high frequencies were collected from the entire

67

Page 25: 2 Lossless image compression

block. This is an advantage of the Haar transformation, and other wavelet transformations.Typical images are composed of smooth areas separated by sudden changes of intensity atimage boundaries. WHT performs well on the smooth areas but image boundaries produceenergy to several coefficients. In the Haar transformation image boundaries are captured ina small coefficients.

Example 31. To demonstrate the difference of WHT and HT on image boundaries, let ustransform the one-dimensional step function

using WHT and HT. This example simulates an object boundary in an image.The result depends on the position of the step: if it is at an even location then the first

level of W2 produces a zero high frequency band and the low frequency band contains thestep function. The situation is reduced to another, smaller, step function.

Assume then the step is at an odd location. Then W2 produces a ”spike” to the highfrequency band, and a ”softer” step function to the low frequency band:

The Haar transform leaves the ”spike” unchanged which is optimal in energy compactionsense: all high frequency energy is concentrated on one coefficient. But in the WHT boththe low frequency and the high frequency bands are transformed. The energy of the ”spike”gets spread evenly over all coefficients:

0

(the actual signs of the coefficients depend on the position of the ”spike”) We can concludethat the Haar transformation treats the step function better. �

On typical images (consisting of large smooth areas separated by localized object boundaries)

• WHT concentrates more energy than Haar on predetermined, image independent basisvectors, but

• if the coefficients are ordered inside each block in the order of magnitude, then theHaar transformation concentrates energy better in fewer coefficients than WHT.

In other words: After the Haar transform there are fewer large coordinates than after WHT.But the position of the large coefficients is image dependent. The large coefficients areconcentrated around edges and other sharp changes of intensity. WHT concentrates energybetter than HT on fixed, image independent coordinates.

This is the basic difference between the classical image transforms (WHT, DCT, KLT)and the wavelet transforms. Wavelets, including Haar, concentrate image energy very well.

68

Page 26: 2 Lossless image compression

They can be applied to the entire image, without partitioning it first into blocks. Quanti-zation makes most coefficients equal to zero. The non-zero coefficients are in the locationsthat correspond to image edges. The compressed representation must specify the positionsof the non-zero coefficients as well as their values.

In contrast, the classical transformations are applied to image blocks. The reason is thatotherwise edges and other sharp intensity changes would affect a large number of coefficients,as we saw in the Example 31. The division into blocks localizes the edge effect inside theblock. The block energy is now concentrated into a small number of low-frequency transformcoefficients. The high-frequency coefficients are small and will be quantized to zero. Theposition of the large coefficients is image independent and known.

Example 32. The following experiments demonstrate further the difference between theWHT and the Haar transformations.

1. The ”peppers” image was transformed using WHT and HT. The entire frame wastaken as a single block, which results in 512 × 512 = 262144 transform coefficients.Let us first order the squares of the transform coefficients in the decreasing order ofmagnitude and compute their cumulative sums in this order. The following graphsshow the cumulative totals for the first 1000 coefficients — which represent 0.38 % ofthe total number of coefficients — for both WHT and Haar (the lower and the highergraph, respectively.)

The unit of the plot on the y-axis is the ratio of the cumulative square sum to the totalimage energy.

From the graphs we see that the largest 1000 Haar coefficients contain 98.0 % of thetotal energy. The total energy in ”peppers” is 17252.5 per pixel. This means that if we

69

Page 27: 2 Lossless image compression

keep the largest 1000 coefficients and replace all other coefficients by zero we obtainMSE error

0.020× 17252.5 = 353.2

On the other hand, largest 1000 WHT coefficients contain only 97.4 % of the totalsquare sum, which means that replacing other coefficients by zero introduces MSEerror

0.026× 17252.5 = 453.4.

The trend continues throughout the cumulative counts. For example, keeping 26214largest coefficients (10% of all coefficients) introduces MSE errors 20.4 and 53.5 in thecases of Haar and WHT, respectively.

2. To purpose of the following two figures is to show which coefficients contain most ofthe energy. The largest 10% of the coefficients are shown after the Haar and the WHTtransformation:

In the figures, the basis vectors are organized in their natural order, provided by therecursive division of the image into the subbands.

So we observe that after the Haar transformation more image energy is concentrated infewer coefficients (good!) but the positions of the large coefficients are image dependent(bad!). We need to identify both the positions and the values of the large coefficients.

3. The goal of the next experiment is to show that WHT packs energy better than Haar ifa fixed, image independent order of coefficients is used. In order to demonstrate this letus perform the Haar transformation and the WHT on 8×8 blocks of ”peppers”. Thereare 4096 blocks so we get 4096 values for each of the 64 basis vectors. Let us computethe variances of the 64 coordinates, and order them in the decreasing order of variance.This simulates the best ordering of the coordinates that is block independent, that is,the same in each block. The following plot shows the cumulative sums of variances

70

Page 28: 2 Lossless image compression

over the 64 basis vectors, using Haar and WHT. Now the higher plot comes from WHTand the lower one from the Haar transformation.

This plot indicates that the first eight WHT coefficients contain 96.92 % of the to-tal variance, but the first eight Haar coefficients capture only 96.15 % of the totalvariance. The total variance is 2893.83, so replacing the other 56 smallest coefficientswith constant values would introduce the MSE errors 0.0308 × 2893.83 = 89.2 and0.0385× 2893.83 = 111.4 under WHT and Haar, respectively.

Finally, observe that the Haar transformation is very fast to compute: one level ofW2 requirestwo additions/subtractions per pixel, and a division by 2 of each value. The first level of thetransformation is done on all pixels, the second level is done only on the LL subband, thatis, on one quarter of the pixels, and so on. The total number of additions/subtractions perpixel is

2× (1 + 1/4 + 1/16 + . . .) = 8/3,

that is, the number of operations per pixel is less than three.

4.4 Karhunen-Loeve Transformation

The Karhunen-Loeve transformation (KLT) — also known as the Principal Component Anal-ysis — is optimal in its energy compaction performance in the classical sense. A maximumamount of variance (or image energy) is packed into the smallest possible number of coor-dinates. This is established by fully decorrelating the coordinates. The transformation iscalculated from the given data vectors, so it is image dependent. Of course one can also take

71

Page 29: 2 Lossless image compression

a large number of training images and find the optimal transformation for them, hoping thatthe training data is similar to the actual data to be compressed.

Consider the n-dimensional space Rn, and let

X = {x1, x2, . . . , xm} ⊆ Rn

be a set of m vectors. We want to find an orthogonal transformation matrix A such thatthe coenergy of the transformed coordinates i and j is zero for all i 6= j. Let us arrange theelements of X into an n×m matrix

C = (x1, x2, . . . , xm)

whose columns are the data vectors xi. The elements of the n× n square matrix

CCT

are the coenergies of X , multiplied by m, that is, the element (i, j) of CCT is

[CCT ]i,j = m · CoEi,j(X).

Let A be any orthogonal n×n square matrix. The matrix whose columns are the transformedvectors Axi is

AC = (Ax1, Ax2, . . . , Axm) .

The coenergies of the transformed coordinates are then obtained from the elements of

AC(AC)T = A(CCT )AT .

The element (i, j) is[A(CCT )AT ]i,j = m · CoEi,j(AX),

where AX is the set of transformed vectors. In order to have all coenergies between differentcoordinates equal to zero, we need a transformation matrix A such that

A(CCT )AT = diag(λ1, λ2, . . . , λn)

is a diagonal matrix. This is exactly the well understood diagonalization problem of squarematrices. Notice that the matrix CCT has the following properties.

(i) Matrix CCT is symmetric: (CCT )T = CCT .

(ii) Matrix CCT is positive semi-definite: For all x ∈ Rn

xTCCTx = (CTx)T (CTx) ≥ 0.

Moreover, if the rows of C are linearly independent then CCT is positive definite: Forall x ∈ R

n, x 6= 0, we have CTx 6= 0, so

xTCCTx > 0, for all x 6= 0.

72

Page 30: 2 Lossless image compression

The following results are well known in linear algebra:

1. Every symmetric (real) square matrix M is orthogonally diagonalizable. In otherwords, there exists an orthogonal matrix A such that AMAT is a diagonal matrix.The diagonal elements of AMAT are the eigenvalues of M , and the rows of matrix Aare orthonormal eigenvectors of M .

2. If a diagonalizable square matrix M is positive semidefinite (positive definite) then alleigenvalues are non-negative (strictly positive, respectively).

If we apply the results to M = CCT we see that there exists an orthogonal matrix A whoserows are orthonormal eigenvectors of CCT such that ACCTAT is a diagonal matrix whosediagonal elements are the eigenvalues of CCT . These eigenvalues are non-negative, andstrictly positive if the rows of C are linearly independent.

We can freely choose the order of the rows in A, so let us choose A in such a way thatthe diagonal elements of ACCTAT are in the decreasing order. Then we have

ACCTAT = diag(λ1, λ2, . . . , λn), where λ1 ≥ λ2 ≥ . . . ≥ λn ≥ 0.

The orthogonal transformation whose matrix is A is the Karhunen-Loeve transform (KLT)of X . After the transformation the coenergies between different coordinates are zero. Theenergies of the coordinates are the eigenvalues λi of CCT , divided by m.

More commonly we are interested in the transformation that makes all covariances zero.This transformation is the KLT of the modified data X ′ that one gets by subtracting themean vector

y =1

m

m∑

j=1

xj

from all elements of X , that is, the KLT of the set

X ′ = {x1 − y, x2 − y, . . . , xm − y}.

As the energies and coenergies in X ′ are the same as the variances and the covariances inX , the transfomation that makes coenergies in X ′ zero also makes covariances in X zero.

Example 33. Consider for example the following 5 blocks of two pixels:

X =

{(12

) (−2−1

) (00

) (10

) (0

−1

)}

The means have been subtracted. The variances (without dividing by 5, the number ofvectors) of the two coordinates are

12 + (−2)2 + 02 + 12 + 02 = 6

and22 + (−1)2 + 02 + 02 + (−1)2 = 6.

73

Page 31: 2 Lossless image compression

Let us find the KLT basis vectors for this data. The data matrix is

C =

(1 −2 0 1 02 −1 0 0 −1

)

,

so the covariance matrix is

CCT =

(6 44 6

)

Orthogonal eigenvectors of matrix CCT are (1,−1) and (1, 1), and the corresponding eigen-values are 2 and 10: (

6 44 6

)(1

−1

)

= 2

(1

−1

)

and (6 44 6

)(11

)

= 10

(11

)

Normalizing the eigenvectors gives orthonormal eigenvectors 1√2(1,−1) and 1√

2(1, 1) which

are chosen as the KLT basis vectors. In this case KLT happens to coincide with WHT. Wehave

ACCTAT = 1√2

(1 1

−1 1

)(6 44 6

)

1√2

(1 −11 1

)

=

(10 00 2

)

The covariance between the transformed coordinates is zero, so the new coordinates areuncorrelated. Notice that the variances of the new coordinates are 2 and 10, so the energyhas been concentrated in one of the coordinates. �

An interesting fact is that while the KLT transformation decorrelates the coordinates, italso provides optimal energy compaction. Let us prove this fact. First notice that if B isany orthogonal n× n matrix then the elements of

BCCTBT

are the coenergies of the coordinates (multiplied by m) after the transformation B has beenapplied to the data. The diagonal elements are the coordinate energies, again multipliedby m. The sum of the diagonal elements is the total energy, so the sum is the same for allorthogonal transformations B. (This is the well known fact that similar matrices have thesame trace.)

According to the next theorem the KLT transformation A packs the image energy opti-mally, in the strong sense that for every k, the k highest energy coordinates after applying Acontain as much energy as possible in any k coordinates after any orthogonal transformationB.

74

Page 32: 2 Lossless image compression

Theorem 10 Let A be the KLT transformation matrix, that is,

ACCTAT = diag(λ1, λ2, . . . , λn), where λ1 ≥ λ2 ≥ . . . ≥ λn ≥ 0,

and let B be an arbitrary n×n orthogonal matrix. Let µ1, µ2, . . . , µn be the diagonal elements

of the matrix

BCCTBT .

Then for every k, 1 ≤ k ≤ n, we have

λ1 + . . .+ λk ≥ µ1 + . . . µk.

Proof. Let us denoteD = ACCTAT = diag(λ1, λ2, . . . , λn),

andM = BAT .

Then M is orthogonal (as the product of two orthogonal matrices), and

MDMT = BCCTBT .

Numbers µ1, µ2, . . . , µn are the diagonal elements of MDMT , and numbers λ1, λ2, . . . , λn arethe diagonal elements of the diagonal matrix D. Performing the matrix products in MDMT

gives the diagonal elements

µi = λ1m2

i1 + λ2m2

i2 + . . .+ λnm2

in,

where mij are the elements of the matrix M . In other words, if we denote by

S =

m211 m2

12 . . . m21n

m221 m2

22 . . . m22n

......

m2n1 m2

n2 . . . m2nn

the matrix whose elements are the squares of the elements of M then

µ1

µ2

...µn

= S

λ1

λ2

...λn

.

Because M and MT are orthogonal, matrix S is row and column stochastic. This meansthat all elements are non-negative, and the sums of elements on each row and each columnof S are equal to one.

75

Page 33: 2 Lossless image compression

Let k be arbitrary, 1 ≤ k ≤ n. We have

µ1 + . . . µk = s1λ1 + s2λ2 + . . .+ snλn,

wheresi = m2

1i +m2

2i + . . .+m2

ki.

This means that 0 ≤ si ≤ 1 and

s1 + s2 + . . .+ sn = k.

Because λ1 ≥ λ2 ≥ . . . ≥ λn, we have

µ1 + . . . µk = s1λ1 + s2λ2 + . . .+ skλk + sk+1λk+1 + . . .+ snλn

≤ s1λ1 + s2λ2 + . . .+ skλk + sk+1λk + . . .+ snλk

= s1λ1 + s2λ2 + . . .+ skλk + (k − s1 − s2 − . . .− sk)λk

= s1λ1 + s2λ2 + . . .+ skλk + [(1− s1) + (1− s2) + . . .+ (1− sk)]λk

≤ [s1λ1 + (1− s1)λk] + [s2λ2 + (1− s2)λk] + . . . [s1λk + (1− sk)λk]≤ λ1 + λ2 + . . .+ λk.

Example 34. Let us evaluate the energy compaction performance of KLT on the test image”peppers”. Let us divide the image again into 4096 blocks of size 8× 8. The following figureshows the KLT basis vectors in the descending order of variance:

76

Page 34: 2 Lossless image compression

As expected, the first basis vectors are smooth, low frequency vectors. The very first vectoris close to the constant valued DC basis vector. The accumulation of the variance in thecoefficients is compared with the WHT data:

0 10 20 30 40 50 60 700.88

0.9

0.92

0.94

0.96

0.98

1

number of coefficients

cum

ulat

ive

ener

gy

Accumulation of energy with 8x8 KLT vs. WHT

Notice that the KLT basis is designed to concentrate maximum amount of energy in as fewcoordinates as possible, where the order of the basis is fixed and the same in all transformedblocks. In this sense comparison with WHT is more natural than comparison with the Haartransformation.

In KLT the proportions of the variance in the first coefficients are 88.42%, 4.22%, 2.81%,0.93%, 0.71%, 0.58%, 0.52%, 0.33% and 0.28%. The total in the first eight coefficients is98.29 %, so if we replace the other 56 coefficients by their average value, we introduce MSEerror

0.0171× 2893.83 = 49.5.

Finally, to compare the effect on the rate-distortion performance we can quantize the co-efficients using a uniform quantizer. The following graph shows the results, together withthe similar results of WHT, and the rate-distortion curve of JPEG. Note that the bitrate ofKLT is just the entropies of the coefficients. No overhead of specifying the basis vectors isincluded in the bitrate.

77

Page 35: 2 Lossless image compression

0 0.5 1 1.5 2 2.50

20

40

60

80

100

120

bitrate (bpp)

dist

ortio

n (m

se)

Rate−distortion of ’Peppers’ using KLT, WHT and JPEG

Despite its nice mathematical properties and the fact that it decorrelates the coordinatesand optimally packs the energy, KLT is not much used in practical image compressionalgorithms. The fact that the transform is data dependent is a problem. Also, there is nofast algorithm to calculate the transformation of a vector, so encoding and decoding slow.In practice, the discrete cosine transform packs the energy almost as well, and it has fastencoding and decoding algorithms.

4.5 Discrete cosine transformation

Discrete cosine transformation (DCT) is a discrete trigonometric transformation. Its basisvectors are obtained from the cosine function by discrete sampling at regular intervals. The2D transformation is a separable product of 1D transformations in the horizontal and verticaldirections.

The 1D DCT basis vectors of size 8 are obtained from functions

cos(kπx) for k = 0, 1, . . . , 7

by sampling them at points

x =1

16,3

16,5

16, . . . ,

15

16,

as shown by the following illustration:

78

Page 36: 2 Lossless image compression

0 0.5 1012

0 0.5 1−1

01

0 0.5 1−1

01

0 0.5 1−1

01

0 0.5 1−1

01

0 0.5 1−1

01

0 0.5 1−1

01

0 0.5 1−1

01

More generally, the 1D DCT basis vectors of size n are obtained from functions

cos(kπx), for k = 0, 1, . . . , n− 1,

by sampling at points

x =1

2n,3

2n,5

2n, . . . ,

2n− 1

2n.

The vectors have to be normalized by multiplying them by a suitable normalization factorthat makes the norm of each vector equal to one. Let D denote the n × n transformationmatrix whose columns and rows are indexed by 0, 1, . . . , n−1. Then the j’th element of k’thbasis vector (row) is

Dk,j = Ck cos

(

kπ2j + 1

2n

)

where Ck is the normalization factor

Ck =

{ √

1/n, if k = 0,√

2/n, otherwise.

79

Page 37: 2 Lossless image compression

The first row, given by k = 0, is the DC basis vector whose elements are all constants√

1/n.The other rows are AC basis vectors. Let us prove that the transformation is orthogonal.

Theorem 11 DCT is an orthogonal transformation.

Proof. Adding up trigonometric identities

cos(α + β) = cos(α) cos(β)− sin(α) sin(β), andcos(α− β) = cos(α) cos(β) + sin(α) sin(β)

gives a formula for the product of two cosines:

2 cos(α) cos(β) = cos(α + β) + cos(α− β). (7)

Let us prove first that the DCT basis vectors (except the DC vector) are orthogonal to theconstant vector. More precisely, let us show that

n−1∑

j=0

cos

(

kπ2j + 1

2n

)

=

n, if k is an even multiple of 2n,−n, if k is an odd multiple of 2n,0, if k is not a multiple of 2n.

(8)

Indeed, if k is an even multiple of 2n then all kπ 2j+1

2nare even multiples of π, so each

cos(kπ 2j+1

2n

)= 1. If k is an odd multiple of 2n then kπ 2j+1

2nare odd multiples of π, so

cos(kπ 2j+1

2n

)= −1 for every j. These prove the first two lines of (8).

Assume then that k is not a multiple of 2n. Let

x =

n−1∑

j=0

cos

(

kπ2j + 1

2n

)

be the sum we want to evaluate. By multiplying both sides with 2 cos kπn

and using (7), weget

2x cos kπn

=n−1∑

j=0

2 coskπ

ncos

(

kπ2j + 1

2n

)

=n−1∑

j=0

cos

(

kπ2j + 1

2n+

n

)

+ cos

(

kπ2j + 1

2n− kπ

n

)

=

n−1∑

j=0

cos

(

kπ2(j + 1) + 1

2n

)

+

n−1∑

j=0

cos

(

kπ2(j − 1) + 1

2n

)

.

The two sums of the last line are almost the same as x. The first one has the same terms asx, except that it does not have the term cos kπ

2nof x and it has the additional term cos kπ 2n+1

2n.

So it has the same value as

x− coskπ

2n+ cos kπ

2n+ 1

2n.

80

Page 38: 2 Lossless image compression

In the same way, the second sum has value

x− cos kπ2n− 1

2n+ cos

−kπ

2n.

Altogether we have

2x coskπ

n=

(

x− coskπ

2n+ cos kπ

2n+ 1

2n

)

+

(

x− cos kπ2n− 1

2n+ cos

−kπ

2n

)

= 2x,

where the last step follows from the facts that

cos −kπ2n

= cos kπ2n, and

cos kπ 2n+1

2n= cos

(kπ + kπ

2n

)= cos

(kπ − kπ

2n

)= cos kπ 2n−1

2n.

We have

2x

(

coskπ

n− 1

)

= 0,

so either x = 0 or cos kπn

= 1. The latter happens only when kπn

is a multiple of 2π, that is,only if k is a multiple of 2n. So we have x = 0 as claimed in (8).

Now we are ready to prove the theorem. Let a, b ∈ {0, 1, . . . , n− 1} be arbitrary. Thenthe inner product of the a’the and the b’th row of the DCT transformation matrix is

n−1∑

j=0

Da,jDb,j = CaCb

∑n−1

j=0cos(aπ 2j+1

2n

)cos(bπ 2j+1

2n

)

= CaCb

n−1∑

j=0

1

2

[

cos

(

(a + b)π2j + 1

2n

)

+ cos

(

(a− b)π2j + 1

2n

)]

=1

2CaCb

[n−1∑

j=0

cos

(

(a + b)π2j + 1

2n

)

+n−1∑

j=0

cos

(

(a− b)π2j + 1

2n

)]

.

According to (8) the first sum has value n if a = b = 0 and value 0 otherwise, and the secondsum has value n if a = b and value 0 otherwise. So the inner product of the a’th and theb’th rows is 0 if a 6= b. If a = b = 0 then the inner product is

1

2CaCb(n+ n) =

1

2

1

n

1

n2n = 1,

and if a = b 6= 0 then the inner product is

1

2CaCb(0 + n) =

1

2

2

n

2

nn = 1.

81

Page 39: 2 Lossless image compression

This proves that the transformation matrix is orthogonal. �

Different DCT basis vectors capture energy of different frequencies present in the inputdata. Typical images have more energy in low frequencies, so more energy is concentratedin the low frequency DCT coefficients.

The 2D version of DCT has separable basis formed as the outer product of the 1D DCTbasis vectors. Here are the 2D 8× 8 basis vectors:

The basis vectors are shown in the order of increasing frequency.

Example 35. Let us apply DCT to the 8× 8 blocks of ”Peppers”. The energy compactionis almost as good as with KLT, and much better than with WHT. The following figure showsthe cumulative variances in the 64 transform coefficients after KLT, DCT and WHT:

82

Page 40: 2 Lossless image compression

0 10 20 30 40 50 60 700.88

0.9

0.92

0.94

0.96

0.98

1

number of coefficients

cum

ulat

ive

ener

gy

Accumulation of energy with 8x8 DCT, KLT and WHT

The coefficients that produce most of the energy are the low frequency ones. The follow-ing table shows the percentage of the total variance provided by various coefficients. Thecoefficients are shown in the same order as the 8× 8 basis vectors in our earlier figure.

88.39 4.06 0.81 0.31 0.16 0.09 0.04 0.012.84 0.60 0.22 0.07 0.04 0.02 0.01 0.010.64 0.21 0.09 0.04 0.02 0.01 0.01 0.000.26 0.08 0.04 0.02 0.01 0.01 0.01 0.000.14 0.04 0.02 0.01 0.01 0.00 0.01 0.010.08 0.02 0.01 0.01 0.00 0.00 0.01 0.000.04 0.01 0.01 0.00 0.01 0.00 0.01 0.010.01 0.01 0.01 0.00 0.01 0.00 0.01 0.02

The variance is highest in the upper left corner, and decreases as one moves down and tothe right. We’ll see shortly how the JPEG compression algorithm takes advantage of thisfact. �

Let us consider briefly the computational complexity of the DCT transformation. Sincethe 2D version is separable we would normally divide it into horizontal and vertical 1DDCT’s. For n dimensional data, the naive 1D algorithm requires n2 multiplications andadditions. There are n rows and n columns to be transformed, resulting in 2n3 operationsper n× n block, or 2n operations per pixel. This can however be improved substantially bytaking advantage of the symmetries present in the trigonometric functions.

Using any of several existing Fast Fourier Transform algorithms, the complexity can bereduced to O(n logn), when n is a power of 2. For 2D n × n blocks this gives complexity

83

Page 41: 2 Lossless image compression

O(n2 log n), or O(logn) operations per pixel. Notice the usage of the ”order” symbol O(·). Inpractice we are interested in minimizing the exact number of operations needed. Especiallyimportant case is n = 8, as many compression standards apply DCT to blocks of size 8× 8.

The fastest known algorithms for 1D DCT of size n = 8 require 13 multiplications and29 additions, and 8 of the multiplications are done in the end. Such ”renormalizations” canbe combined with the quantization operation that follows the transformation. This gives atotal of 16 × 5 = 80 multiplications, 16 × 29 = 464 additions and 64 renormalizations perone 8× 8 block.

One can do even better if one does not do the horizontal and vertical transforms sepa-rately, but uses the additional symmetries of the 2D basis vectors. The fastest 2D DCT is byFeig, and it is natively 2D. Feig’s algorithm requires only 54 multiplications, 464 additions,6 shifts, and 64 coefficient renormalizations in the end. This is less than one multiplicationand a little over 6 additions per pixel.

It is of course equally (if not more) important that there exist fast inverse DCT trans-forms. The inverse transform is done in the decoder. Unlike in the case of WHT wherethe inverse was identical to the transform itself, the inverse DCT is not identical to forwardDCT. However, the fast computation techniques of DCT are reversible, so there are inversetransforms with the same number of operations as the forward direction.

DCT is the transform used in most international standards for lossy image and videocompression: JPEG, MPEG’s, H.261 and H.263. It has great energy compaction properties,and it is easy and reasonably fast to compute. Many IC manufacturers provide hardwareimplementations of DCT and inverse DCT for applications where speed is critical, e.g. real-time video coding.

4.6 Color images

Until now we have discussed coding of grayscale images. Same techniques work with colorimages as well. Any color can be obtained by mixing three basic colors. The basic colorsused in devices that emit light are red, green and blue. Hence each pixel has a red, greenand blue intensity value. This is called the RGB color system.

By taking the red values of all pixels one gets the red component of the image. Similarly,one obtains the green and blue components. One can compress the color image by usinggrayscale compression techniques on all three color components separately. However, theRGB color components are correlated, and therefore RGB representation is not the bestchoice for compression. It is better to change the color representation first in order toremove correlation.

Better color representations from compression point of view are luminance-chrominancerepresentations, such as YUV and YIQ representations. Y component is the luminancevalue of the pixel, and it alone provides a gray-scale version of the image. Chrominancecomponents U and V (or I and Q) add the color. Most of the image energy is packed in theluminance, so chrominance components are typically heavily compressible.

Both YUV and YIQ representations can be obtained from RGB representation by a

84

Page 42: 2 Lossless image compression

simple linear transformation. For example, we get the YIQ representation as follows:

YIQ

=

0.299 0.587 0.1140.596 −0.274 −0.3220.212 −0.523 0.311

RGB

The transformation matrix is regular, and the inverse matrix gives the inverse transformationfrom YIQ back to RGB. Notice that the transformation matrix is not orthogonal, so the MSEdistortion measure of color images depends on the color space used.

In YUV representation Y is the same as in YIQ, that is,

Y = 0.299×R + 0.587×G+ 0.114× B,

but U and V are defined simply by

U = 0.492(B − Y )V = 0.877(R− Y )

An RGB image is gray if R=G=B. In this case Y=R=G=B, so the chrominance values I,Q,Uand V are all zero.

The chrominance components U and V (or I and Q) are easier to compress than the Ycomponent since they contain less data. Also the human visual system is less sensitive tochrominance data than it is to luminance data: therefore U and V components are often alsosubsampled :

• In 422 format every other column is removed from the U and V images,

• In 411 format both every other column and every other row are removed, so the numberof pixels in the U and V components is only one quarter of the number of pixels in theY component.

Subsampling — unlike color conversion — is a lossy operation, but the error is visually small.The chrominance components may also be quantized more than the luminance component.

4.7 JPEG compression standard

JPEG (Joint Photographic Experts Group) is a joint image compression standard by threeinternational organizations ISO, CCITT and IEC (International Electrotechnical Commis-sion). JPEG is not just one image compression algorithm. It consists of a baseline algorithmthat can be extended by several options depending on the requirements of the applications.

The baseline algorithm uses DCT, Huffman coding and sequential transmission. Theextended options include arithmetic coding instead of Huffman coding, restart capability forsimple error concealment, and progressive sequential or hierarchical modes. All these optionsare build on top of the baseline algorithm. There is an independent lossless mode that is notbased on DCT but uses predictive coding instead.

Let us study the baseline algorithm first. Briefly, it consists of the following steps:

85

Page 43: 2 Lossless image compression

1. The original Image is partitioned into 8 × 8 blocks. The blocks are transformed withDCT.

2. The transform coefficients are quantized using uniform quantizers. Different transformcoefficients may use quantizers with different step sizes. Typically lower frequencycoefficients are quantized less than higher frequency coefficients.

3. The quantized DC coefficients are encoded using a lossless DPCM where the predictoris simply the quantized DC coefficient of the previous encoded block. The predictionerrors are Huffman encoded.

4. The quantized AC coefficients are ordered in an order of increasing frequency. Thisgives a vector of 63 values that need to be coded. The idea is to have the small highfrequency coefficients in the end of the list.

5. The positions of zero AC coefficients are encoded using runlength coding. The run-lengths are Huffman encoded together with the values of the non-zero AC coefficientsseparating the runs.

6. Color images are compressed as three separate grayscale images. JPEG does not specifythe color representation used.

Let us look into the different steps in more details, with the aid of an example.

1. Consider a sample image block from peppers:

or

77 76 80 80 83 85 114 7775 80 80 80 87 105 169 13381 77 80 86 116 167 171 18067 79 86 135 170 169 169 16180 87 119 168 176 165 159 16183 122 166 175 177 163 166 155117 168 172 179 165 162 162 159168 174 180 169 172 162 155 160

86

Page 44: 2 Lossless image compression

The first step is to subtract neutral gray 2d−1 from all pixel values, where d is the bit depth ofthe input image. In the baseline JPEG d = 8, so 128 is subtracted from all pixel values. Thisplaces the intensities in the interval -128,. . . , 127. This does not affect the AC coefficientsafter the DCT transform, but it puts the DC coefficient in the range -1024,. . . ,1016. In thesample block, the subtraction gives

-51 -52 -48 -48 -45 -43 -14 -51-53 -48 -48 -48 -41 -23 41 5-47 -51 -48 -42 -12 39 43 52-61 -49 -42 7 42 41 41 33-48 -41 -9 40 48 37 31 33-45 -6 38 47 49 35 38 27-11 40 44 51 37 34 34 3140 46 52 41 44 34 27 32

Then we calculate the DCT transform of the block. The transform coefficients (rounded tointegers) are as follows:

28 -158 -47 -5 -14 8 -17 5-212 -61 56 32 -9 22 -9 7-27 104 44 -12 -23 7 -6 9-30 29 -50 -25 -1 11 -10 5-11 21 -23 19 25 0 0 1-3 2 -12 4 -4 -12 -2 -92 0 1 6 -1 -5 19 30 0 11 2 3 -2 11 -11

All AC coefficients x satisfy −1024 < x < 1024. (Proved in the homework assignments.)

2. Next the coefficients are quantized using uniform quantizers. Different coefficients mayuse different step sizes. The step sizes are specified in a quantization table. A quantizationtable is an 8× 8 array of quantization step sizes. JPEG does not fix the quantization table,but it allows the user to specify the table to be used. The table is then included as part of thecompressed data. Up to four different quantization tables may be specified in the baselineJPEG. Different tables may be used for luminance and chrominance data, for example.

The idea in allowing different coefficients to be quantized differently is to take advantageof the high sensitivity of the human visual system to low frequencies. The high frequencycoefficients may be quantized more with no visual effect.

Choosing the quantizers is the rate control mechanism of JPEG: Larger step sizes meanlower bitrate and lower quality. In our example, let us use the following quantization table:

87

Page 45: 2 Lossless image compression

16 11 10 16 24 40 51 6112 12 14 19 26 58 60 5514 13 16 24 40 57 69 5614 17 22 29 51 87 80 6218 22 37 56 68 109 103 7724 35 55 64 81 104 113 9249 64 78 87 103 121 120 10172 92 95 98 112 100 103 99

This table has been build by empirically measuring the psychovisual thresholding ofdifferent frequencies, and it is the default quantization table in many JPEG implementations.The quantized coefficients are shown in the following array. They are obtained by dividingthe coefficient by the quantizer step size, rounding the result to the nearest integer:

2 -14 -5 0 -1 0 0 0-18 -5 4 2 0 0 0 0-2 8 3 0 -1 0 0 0-2 2 -2 -1 0 0 0 0-1 1 -1 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

Notice the large number of zeroes in the high frequency part of the table. This is due to twofactors: (a) Higher frequencies were quantized more since the human visual system is notvery sensitive to high frequencies, and (b) the block contains less high frequency information(energy) in the first place.

3. Next we take care of the DC coefficient. It is the quantized average of the block. DCcoefficient is encoded using DPCM where the prediction is just the previous quantized DCcoefficient that was encoded. In our example, assume that the previous block that wasencoded had quantized DC value 7. For the current block we therefore encode the predictionerror e = 2− 7 = −5.

Notice that the prediction error e is always within the interval −2040, . . . , 2040. The pre-diction error is encoded in two parts: First we use a Huffman code to specify the magnitudeof the value. There are 12 magnitude categories indicating the number of significant bits ine (after removing leading zeros) In other words: Number e belongs to the category number

C = 1 + ⌊log2 |e|⌋

if e 6= 0, and category C = 0 if e = 0.Here are the categories, and sample Huffman codewords for the categories.

88

Page 46: 2 Lossless image compression

Category DC coefficient range Huffman code0 0 001 -1,1 0102 -3,-2,2,3 0113 -7,. . . ,-4,4,. . . ,7 1004 -15,. . . ,-8,8,. . . ,15 1015 -31,. . . ,-16,16,. . . ,31 1106 -63,. . . ,-32,32,. . . ,63 11107 -127,. . . ,-64,64,. . . ,127 111108 -255,. . . ,-128,128,. . . ,255 1111109 -511,. . . ,-256,256,. . . ,511 111111010 -1023,. . . ,-512,512,. . . ,1023 1111111011 -2047,. . . ,-1024,1024,. . . ,2047 111111110

The Huffman codes of the categories are given by the user. Two different Huffmantables may be specified in the baseline JPEG (for example, one for luminance and one forchrominance.) 16 bits is the longest Huffman codeword allowed in JPEG.

After the category has been specified, we have to add more bits to identify the exactprediction error inside the category. This is done using fixed length coding. Category Ccontains 2C values, so the exact value can be specified by adding C bits.

• If e is positive then the C bit binary representation of e is used. Since 2C−1 ≤ e < 2C ,this binary representation starts with symbol 1.

• If e is negative then the complement of the C bit representation of −e is used. Again,2C−1 ≤ −e < 2C , so the complement word starts with bit 0.

Our example e = −5 belongs to category 3, so we transmit the Huffman code 100 forcategory 3, followed by the complement of the binary word for −e = 5 = 101b. So the codefor our DC coefficient is

100 010

4. Next we encode the quantized AC coefficients. In order to get long runs of consecutivezeros, the coefficients are read in the following zigzag order:

��������

��������

��������

��������

��������

������

������

��������

����

����

����

����

����

����

���

���

����

����

����

����

����

����

����

���

���

����

��������

��������

��������

��������

��������

��������

������

������

��������

��������

��������

��������

DC

��������

��������

��������

��������

������

������

��������

��������

��������

��������

��������

��������

��������

������

������

��������

��������

��������

��������

��������

��������

��������

������

������

��������

��������

��������

��������

��������

��������

��������

������

������

��������

89

Page 47: 2 Lossless image compression

In our example this produces sequence

−14,−18,−2,−5,−5, 0, 4, 8,−2,−1, 2, 3, 2,−1, 0, 0, 0,−2, 1, 0, 0, 0,−1,−1,−1, 0, 0, . . .

where the end of the sequence contains only 0’s.

5. The sequence is processed from left to right. For each non-zero coefficient three numbersare needed:

• The length of the zero run preceding the coefficient, i.e. the number of zero coefficientsbetween two non-zero coefficients,

• The category of the coefficient (there are now only 10 different categories), and

• the exact value of the coefficient in the category.

There are only 10 categories since category 0 does not exist (we only code non-zero coeffi-cients), and all AC coefficients are inside the range -1023,. . . ,1023. Here are the 10 categories:

Category AC coefficient range1 -1,12 -3,-2,2,33 -7,. . . ,-4,4,. . . ,74 -15,. . . ,-8,8,. . . ,155 -31,. . . ,-16,16,. . . ,316 -63,. . . ,-32,32,. . . ,637 -127,. . . ,-64,64,. . . ,1278 -255,. . . ,-128,128,. . . ,2559 -511,. . . ,-256,256,. . . ,51110 -1023,. . . ,-512,512,. . . ,1023

The runlengths and the category are combined in one Huffman table. That is, eachHuffman codeword specifies both the number of zero coefficients and the category of thenext non-zero coefficient.

The runlengths are restricted to 0,. . . ,15. If there are more than 15 consecutive zeroes,the run is encoded in pieces: There is a special codeword in the Huffman table that saysthat we have a run of 15 zeros, and also the next symbol is zero. This takes care of 16 zeros,and the 16’th zero is set as the starting point for the next run.

There is another special end-of-block symbol EOB in the Huffman table, which indicatesthat all the remaining coefficients are zero. So the total number of entries in the Huffmantable is

10× 16 + 2 = 162.

As in the case of DC coefficients, also the AC Huffman table is specified by the user. Twodifferent tables may be specified.

90

Page 48: 2 Lossless image compression

The Huffman code for the runlength and category is followed by the C additional bitsneeded to specify the exact coefficient inside the category number C.

In our example, the AC coefficients are encoded as

H〈0, 4〉0001H〈0, 5〉01101H〈0, 2〉01H〈0, 3〉010H〈0, 3〉010H〈1, 3〉100H〈0, 4〉1000H〈0, 2〉01H〈0, 1〉0H〈0, 2〉10H〈0, 2〉11H〈0, 2〉10H〈0, 1〉0H〈3, 2〉01H〈0, 1〉1H〈3, 1〉0H〈0, 1〉0H〈0, 1〉0H〈EOB〉

where H〈x, y〉 denotes the Huffman codeword for runlength x, category y.Using the sample Huffman coding specified in the Annex K of JPEG gives AC bitstream

1011 0001 11010 01101 01 01 . . .

total length of which is 105 bits. Adding the 6 bits for the DC coefficient gives total bitcount111 bits, or 1.73 bpp for our sample block.

AC Runlength/CategoryHuffman table

bitstreamZigzag

DCT Quantizer

TableQuantization

8x8 block

63 AC’s

1 DC DPCM Huffman

DC CategoryHuffman table

Huffman

Let us move on to the JPEG decoder. From the bitstream it receives it can Huffmandecode the runlengths and the quantized transform coefficients. The decoder multiplies the

91

Page 49: 2 Lossless image compression

coefficients with the corresponding quantizer step sizes. In our example, the decoded valuesare

32 -154 -50 0 -24 0 0 0-216 -60 56 38 0 0 0 0-28 104 48 0 -40 0 0 0-28 34 -44 -29 0 0 0 0-18 22 -37 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

Then the decoder applies the inverse DCT to compute the reconstructed block, and adds128 to every pixel value. Here is the decoded block:

70 87 86 71 77 100 100 7884 82 72 69 93 130 145 13884 74 71 91 127 160 175 17768 72 97 135 163 169 166 16666 90 133 171 180 165 154 15494 125 161 178 174 164 161 163132 160 178 172 164 168 168 161154 180 188 169 162 172 165 144

or

Here are a couple of notes about the baseline JPEG:

• JPEG does not specify default quantization tables and Huffman tables. They have tobe provided by the user or the application using JPEG.

• One widely used JPEG implementation is IJG (Independent JPEG Group). Our sam-ple tables are the default tables used by IJG.

• The quality setting in IJG is a parameter 1..100 that determines a factor used inmultiplying the elements of the quantization table we have used in our example above.Quality setting 50 gives factor one, i.e. then IJG uses our table. Quality setting 1multiplies our table by 50, quality setting 100 by 0 (Of course the step sizes must beat least 1, so setting 100 makes all of them 1, not 0.)

92

Page 50: 2 Lossless image compression

• There are many important image parameters that are not specified inside JPEG. Theseinclude the color representation used, and horizontal and vertical pixel densities andaspect ratio. Therefore a common file format for exchanging JPEG files has to beagreed upon externally. Most common file format using JPEG is JFIF (JPEG FileInterchange Format). Also TIFF (Tag Image File Format) has JPEG available.

• The quality of an image may slightly degrade when it is compressed and decompressedmultiple times. The loss is smallest if the same quantization table is used on alliterations. That is, if the quality setting was 50 the first time the file was compressed,one should use 50 - not 100 - also when recompressing. Of course if the image waschanged in the meantime (e.g. cropped) the situation may be totally different: thedivision into 8x8 blocks may have shifted.

• JPEG contains a compliance test procedure. The most important part of the com-pliance test guarantees that every JPEG implementation uses a sufficiently accurateDCT.

Some of the extended options of JPEG are as follows:

• The baseline JPEG allows 2 DC and 2 AC Huffman tables. There is an extended modethat allows 4 DC and 4 AC Huffman tables.

• The baseline JPEG allows only images with intensity bit depths up to 8. There is anextended mode where bit depth up to 12 is allowed. In this mode a greater accuracyof DCT is required.

• Arithmetic coding. The standard includes a mode where the Huffman coding of coef-ficients and runlengths is replaced by arithmetic coding. The arithmetic coder used isthe QM-coder.

• The baseline mode is sequential: The blocks are processed one at a time, sequentiallyfrom top left corner to bottom right corner. In progressive mode the quantized DCTcoefficients are transmitted in a different order. First, all blocks are transformed usingDCT, and the coefficients are stored in a buffer. There are two choices for the progres-sive transmission of the coefficients: spectral selection, and successive approximation.

If spectral selection is used, the DCT coefficients are divided into bands according totheir position in the zigzag scan. A band consists of all coefficients (of all blocks) whosezigzag indices are inside a specified interval. The bands are encoded sequentially, oneafter the other. We may for example encode first coefficient 0 (=DC coefficient) of allblocks, then coefficients 1 and 2 of all blocks, followed by coefficients 3,4 and 5 of allblocks, etc. In this way a low frequency approximation of the image is obtained rapidly,after a small number of bits, and bits that come later add high frequency details tothe image.

If successive approximation is used, low precision approximations of DCT coefficientsare transmitted first, followed by less significant bits of the coefficients. On the first

93

Page 51: 2 Lossless image compression

stage certain number of most significant bits of DCT coefficients in all blocks areencoded. On each successive stage we add one more bitlayer to the coefficients.

Consider the following illustration of all bits of all DCT coefficients in all blocks:

DC

T c

oeff

icie

nts

12

0

637 06 5

msb lsb

Bits

Blocks

In the sequential mode the coefficients are coded block-by-block:

12

0

637 06 5

. . . .

12

0

637 06 5

1st 2nd

. . . .

In spectral selection they are transmitted band-by-band:

1

. . .

00

1st 2nd

In successive approximation more and more bits of all coefficients are coded:

94

Page 52: 2 Lossless image compression

2

0

637

1

. . .

12

0

637 6

1st 2nd . . .

Spectral selection and successive approximation may also be combined.

• JPEG has also an independent lossless mode that is not based on DCT, but predictivecoding.

• The hierarchical coding of JPEG can be used in both lossy and lossless mode. In thehierarchical mode the image consists of several frames. Typically the frames representthe image at different resolutions. The frames are processed one after the other.

The first frame (typically the low resolution frame) is encoded using normal DCT(lossy) or DPCM (lossless).

Subsequent frames may refer to any preceding frames. In that case only the framedifferences are encoded. Such frames are called differential frames. The referenceframe may have to be upsampled before forming the difference.

If the first frame is compressed losslessly, then all frames are compressed losslessly. Ifthe first frame is compressed using DCT, then all frames except the last one have touse DCT. The last frame may use either DCT or DPCM.

95