Adaptive Signal and Image Processing

100
Adaptive Signal and Image Processing Gabriel Peyré www.numerical-tours.com

description

Slide of a course for the summer school "Ecole Analyse multirésolution pour l'image", June 20-22, 2012, Auxerre, France.

Transcript of Adaptive Signal and Image Processing

Page 1: Adaptive Signal and Image Processing

Adaptive Signal

and Image Processing

Gabriel Peyréwww.numerical-tours.com

Page 2: Adaptive Signal and Image Processing

Fourier decomposition Wavelet decomposition

Natural Image Priors

Page 3: Adaptive Signal and Image Processing

Fourier decomposition Wavelet decomposition

Natural Image Priors

Page 4: Adaptive Signal and Image Processing

Overview

•Sparsity for Compression and Denoising

•Geometric Representations

•L0 vs. L1 Sparsity

•Dictionary Learning

Page 5: Adaptive Signal and Image Processing

Sparse Approximation in a Basis

Page 6: Adaptive Signal and Image Processing

Sparse Approximation in a Basis

Page 7: Adaptive Signal and Image Processing

Sparse Approximation in a Basis

Page 8: Adaptive Signal and Image Processing

Image and Texture Models

�|�f |2

Uniformly smooth C� image.Fourier, Wavelets: ||f � fM ||2 = O(M��).

Page 9: Adaptive Signal and Image Processing

Image and Texture Models

�|�f |2

Discontinuous image with bounded variation.Wavelets: ||f � fM ||2 = O(M�1).

Uniformly smooth C� image.Fourier, Wavelets: ||f � fM ||2 = O(M��).

�|�f |

Page 10: Adaptive Signal and Image Processing

Image and Texture Models

�|�f |2

�|�f |

Discontinuous image with bounded variation.Wavelets: ||f � fM ||2 = O(M�1).

Uniformly smooth C� image.Fourier, Wavelets: ||f � fM ||2 = O(M��).

C2-geometrically regular image.Curvelets: ||f � fM ||2 = O(log3(M)M�2).

�|�f |

Page 11: Adaptive Signal and Image Processing

Image and Texture Models

�|�f |2

�|�f |

Discontinuous image with bounded variation.Wavelets: ||f � fM ||2 = O(M�1).

Uniformly smooth C� image.Fourier, Wavelets: ||f � fM ||2 = O(M��).

C2-geometrically regular image.Curvelets: ||f � fM ||2 = O(log3(M)M�2).

�|�f |

Page 12: Adaptive Signal and Image Processing

Image and Texture Models

More complex images: needs adaptivity.

�|�f |2

�|�f |

Discontinuous image with bounded variation.Wavelets: ||f � fM ||2 = O(M�1).

Uniformly smooth C� image.Fourier, Wavelets: ||f � fM ||2 = O(M��).

C2-geometrically regular image.Curvelets: ||f � fM ||2 = O(log3(M)M�2).

�|�f |

Page 13: Adaptive Signal and Image Processing

Compression by Transform-coding

Image f Zoom on f

fforward

a[m] = ⇥f, �m⇤ � Rtransform

Page 14: Adaptive Signal and Image Processing

Compression by Transform-coding

Image f Zoom on f

fforward

a[m] = ⇥f, �m⇤ � Rtransform

Quantization: q[m] = sign(a[m])�

|a[m]|T

⇥� Z

a[m]

�T T 2T�2T a[m]

Quantized q[m]

bin Tq[m] � Z

Page 15: Adaptive Signal and Image Processing

Compression by Transform-coding

Image f Zoom on f

fforward

a[m] = ⇥f, �m⇤ � R codingtransform

Entropic coding: use statistical redundancy (many 0’s).

Quantization: q[m] = sign(a[m])�

|a[m]|T

⇥� Z

a[m]

�T T 2T�2T a[m]

Quantized q[m]

bin Tq[m] � Z

Page 16: Adaptive Signal and Image Processing

Compression by Transform-coding

Image f Zoom on f

fforward

a[m] = ⇥f, �m⇤ � R coding

decoding

q[m] � Z

transform

Entropic coding: use statistical redundancy (many 0’s).

Quantization: q[m] = sign(a[m])�

|a[m]|T

⇥� Z

a[m]

�T T 2T�2T a[m]

Quantized q[m]

bin Tq[m] � Z

Page 17: Adaptive Signal and Image Processing

Compression by Transform-coding

Image f Zoom on f

fforward

Dequantization: a[m] = sign(q[m])�

|q[m] +12

⇥T

a[m] = ⇥f, �m⇤ � R coding

decoding

q[m] � Za[m] dequantization

transform

Entropic coding: use statistical redundancy (many 0’s).

Quantization: q[m] = sign(a[m])�

|a[m]|T

⇥� Z

a[m]

�T T 2T�2T a[m]

Quantized q[m]

bin Tq[m] � Z

Page 18: Adaptive Signal and Image Processing

Compression by Transform-coding

Image f Zoom on f fR, R =0.2 bit/pixel

fforward

Dequantization: a[m] = sign(q[m])�

|q[m] +12

⇥T

a[m] = ⇥f, �m⇤ � R coding

decoding

q[m] � Za[m] dequantizationtransform

backwardfR =

m�IT

a[m]�m

transform

Entropic coding: use statistical redundancy (many 0’s).

Quantization: q[m] = sign(a[m])�

|a[m]|T

⇥� Z

a[m]

�T T 2T�2T a[m]

Quantized q[m]

bin Tq[m] � Z

Page 19: Adaptive Signal and Image Processing

Compression by Transform-coding

Image f Zoom on f fR, R =0.2 bit/pixel

fforward

Dequantization: a[m] = sign(q[m])�

|q[m] +12

⇥T

a[m] = ⇥f, �m⇤ � R coding

decoding

q[m] � Za[m] dequantizationtransform

backwardfR =

m�IT

a[m]�m

transform

Entropic coding: use statistical redundancy (many 0’s).

Quantization: q[m] = sign(a[m])�

|a[m]|T

⇥� Z

a[m]

�T T 2T�2T a[m]

Quantized q[m]

bin Tq[m] � Z

“Theorem:” ||f � fM ||2 = O(M��) =⇥ ||f � fR||2 = O(log�(R)R��)

Page 20: Adaptive Signal and Image Processing

Enter Wavelets…

• Standard 2-D tensor product wavelet transform

+ embedded coder.

� chunks of large coe�cients.

JPEG-2000 vs. JPEG

Image f JPEG, R = .19bit/pxl JPEG2k, R = .15bit/pxl

JPEG Compression

256x256 pixels, 12,500 total bits, 0.19 bits/pixel

JPEG Compression

256x256 pixels, 12,500 total bits, 0.19 bits/pixel

EZW Compression

256x256 pixels, 9,800 total bits, 0.15 bits/pixel

JPEG2k: exploit the statistical redundancy of coe�cients.

� neighboring coe�cients are not independent.

Page 21: Adaptive Signal and Image Processing

Denoising (Donoho/Johnstone)

Page 22: Adaptive Signal and Image Processing

Denoising (Donoho/Johnstone)

thresh.f =

N�1�

m=0

�f, �m⇥�m f =�

|�f, �m⇥|>T

�f, �m⇥�m

Page 23: Adaptive Signal and Image Processing

Denoising (Donoho/Johnstone)

thresh.f =

N�1�

m=0

�f, �m⇥�m f =�

|�f, �m⇥|>T

�f, �m⇥�m

In practice:T � 3�for T =

�2 log(N)�

Theorem: if ||f0 � f0,M ||2 = O(M��),

E(||f � f0||2) = O(�2�

�+1 )

Page 24: Adaptive Signal and Image Processing

Overview

•Sparsity for Compression and Denoising

•Geometric Representations

•L0 vs. L1 Sparsity

•Dictionary Learning

Page 25: Adaptive Signal and Image Processing

For Fourier, linear�non-linear, sub-optimal.

For wavelets, linear�non-linear, optimal.

||f � fM ||2 =�

O(M�1) (Fourier),O(M�2�) (wavelets).

Piecewise Regular Functions in 1D

Theorem: If f is C� outside a finite set of discontinuities:

Page 26: Adaptive Signal and Image Processing

Wavelets: same result for BV functions (optimal).

Piecewise Regular Functions in 2D

Theorem: If f is C� outside a set of finite length edge curves,

||f � fM ||2 =�

O(M�1/2) (Fourier),O(M�1) (wavelets).

Fourier � Wavelets.

Page 27: Adaptive Signal and Image Processing

Wavelets: same result for BV functions (optimal).

Regular C� edges: sub-optimal (requires anisotropy).

Piecewise Regular Functions in 2D

Theorem: If f is C� outside a set of finite length edge curves,

||f � fM ||2 =�

O(M�1/2) (Fourier),O(M�1) (wavelets).

Fourier � Wavelets.

Page 28: Adaptive Signal and Image Processing

Geometric image model: f is C� outside a set of C� edge curves.

BV image: level sets have finite lengths.

Geometric image: level sets are regular.

Geometry = cartoon image Sharp edges Smoothed edges

Geometrically Regular Images

�|�f |

Page 29: Adaptive Signal and Image Processing

Curvelets for Cartoon ImagesCurvelets: [Candes, Donoho] [Candes, Demanet, Ying, Donoho]

Page 30: Adaptive Signal and Image Processing

Curvelets for Cartoon ImagesCurvelets: [Candes, Donoho] [Candes, Demanet, Ying, Donoho]

Redundant tight frame (redundancy � 5): not e�cient for compression.Denoising by curvelet thresholding: recovers edges and geometric textures.

Page 31: Adaptive Signal and Image Processing

Approximation with Triangulation

Vertices V = {vi}Mi=1.

Faces F � {1, . . . ,M}3.Triangulation (V,F):

Page 32: Adaptive Signal and Image Processing

Approximation with Triangulation

Vertices V = {vi}Mi=1.

Faces F � {1, . . . ,M}3.

Piecewise linear approximation: fM =M�

m=1

�m⇥m

⇥m(vi) = �mi is a�ne on each face of F .

� = argminµ

||f ��

m

µm⇥m||

Triangulation (V,F):

Page 33: Adaptive Signal and Image Processing

Approximation with Triangulation

Regular areas:�M/2 equilateral triangles.

M�1/2

M�1/2 �M/2 anisotropic triangles.Singular areas:

Theorem:There exists (V,F) such that

||f � fM || � CfM�2

Vertices V = {vi}Mi=1.

Faces F � {1, . . . ,M}3.

Piecewise linear approximation: fM =M�

m=1

�m⇥m

⇥m(vi) = �mi is a�ne on each face of F .

� = argminµ

||f ��

m

µm⇥m||

Triangulation (V,F):

Page 34: Adaptive Signal and Image Processing

Approximation with Triangulation

Regular areas:�M/2 equilateral triangles.

M�1/2

M�1/2 �M/2 anisotropic triangles.Singular areas:

Theorem:There exists (V,F) such that

||f � fM || � CfM�2

Vertices V = {vi}Mi=1.

Faces F � {1, . . . ,M}3.

Piecewise linear approximation: fM =M�

m=1

�m⇥m

⇥m(vi) = �mi is a�ne on each face of F .

� = argminµ

||f ��

m

µm⇥m||

Triangulation (V,F):

Provably good greedy schemes:[Mirebeau, Cohen, 2009]

Optimal (V,F): NP-hard.

Page 35: Adaptive Signal and Image Processing

Greedy Triangulation Optimization

Anisotropic triangulation JPEG2000

M=

200

M=

600

Bougleux, Peyre, Cohen, ICCV’09

Page 36: Adaptive Signal and Image Processing

Image f � RN

f�j [n] = �f, ��

j,n�

Wavelettransform

Geometric Multiscale Processing

Wavelet coe�cients

f

f�j

��j

Page 37: Adaptive Signal and Image Processing

Image f � RN

f�j [n] = �f, ��

j,n�

Wavelettransform

Geo

met

ryGeometric Multiscale Processing

Wavelet coe�cients

f

f�j

��j

transformGeometric

Geometric coe�cients

Page 38: Adaptive Signal and Image Processing

Image f � RN

f�j [n] = �f, ��

j,n�

2�

2j

Wavelettransform

Geo

met

ryGeometric Multiscale Processing

Wavelet coe�cients

f

f�j

��j

transformGeometric

Geometric coe�cients

�b[j, �, n] = �f⇥j , b�

⇤,n⇥

Page 39: Adaptive Signal and Image Processing

Image f � RN

f�j [n] = �f, ��

j,n�

2�

2j

Wavelettransform

Geo

met

ry

C� cartoon image:

Geometric Multiscale Processing

Wavelet coe�cients

f

f�j

��j

transformGeometric

Geometric coe�cients

�b[j, �, n] = �f⇥j , b�

⇤,n⇥ M = Mband + M�

||f � fM ||2 = O(M��)

Page 40: Adaptive Signal and Image Processing

Overview

•Sparsity for Compression and Denoising

•Geometric Representations

•L0 vs. L1 Sparsity

•Dictionary Learning

Page 41: Adaptive Signal and Image Processing

Dictionary D = {dm}Q�1m=0 of atoms dm � RN .

Image decomposition: f =Q�1�

m=0

xmdm = Dx

f D x

xm

= �

dm

Image Representation

dm

Page 42: Adaptive Signal and Image Processing

Dictionary D = {dm}Q�1m=0 of atoms dm � RN .

Image decomposition: f =Q�1�

m=0

xmdm = Dx

f D x

xm

= �

Image approximation: f � Dx

dm

Image Representation

dm

Page 43: Adaptive Signal and Image Processing

Dictionary D = {dm}Q�1m=0 of atoms dm � RN .

Image decomposition: f =Q�1�

m=0

xmdm = Dx

f D x

xm

= �Orthogonal dictionary: N = Q

xm = �f, dm�

Image approximation: f � Dx

dm

Image Representation

dm

Page 44: Adaptive Signal and Image Processing

Dictionary D = {dm}Q�1m=0 of atoms dm � RN .

Image decomposition: f =Q�1�

m=0

xmdm = Dx

f D x

xm

= �Orthogonal dictionary: N = Q

xm = �f, dm�

Redundant dictionary: N � Q

� x is not unique.

Image approximation: f � Dx

Examples: TI wavelets, curvelets, . . .

dm

Image Representation

dm

Page 45: Adaptive Signal and Image Processing

Sparsity: most xm are small.

Decomposition:

Coe�cients x

Image f

Sparsity

f =Q�1�

m=0

xmdm = Dx

Example: wavelet transform.

Page 46: Adaptive Signal and Image Processing

Sparsity: most xm are small.

Ideal sparsity: most xm are zero.

J0(x) = | {m \ xm �= 0} |

Decomposition:

Coe�cients x

Image f

Sparsity

f =Q�1�

m=0

xmdm = Dx

Example: wavelet transform.

Page 47: Adaptive Signal and Image Processing

Sparsity: most xm are small.

Ideal sparsity: most xm are zero.

J0(x) = | {m \ xm �= 0} |

Decomposition:

Coe�cients x

Image f

Approximate sparsity: compressibility

||f �Dx|| is small with J0(x) � M .

Sparsity

f =Q�1�

m=0

xmdm = Dx

Example: wavelet transform.

Page 48: Adaptive Signal and Image Processing

Redundant dictionary D = {dm}Q�1m=0, Q � N .

� non-unique representation f = Dx.

Sparsest decomposition: minf=Dx

J0(x)

Sparse Coding

Page 49: Adaptive Signal and Image Processing

Redundant dictionary D = {dm}Q�1m=0, Q � N .

� non-unique representation f = Dx.

Sparsest decomposition: minf=Dx

J0(x)

Sparsest approximation: minx

12

||f �Dx||2 + �J0(x)

minJ0(x)�M

||f �Dx||

min||f�Dx||��

J0(x)

Equivalence��M � ⇥

Sparse Coding

Page 50: Adaptive Signal and Image Processing

Redundant dictionary D = {dm}Q�1m=0, Q � N .

� non-unique representation f = Dx.

Sparsest decomposition: minf=Dx

J0(x)

Sparsest approximation: minx

12

||f �Dx||2 + �J0(x)

minJ0(x)�M

||f �Dx||

min||f�Dx||��

J0(x)

Equivalence��M � ⇥

Ortho-basis D:

xm =�

�f, dm⇥ if |xm| �⇤

2�0 otherwise.

�Pick the M largest

coe�cientsin {�f, dm⇥}m

Sparse Coding

Page 51: Adaptive Signal and Image Processing

Redundant dictionary D = {dm}Q�1m=0, Q � N .

� non-unique representation f = Dx.

Sparsest decomposition: minf=Dx

J0(x)

Sparsest approximation: minx

12

||f �Dx||2 + �J0(x)

minJ0(x)�M

||f �Dx||

min||f�Dx||��

J0(x)

Equivalence��M � ⇥

Ortho-basis D:

General redundant dictionary: NP-hard.

xm =�

�f, dm⇥ if |xm| �⇤

2�0 otherwise.

�Pick the M largest

coe�cientsin {�f, dm⇥}m

Sparse Coding

Page 52: Adaptive Signal and Image Processing

Image with 2 pixels:

q = 0

J0(x) = | {m \ xm �= 0} |

d1

J0(x) = 0 �� null image.J0(x) = 1 �� sparse image.J0(x) = 2 �� non-sparse image.

Convex Relaxation: L1 Prior

d0

Page 53: Adaptive Signal and Image Processing

Image with 2 pixels:

q = 0 q = 1 q = 2q = 3/2q = 1/2

J0(x) = | {m \ xm �= 0} |

Jq(x) =�

m

|xm|q

d1

J0(x) = 0 �� null image.J0(x) = 1 �� sparse image.J0(x) = 2 �� non-sparse image.

Convex Relaxation: L1 Prior

�q priors: (convex for q � 1)

d0

Page 54: Adaptive Signal and Image Processing

Image with 2 pixels:

q = 0 q = 1 q = 2q = 3/2q = 1/2

J0(x) = | {m \ xm �= 0} |

Jq(x) =�

m

|xm|q

d1

J0(x) = 0 �� null image.J0(x) = 1 �� sparse image.J0(x) = 2 �� non-sparse image.

J1(x) = ||x||1 =�

m

|xm|

Convex Relaxation: L1 Prior

Sparse �1 prior:

�q priors: (convex for q � 1)

d0

Page 55: Adaptive Signal and Image Processing

Denoising/approximation: � = Id.

Inverse Problems

Page 56: Adaptive Signal and Image Processing

Examples: Inpainting, super-resolution, compressed-sensing

Denoising/approximation: � = Id.

Inverse Problems

Page 57: Adaptive Signal and Image Processing

Fidelity

Denoising/compression: y = f0 + w � RN .

Sparse approximation: f� = Dx� where

x� ⇥ argminx

12

||y �Dx||2 + �||x||1

Regularized Inversion

Page 58: Adaptive Signal and Image Processing

Fidelity

Denoising/compression: y = f0 + w � RN .

Sparse approximation: f� = Dx� where

x� ⇥ argminx

12

||y �Dx||2 + �||x||1

x� ⇥ argminx

12

||y � �Dx||2 + �||x||1

Inverse problems y = �f0 + w � RP .

ReplaceD by �D

Regularized Inversion

Page 59: Adaptive Signal and Image Processing

Fidelity

Denoising/compression: y = f0 + w � RN .

Sparse approximation: f� = Dx� where

x� ⇥ argminx

12

||y �Dx||2 + �||x||1

x� ⇥ argminx

12

||y � �Dx||2 + �||x||1

Inverse problems y = �f0 + w � RP .

Numerical solvers: proximal splitting schemes.�� www.numerical-tours.com

ReplaceD by �D

Regularized Inversion

Page 60: Adaptive Signal and Image Processing

Inpainting Results

Page 61: Adaptive Signal and Image Processing

Overview

•Sparsity for Compression and Denoising

•Geometric Representations

•L0 vs. L1 Sparsity

•Dictionary Learning

Page 62: Adaptive Signal and Image Processing

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1

Dictionary Learning: MAP Energy

Page 63: Adaptive Signal and Image Processing

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1�

k

minD�C

Dictionary learning

Dictionary Learning: MAP Energy

Page 64: Adaptive Signal and Image Processing

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1�

k

minD�C

Dictionary learningConstraint: C = {D = (dm)m \ �m, ||dm|| � 1}

Otherwise: D � +�, X � 0

Dictionary Learning: MAP Energy

Page 65: Adaptive Signal and Image Processing

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1�

k

minD�C

Dictionary learning

Matrix formulation:min f(X, D) =

12

||Y �DX||2 + �||X||1X ⇥ RQ�K

D ⇥ C � RN�Q

Constraint: C = {D = (dm)m \ �m, ||dm|| � 1}Otherwise: D � +�, X � 0

Dictionary Learning: MAP Energy

Page 66: Adaptive Signal and Image Processing

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1�

k

minD�C

Dictionary learning

Matrix formulation:min f(X, D) =

12

||Y �DX||2 + �||X||1X ⇥ RQ�K

D ⇥ C � RN�Q

� Convex with respect to X.� Convex with respect to D.� Non-onvex with respect to (X, D).

Constraint: C = {D = (dm)m \ �m, ||dm|| � 1}Otherwise: D � +�, X � 0

DLocal minima

minX

f(X, D)

Dictionary Learning: MAP Energy

Page 67: Adaptive Signal and Image Processing

Step 1: � k, minimization on xk

� Convex sparse coding.

D, initialization

minxk

12

||yk �Dxk||2 + �||xk||1

Dictionary Learning: Algorithm

Page 68: Adaptive Signal and Image Processing

Step 1: � k, minimization on xk

� Convex sparse coding.

minD�C

||Y �DX||2

� Convex constraint minimization.

D, initializationStep 2: Minimization on D

minxk

12

||yk �Dxk||2 + �||xk||1

Dictionary Learning: Algorithm

Page 69: Adaptive Signal and Image Processing

Step 1: � k, minimization on xk

� Convex sparse coding.

minD�C

||Y �DX||2

� Convex constraint minimization.Projected gradient descent:

D, initializationStep 2: Minimization on D

minxk

12

||yk �Dxk||2 + �||xk||1

D(�+1) = ProjC�D(�) � ��(D(�)X � Y )X�

Dictionary Learning: Algorithm

Page 70: Adaptive Signal and Image Processing

Step 1: � k, minimization on xk

� Convex sparse coding.

minD�C

||Y �DX||2

� Convex constraint minimization.Projected gradient descent:

D, initialization

D, convergence

Convergence: toward a stationary pointof f(X, D).

Step 2: Minimization on D

minxk

12

||yk �Dxk||2 + �||xk||1

D(�+1) = ProjC�D(�) � ��(D(�)X � Y )X�

Dictionary Learning: Algorithm

Page 71: Adaptive Signal and Image Processing

Learning D

Exemplar patches yk

� State of the art denoising [Elad et al. 2006]

Dictionary D[Olshausen, Fields 1997]

Patch-based Learning

Page 72: Adaptive Signal and Image Processing

Learning D

Exemplar patches yk

� State of the art denoising [Elad et al. 2006]

Dictionary D[Olshausen, Fields 1997]

� Sparse texture synthesis, inpainting [Peyre 2008]

Patch-based Learning

Learning D

Page 73: Adaptive Signal and Image Processing

D(k) = (dm)k�1m=0

PCA dimensionality reduction:⇥ k, min

D||Y �D(k)X||

Linear (PCA): Fourier-like atoms.

Comparison with PCA

RUBINSTEIN et al.: DICTIONARIES FOR SPARSE REPRESENTATION 3

Fig. 1. Left: A few 12 £ 12 DCT atoms. Right: The first 40 KLT atoms,trained using 12£ 12 image patches from Lena.

B. Non-Linear Revolution and Elements of Modern DictionaryDesign

In statistics research, the 1980’s saw the rise of a newpowerful approach known as robust statistics. Robust statisticsadvocates sparsity as a key for a wide range of recovery andanalysis tasks. The idea has its roots in classical Physics, andmore recently in Information Theory, and promotes simplicityand conciseness in guiding phenomena descriptions. Motivatedby these ideas, the 1980’s and 1990’s were characterizedby a search for sparser representations and more efficienttransforms.

Increasing sparsity required departure from the linear model,towards a more flexible non-linear formulation. In the non-linear case, each signal is allowed to use a different setof atoms from the dictionary in order to achieve the bestapproximation. Thus, the approximation process becomes

x ⇤X

n�IK(x)

cn�n , (5)

where IK(x) is an index set adapted to each signal individually(we refer the reader to [5], [7] for more information on thiswide topic).

The non-linear view paved the way to the design ofnewer, more efficient transforms. In the process, many ofthe fundamental concepts guiding modern dictionary designwere formed. Following the historic time line, we trace theemergence of the most important modern dictionary designconcepts, which are mostly formed during the last two decadesof the 20th century.

Localization: To achieve sparsity, transforms required betterlocalization. Atoms with concentrated supports allow moreflexible representations based on the local signal characteris-tics, and limit the effects of irregularities, which are observedto be the main source of large coefficients. In this spirit, oneof the first structures to be used was the Short Time FourierTransform (STFT) [8], which emerges as a natural extensionto the Fourier transform. In the STFT, the Fourier transform isapplied locally to (possibly overlapping) portions of the signal,revealing a time-frequency (or space-frequency) descriptionof the signal. An example of the STFT is the JPEG imagecompression algorithm [9], which is based on this concept.

During the 1980’s and 1990’s, the STFT was extensivelyresearched and generalized, becoming more known as theGabor transform, named in homage of Dennis Gabor, whofirst suggested the time-frequency decomposition back in1946 [10]. Gabor’s work was independently rediscovered in

1980 by Bastiaans [11] and Janssen [12], who studied thefundamental properties of the expansion.

A basic 1-D Gabor dictionary consists of windowed wave-forms

G =©

⇤n,m(x) = w(x� ⇥m)ei2⇥�nx™

n,m�Z ,

where w(·) is a low-pass window function localized at 0(typically a Gaussian), and � and ⇥ control the time andfrequency resolution of the transform. Much of the mathe-matical foundations of this transform were laid out during thelate 1980’s by Daubechies, Grossman and Meyer [13], [14]who studied the transform from the angle of frame theory,and by Feichtinger and Grochenig [15]–[17] who employed ageneralized group-theoretic point of view. Study of the discreteversion of the transform and its numerical implementationfollowed in the early 1990’s, with notable contributions byWexler and Raz [18] and by Qian and Chen [19].

In higher dimensions, more complex Gabor structures weredeveloped which add directionality, by varying the orientationof the sinusoidal waves. This structure gained substantialsupport from the work of Daugman [20], [21], who discoveredoriented Gabor-like patterns in simple-cell receptive fields inthe visual cortex. These results motivated the deployment ofthe transform to image processing tasks, led by works such asDaugman [22] and Porat and Zeevi [23]. Today, practical usesof the Gabor transform are mainly in analysis and detectiontasks, as a collection of directional filters. Figure 2 showssome examples of 2-D Gabor atoms of various orientationsand sizes.

Multi-Resolution: One of the most significant conceptualadvancements achieved in the 1980’s was the rise of multi-scale analysis. It was realized that natural signals, and imagesspecifically, exhibited meaningful structures over many scales,and could be analyzed and described particularly efficientlyby multi-scale constructions. One of the simplest and bestknown such structures is the Laplacian pyramid, introducedin 1984 by Burt and Adelson [24]. The Laplacian pyramidrepresents an image as a series of difference images, whereeach one corresponds to a different scale and roughly adifferent frequency band.

In the second half of the 1980’s, though, the signal process-ing community was particularly excited about the developmentof a new very powerful tool, known as wavelet analysis [5],[25], [26]. In a pioneering work from 1984, Grossman andMorlet [27] proposed a signal expansion over a series oftranslated and dilated versions of a single elementary function,taking the form

W =n

⇤n,m(x) = �n/2f(�nx� ⇥m)o

n,m�Z.

This simple idea captivated the signal processing and harmonicanalysis communities, and in a series of influential works byMeyer, Daubechies, Mallat and others [13], [14], [28]–[33],an extensive wavelet theory was formalized. The theory wasformulated for both the continuous and discrete domains, anda complete mathematical framework relating the two was putforth. A significant breakthrough came from Meyer’s work in1985 [28], who found that unlike the Gabor transform (and

RUBINSTEIN et al.: DICTIONARIES FOR SPARSE REPRESENTATION 3

Fig. 1. Left: A few 12 £ 12 DCT atoms. Right: The first 40 KLT atoms,trained using 12£ 12 image patches from Lena.

B. Non-Linear Revolution and Elements of Modern DictionaryDesign

In statistics research, the 1980’s saw the rise of a newpowerful approach known as robust statistics. Robust statisticsadvocates sparsity as a key for a wide range of recovery andanalysis tasks. The idea has its roots in classical Physics, andmore recently in Information Theory, and promotes simplicityand conciseness in guiding phenomena descriptions. Motivatedby these ideas, the 1980’s and 1990’s were characterizedby a search for sparser representations and more efficienttransforms.

Increasing sparsity required departure from the linear model,towards a more flexible non-linear formulation. In the non-linear case, each signal is allowed to use a different setof atoms from the dictionary in order to achieve the bestapproximation. Thus, the approximation process becomes

x ⇤X

n�IK(x)

cn�n , (5)

where IK(x) is an index set adapted to each signal individually(we refer the reader to [5], [7] for more information on thiswide topic).

The non-linear view paved the way to the design ofnewer, more efficient transforms. In the process, many ofthe fundamental concepts guiding modern dictionary designwere formed. Following the historic time line, we trace theemergence of the most important modern dictionary designconcepts, which are mostly formed during the last two decadesof the 20th century.

Localization: To achieve sparsity, transforms required betterlocalization. Atoms with concentrated supports allow moreflexible representations based on the local signal characteris-tics, and limit the effects of irregularities, which are observedto be the main source of large coefficients. In this spirit, oneof the first structures to be used was the Short Time FourierTransform (STFT) [8], which emerges as a natural extensionto the Fourier transform. In the STFT, the Fourier transform isapplied locally to (possibly overlapping) portions of the signal,revealing a time-frequency (or space-frequency) descriptionof the signal. An example of the STFT is the JPEG imagecompression algorithm [9], which is based on this concept.

During the 1980’s and 1990’s, the STFT was extensivelyresearched and generalized, becoming more known as theGabor transform, named in homage of Dennis Gabor, whofirst suggested the time-frequency decomposition back in1946 [10]. Gabor’s work was independently rediscovered in

1980 by Bastiaans [11] and Janssen [12], who studied thefundamental properties of the expansion.

A basic 1-D Gabor dictionary consists of windowed wave-forms

G =©

⇤n,m(x) = w(x� ⇥m)ei2⇥�nx™

n,m�Z ,

where w(·) is a low-pass window function localized at 0(typically a Gaussian), and � and ⇥ control the time andfrequency resolution of the transform. Much of the mathe-matical foundations of this transform were laid out during thelate 1980’s by Daubechies, Grossman and Meyer [13], [14]who studied the transform from the angle of frame theory,and by Feichtinger and Grochenig [15]–[17] who employed ageneralized group-theoretic point of view. Study of the discreteversion of the transform and its numerical implementationfollowed in the early 1990’s, with notable contributions byWexler and Raz [18] and by Qian and Chen [19].

In higher dimensions, more complex Gabor structures weredeveloped which add directionality, by varying the orientationof the sinusoidal waves. This structure gained substantialsupport from the work of Daugman [20], [21], who discoveredoriented Gabor-like patterns in simple-cell receptive fields inthe visual cortex. These results motivated the deployment ofthe transform to image processing tasks, led by works such asDaugman [22] and Porat and Zeevi [23]. Today, practical usesof the Gabor transform are mainly in analysis and detectiontasks, as a collection of directional filters. Figure 2 showssome examples of 2-D Gabor atoms of various orientationsand sizes.

Multi-Resolution: One of the most significant conceptualadvancements achieved in the 1980’s was the rise of multi-scale analysis. It was realized that natural signals, and imagesspecifically, exhibited meaningful structures over many scales,and could be analyzed and described particularly efficientlyby multi-scale constructions. One of the simplest and bestknown such structures is the Laplacian pyramid, introducedin 1984 by Burt and Adelson [24]. The Laplacian pyramidrepresents an image as a series of difference images, whereeach one corresponds to a different scale and roughly adifferent frequency band.

In the second half of the 1980’s, though, the signal process-ing community was particularly excited about the developmentof a new very powerful tool, known as wavelet analysis [5],[25], [26]. In a pioneering work from 1984, Grossman andMorlet [27] proposed a signal expansion over a series oftranslated and dilated versions of a single elementary function,taking the form

W =n

⇤n,m(x) = �n/2f(�nx� ⇥m)o

n,m�Z.

This simple idea captivated the signal processing and harmonicanalysis communities, and in a series of influential works byMeyer, Daubechies, Mallat and others [13], [14], [28]–[33],an extensive wavelet theory was formalized. The theory wasformulated for both the continuous and discrete domains, anda complete mathematical framework relating the two was putforth. A significant breakthrough came from Meyer’s work in1985 [28], who found that unlike the Gabor transform (and

DCT PCA

Page 74: Adaptive Signal and Image Processing

D(k) = (dm)k�1m=0

PCA dimensionality reduction:⇥ k, min

D||Y �D(k)X||

Linear (PCA): Fourier-like atoms.Sparse (learning): Gabor-like atoms.

Comparison with PCA

RUBINSTEIN et al.: DICTIONARIES FOR SPARSE REPRESENTATION 3

Fig. 1. Left: A few 12 £ 12 DCT atoms. Right: The first 40 KLT atoms,trained using 12£ 12 image patches from Lena.

B. Non-Linear Revolution and Elements of Modern DictionaryDesign

In statistics research, the 1980’s saw the rise of a newpowerful approach known as robust statistics. Robust statisticsadvocates sparsity as a key for a wide range of recovery andanalysis tasks. The idea has its roots in classical Physics, andmore recently in Information Theory, and promotes simplicityand conciseness in guiding phenomena descriptions. Motivatedby these ideas, the 1980’s and 1990’s were characterizedby a search for sparser representations and more efficienttransforms.

Increasing sparsity required departure from the linear model,towards a more flexible non-linear formulation. In the non-linear case, each signal is allowed to use a different setof atoms from the dictionary in order to achieve the bestapproximation. Thus, the approximation process becomes

x ⇤X

n�IK(x)

cn�n , (5)

where IK(x) is an index set adapted to each signal individually(we refer the reader to [5], [7] for more information on thiswide topic).

The non-linear view paved the way to the design ofnewer, more efficient transforms. In the process, many ofthe fundamental concepts guiding modern dictionary designwere formed. Following the historic time line, we trace theemergence of the most important modern dictionary designconcepts, which are mostly formed during the last two decadesof the 20th century.

Localization: To achieve sparsity, transforms required betterlocalization. Atoms with concentrated supports allow moreflexible representations based on the local signal characteris-tics, and limit the effects of irregularities, which are observedto be the main source of large coefficients. In this spirit, oneof the first structures to be used was the Short Time FourierTransform (STFT) [8], which emerges as a natural extensionto the Fourier transform. In the STFT, the Fourier transform isapplied locally to (possibly overlapping) portions of the signal,revealing a time-frequency (or space-frequency) descriptionof the signal. An example of the STFT is the JPEG imagecompression algorithm [9], which is based on this concept.

During the 1980’s and 1990’s, the STFT was extensivelyresearched and generalized, becoming more known as theGabor transform, named in homage of Dennis Gabor, whofirst suggested the time-frequency decomposition back in1946 [10]. Gabor’s work was independently rediscovered in

1980 by Bastiaans [11] and Janssen [12], who studied thefundamental properties of the expansion.

A basic 1-D Gabor dictionary consists of windowed wave-forms

G =©

⇤n,m(x) = w(x� ⇥m)ei2⇥�nx™

n,m�Z ,

where w(·) is a low-pass window function localized at 0(typically a Gaussian), and � and ⇥ control the time andfrequency resolution of the transform. Much of the mathe-matical foundations of this transform were laid out during thelate 1980’s by Daubechies, Grossman and Meyer [13], [14]who studied the transform from the angle of frame theory,and by Feichtinger and Grochenig [15]–[17] who employed ageneralized group-theoretic point of view. Study of the discreteversion of the transform and its numerical implementationfollowed in the early 1990’s, with notable contributions byWexler and Raz [18] and by Qian and Chen [19].

In higher dimensions, more complex Gabor structures weredeveloped which add directionality, by varying the orientationof the sinusoidal waves. This structure gained substantialsupport from the work of Daugman [20], [21], who discoveredoriented Gabor-like patterns in simple-cell receptive fields inthe visual cortex. These results motivated the deployment ofthe transform to image processing tasks, led by works such asDaugman [22] and Porat and Zeevi [23]. Today, practical usesof the Gabor transform are mainly in analysis and detectiontasks, as a collection of directional filters. Figure 2 showssome examples of 2-D Gabor atoms of various orientationsand sizes.

Multi-Resolution: One of the most significant conceptualadvancements achieved in the 1980’s was the rise of multi-scale analysis. It was realized that natural signals, and imagesspecifically, exhibited meaningful structures over many scales,and could be analyzed and described particularly efficientlyby multi-scale constructions. One of the simplest and bestknown such structures is the Laplacian pyramid, introducedin 1984 by Burt and Adelson [24]. The Laplacian pyramidrepresents an image as a series of difference images, whereeach one corresponds to a different scale and roughly adifferent frequency band.

In the second half of the 1980’s, though, the signal process-ing community was particularly excited about the developmentof a new very powerful tool, known as wavelet analysis [5],[25], [26]. In a pioneering work from 1984, Grossman andMorlet [27] proposed a signal expansion over a series oftranslated and dilated versions of a single elementary function,taking the form

W =n

⇤n,m(x) = �n/2f(�nx� ⇥m)o

n,m�Z.

This simple idea captivated the signal processing and harmonicanalysis communities, and in a series of influential works byMeyer, Daubechies, Mallat and others [13], [14], [28]–[33],an extensive wavelet theory was formalized. The theory wasformulated for both the continuous and discrete domains, anda complete mathematical framework relating the two was putforth. A significant breakthrough came from Meyer’s work in1985 [28], who found that unlike the Gabor transform (and

RUBINSTEIN et al.: DICTIONARIES FOR SPARSE REPRESENTATION 3

Fig. 1. Left: A few 12 £ 12 DCT atoms. Right: The first 40 KLT atoms,trained using 12£ 12 image patches from Lena.

B. Non-Linear Revolution and Elements of Modern DictionaryDesign

In statistics research, the 1980’s saw the rise of a newpowerful approach known as robust statistics. Robust statisticsadvocates sparsity as a key for a wide range of recovery andanalysis tasks. The idea has its roots in classical Physics, andmore recently in Information Theory, and promotes simplicityand conciseness in guiding phenomena descriptions. Motivatedby these ideas, the 1980’s and 1990’s were characterizedby a search for sparser representations and more efficienttransforms.

Increasing sparsity required departure from the linear model,towards a more flexible non-linear formulation. In the non-linear case, each signal is allowed to use a different setof atoms from the dictionary in order to achieve the bestapproximation. Thus, the approximation process becomes

x ⇤X

n�IK(x)

cn�n , (5)

where IK(x) is an index set adapted to each signal individually(we refer the reader to [5], [7] for more information on thiswide topic).

The non-linear view paved the way to the design ofnewer, more efficient transforms. In the process, many ofthe fundamental concepts guiding modern dictionary designwere formed. Following the historic time line, we trace theemergence of the most important modern dictionary designconcepts, which are mostly formed during the last two decadesof the 20th century.

Localization: To achieve sparsity, transforms required betterlocalization. Atoms with concentrated supports allow moreflexible representations based on the local signal characteris-tics, and limit the effects of irregularities, which are observedto be the main source of large coefficients. In this spirit, oneof the first structures to be used was the Short Time FourierTransform (STFT) [8], which emerges as a natural extensionto the Fourier transform. In the STFT, the Fourier transform isapplied locally to (possibly overlapping) portions of the signal,revealing a time-frequency (or space-frequency) descriptionof the signal. An example of the STFT is the JPEG imagecompression algorithm [9], which is based on this concept.

During the 1980’s and 1990’s, the STFT was extensivelyresearched and generalized, becoming more known as theGabor transform, named in homage of Dennis Gabor, whofirst suggested the time-frequency decomposition back in1946 [10]. Gabor’s work was independently rediscovered in

1980 by Bastiaans [11] and Janssen [12], who studied thefundamental properties of the expansion.

A basic 1-D Gabor dictionary consists of windowed wave-forms

G =©

⇤n,m(x) = w(x� ⇥m)ei2⇥�nx™

n,m�Z ,

where w(·) is a low-pass window function localized at 0(typically a Gaussian), and � and ⇥ control the time andfrequency resolution of the transform. Much of the mathe-matical foundations of this transform were laid out during thelate 1980’s by Daubechies, Grossman and Meyer [13], [14]who studied the transform from the angle of frame theory,and by Feichtinger and Grochenig [15]–[17] who employed ageneralized group-theoretic point of view. Study of the discreteversion of the transform and its numerical implementationfollowed in the early 1990’s, with notable contributions byWexler and Raz [18] and by Qian and Chen [19].

In higher dimensions, more complex Gabor structures weredeveloped which add directionality, by varying the orientationof the sinusoidal waves. This structure gained substantialsupport from the work of Daugman [20], [21], who discoveredoriented Gabor-like patterns in simple-cell receptive fields inthe visual cortex. These results motivated the deployment ofthe transform to image processing tasks, led by works such asDaugman [22] and Porat and Zeevi [23]. Today, practical usesof the Gabor transform are mainly in analysis and detectiontasks, as a collection of directional filters. Figure 2 showssome examples of 2-D Gabor atoms of various orientationsand sizes.

Multi-Resolution: One of the most significant conceptualadvancements achieved in the 1980’s was the rise of multi-scale analysis. It was realized that natural signals, and imagesspecifically, exhibited meaningful structures over many scales,and could be analyzed and described particularly efficientlyby multi-scale constructions. One of the simplest and bestknown such structures is the Laplacian pyramid, introducedin 1984 by Burt and Adelson [24]. The Laplacian pyramidrepresents an image as a series of difference images, whereeach one corresponds to a different scale and roughly adifferent frequency band.

In the second half of the 1980’s, though, the signal process-ing community was particularly excited about the developmentof a new very powerful tool, known as wavelet analysis [5],[25], [26]. In a pioneering work from 1984, Grossman andMorlet [27] proposed a signal expansion over a series oftranslated and dilated versions of a single elementary function,taking the form

W =n

⇤n,m(x) = �n/2f(�nx� ⇥m)o

n,m�Z.

This simple idea captivated the signal processing and harmonicanalysis communities, and in a series of influential works byMeyer, Daubechies, Mallat and others [13], [14], [28]–[33],an extensive wavelet theory was formalized. The theory wasformulated for both the continuous and discrete domains, anda complete mathematical framework relating the two was putforth. A significant breakthrough came from Meyer’s work in1985 [28], who found that unlike the Gabor transform (and

DCT PCA

Gabor Learned

4 IEEE PROCEEDINGS, VOL. X, NO. X, XX 20XX

Fig. 2. Left: A few 12£12 Gabor atoms at different scales and orientations.Right: A few atoms trained by Olshausen and Field (extracted from [34]).

contrary to common belief) the wavelet transform could bedesigned to be orthogonal while maintaining stability — anextremely appealing property to which much of the initialsuccess of the wavelets can be attributed to.

Specifically of interest to the signal processing communitywas the work of Mallat and his colleagues [31]–[33] whichestablished the wavelet decomposition as a multi-resolutionexpansion and put forth efficient algorithms for computingit. In Mallat’s description, a multi-scale wavelet basis isconstructed from a pair of localized functions referred to asthe scaling function and the mother wavelet, see Figure 3.The scaling function is a low frequency signal, and alongwith its translations, spans the coarse approximation of thesignal. The mother wavelet is a high frequency signal, andwith its various scales and translations spans the signal detail.In the orthogonal case, the wavelet basis functions at eachscale are critically sampled, spanning precisely the new detailintroduced by the finer level.

Non-linear approximation in the wavelet basis was shownto be optimal for piecewise-smooth 1-D signals with a finitenumber of discontinuities, see e.g. [32]. This was a strikingfinding at the time, realizing that this is achieved withoutprior detection of the discontinuity locations. Unfortunately,in higher dimensions the wavelet transform loses its opti-mality; the multi-dimensional transform is a simple separableextension of the 1-D transform, with atoms supported overrectangular regions of different sizes (see Figure 3). Thisseparability makes the transform simple to apply, however theresulting dictionary is only effective for signals with pointsingularities, while most natural signals exhibit elongated edgesingularities. The JPEG2000 image compression standard,based on the wavelet transform, is indeed known for its ringing(smoothing) artifacts near edges.

Adaptivity: Going to the 1990’s, the desire to push sparsityeven further, and describe increasingly complex phenomena,was gradually revealing the limits of approximation in orthog-onal bases. The weakness was mostly associated with the smalland fixed number of atoms in the dictionary — dictated by theorthogonality — from which the optimal representation couldbe constructed. Thus, one option to obtain further sparsity wasto adapt the transform atoms themselves to the signal content.

One of the first such structures to be proposed was thewavelet packet transform, introduced by Coifman, Meyerand Wickerhauser in 1992 [35]. The transform is built uponthe success of the wavelet transform, adding adaptivity toallow finer tuning to the specific signal properties. The mainobservation of Coifman et al. was that the wavelet transform

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Scaling functionMother wavelet

Fig. 3. Left: Coiflet 1-D scaling function (solid) and mother wavelet (dashed).Right: Some 2-D separable Coiflet atoms.

enforced a very specific time-frequency structure, with highfrequency atoms having small supports and low frequencyatoms having large supports. Indeed, this choice has deepconnections to the behavior of real natural signals; however,for specific signals, better partitionings may be possible. Thewavelet packet dictionary essentially unifies all dyadic time-frequency atoms which can be derived from a specific pairof scaling function and mother wavelet, so atoms of differentfrequencies can come in an array of time supports. Out ofthis large collection, the wavelet packet transform allows toefficiently select an optimized orthogonal sub-dictionary forany given signal, with the standard wavelet basis being justone of an exponential number of options. The process wasthus named by the authors a Best Basis search. The waveletpacket transform is, by definition, at least as good as waveletsin terms of coding efficiency. However, we note that the multi-dimensional wavelet packet transform remains a separable andnon-oriented transform, and thus does not generally provide asubstantial improvement over wavelets for images.

Geometric Invariance and Overcompleteness: In 1992, Si-moncelli et al. [36] published a thorough work advocating adictionary property they termed shiftability, which describesthe invariance of the dictionary under certain geometric defor-mations, e.g. translation, rotation or scaling. Indeed, the mainweakness of the wavelet transform is its strong translation-sensitivity, as well as rotation-sensitivity in higher dimensions.The authors concluded that achieving these properties requiredabandoning orthogonality in favor of overcompleteness, sincethe critical number of atoms in an orthogonal transform wassimply insufficient. In the same work, the authors developedan overcomplete oriented wavelet transform — the steerablewavelet transform — which was based on their previous workon steerable filters and consisted of localized 2-D waveletatoms in many orientations, translations and scales.

For the basic 1-D wavelet transform, translation-invariancecan be achieved by increasing the sampling density of theatoms. The stationary wavelet transform, also known as theundecimated or non-subsampled wavelet transform, is obtainedfrom the orthogonal transform by eliminating the sub-samplingand collecting all translations of the atoms over the signaldomain. The algorithmic foundation for this was laid byBeylkin in 1992 [37], with the development of an efficientalgorithm for computing the undecimated transform. Thestationary wavelet transform was indeed found to substantiallyimprove signal recovery compared to orthogonal wavelets,and its benefits were independently demonstrated in 1995 byNason and Silverman [38] and Coifman and Donoho [39].

4 IEEE PROCEEDINGS, VOL. X, NO. X, XX 20XX

Fig. 2. Left: A few 12£12 Gabor atoms at different scales and orientations.Right: A few atoms trained by Olshausen and Field (extracted from [34]).

contrary to common belief) the wavelet transform could bedesigned to be orthogonal while maintaining stability — anextremely appealing property to which much of the initialsuccess of the wavelets can be attributed to.

Specifically of interest to the signal processing communitywas the work of Mallat and his colleagues [31]–[33] whichestablished the wavelet decomposition as a multi-resolutionexpansion and put forth efficient algorithms for computingit. In Mallat’s description, a multi-scale wavelet basis isconstructed from a pair of localized functions referred to asthe scaling function and the mother wavelet, see Figure 3.The scaling function is a low frequency signal, and alongwith its translations, spans the coarse approximation of thesignal. The mother wavelet is a high frequency signal, andwith its various scales and translations spans the signal detail.In the orthogonal case, the wavelet basis functions at eachscale are critically sampled, spanning precisely the new detailintroduced by the finer level.

Non-linear approximation in the wavelet basis was shownto be optimal for piecewise-smooth 1-D signals with a finitenumber of discontinuities, see e.g. [32]. This was a strikingfinding at the time, realizing that this is achieved withoutprior detection of the discontinuity locations. Unfortunately,in higher dimensions the wavelet transform loses its opti-mality; the multi-dimensional transform is a simple separableextension of the 1-D transform, with atoms supported overrectangular regions of different sizes (see Figure 3). Thisseparability makes the transform simple to apply, however theresulting dictionary is only effective for signals with pointsingularities, while most natural signals exhibit elongated edgesingularities. The JPEG2000 image compression standard,based on the wavelet transform, is indeed known for its ringing(smoothing) artifacts near edges.

Adaptivity: Going to the 1990’s, the desire to push sparsityeven further, and describe increasingly complex phenomena,was gradually revealing the limits of approximation in orthog-onal bases. The weakness was mostly associated with the smalland fixed number of atoms in the dictionary — dictated by theorthogonality — from which the optimal representation couldbe constructed. Thus, one option to obtain further sparsity wasto adapt the transform atoms themselves to the signal content.

One of the first such structures to be proposed was thewavelet packet transform, introduced by Coifman, Meyerand Wickerhauser in 1992 [35]. The transform is built uponthe success of the wavelet transform, adding adaptivity toallow finer tuning to the specific signal properties. The mainobservation of Coifman et al. was that the wavelet transform

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Scaling functionMother wavelet

Fig. 3. Left: Coiflet 1-D scaling function (solid) and mother wavelet (dashed).Right: Some 2-D separable Coiflet atoms.

enforced a very specific time-frequency structure, with highfrequency atoms having small supports and low frequencyatoms having large supports. Indeed, this choice has deepconnections to the behavior of real natural signals; however,for specific signals, better partitionings may be possible. Thewavelet packet dictionary essentially unifies all dyadic time-frequency atoms which can be derived from a specific pairof scaling function and mother wavelet, so atoms of differentfrequencies can come in an array of time supports. Out ofthis large collection, the wavelet packet transform allows toefficiently select an optimized orthogonal sub-dictionary forany given signal, with the standard wavelet basis being justone of an exponential number of options. The process wasthus named by the authors a Best Basis search. The waveletpacket transform is, by definition, at least as good as waveletsin terms of coding efficiency. However, we note that the multi-dimensional wavelet packet transform remains a separable andnon-oriented transform, and thus does not generally provide asubstantial improvement over wavelets for images.

Geometric Invariance and Overcompleteness: In 1992, Si-moncelli et al. [36] published a thorough work advocating adictionary property they termed shiftability, which describesthe invariance of the dictionary under certain geometric defor-mations, e.g. translation, rotation or scaling. Indeed, the mainweakness of the wavelet transform is its strong translation-sensitivity, as well as rotation-sensitivity in higher dimensions.The authors concluded that achieving these properties requiredabandoning orthogonality in favor of overcompleteness, sincethe critical number of atoms in an orthogonal transform wassimply insufficient. In the same work, the authors developedan overcomplete oriented wavelet transform — the steerablewavelet transform — which was based on their previous workon steerable filters and consisted of localized 2-D waveletatoms in many orientations, translations and scales.

For the basic 1-D wavelet transform, translation-invariancecan be achieved by increasing the sampling density of theatoms. The stationary wavelet transform, also known as theundecimated or non-subsampled wavelet transform, is obtainedfrom the orthogonal transform by eliminating the sub-samplingand collecting all translations of the atoms over the signaldomain. The algorithmic foundation for this was laid byBeylkin in 1992 [37], with the development of an efficientalgorithm for computing the undecimated transform. Thestationary wavelet transform was indeed found to substantiallyimprove signal recovery compared to orthogonal wavelets,and its benefits were independently demonstrated in 1995 byNason and Silverman [38] and Coifman and Donoho [39].

Page 75: Adaptive Signal and Image Processing

[Aharon & Elad 2006]

yk(·) = f(zk + ·)

Patch-based Denoising

Step 1: Extract patches.Noisy image: f = f0 + w.

yk

Page 76: Adaptive Signal and Image Processing

Step 2: Dictionary learning.

[Aharon & Elad 2006]

yk(·) = f(zk + ·)

minD,(xk)k

k

12

||yk �Dxk||2 + �||xk||1

Patch-based Denoising

Step 1: Extract patches.Noisy image: f = f0 + w.

yk

Page 77: Adaptive Signal and Image Processing

Step 3: Patch averaging.

Step 2: Dictionary learning.

yk = Dxk

[Aharon & Elad 2006]

f(·) ⇥�

k

yk(·� zk)

yk(·) = f(zk + ·)

minD,(xk)k

k

12

||yk �Dxk||2 + �||xk||1

Patch-based Denoising

Step 1: Extract patches.Noisy image: f = f0 + w.

ykyk

Page 78: Adaptive Signal and Image Processing

Inverse problem:

D � C

pk(f) = f(zk + ·)

Learning with Missing Datay = �f0 + w

minf,(xk)k

12

||y � �f ||2 + ��

k

12

||pk(f)�Dxk||2 + ⇥||xk||1

Patch extractor:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

pk

y

f0

Page 79: Adaptive Signal and Image Processing

Inverse problem:

D � C

� Convex sparse coding.

pk(f) = f(zk + ·)

Learning with Missing Datay = �f0 + w

minf,(xk)k

12

||y � �f ||2 + ��

k

12

||pk(f)�Dxk||2 + ⇥||xk||1

Step 1: � k, minimization on xk

Patch extractor:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

pk

y

f0

Page 80: Adaptive Signal and Image Processing

Inverse problem:

D � C

Step 2: Minimization on D

� Convex sparse coding.

� Quadratic constrained.

pk(f) = f(zk + ·)

Learning with Missing Datay = �f0 + w

minf,(xk)k

12

||y � �f ||2 + ��

k

12

||pk(f)�Dxk||2 + ⇥||xk||1

Step 1: � k, minimization on xk

Patch extractor:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

pk

y

f0

Page 81: Adaptive Signal and Image Processing

Inverse problem:

D � C

Step 2: Minimization on D

Step 3: Minimization on f

� Convex sparse coding.

� Quadratic constrained.

� Quadratic.

pk(f) = f(zk + ·)

Learning with Missing Datay = �f0 + w

minf,(xk)k

12

||y � �f ||2 + ��

k

12

||pk(f)�Dxk||2 + ⇥||xk||1

Step 1: � k, minimization on xk

Patch extractor:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

pk

y

f0

Page 82: Adaptive Signal and Image Processing

Image f0 Observationsy = �f0 + w

Regularized f

[Mairal et al. 2008]

Inpainting Example

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Page 83: Adaptive Signal and Image Processing

[Peyre, Fadili, Starck 2010]

Adaptive Inpainting and Separation

Local DCT

Local DCTWavelets

Wavelets Learned

Page 84: Adaptive Signal and Image Processing

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

Higher Dimensional LearningMAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

MA

IRA

Letal.:SPA

RSE

RE

PRE

SEN

TAT

ION

FOR

CO

LO

RIM

AG

ER

EST

OR

AT

ION

61

Fig.7.D

atasetused

forevaluating

denoisingexperim

ents.

TAB

LE

IPSN

RR

ESU

LTS

OF

OU

RD

EN

OISIN

GA

LG

OR

ITH

MW

ITH

256A

TO

MS

OF

SIZ

E7

73

FOR

AN

D6

63

FOR

.EA

CH

CA

SEIS

DIV

IDE

DIN

FO

UR

PA

RT

S:TH

ET

OP-L

EFT

RE

SULT

SA

RE

TH

OSE

GIV

EN

BY

MCA

UL

EY

AN

DA

L[28]W

ITH

TH

EIR

“33

MO

DE

L.”T

HE

TO

P-RIG

HT

RE

SULT

SA

RE

TH

OSE

OB

TAIN

ED

BY

APPLY

ING

TH

EG

RA

YSC

AL

EK

-SVD

AL

GO

RIT

HM

[2]O

NE

AC

HC

HA

NN

EL

SE

PAR

AT

ELY

WIT

H8

8A

TO

MS.T

HE

BO

TT

OM

-LE

FTA

RE

OU

RR

ESU

LTS

OB

TAIN

ED

WIT

HA

GL

OB

AL

LYT

RA

INE

DD

ICT

ION

AR

Y.TH

EB

OT

TO

M-R

IGH

TA

RE

TH

EIM

PRO

VE

ME

NT

SO

BTA

INE

DW

ITH

TH

EA

DA

PTIV

EA

PPRO

AC

HW

ITH

20IT

ER

AT

ION

S.B

OL

DIN

DIC

AT

ES

TH

EB

EST

RE

SULT

SFO

RE

AC

HG

RO

UP.

AS

CA

NB

ESE

EN,

OU

RP

RO

POSE

DT

EC

HN

IQU

EC

ON

SISTE

NT

LYP

RO

DU

CE

ST

HE

BE

STR

ESU

LTS

TAB

LE

IIC

OM

PAR

ISON

OF

TH

EPSN

RR

ESU

LTS

ON

TH

EIM

AG

E“C

AST

LE”

BE

TW

EE

N[28]

AN

DW

HA

TW

EO

BTA

INE

DW

ITH

2566

63

AN

D7

73

PA

TC

HE

S.F

OR

TH

EA

DA

PTIV

EA

PPRO

AC

H,20

ITE

RA

TIO

NS

HA

VE

BE

EN

PE

RFO

RM

ED.B

OL

DIN

DIC

AT

ES

TH

EB

EST

RE

SULT,

IND

ICA

TIN

GO

NC

EA

GA

INT

HE

CO

NSIST

EN

TIM

PRO

VE

ME

NT

OB

TAIN

ED

WIT

HO

UR

PR

OPO

SED

TE

CH

NIQ

UE

patch),inorder

topreventany

learningof

theseartifacts

(over-fitting).

We

definethen

thepatch

sparsityof

thedecom

po-sition

asthis

number

ofsteps.T

hestopping

criteriain

(2)be-

comes

thenum

berof

atoms

usedinstead

ofthe

reconstructionerror.U

singa

small

duringthe

OM

Pperm

itsto

learna

dic-tionary

specializedin

providinga

coarseapproxim

ation.O

urassum

ptionis

that(pattern)

artifactsare

lesspresent

incoarse

approximations,preventing

thedictionary

fromlearning

them.

We

proposethen

thealgorithm

describedin

Fig.6.We

typicallyused

toprevent

thelearning

ofartifacts

andfound

outthattw

oouteriterations

inthe

scheme

inFig.6

aresufficientto

givesatisfactory

results,while

within

theK

-SVD

,10–20itera-

tionsare

required.To

conclude,inorderto

addressthedem

osaicingproblem

,we

usethe

modified

K-SV

Dalgorithm

thatdealsw

ithnonuniform

noise,asdescribed

inprevious

section,andadd

toitan

adaptivedictionary

thathasbeen

learnedw

ithlow

patchsparsity

inorder

toavoid

over-fittingthe

mosaic

pattern.The

same

techniquecan

beapplied

togeneric

colorinpainting

asdem

onstratedin

thenextsection.

V.

EX

PER

IME

NTA

LR

ESU

LTS

We

arenow

readyto

presentthe

colorim

agedenoising,in-

painting,anddem

osaicingresultsthatare

obtainedw

iththe

pro-posed

framew

ork.

A.

Denoising

Color

Images

The

state-of-the-artperform

anceof

thealgorithm

ongrayscale

images

hasalready

beenstudied

in[2].

We

nowevaluate

ourextension

forcolor

images.

We

trainedsom

edictionaries

with

differentsizesof

atoms

55

3,66

3,7

73

and8

83,

on200

000patches

takenfrom

adatabase

of15

000im

agesw

iththe

patch-sparsityparam

eter(six

atoms

inthe

representations).We

usedthe

databaseL

abelMe

[55]to

buildour

image

database.T

henw

etrained

eachdictionary

with

600iterations.

This

providedus

aset

ofgeneric

dictionariesthat

we

usedas

initialdictionaries

inour

denoisingalgorithm

.C

omparing

theresults

obtainedw

iththe

globalapproach

andthe

adaptiveone

permits

usto

seethe

improvem

entsin

thelearning

process.W

echose

toevaluate

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 61

Fig. 7. Data set used for evaluating denoising experiments.

TABLE IPSNR RESULTS OF OUR DENOISING ALGORITHM WITH 256 ATOMS OF SIZE 7 7 3 FOR AND 6 6 3 FOR . EACH CASE IS DIVIDED IN FOURPARTS: THE TOP-LEFT RESULTS ARE THOSE GIVEN BY MCAULEY AND AL [28] WITH THEIR “3 3 MODEL.” THE TOP-RIGHT RESULTS ARE THOSE OBTAINED BY

APPLYING THE GRAYSCALE K-SVD ALGORITHM [2] ON EACH CHANNEL SEPARATELY WITH 8 8 ATOMS. THE BOTTOM-LEFT ARE OUR RESULTS OBTAINEDWITH A GLOBALLY TRAINED DICTIONARY. THE BOTTOM-RIGHT ARE THE IMPROVEMENTS OBTAINED WITH THE ADAPTIVE APPROACH WITH 20 ITERATIONS.

BOLD INDICATES THE BEST RESULTS FOR EACH GROUP. AS CAN BE SEEN, OUR PROPOSED TECHNIQUE CONSISTENTLY PRODUCES THE BEST RESULTS

TABLE IICOMPARISON OF THE PSNR RESULTS ON THE IMAGE “CASTLE” BETWEEN [28] AND WHAT WE OBTAINED WITH 256 6 6 3 AND 7 7 3 PATCHES.

FOR THE ADAPTIVE APPROACH, 20 ITERATIONS HAVE BEEN PERFORMED. BOLD INDICATES THE BEST RESULT, INDICATING ONCEAGAIN THE CONSISTENT IMPROVEMENT OBTAINED WITH OUR PROPOSED TECHNIQUE

patch), in order to prevent any learning of these artifacts (over-fitting). We define then the patch sparsity of the decompo-sition as this number of steps. The stopping criteria in (2) be-comes the number of atoms used instead of the reconstructionerror. Using a small during the OMP permits to learn a dic-tionary specialized in providing a coarse approximation. Ourassumption is that (pattern) artifacts are less present in coarseapproximations, preventing the dictionary from learning them.We propose then the algorithm described in Fig. 6. We typicallyused to prevent the learning of artifacts and found outthat two outer iterations in the scheme in Fig. 6 are sufficient togive satisfactory results, while within the K-SVD, 10–20 itera-tions are required.

To conclude, in order to address the demosaicing problem, weuse the modified K-SVD algorithm that deals with nonuniformnoise, as described in previous section, and add to it an adaptivedictionary that has been learned with low patch sparsity in orderto avoid over-fitting the mosaic pattern. The same technique canbe applied to generic color inpainting as demonstrated in thenext section.

V. EXPERIMENTAL RESULTS

We are now ready to present the color image denoising, in-painting, and demosaicing results that are obtained with the pro-posed framework.

A. Denoising Color Images

The state-of-the-art performance of the algorithm ongrayscale images has already been studied in [2]. We nowevaluate our extension for color images. We trained somedictionaries with different sizes of atoms 5 5 3, 6 6 3,7 7 3 and 8 8 3, on 200 000 patches taken from adatabase of 15 000 images with the patch-sparsity parameter

(six atoms in the representations). We used the databaseLabelMe [55] to build our image database. Then we trainedeach dictionary with 600 iterations. This provided us a set ofgeneric dictionaries that we used as initial dictionaries in ourdenoising algorithm. Comparing the results obtained with theglobal approach and the adaptive one permits us to see theimprovements in the learning process. We chose to evaluate

Learning D

Page 85: Adaptive Signal and Image Processing

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

Inpainting

Higher Dimensional LearningMAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

MA

IRA

Letal.:SPA

RSE

RE

PRE

SEN

TAT

ION

FOR

CO

LO

RIM

AG

ER

EST

OR

AT

ION

61

Fig.7.D

atasetused

forevaluating

denoisingexperim

ents.

TAB

LE

IPSN

RR

ESU

LTS

OF

OU

RD

EN

OISIN

GA

LG

OR

ITH

MW

ITH

256A

TO

MS

OF

SIZ

E7

73

FOR

AN

D6

63

FOR

.EA

CH

CA

SEIS

DIV

IDE

DIN

FO

UR

PA

RT

S:TH

ET

OP-L

EFT

RE

SULT

SA

RE

TH

OSE

GIV

EN

BY

MCA

UL

EY

AN

DA

L[28]W

ITH

TH

EIR

“33

MO

DE

L.”T

HE

TO

P-RIG

HT

RE

SULT

SA

RE

TH

OSE

OB

TAIN

ED

BY

APPLY

ING

TH

EG

RA

YSC

AL

EK

-SVD

AL

GO

RIT

HM

[2]O

NE

AC

HC

HA

NN

EL

SE

PAR

AT

ELY

WIT

H8

8A

TO

MS.T

HE

BO

TT

OM

-LE

FTA

RE

OU

RR

ESU

LTS

OB

TAIN

ED

WIT

HA

GL

OB

AL

LYT

RA

INE

DD

ICT

ION

AR

Y.TH

EB

OT

TO

M-R

IGH

TA

RE

TH

EIM

PRO

VE

ME

NT

SO

BTA

INE

DW

ITH

TH

EA

DA

PTIV

EA

PPRO

AC

HW

ITH

20IT

ER

AT

ION

S.B

OL

DIN

DIC

AT

ES

TH

EB

EST

RE

SULT

SFO

RE

AC

HG

RO

UP.

AS

CA

NB

ESE

EN,

OU

RP

RO

POSE

DT

EC

HN

IQU

EC

ON

SISTE

NT

LYP

RO

DU

CE

ST

HE

BE

STR

ESU

LTS

TAB

LE

IIC

OM

PAR

ISON

OF

TH

EPSN

RR

ESU

LTS

ON

TH

EIM

AG

E“C

AST

LE”

BE

TW

EE

N[28]

AN

DW

HA

TW

EO

BTA

INE

DW

ITH

2566

63

AN

D7

73

PA

TC

HE

S.F

OR

TH

EA

DA

PTIV

EA

PPRO

AC

H,20

ITE

RA

TIO

NS

HA

VE

BE

EN

PE

RFO

RM

ED.B

OL

DIN

DIC

AT

ES

TH

EB

EST

RE

SULT,

IND

ICA

TIN

GO

NC

EA

GA

INT

HE

CO

NSIST

EN

TIM

PRO

VE

ME

NT

OB

TAIN

ED

WIT

HO

UR

PR

OPO

SED

TE

CH

NIQ

UE

patch),inorder

topreventany

learningof

theseartifacts

(over-fitting).

We

definethen

thepatch

sparsityof

thedecom

po-sition

asthis

number

ofsteps.T

hestopping

criteriain

(2)be-

comes

thenum

berof

atoms

usedinstead

ofthe

reconstructionerror.U

singa

small

duringthe

OM

Pperm

itsto

learna

dic-tionary

specializedin

providinga

coarseapproxim

ation.O

urassum

ptionis

that(pattern)

artifactsare

lesspresent

incoarse

approximations,preventing

thedictionary

fromlearning

them.

We

proposethen

thealgorithm

describedin

Fig.6.We

typicallyused

toprevent

thelearning

ofartifacts

andfound

outthattw

oouteriterations

inthe

scheme

inFig.6

aresufficientto

givesatisfactory

results,while

within

theK

-SVD

,10–20itera-

tionsare

required.To

conclude,inorderto

addressthedem

osaicingproblem

,we

usethe

modified

K-SV

Dalgorithm

thatdealsw

ithnonuniform

noise,asdescribed

inprevious

section,andadd

toitan

adaptivedictionary

thathasbeen

learnedw

ithlow

patchsparsity

inorder

toavoid

over-fittingthe

mosaic

pattern.The

same

techniquecan

beapplied

togeneric

colorinpainting

asdem

onstratedin

thenextsection.

V.

EX

PER

IME

NTA

LR

ESU

LTS

We

arenow

readyto

presentthe

colorim

agedenoising,in-

painting,anddem

osaicingresultsthatare

obtainedw

iththe

pro-posed

framew

ork.

A.

Denoising

Color

Images

The

state-of-the-artperform

anceof

thealgorithm

ongrayscale

images

hasalready

beenstudied

in[2].

We

nowevaluate

ourextension

forcolor

images.

We

trainedsom

edictionaries

with

differentsizesof

atoms

55

3,66

3,7

73

and8

83,

on200

000patches

takenfrom

adatabase

of15

000im

agesw

iththe

patch-sparsityparam

eter(six

atoms

inthe

representations).We

usedthe

databaseL

abelMe

[55]to

buildour

image

database.T

henw

etrained

eachdictionary

with

600iterations.

This

providedus

aset

ofgeneric

dictionariesthat

we

usedas

initialdictionaries

inour

denoisingalgorithm

.C

omparing

theresults

obtainedw

iththe

globalapproach

andthe

adaptiveone

permits

usto

seethe

improvem

entsin

thelearning

process.W

echose

toevaluate

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 61

Fig. 7. Data set used for evaluating denoising experiments.

TABLE IPSNR RESULTS OF OUR DENOISING ALGORITHM WITH 256 ATOMS OF SIZE 7 7 3 FOR AND 6 6 3 FOR . EACH CASE IS DIVIDED IN FOURPARTS: THE TOP-LEFT RESULTS ARE THOSE GIVEN BY MCAULEY AND AL [28] WITH THEIR “3 3 MODEL.” THE TOP-RIGHT RESULTS ARE THOSE OBTAINED BY

APPLYING THE GRAYSCALE K-SVD ALGORITHM [2] ON EACH CHANNEL SEPARATELY WITH 8 8 ATOMS. THE BOTTOM-LEFT ARE OUR RESULTS OBTAINEDWITH A GLOBALLY TRAINED DICTIONARY. THE BOTTOM-RIGHT ARE THE IMPROVEMENTS OBTAINED WITH THE ADAPTIVE APPROACH WITH 20 ITERATIONS.

BOLD INDICATES THE BEST RESULTS FOR EACH GROUP. AS CAN BE SEEN, OUR PROPOSED TECHNIQUE CONSISTENTLY PRODUCES THE BEST RESULTS

TABLE IICOMPARISON OF THE PSNR RESULTS ON THE IMAGE “CASTLE” BETWEEN [28] AND WHAT WE OBTAINED WITH 256 6 6 3 AND 7 7 3 PATCHES.

FOR THE ADAPTIVE APPROACH, 20 ITERATIONS HAVE BEEN PERFORMED. BOLD INDICATES THE BEST RESULT, INDICATING ONCEAGAIN THE CONSISTENT IMPROVEMENT OBTAINED WITH OUR PROPOSED TECHNIQUE

patch), in order to prevent any learning of these artifacts (over-fitting). We define then the patch sparsity of the decompo-sition as this number of steps. The stopping criteria in (2) be-comes the number of atoms used instead of the reconstructionerror. Using a small during the OMP permits to learn a dic-tionary specialized in providing a coarse approximation. Ourassumption is that (pattern) artifacts are less present in coarseapproximations, preventing the dictionary from learning them.We propose then the algorithm described in Fig. 6. We typicallyused to prevent the learning of artifacts and found outthat two outer iterations in the scheme in Fig. 6 are sufficient togive satisfactory results, while within the K-SVD, 10–20 itera-tions are required.

To conclude, in order to address the demosaicing problem, weuse the modified K-SVD algorithm that deals with nonuniformnoise, as described in previous section, and add to it an adaptivedictionary that has been learned with low patch sparsity in orderto avoid over-fitting the mosaic pattern. The same technique canbe applied to generic color inpainting as demonstrated in thenext section.

V. EXPERIMENTAL RESULTS

We are now ready to present the color image denoising, in-painting, and demosaicing results that are obtained with the pro-posed framework.

A. Denoising Color Images

The state-of-the-art performance of the algorithm ongrayscale images has already been studied in [2]. We nowevaluate our extension for color images. We trained somedictionaries with different sizes of atoms 5 5 3, 6 6 3,7 7 3 and 8 8 3, on 200 000 patches taken from adatabase of 15 000 images with the patch-sparsity parameter

(six atoms in the representations). We used the databaseLabelMe [55] to build our image database. Then we trainedeach dictionary with 600 iterations. This provided us a set ofgeneric dictionaries that we used as initial dictionaries in ourdenoising algorithm. Comparing the results obtained with theglobal approach and the adaptive one permits us to see theimprovements in the learning process. We chose to evaluate

Learning D

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

Figure 7: Inpainting example on a 12-Megapixel image. Top: Damaged and restored images. Bot-tom: Zooming on the damaged and restored images. Note that the pictures presented herehave been scaled down for display. (Best seen in color).

6.4 Application to Large-Scale Image Processing

We demonstrate in this section that our algorithm can be used for a difficult large-scale imageprocessing task, namely, removing the text (inpainting) from the damaged 12-Megapixel imageof Figure 7. Using a multi-threaded version of our implementation, we have learned a dictionarywith 256 elements from the roughly 7! 106 undamaged 12! 12 color patches in the image withtwo epochs in about 8 minutes on a 2.4GHz machine with eight cores. Once the dictionary has beenlearned, the text is removed using the sparse coding technique for inpainting of Mairal et al. (2008b).Our intent here is of course not to evaluate our learning procedure in inpainting tasks, which wouldrequire a thorough comparison with state-the-art techniques on standard data sets. Instead, we justwish to demonstrate that it can indeed be applied to a realistic, non-trivial image processing task ona large image. Indeed, to the best of our knowledge, this is the first time that dictionary learningis used for image restoration on such large-scale data. For comparison, the dictionaries used forinpainting in Mairal et al. (2008b) are learned (in batch mode) on 200,000 patches only.

49

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

Figure 7: Inpainting example on a 12-Megapixel image. Top: Damaged and restored images. Bot-tom: Zooming on the damaged and restored images. Note that the pictures presented herehave been scaled down for display. (Best seen in color).

6.4 Application to Large-Scale Image Processing

We demonstrate in this section that our algorithm can be used for a difficult large-scale imageprocessing task, namely, removing the text (inpainting) from the damaged 12-Megapixel imageof Figure 7. Using a multi-threaded version of our implementation, we have learned a dictionarywith 256 elements from the roughly 7! 106 undamaged 12! 12 color patches in the image withtwo epochs in about 8 minutes on a 2.4GHz machine with eight cores. Once the dictionary has beenlearned, the text is removed using the sparse coding technique for inpainting of Mairal et al. (2008b).Our intent here is of course not to evaluate our learning procedure in inpainting tasks, which wouldrequire a thorough comparison with state-the-art techniques on standard data sets. Instead, we justwish to demonstrate that it can indeed be applied to a realistic, non-trivial image processing task ona large image. Indeed, to the best of our knowledge, this is the first time that dictionary learningis used for image restoration on such large-scale data. For comparison, the dictionaries used forinpainting in Mairal et al. (2008b) are learned (in batch mode) on 200,000 patches only.

49

Page 86: Adaptive Signal and Image Processing

Movie Inpainting

Page 87: Adaptive Signal and Image Processing

Image registration.

Facial Image Compressionshow recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

[Elad et al. 2009]

Page 88: Adaptive Signal and Image Processing

Image registration.

Non-overlapping patches (fk)k.

Facial Image Compressionshow recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

[Elad et al. 2009]

fk

Page 89: Adaptive Signal and Image Processing

Image registration.

Non-overlapping patches (fk)k.

Dictionary learning (Dk)k.

Dk

Facial Image CompressionBefore turning to preset the results we should add the follow-

ing: while all the results shown here refer to the specific databasewe operate on, the overall scheme proposed is general and shouldapply to other face images databases just as well. Naturally, somechanges in the parameters might be necessary, and among those,the patch size is the most important to consider. We also note thatas one shifts from one source of images to another, the relative sizeof the background in the photos may vary, and this necessarilyleads to changes in performance. More specifically, when the back-ground regions are larger (e.g., the images we use here have rela-tively small such regions), the compression performance isexpected to improve.

4.1. K-SVD dictionaries

The primary stopping condition for the training process was setto be a limitation on the maximal number of K-SVD iterations(being 100). A secondary stopping condition was a limitation onthe minimal representation error. In the image compression stagewe added a limitation on the maximal number of atoms per patch.These conditions were used to allow us to better control the ratesof the resulting images and the overall simulation time.

Every obtained dictionary contains 512 patches of size15! 15 pixels as atoms. In Fig. 6 we can see the dictionary that

was trained for patch number 80 (The left eye) for L " 4 sparsecoding atoms, and similarly, in Fig. 7 we can see the dictionary thatwas trained for patch number 87 (The right nostril) also for L " 4sparse coding atoms. It can be seen that both dictionaries containimages similar in nature to the image patch for which they weretrained for. A similar behavior was observed in other dictionaries.

4.2. Reconstructed images

Our coding strategy allows us to learn which parts of the im-age are more difficult than others to code. This is done byassigning the same representation error threshold to all of thepatches, and observing how many atoms are required for therepresentation of each patch on average. Clearly, patches witha small number of allocated atoms are simpler to represent thanothers. We would expect that the representation of smooth areasof the image such as the background, parts of the face andmaybe parts of the clothes will be simpler than the representa-tion of areas containing high frequency elements such as thehair or the eyes. Fig. 8 shows maps of atom allocation per patchand representation error (RMSE—squared-root of the meansquared error) per patch for the images in the test set in twodifferent bit-rates. It can be seen that more atoms were allocatedto patches containing the facial details (hair, mouth, eyes, and

Fig. 6. The Dictionary obtained by K-SVD for Patch No. 80 (the left eye) using the OMP method with L " 4.

Fig. 7. The Dictionary obtained by K-SVD for Patch No. 87 (the right nostril) using the OMP method with L " 4.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 275

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

[Elad et al. 2009]

fk

Page 90: Adaptive Signal and Image Processing

Image registration.

Non-overlapping patches (fk)k.

Dictionary learning (Dk)k.

fk � Dkxk

Sparse approximation:

Entropic coding: xk � file.

JPEG-2k PCA Learning

Dk

Facial Image CompressionBefore turning to preset the results we should add the follow-

ing: while all the results shown here refer to the specific databasewe operate on, the overall scheme proposed is general and shouldapply to other face images databases just as well. Naturally, somechanges in the parameters might be necessary, and among those,the patch size is the most important to consider. We also note thatas one shifts from one source of images to another, the relative sizeof the background in the photos may vary, and this necessarilyleads to changes in performance. More specifically, when the back-ground regions are larger (e.g., the images we use here have rela-tively small such regions), the compression performance isexpected to improve.

4.1. K-SVD dictionaries

The primary stopping condition for the training process was setto be a limitation on the maximal number of K-SVD iterations(being 100). A secondary stopping condition was a limitation onthe minimal representation error. In the image compression stagewe added a limitation on the maximal number of atoms per patch.These conditions were used to allow us to better control the ratesof the resulting images and the overall simulation time.

Every obtained dictionary contains 512 patches of size15! 15 pixels as atoms. In Fig. 6 we can see the dictionary that

was trained for patch number 80 (The left eye) for L " 4 sparsecoding atoms, and similarly, in Fig. 7 we can see the dictionary thatwas trained for patch number 87 (The right nostril) also for L " 4sparse coding atoms. It can be seen that both dictionaries containimages similar in nature to the image patch for which they weretrained for. A similar behavior was observed in other dictionaries.

4.2. Reconstructed images

Our coding strategy allows us to learn which parts of the im-age are more difficult than others to code. This is done byassigning the same representation error threshold to all of thepatches, and observing how many atoms are required for therepresentation of each patch on average. Clearly, patches witha small number of allocated atoms are simpler to represent thanothers. We would expect that the representation of smooth areasof the image such as the background, parts of the face andmaybe parts of the clothes will be simpler than the representa-tion of areas containing high frequency elements such as thehair or the eyes. Fig. 8 shows maps of atom allocation per patchand representation error (RMSE—squared-root of the meansquared error) per patch for the images in the test set in twodifferent bit-rates. It can be seen that more atoms were allocatedto patches containing the facial details (hair, mouth, eyes, and

Fig. 6. The Dictionary obtained by K-SVD for Patch No. 80 (the left eye) using the OMP method with L " 4.

Fig. 7. The Dictionary obtained by K-SVD for Patch No. 87 (the right nostril) using the OMP method with L " 4.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 275

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

Much like other compression methods, the quality of thereconstructed images in our method improves as the bit-rateincreases. However, the contribution gained from such a rateincrement is not divided equally over the image. Additional bitsare allocated to patches with higher representation error, andthose are improved first. This property is directly caused bythe nature of the compression process, which is RMSE orientedand not bit-rate oriented. The compression process sets a singleRMSE threshold for all the patches, forcing each of them toreach it without fixing the number of allocated atoms perpatch. Patches with simple (smooth) content are most likelyto have a representation error far below the threshold evenusing zero or one atom, whereas patches with more complexcontent are expected to give a representation error very closeto the threshold. Such problematic patches will be forced to im-prove their representation error by increasing the number ofatoms they use as the RMSE threshold is decreased, whilepatches with a representation error below the threshold willnot be forced to change at all. Fig. 11 illustrates thegradual improvement in the image quality as the bit-rate in-creases. As can be seen, not all the patches improve as the

bit-rate increases but only some of them, such as severalpatches in the clothes area, in the ears and in the outline ofthe hair. These patches were more difficult to represent thanothers.

4.3. Comparing to other techniques

An important part in assessing the performance of our com-pression method is its comparison to known and competitivecompression techniques. As mentioned before, we compare ourresults in this work with JPEG, JPEG2000, The VQ-Based compres-sion method described in [17], and a PCA-Based compressionmethod that was built especially for this work as a competitivebenchmark. We therefore start with a brief description of thePCA technique.

The PCA-Based compression method is very similar to thescheme described in this work, simply replacing the K-SVD dic-tionaries with a Principal Component Analysis (PCA) ones. Thesedictionaries are square matrices storing the eigenvectors of theautocorrelation matrices of the training examples in each patch,sorted by a decreasing order of their corresponding eigenvalues.

Fig. 12. Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are therepresentation RMSE.

278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282

Much like other compression methods, the quality of thereconstructed images in our method improves as the bit-rateincreases. However, the contribution gained from such a rateincrement is not divided equally over the image. Additional bitsare allocated to patches with higher representation error, andthose are improved first. This property is directly caused bythe nature of the compression process, which is RMSE orientedand not bit-rate oriented. The compression process sets a singleRMSE threshold for all the patches, forcing each of them toreach it without fixing the number of allocated atoms perpatch. Patches with simple (smooth) content are most likelyto have a representation error far below the threshold evenusing zero or one atom, whereas patches with more complexcontent are expected to give a representation error very closeto the threshold. Such problematic patches will be forced to im-prove their representation error by increasing the number ofatoms they use as the RMSE threshold is decreased, whilepatches with a representation error below the threshold willnot be forced to change at all. Fig. 11 illustrates thegradual improvement in the image quality as the bit-rate in-creases. As can be seen, not all the patches improve as the

bit-rate increases but only some of them, such as severalpatches in the clothes area, in the ears and in the outline ofthe hair. These patches were more difficult to represent thanothers.

4.3. Comparing to other techniques

An important part in assessing the performance of our com-pression method is its comparison to known and competitivecompression techniques. As mentioned before, we compare ourresults in this work with JPEG, JPEG2000, The VQ-Based compres-sion method described in [17], and a PCA-Based compressionmethod that was built especially for this work as a competitivebenchmark. We therefore start with a brief description of thePCA technique.

The PCA-Based compression method is very similar to thescheme described in this work, simply replacing the K-SVD dic-tionaries with a Principal Component Analysis (PCA) ones. Thesedictionaries are square matrices storing the eigenvectors of theautocorrelation matrices of the training examples in each patch,sorted by a decreasing order of their corresponding eigenvalues.

Fig. 12. Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are therepresentation RMSE.

278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282

Much like other compression methods, the quality of thereconstructed images in our method improves as the bit-rateincreases. However, the contribution gained from such a rateincrement is not divided equally over the image. Additional bitsare allocated to patches with higher representation error, andthose are improved first. This property is directly caused bythe nature of the compression process, which is RMSE orientedand not bit-rate oriented. The compression process sets a singleRMSE threshold for all the patches, forcing each of them toreach it without fixing the number of allocated atoms perpatch. Patches with simple (smooth) content are most likelyto have a representation error far below the threshold evenusing zero or one atom, whereas patches with more complexcontent are expected to give a representation error very closeto the threshold. Such problematic patches will be forced to im-prove their representation error by increasing the number ofatoms they use as the RMSE threshold is decreased, whilepatches with a representation error below the threshold willnot be forced to change at all. Fig. 11 illustrates thegradual improvement in the image quality as the bit-rate in-creases. As can be seen, not all the patches improve as the

bit-rate increases but only some of them, such as severalpatches in the clothes area, in the ears and in the outline ofthe hair. These patches were more difficult to represent thanothers.

4.3. Comparing to other techniques

An important part in assessing the performance of our com-pression method is its comparison to known and competitivecompression techniques. As mentioned before, we compare ourresults in this work with JPEG, JPEG2000, The VQ-Based compres-sion method described in [17], and a PCA-Based compressionmethod that was built especially for this work as a competitivebenchmark. We therefore start with a brief description of thePCA technique.

The PCA-Based compression method is very similar to thescheme described in this work, simply replacing the K-SVD dic-tionaries with a Principal Component Analysis (PCA) ones. Thesedictionaries are square matrices storing the eigenvectors of theautocorrelation matrices of the training examples in each patch,sorted by a decreasing order of their corresponding eigenvalues.

Fig. 12. Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are therepresentation RMSE.

278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282

[Elad et al. 2009]

400

byte

s

fk

Page 91: Adaptive Signal and Image Processing

PCA Learning

Dictionary learning:C = {D \ ||dm|| � 1}

Exemplars Y

Constraints on the Learning

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

min12

||Y �DX||2 + �||X||1X, D � C

Page 92: Adaptive Signal and Image Processing

PCA NMFLearning

Dictionary learning:C = {D \ ||dm|| � 1}

C = {D \ ||dm|| � 1, D � 0}Non-negative matrix factorization:

Exemplars Y

Constraints on the Learning

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

min12

||Y �DX||2 + �||X||1X, D � C

Page 93: Adaptive Signal and Image Processing

Sparse PCAPCA NMFLearning

Dictionary learning:C = {D \ ||dm|| � 1}

C = {D \ ||dm|| � 1, D � 0}

C =�D \ ||dm||2 + �||dm||1 � 1

Non-negative matrix factorization:

Sparse-PCA:

Exemplars Y

Constraints on the Learning

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

min12

||Y �DX||2 + �||X||1X, D � C

Page 94: Adaptive Signal and Image Processing

Translation invariance + patches: [Aharon & Elad 2008][Jojic et al. 2003]

Low dimensional dictionary parameterization.D = (dm)m

Dictionary D

Image f

dm(t) = d(zm + t)

Signature d

Dictionary Signature / Epitome

Figure 7. 1, 2, 4 and 20 epitomes learned on the barbara imagefor the same parameters. They are of sizes 42, 32, 25 and 15 inorder to keep the same number of elements in D. They are notrepresented to scale.

5.3. Influence of the Number of EpitomesWe present in this section an experiment where the num-

ber of learned epitomes vary, while keeping the same num-bers of columns in D. The 1, 2, 4 and 20 epitomes learnedon the image barbara are shown in Figure 7. When the num-ber of epitomes is small, we observe in the epitomes somediscontinuities between texture areas with different visualcharacteristics, which is not the case when learning severalindependant epitomes.

5.4. Application to DenoisingIn order to evaluate the performance of epitome learn-

ing in various regimes (single epitome, multiple epitomes),we use the same methodology as [1] that uses the success-ful denoising method first introduced by [9]. Let us con-sider first the classical problem of restoring a noisy image y

in Rn which has been corrupted by a white Gaussian noiseof standard deviation ⇥. We denote by yi in Rm the patchof y centered at pixel i (with any arbitrary ordering of theimage pixels).

The method of [9] proceeds as follows:

• Learn a dictionary D adapted to all overlappingpatches y1,y2, . . . from the noisy image y.

• Approximate each noisy patch using the learned dic-tionary with a greedy algorithm called orthogonalmatching pursuit (OMP) [17] to have a clean estimateof every patch of yi by addressing the following prob-lem

argmin↵i�Rp

⇤↵i⇤0 s.t. ⇤yi �D↵i⇤22 � (C⇥2),

where D↵i is a clean estimate of the patch yi, ⇤↵i⇤0

is the ⌃0 pseudo-norm of ↵i, and C is a regularizationparameter. Following [9], we choose C = 1.15.

• Since every pixel in y admits many clean estimates(one estimate for every patch the pixel belongs to), av-erage the estimates.

Figure 8. Artificially noised boat image (with standard deviation� = 15), and the result of our denoising algorithm.

Quantitative results for single epitome, and multi-scalemulti-epitomes are presented in Table 1 on six images andfive levels of noise. We evaluate the performance of the de-noising process by computing the peak signal-to-noise ratio(PSNR) for each pair of images. For each level of noise,we have selected the best regularization parameter � overallthe six images, and have then used it all the experiments.The PNSR values are averaged over 5 experiments with 5different noise realizations. The mean standard deviation isof 0.05dB both for the single epitome and the multi-scalemulti-epitomes.

We see from this experiment that the formulation we pro-pose is competitive compared to the one of [1]. Learningmulti epitomes instead of a single one seems to provide bet-ter results, which might be explained by the lack of flexi-bility of the single epitome representation. Evidently, theseresults are not as good as recent state-of-the-art denoisingalgorithms such as [7, 15] which exploit more sophisticated

2918

ourselves for simplicity to the case of single and multipleepitomes of the same size and shape.

The multi-epitome version of our approach can be seenas an interpolation between classical dictionary and singleepitome. Indeed, defining a multitude of epitomes of thesame size as the considered patches is equivalent to work-ing with a dictionary. Defining a large number a epito-mes slightly larger than the patches is equivalent to shift-invariant dictionaries. In Section 5, we experimentally com-pare these different regimes for the task of image denoising.

4.4. InitializationBecause of the nonconvexity of the optimization prob-

lem, the question of the initialization is an important issuein epitome learning. We have already mentioned a multi-scale strategy to overcome this issue, but for the first scale,the problem remains. Whereas classical flat dictionaries cannaturally be initialized with prespecified dictionaries suchas overcomplete DCT basis (see [9]), the epitome does notadmit such a natural choice. In all the experiences (un-less written otherwise), we use as the initialization a singleepitome (or a collection of epitomes), common to all ex-periments, which is learned using our algorithm, initializedwith a Gaussian low-pass filtered random image, on a set of100 000 random patches extracted from 5 000 natural im-ages (all different from the test images used for denoising).

5. Experimental Validation

Figure 4. House, Peppers, Cameraman, Lena, Boat and Barbaraimages.

We provide in this section qualitative and quantitativevalidation. We first study the influence of the differentmodel hyperparameters on the visual aspect of the epitomebefore moving to an image denoising task. We choose torepresent the epitomes as images in order to visualize moreeasily the patches that will be extracted to form the images.Since epitomes contain negative values, they are arbitrarilyrescaled between 0 and 1 for display.

In this section, we will work with several images, whichare shown in Figure 4.

5.1. Influence of the Initialization

In order to measure the influence of the initialization onthe resulting epitome, we have run the same experience withdifferent initializations. Figure 5 shows the different resultsobtained.

The difference in contrast may be due to the scaling ofthe data in the displaying process. This experiment illus-trates that different initializations lead to visually differentepitomes. Whereas this property might not be desirable, theclassical dictionary learning framework also suffers fromthis issue, but yet has led to successful applications in im-age processing [9].

Figure 5. Three epitomes obtained on the boat image for differentinitializations, but all the same parameters. Left: epitome obtainedwith initialization on a epitome learned on random patches fromnatural images. Middle and Right: epitomes obtained for two dif-ferent random initializations.

5.2. Influence of the Size of the Patches

The size of the patches seem to play an important role inthe visual aspect of the epitome. We illustrate in Figure 6an experiment where pairs of epitome of size 46 � 46 arelearned with different sizes of patches.

Figure 6. Pairs of epitomes of width 46 obtained for patches ofwidth 6, 8, 9, 10 and 12. All other parameters are unchanged. Ex-periments run with 2 scales (20 iterations for the first scale, 5 forthe second) on the house image.

As we see, learning epitomes with small patches seemsto introduce finer details and structures in the epitome,whereas large patches induce epitomes with coarser struc-tures.

2917

Page 95: Adaptive Signal and Image Processing

Translation invariance + patches: [Aharon & Elad 2008][Jojic et al. 2003]

Low dimensional dictionary parameterization.D = (dm)m

Dictionary D

Image f

� Faster learning.� Make use of atoms spacial location xm.

dm(t) = d(zm + t)

Signature d

Dictionary Signature / Epitome

Figure 7. 1, 2, 4 and 20 epitomes learned on the barbara imagefor the same parameters. They are of sizes 42, 32, 25 and 15 inorder to keep the same number of elements in D. They are notrepresented to scale.

5.3. Influence of the Number of EpitomesWe present in this section an experiment where the num-

ber of learned epitomes vary, while keeping the same num-bers of columns in D. The 1, 2, 4 and 20 epitomes learnedon the image barbara are shown in Figure 7. When the num-ber of epitomes is small, we observe in the epitomes somediscontinuities between texture areas with different visualcharacteristics, which is not the case when learning severalindependant epitomes.

5.4. Application to DenoisingIn order to evaluate the performance of epitome learn-

ing in various regimes (single epitome, multiple epitomes),we use the same methodology as [1] that uses the success-ful denoising method first introduced by [9]. Let us con-sider first the classical problem of restoring a noisy image y

in Rn which has been corrupted by a white Gaussian noiseof standard deviation ⇥. We denote by yi in Rm the patchof y centered at pixel i (with any arbitrary ordering of theimage pixels).

The method of [9] proceeds as follows:

• Learn a dictionary D adapted to all overlappingpatches y1,y2, . . . from the noisy image y.

• Approximate each noisy patch using the learned dic-tionary with a greedy algorithm called orthogonalmatching pursuit (OMP) [17] to have a clean estimateof every patch of yi by addressing the following prob-lem

argmin↵i�Rp

⇤↵i⇤0 s.t. ⇤yi �D↵i⇤22 � (C⇥2),

where D↵i is a clean estimate of the patch yi, ⇤↵i⇤0

is the ⌃0 pseudo-norm of ↵i, and C is a regularizationparameter. Following [9], we choose C = 1.15.

• Since every pixel in y admits many clean estimates(one estimate for every patch the pixel belongs to), av-erage the estimates.

Figure 8. Artificially noised boat image (with standard deviation� = 15), and the result of our denoising algorithm.

Quantitative results for single epitome, and multi-scalemulti-epitomes are presented in Table 1 on six images andfive levels of noise. We evaluate the performance of the de-noising process by computing the peak signal-to-noise ratio(PSNR) for each pair of images. For each level of noise,we have selected the best regularization parameter � overallthe six images, and have then used it all the experiments.The PNSR values are averaged over 5 experiments with 5different noise realizations. The mean standard deviation isof 0.05dB both for the single epitome and the multi-scalemulti-epitomes.

We see from this experiment that the formulation we pro-pose is competitive compared to the one of [1]. Learningmulti epitomes instead of a single one seems to provide bet-ter results, which might be explained by the lack of flexi-bility of the single epitome representation. Evidently, theseresults are not as good as recent state-of-the-art denoisingalgorithms such as [7, 15] which exploit more sophisticated

2918

ourselves for simplicity to the case of single and multipleepitomes of the same size and shape.

The multi-epitome version of our approach can be seenas an interpolation between classical dictionary and singleepitome. Indeed, defining a multitude of epitomes of thesame size as the considered patches is equivalent to work-ing with a dictionary. Defining a large number a epito-mes slightly larger than the patches is equivalent to shift-invariant dictionaries. In Section 5, we experimentally com-pare these different regimes for the task of image denoising.

4.4. InitializationBecause of the nonconvexity of the optimization prob-

lem, the question of the initialization is an important issuein epitome learning. We have already mentioned a multi-scale strategy to overcome this issue, but for the first scale,the problem remains. Whereas classical flat dictionaries cannaturally be initialized with prespecified dictionaries suchas overcomplete DCT basis (see [9]), the epitome does notadmit such a natural choice. In all the experiences (un-less written otherwise), we use as the initialization a singleepitome (or a collection of epitomes), common to all ex-periments, which is learned using our algorithm, initializedwith a Gaussian low-pass filtered random image, on a set of100 000 random patches extracted from 5 000 natural im-ages (all different from the test images used for denoising).

5. Experimental Validation

Figure 4. House, Peppers, Cameraman, Lena, Boat and Barbaraimages.

We provide in this section qualitative and quantitativevalidation. We first study the influence of the differentmodel hyperparameters on the visual aspect of the epitomebefore moving to an image denoising task. We choose torepresent the epitomes as images in order to visualize moreeasily the patches that will be extracted to form the images.Since epitomes contain negative values, they are arbitrarilyrescaled between 0 and 1 for display.

In this section, we will work with several images, whichare shown in Figure 4.

5.1. Influence of the Initialization

In order to measure the influence of the initialization onthe resulting epitome, we have run the same experience withdifferent initializations. Figure 5 shows the different resultsobtained.

The difference in contrast may be due to the scaling ofthe data in the displaying process. This experiment illus-trates that different initializations lead to visually differentepitomes. Whereas this property might not be desirable, theclassical dictionary learning framework also suffers fromthis issue, but yet has led to successful applications in im-age processing [9].

Figure 5. Three epitomes obtained on the boat image for differentinitializations, but all the same parameters. Left: epitome obtainedwith initialization on a epitome learned on random patches fromnatural images. Middle and Right: epitomes obtained for two dif-ferent random initializations.

5.2. Influence of the Size of the Patches

The size of the patches seem to play an important role inthe visual aspect of the epitome. We illustrate in Figure 6an experiment where pairs of epitome of size 46 � 46 arelearned with different sizes of patches.

Figure 6. Pairs of epitomes of width 46 obtained for patches ofwidth 6, 8, 9, 10 and 12. All other parameters are unchanged. Ex-periments run with 2 scales (20 iterations for the first scale, 5 forthe second) on the house image.

As we see, learning epitomes with small patches seemsto introduce finer details and structures in the epitome,whereas large patches induce epitomes with coarser struc-tures.

2917

Page 96: Adaptive Signal and Image Processing

Conclusion

Page 97: Adaptive Signal and Image Processing

Conclusion

Page 98: Adaptive Signal and Image Processing

Conclusion

Fourier

bandletscurveletswavelets

texturelets?

• Quest for the best representation:

Page 99: Adaptive Signal and Image Processing

Conclusion

More sparsity � better prior

� better recovery.

Convex sparsity prior: �1Compressed sensing

• Inverse problems regularization:

�m |�f, �m⇥|

Fourier

bandletscurveletswavelets

texturelets?

• Quest for the best representation:

Page 100: Adaptive Signal and Image Processing

Conclusion

More sparsity � better prior

� better recovery.

Convex sparsity prior: �1Compressed sensing

• Inverse problems regularization:

�m |�f, �m⇥|

Wavelets

More sparsity � better synthesis

• Texture synthesis:

Fourier

bandletscurveletswavelets

texturelets?

• Quest for the best representation: