journal paper in image

7/29/2019 journal paper in image

1/15

3850 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2012

Universal Regularizers for Robust SparseCoding and Modeling

Ignacio Ramrez, Student Member, IEEE, and Guillermo Sapiro, Member, IEEE

AbstractSparse data models, where data is assumed to bewell represented as a linear combination of a few elements froma dictionary, have gained considerable attention in recent years,and their use has led to state-of-the-art results in many signal andimage processing tasks. It is now well understood that the choiceof the sparsity regularization term is critical in the success ofsuch models. Based on a codelength minimization interpretationof sparse coding, and using tools from universal coding theory, wepropose a framework for designing sparsity regularization termswhich have theoretical and practical advantages when comparedwith the more standard 0 or 1 ones. The presentation of theframework and theoretical foundations is complemented withexamples that show its practical advantages in image denoising,

zooming and classification.

Index Terms Classification, denoising, dictionary learning,sparse coding, universal coding, zooming.

I. INTRODUCTION

SPARSE modeling calls for constructing a succinct repre-sentation of some data as a combination of a few typicalpatterns (atoms) learned from the data itself. Significant contri-butions to the theory and practice of learning such collectionsof atoms (usually called dictionaries or codebooks) [1][3],and of representing the actual data in terms of them [4][6],have been developed in recent years, leading to state-of-the-

art results in many signal and image processing tasks [7][10].We refer the reader to [11] for a recent review on the subject.A critical component of sparse modeling is the actual

sparsity of the representation, which is controlled by a regu-larization term (regularizer for short) and its associated para-meters. The choice of the functional form of the regularizerand its parameters is a challenging task. Several solutions tothis problem have been proposed in the literature, rangingfrom the automatic tuning of the parameters [12] to Bayesianmodels, where these parameters are considered as randomvariables [12][14]. In this paper, we adopt the interpretation

Manuscript received August 2, 2010; revised January 25, 2012; acceptedApril 9, 2012. Date of publication May 1, 2012; date of current version

August 22, 2012. This work was supported in part by NSF, NGA, ARO, ONR,NSSEFF, and Fundaciba-Antel. The associate editor coordinating the reviewof this manuscript and approving it for publication was Prof. Mark R. Bell.

I. Ramrez was with the Department of Electrical and Computer Engi-neering, University of Minnesota, Minneapolis, MN 55455-0170 USA.He is now with the Instituto de Ingeniera Elctrica, Facultad de Inge-niera, Universidad de la Repblica, Montevideo 11300, Uruguay (e-mail:[email protected]).

G. Sapiro is with the Department of Electrical and Computer Engi-neering, University of Minnesota, Minneapolis, MN 55455-0170 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2012.2197006

of sparse coding as a codelength minimization problem. Thisis a natural and objective method for assessing the quality ofa statistical model for describing given data, and it is basedon the minimum description length (MDL) principle [15].Here, the regularization term in the sparse coding formulationis interpreted as the cost in bits of describing the sparsecoefficients, which are used to reconstruct the data. Severalimage processing works using this approach were developed inthe 1990s, following the popularization of MDL as a powerfulmodeling tool [16][18]. The focus on these early paperswas in denoising using wavelet basis, using either an early

asymptotic form of MDL, or fixed probability models, tocompute the description length of the coefficients. A later,major breakthrough in MDL theory was the adoption ofuniversal coding for computing optimal codelengths. In thispaper, we improve and extend on previous results in this lineof work by designing regularization terms based on universalcodes for natural image coefficients. The resulting frameworknot only formalizes sparse coding from the MDL and universalcoding perspectives but also leads to a family of universalregularizers which we show to consistently improve results inimage processing tasks such as denoising and classification.These models also enjoy several desirable theoretical and prac-tical properties such as statistical consistency (in certain cases),improved robustness to outliers in the data, and improvedsparse signal recovery (e.g., decoding of sparse signals froma compressive sensing point of view [19]) when comparedwith the traditional 0- and 1-based techniques in practice. ofa simple and efficient optimization technique for solving thecorresponding sparse coding problems as a series of weighted1 subproblems, which in turn, can be solved with off-the-shelf algorithms such as least angle regression (LARS) [5] orIST [4]. Details are given in the following sections.

Finally, we apply our universal regularizers not only forcoding using fixed dictionaries, but also for learning thedictionaries, leading to further improvements in all the afore-

mentioned tasks. Since the original submission of this paper,we have built upon this paper by using these regularizers asbuilding blocks of a new MDL-based framework for sparsecoding and dictionary learning that is completely parameterfree. See [20] for details and additional example applicationsof the ideas developed in this paper.

The remainder of this paper is organized as follows.In Section II, we introduce the standard framework ofsparse modeling. Section III is dedicated to the derivationof our proposed universal sparse modeling framework, andSection IV deals with its implementation. Section V presents

10577149/$31.00 2012 IEEE


2/15

RAMREZ AND SAPIRO: UNIVERSAL REGULARIZERS FOR ROBUST SPARSE CODING AND MODELING 3851

experimental results showing the practical benefits of theproposed framework in image denoising, zooming, and clas-sification tasks. Concluding remarks are given in Section VI.

II. BACKGROUND ON SPARSE MODELING

Let X RMN be a set of N column data samples xj R

M, D RMK a dictionary of K column atoms dk RM,and A

R

KN, ajR

K, a set of reconstruction coefficients

such that X = D A. We use aTk to denote the kth row ofA, the coefficients associated with the kth atom in D. Foreach j = 1, . . . , N we define the active set of aj as Aj =

k : akj = 0, 1 k K

, andaj0 = |Aj | as its cardinality.

The goal of sparse modeling is to design a dictionary D suchthat for all or most data samples xj , there exists a vector ajsuch that xj D aj and

aj0 is small. One way to formalizethis goal is by solving

minD,A

Nj=1

(aj ) s.t.xj D aj22 j, dk2 1 k

(1)

where () is a regularization term which induces sparsity inthe columns of the solution A. The norm constraint on dk isadded to prevent an arbitrary decrease of the cost function bymeans of D D, A 1A, > 1. For fixed D, theproblem of finding a sparse aj for each sample xj is calledsparse coding

aj = argmina

(a) s.t.xj D a22 . (2)

Among possible choices of () are the 0 pseudonorm,() = 0, and the 1 norm. The former tries to solvedirectly for the sparsest aj , but as it is nonconvex, it iscommonly replaced by the 1 norm, its closest convex approxi-mation. Further, under certain conditions on (fixed) D and thesparsity of aj , the solutions to the 0- and 1-based sparsecoding problems coincide (see [19]). Equation (1) is alsousually formulated in Lagrangian form

minD,A

Nj=1

xj D aj22 + (aj ) (3)along with its respective problem when D is fixed

aj = argmina

xj D a22 + (a), dk2 1 k. (4)Even when () is convex, (1) and (3) are jointly nonconvex in(D, A). The standard approach to find an approximate solutionto (1) [correspondingly (3)] is to use alternate minimization

in A and D. Here, updating A is performed by solving (2)[correspondingly (4)], using, for example, IST [4] or LARS [5]when () = 1, or orthogonal matching pursuit (OMP) [21]when () = 0. The update on D (dictionary update step)can be done using, for example, MOD [2] or K-SVD [1].

A. Interpretations of the Sparse Coding Problem

The problem of finding the sparsest vector of coefficients ajthat yields a good approximation of xj admits several inter-pretations. Following is a summary of these interpretationsand the insights that they provide into sparse models that arerelevant to our derivation.

1) Model Selection in Statistics: Using the 0 norm as ()in (4) is known in the statistics community as the Akaikesinformation criterion when = 1, or the Bayes informationcriterion (BIC) when = (1/2) log M, two popular formsof model selection (see [22, Ch. 7]). The 1 regularizer wasintroduced in [23], again as a convex approximation of theabove model selection methods, and is commonly known(either in its constrained or Lagrangian forms) as the Lasso.Note however that, in the regression interpretation of (4), therole ofD and X is very different.

2) Maximum a Posteriori (MAP): Another interpretation of(4) is that of a MAP estimation of aj in the logarithmic scale,that is

aj = argmaxa

{log P(a|xj )}= argmax

a{log P(xj |a) + log P(a)}

= argmina

{ log P(xj |a) log P(a)} (5)where the observed samples xj are assumed to be conta-minated with additive, zero mean, independently identically

distributed (IID) Gaussian noise with variance 2

, P(xj |a) e(1/2

2)xj D a22 , and a prior probability model on a with theform P(a) e (a) is considered. The energy term in (4)followed by plugging the previous two probability models into(5) and factorizing 2 2 into = 2 2. According to (5), the1 regularizer corresponds to an IID Laplacian prior with mean0 and inverse-scale parameter , P(a) = Kk=1 e|ak| =Kea1 .

3) Codelength Minimization: Sparse coding, in all its forms,has yet another important interpretation. Suppose that we havea fixed dictionary D and that we want to use it to compressan image, either losslessly by encoding the reconstructioncoefficients A and the residual X

DA, or in a lossy manner,

by obtaining a good approximation X DA and encodingonly A. Consider for example the latter case. Most moderncompression schemes consist of two parts: 1) a probabilityassignment stage, where the data, in this case A, is assigned aprobability P(A) and 2) an encoding stage, where a code C(A)of length L(A) bits is assigned to the data given its probability,so that L(A) is as short as possible. The techniques knownas arithmetic and Huffman coding provide the best possiblesolution for the encoding step, which is to approximate theShannon ideal codelength L(A) = log P(A) [24, Ch. 5].Therefore, modern compression theory deals with findingthe coefficients A that minimize log P(A). In the lossy

case, where we admit a certain 2distortion on each x

j,xj Daj22 , and assuming that P(A) = j aj , we

can obtain the optimum aj for each sample xj via

aj = argmina

log P(a) s.t.xj D aj22 (6)

which, for the choice P(a) e(a) coincides with the errorconstrained sparse coding problem (2). In the lossless case, wealso need to encode the residual xj Daj . Because P(x, a) =P(x|a)P(a), the total codelength will be

L(xj ,aj )= log P(xj ,aj )= log P(xj |aj ) log P(aj ).(7)


3/15


Thus, computing aj amounts to solving mina L(xj , a), whichis precisely the MAP formulation of (5), which in turn, forproper choices of P(x|a) and P(a), leads to the Lagrangianform of sparse coding (4).1

As one can see, the codelength interpretation of sparsecoding is able to unify and interpret both the constrainedand unconstrained formulations into one consistent framework.Furthermore, this framework offers a natural and objectivemeasure for comparing the quality of different models P(x|a)and P(a) in terms of the codelengths obtained.

4) Remarks on Related Work: As mentioned in the intro-duction, the codelength interpretation of signal coding wasalready studied in the context of orthogonal wavelet-baseddenoising. An early example of this line of work considers aregularization term which uses the Shannon Entropy function

pi log pi to give a measure of the sparsity of the solution[16]. However, the Entropy function is not used as measureof the ideal codelength for describing the coefficients, butas a measure of the sparsity (actually, group sparsity) ofthe solution. The MDL principle was applied to the signal

estimation problem in [18]. In this case, the codelength termincludes the description of both the location and the magnitudeof the nonzero coefficients. Although a pioneering effort, themodel assumed in [18] for the coefficient magnitude is auniform distribution on [0, 1], which does not exploit a prioriknowledge of image coefficient statistics, and the descriptionof the support is slightly wasteful. Furthermore, the codelengthexpression used is an asymptotic result, actually equivalentto BIC (see Section II-A1) which can be misleading whenworking with small sample sizes, such as when encodingsmall image patches, as in current state-of-the-art imageprocessing applications. The uniform distribution was laterreplaced by the universal code for integers [25] in [17].

However, as in [18], the model is so general that it does notperform well for the specific case of coefficients arising fromimage decompositions, leading to poor results. In contrast, ourmodels are derived following a careful analysis of image coef-ficient statistics. Finally, probability models suitable to imagecoefficient statistics of the form P(a) e|a| (known asgeneralized Gaussians) were applied to the MDL-based signalcoding and estimation framework in [17]. The justificationfor such models is based on the empirical observation thatsparse coefficients statistics exhibit heavy tails (see Section).However, the choice is ad hoc and no optimality criterionis available to compare it with other possibilities. Moreover,there is no closed form solution for performing parameter

estimation on such family of models, requiring numerical

1Laplacian models, as well as Gaussian models, are probability distributionsover R, characterized by continuous probability density functions, f(a) =F(a), F(a) = P(x a). If the reconstruction coefficients are consideredreal numbers, under any of these distributions, any instance of A RKNwill have measure 0, that is, P(A) = 0. To use such distributions as ourmodels for the data, we assume that the coefficients in A are quantized to aprecision , small enough for the density function f(a) to be approximatelyconstant in any interval [a /2, a + /2], a R, so that we canapproximate P(a) f(a), a R. Under these assumptions, log P(a) log f(a) log , and the effect of on the codelength produced by anymodel is the same. Therefore, we will omit in the sequel, and treat densityfunctions and probability distributions interchangeably as P(). Of course, inreal compression applications, needs to be tuned.

optimization techniques. In Section III, we derive a numberof probability models for which parameter estimation can becomputed efficiently in closed form, and which are guaranteedto optimally describe image coefficients.

B. Need for a Better Model

As explained in Section, the use of the 1 regularizer

implies that all the coefficients in A share the same Laplacianparameter . However, as noted in [26] and references therein,the empirical variance of coefficients associated with differentatoms, that is, of the different rows aTk ofA, varies greatly withk = 1, . . . , K. This is clearly seen in Fig. 1(a)(c), which showthe empirical distribution of discrete cosine transform (DCT)coefficients of 88 patches. As the variance of a Laplacian is2/2, different variances indicate different underlying . Thehistogram of the set {k, k = 1, . . . , } of estimated Laplacianparameters for each row k, Fig. 1(d), shows that this is indeedthe case, with significant occurrences of values of in a rangeof 525. The straightforward modification suggested by thisphenomenon is to use a model where each row of A has

its own weight associated with it, leading to a weighted 1regularizer. However, from a modeling perspective, this resultsin K parameters to be adjusted instead of just one, whichoften results in poor generalization properties. For example,in the cases studied in Section V, even with thousands ofimages for learning these parameters, the results of applyingthe learned model to new images were always significantlyworse (over 1 dB in estimation problems) when compared withthose obtained using simpler models such as an unweighted1.2 One reason for this failure may be that real images, aswell as other types of signals such as audio samples, are farfrom stationary. In this case, even if each atom k is associatedwith its own k (k), the optimal value of k can have

significant local variations at different positions or times. Thiseffect is shown in Fig. 1(e), where, for each k, every k wasre-estimated several times using samples from different regionsof an image, and the histogram of the different estimatedvalues of k was computed. Here, again, D is the DCT basis.

The need for a flexible model which simultaneously hasa small number of parameters leads naturally to Bayesianformulations where the different possible k are marginalizedout by imposing an hyper-prior distribution on , sampling using its posterior distribution, and then averaging theestimates obtained with the sampled sparse-coding problems.Examples of this recent line of work, and the closely relatedBayesian compressive sensing, are developed in [27][30].

Despite of its promising results, the Bayesian approach isoften criticized because of the potentially expensive samplingprocess (something that can be reduced for certain choicesof the priors involved [27]), arbitrariness in the choice ofthe priors, and lack of proper theoretical justification for theproposed models [30].

In this paper, we pursue the same goal of deriving a moreflexible and accurate sparse model than the traditional ones,

2Note that, in this case, the weights are found by maximum likelihood (ML).Other applications of weighted 1 regularizers, using other types of weightingstrategies, are known to improve over 1-based ones for certain applications(see [14]).


4/15


Fig. 1. (a) Standard 8 8 DCT dictionary. (b) Global empirical distribution of the coefficients in A (log scale). (c) Empirical distributions of the coefficientsassociated to each of the K = 64 DCT atoms (log scale). The distributions in (c) have a similar heavy tailed shape (heavier than Laplacian), but the variancein each case can be significantly different. (d) Histogram of the K = 64 different k values obtained by fitting a Laplacian distribution to each row aTk ofA.Note that there are significant occurrences between = 5 to = 25. The coefficients A used in (b)(d) were obtained from encoding 106 8 8 patches(after removing their dc component) randomly sampled from the Pascal 2006 dataset of natural images [31]. (e) Histograms showing the spatial variability ofthe best local estimations of k for a few rows ofA across different regions of an image. In this case, the coefficients A correspond to the sparse encoding ofall 8 8 patches from a single image, in scan-line order. For each k, each value of k was computed from a random contiguous block of 250 samples fromaTk . The procedure was repeated 4000 times to obtain an empirical distribution. The wide supports of the empirical distributions indicate that the estimated can have very different values, even for the same atom, depending on the region of the data from where the coefficients are taken.

while avoiding an increase in the number of parameters and

the burden of possibly solving several sampled instances ofthe sparse coding problem. For this, we deploy tools from thevery successful information-theoretic field of universal coding,which is an extension of the compression scenario summarizedabove in Section II-A, when the probability model for the datais itself unknown and has to be described as well.

III . UNIVERSAL MODELS FOR SPARSE CODING

Following the discussion in the preceding section, weconsider two encoding scenarios. First, we may still want toconsider a single value of to work well for all the coefficientsin A, and try to design a sparse coding scheme that does not

depend on prior knowledge on the value of . Secondly, wecan consider an independent (but not identically distributed)Laplacian model where the underlying parameter can bedifferent for each atom dk, k = 1, . . . , K.

Consider the model class of all IID Laplacian models overA RKN, M = {P(A| ) : } with R+, P(A| ) =N

j=1K

k=1 P(akj | ), and P(akj | ) = (/2)e|akj |. Ideally,our goal in the first scenario would be to choose the model Pfrom M which minimizes L P (A). With hindsight of A, thiscorresponds to P = P(| ) with the ML estimator of givenA. In general, however, we do not have A within hindsight.The goal of universal coding [15], [32] is to find a probabilitymodel Q(A), so that L Q (A)

L

P

(A) without hindsight of

A, in which case we say that Q(A) is universal with respectto the model class M.

For simplicity, in the following discussion we rearrange thematrix A as one long column vector of length n = KN,a = (a1, . . . , an ). The letter a, without subindex, will denotea a given coefficient value.

First we need to define a criterion for comparing the fittingquality of different models. In the universal coding theory,this is done in terms of the codelengths L(a) required by eachmodel to describe a.

If the model class consists of a single probability distrib-ution P(), we know from Section II-A3 that the optimum

codelength corresponds to L P (a)

= log P(a). Moreover,

this relationship defines a one-to-one correspondence betweendistributions and codelengths, so that for any coding schemeL Q (a), Q(a) = 2L Q (a). Now, ifM is a family of modelsparametrized by , the ideal choice, as mentioned above, isP = P(| ). Unfortunately, the decoder does not know in advance, and we also need to describe its value withinthe code C(a) for the decoder to be able to reconstruct a,thereby increasing the total codelength. Thus, we have thatany model Q(a) inducing valid codelengths L Q (a) will haveL Q (a) > log P(a| ). The overhead of L Q (a) with respectto log P(a| ) is known as the codelength regret

R(a, Q)

:=L Q (a)

(

log P(a

| (a)))

= log Q(a) + log P(a|(a))).A model Q(a) (or, more precisely, a sequence of models,one for each data length n) is called universal, ifR(a, Q)grows sublinearly in n for all possible realizations ofa, that is(1/n)R(a, Q) 0 , a Rn , so that the codelength regretwith respect to the maximum likelihood estimation (MLE)becomes asymptotically negligible.

There are a number of ways to construct universal prob-ability models. The simplest one is the so-called two-partcode, where the data is described in two parts. The first partdescribes the optimal parameter (a) with L( ) bits, and thesecond part describes the data a given

using

log P(a

| (a))

bits. For uncountable parameter spaces , such as a compactsubset ofR, the value of has to be quantized for L( ) tobe finite. Let [] denote the value of quantized uniformly insteps of size . The regret for this model is thus

R(a, Q) = L([]) + L(a|[]) L(a| )= L([]) log P(a|[]) ( log P(a|)).

The key for this model to be universal is to choose so thatboth its description L([]), and the difference log P(a|[])( log P(a| )), grow sublinearly in N. As shown in [15], thiscan be achieved for the choice (n) = O(1/n), so that


5/15


L([]) = (1/2) log n + o(1) bits are required to describe eachdimension of []. This gives us a total regret for two-partcodes which grows as dim()/2log n, where dim() is thedimension of the parameter space .

Another important universal code is the so-called normal-ized ML (NML) [33]. In this case, the universal model Q(a)corresponds to the model that minimizes the worst case regret

Q = minQ maxa { log Q(a) + log P(a| (a))}which can be written in closed form as Q(a) =P(a|(a))/C(M, n), where the normalization constantC(M, n) := aRn P(a| (a))da is the value of the minimaxregret and depends only on M and the length of the datan.3 Note that the NML model requires C(M, n) to be finite,something which is often not the case.

The two previous examples are good for assigning a prob-ability to coefficients a that have already been computed, butthey cannot be used as a model for computing the coefficientsthemselves since they depend on having observed them inthe first place. For this and other reasons that will become

clearer later, we concentrate our paper on a third importantfamily of universal codes derived from the so-called mixturemodels (also called Bayesian mixtures). In a mixture model,Q(a) is a convex mixture of all the models P(a| ) M,Q(a) = P(a|)w()d, where w() specifies the weightof each model. Being a convex mixture implies that w() 0and

w()d = 1, thus w() is itself a probability measure

over . We will restrict ourselves to the particular case whena is considered a sequence of independent random variables4

Q(a) =n

j=1Qj (aj ), Qj (aj ) =

P(aj | )wj ()d (8)

where the mixing function wj () can be different for eachsample j . An important particular case of this scheme is theso-called Sequential Bayes code, in which wj () is computedsequentially as a posterior distribution based on previouslyobserved samples, that is wj () = P(|a1, a2, . . . , an1) [34,Ch. 6]. In this paper, for simplicity, we restrict ourselves tothe case where wj () = w() is the same for all j . The resultis an IID model where the probability of each sample aj isgiven by

Qj (aj )= Q(aj )=

P(aj |)w()d j = 1, . . . , N. (9)

A well-known result for IID mixture (Bayesian) codesstates that their asymptotic regret is O((dim()/2) log n), thusstating their universality, as long as the weighting function

3The minimax optimality of Q(a) derives from the fact that it definesa complete uniquely decodable code for all data a of length n, that is, itsatisfies the Kraft inequality with equality,

aRn 2

L Q (a) = 1. Becauseevery uniquely decodable code with lengths

L Q (a) : a Rn

must satisfy

the Kraft inequality (see [24, Ch. 5]), if there exists a value of a such thatL Q (a) < L Q (a) (i.e., 2

L Q (a) > 2L Q (a)), then there exists a vector a ,for which L Q (a

) > L Q (a) for the Kraft inequality to hold. Therefore, theregret of Q for a is necessarily greater than C(M, n), which shows that Qis minimax optimal.

4More sophisticated models which include dependencies between theelements of a are out of the scope of this paper.

w() is positive, continuous, and unimodal over (see [34,Th. 8.1], [35]). In principle, this gives us great flexibility onthe choice of the function w(). In practice, because the resultsare asymptotic and the o(log n) terms can be large, the choiceof w() can be relevant for small sample sizes.

Next, we derive several IID mixture models for the class Mof Laplacian models. For this purpose, it will be convenient toconsider the corresponding one-sided counterpart of the Lapla-cian, which is the exponential distribution over the absolutevalue of the coefficients, |a|, and then symmetrize back toobtain the final distribution over the signed coefficients a.

A. Conjugate Prior

In general, (9) can be computed in closed form if w() isthe conjugate prior of P(a| ). When P(a| ) is an exponential(one-sided Laplacian), the conjugate prior is the Gammadistribution

w(|,) = ()11 e, R+

where and are its shape and scale parameters, respectively.

Plugging this in (9) we obtain the mixture of exponentials(MOE) model, which has the following form (see Appendixfor the full derivation):

QMOE(a|, ) = (a +)(+1), a R+. (10)With some abuse of notation, we will also denote thesymmetric distribution on a as MOE

QMOE(a|, ) = 12

(|a| +)(+1), a R. (11)Note that the parameters and are noninformative, in

the sense that, according to the theory, the universality ofQMOE does not depend on their choice. However, if desired,

both and can be easily estimated using the method ofmoments (see Appendix). Given sample estimates of the firstand second noncentral moments, 1 = (1/n)

nj =1 |aj | and

2 = (1/n)n

j=1 |aj |2, we have that

= 2(2 21)

(2 221)and = ( 1)1. (12)

When the MOE prior is plugged into (5) instead of the standardLaplacian, the following new sparse coding formulation isobtained:

aj = argmina xj Da

22 + MOE

K

k=1

log (|ak| +) (13)

where MOE = 2 2( + 1). An example of the MOEregularizer, and the thresholding function it induces, is shownin Fig. 2 (center column) for = 2.5, = 0.05. Smooth,differentiable nonconvex regularizers such as the one in (13)have become a mainstream robust alternative to the 1 normin statistics [14], [36]. Furthermore, it has been shown thatthe use of such regularizers in regression leads to consistentestimators which are able to identify the relevant variables in aregression model (oracle property) [36]. This is not always thecase for the 1 regularizer, as was proved in [14]. The MOEregularizer has also been recently proposed in the context of


6/15


(a) (b) (c)

Fig. 2. (a) 1 (green), (b) MOE (red), and (c) JOE (blue) regularizers and their corresponding thresholding functions thres(x) := argmina{(xa)2 +(|a|)}.The unbiasedness of MOE is due to the fact that large coefficients are not shrank by the thresholding function. Also, although the JOE regularizer is biased,the shrinkage of large coefficients can be much smaller than the one applied to small coefficients.

compressive sensing [37], where it is conjectured to be betterthan the 1-term at recovering sparse signals in compressivesensing applications.5 This conjecture was partially confirmedrecently for nonconvex regularizers of the form (a) = arwith 0 < r < 1 in [38] and [39], and for a more generalfamily of nonconvex regularizers including the one in (13)in [40]. In all cases, it was shown that the conditions onthe sensing matrix (here D) can be significantly relaxed toguarantee exact recovery if nonconvex regularizers are usedinstead of the 1 norm, provided that the exact solution to the

nonconvex optimization problem can be computed. In practice,this regularizer is being used with success in a number ofapplications here and in [41] and [42].6 Our experimentalresults in Section V provide further evidence on the benefits ofthe use of nonconvex regularizers, leading to a much improvedrecovery accuracy of sparse coefficients compared with 1and 0. We also show in Section V that the MOE prior ismuch more accurate than the standard Laplacian to model thedistribution of reconstruction coefficients drawn from a largedatabase of image patches. We also show in Section V howthese improvements lead to better results in applications suchas image estimation and classification.

B. Jeffreys Prior

The Jeffreys prior for a parametric model class M ={P(a|), }, is defined as

w() =|I()|

|I()|d, (14)

where |I()| is the determinant of the Fisher informationmatrix

I() =

EP(a|)

2

2log P(a|)

=. (15)

The Jeffreys prior is well known in Bayesian theory dueto three important properties: it virtually eliminates the hyper-parameters of the model, it is invariant to the original parame-trization of the distribution, and it is a noninformative prior,meaning that it represents well the lack of prior informationon the unknown parameter [43]. It turns out that, forquite different reasons, the Jeffreys prior is also of paramount

5In [37], the logarithmic regularizer arises from approximating the 0pseudonorm as an 1-normalized element-wise sum, without the insight andtheoretical foundation here reported.

6While these works support the use of such nonconvex regularizers, noneof them formally derives them using the universal coding framework as inthis paper.

importance in the theory of universal coding. For instance,it has been shown in [44] that the worst-case regret of themixture code obtained using the Jeffreys prior approaches thatof the NML as the number of samples n grows. Thus, byusing Jeffreys, one can attain the minimum worst-case regretasymptotically, while retaining the advantages of a mixture(not needing hindsight of a), which in our case means to beable to use it as a model for computing a via sparse coding.

For the exponential distribution, we have that I() = 1/2.Clearly, if we let

=(0,

), the integral in (14) evaluates to

. Therefore, to obtain a proper integral, we need to exclude 0and from (note that this was not needed for the conjugateprior). We choose to define = [1, 2], 0 < 1 < 2 < ,leading to w() = (1/ln(2/1))(1/), [1, 2].

The resulting mixture, after being symmetrized around 0,has the following form (see Appendix):

QJOE(a|1, 2) = 12 ln(2/1)

1

|a|

e1|a| e2|a|

, a R+.(16)

Previously we referred this as a Jeffreys mixture of expo-nentials (JOE), and again overload this acronym to refer tothe symmetric case as well. Note that although QJOE is not

defined for a = 0, its limit when a 0 is finite and eval-uates to (2 1/2 ln(2/1)). Thus, by defining QJOE(0) =(2 1/2 ln(2/1)), we obtain a prior that is well definedand continuous for all a R. When plugged into (5) togetherwith a Gaussian error model, we get the JOE-based sparsecoding formulation

aj = mina

xj Da22 + 2 2K

k=1{log |uk| log

(e1|uk| e2|uk|)} (17)where, according to the convention just defined for QJOE(0),we define JOE(0)

:=log(2

1).

As with MOE, the JOE-based regularizer, JOE() = log QJOE(), is continuous and differentiable in R+,and its derivative converges to a finite value at zero,limu0 JOE(u) = (22 21 /2 1). As we will see laterin Section IV, these properties are important to guaranteethe convergence of sparse coding algorithms using nonconvexpriors. Note from (17) that we can rewrite the JOE regularizeras

JOE(ak) = log |ak| log e1|a|

1 e(21)|a|

= 1|ak| + log |ak| log

1 e(21)|ak|


7/15


so that for sufficiently large |ak|, log(1 e(21)|ak|) 0,1|ak| log |ak|, and we have that JOE(|ak|) 1|ak|. Thus,for large |ak|, the JOE regularizer behaves like 1 with =2 21. In terms of the probability model, this means that thetails of the JOE mixture behave like a Laplacian with = 1,with the region where this happens determined by the value of21. Finally, although having Laplacian tails means that theestimated a will be biased [36], the sharper peak at 0 allowsus to perform a more aggressive thresholding of small values,without excessively clipping large coefficients, which leads tothe typical over-smoothing of signals recovered using an 1regularizer. See Fig. 2 (two rightmost columns).

As with MOE, the JOE regularizer has two noninformativehyper-parameters (1, 2) which define . One possibility isto choose a conservative range [1, 2] which accommodatesall physically possible realizations of. In practice it, may beadvantageous to adjust [1, 2] to the data at hand, in whichcase, any standard optimization technique can be easily appliedto obtain their ML estimations. See Appendix for details.

C. Conditional Jeffreys

A recent approach to deal with the case when the integralover in the Jeffreys prior is improper, is the conditionalJeffreys [34, Ch. 11]. The idea is to construct a properprior, based on the improper Jeffreys prior and the firstfew n0 samples of a, (a1, a2, . . . , an0 ), and then use it forthe remaining data. The key observation is that althoughthe normalizing integral

I()d in the Jeffreys prior is

improper, the unnormalized prior w() = I() can be usedas a measure to weight P(a1, a2, . . . , an0 | )

w() = P(a1, a2, . . . , an0 | )

I()

P(a1, a2, . . . , an0 | )

I()d. (18)

It turns out that the integral in (18) usually becomes properfor small n0 in the order of dim(). In our case we havethat for any n0 1, the resulting prior is a Gamma(0, 0)distribution with 0 := n0 and 0 :=

n0j=1 aj (see Appendix

for details). Therefore, using the conditional Jeffreys prior inthe mixture leads to a particular instance of MOE, which wedenote by CMOE (although the functional form is identical toMOE), where the Gamma parameters and are automati-cally selected from the data. This may explain in part why theGamma prior performs so well in practice, as we will see inSection V.

Furthermore, we observe that the value of obtained withthis approach (0) coincides with the one estimated using the

method of moments for MOE if the in MOE is fixed to = 0+1 = n0+1. Indeed, if computed from n0 samples, themethod of moments for MOE gives = ( 1)1, with 1 =(1/n0)

aj , which gives us = (n0 + 1 1/n0)

aj = 0.

It turns out in practice that the value of estimated using themethod of moments gives a value between 2 and 3 for thetype of data that we deal with (see Section V), which is justabove the minimum acceptable value for the CMOE prior to bedefined, which is n0 = 1. This justifies our choice of n0 = 2when applying CMOE in practice.

As n0 becomes large, so does 0 = n0, and theGamma prior w() obtained with this method converges to a

Kronecker delta at the mean value of the Gamma distribution,0/0 (). Consequently, when w() 0/0 (), the mixture

P(a|)w()d will be close to P(a|0/0). Moreover,from the definition of 0 and 0 we have that 0/0 isexactly the MLE of for the Laplacian distribution. Thus,for large n0, the conditional Jeffreys method approaches theMLE Laplacian model.

Although this is not a problem from a universal coding pointof view, ifn0 is large, the conditional Jeffreys model will looseits flexibility to deal with the case when different coefficientsin A have different underlying . On the other hand, a smalln0 can lead to a prior w() that is overfitted to the localproperties of the first samples, which can be problematicfor nonstationary data such as image patches. Ultimately, n0defines a trade-off between the degree of flexibility and theaccuracy of the resulting model.

IV. OPTIMIZATION AND IMPLEMENTATION DETAILS

All of the mixture models discussed so far yield nonconvexregularizers, rendering the sparse coding problem nonconvex

in a. It turns out however that these regularizers satisfycertain conditions which make the resulting sparse codingoptimization well suited to be approximated using a sequenceof successive convex sparse coding problems, a techniqueknown as local linear approximation (LLA) [45] (also [42],[46] for alternative techniques). In a nutshell, suppose we needto obtain an approximate solution to

aj = argmina

xj D a22 + K

k=1(|ak|) (19)

where () is a nonconvex function over R+. At each LLAiteration, we compute a(t+1)j by doing a first-order expansion

of () around the K elements of the current estimate a(t)kj

(t)k (|a|) = (|a(t)kj |) + (|a(t)kj |)

|a| |a(t)kj |

= (|a(t)kj |)|a| + ck

and solving the convex weighted 1 problem that results afterdiscarding the constant terms ck

a(t+1)j = argmina

xj D a22 + K

k=1

(t)k (|ak|)

= argmina xj D a

22 +

K

k=1

|a(t)kj | |ak|

= argmina

xj D a22 +K

k=1

(t)k |ak| (20)

where we have defined (t)k := (|a(t)kj |). If () is contin-uous in (0, +), and right-continuous and finite at 0, thenthe LLA algorithm converges to a stationary point of (19)[14]. These conditions are met for both the MOE and JOEregularizers. Although, for the JOE prior, the derivative () isnot defined at 0, it converges to the limit (22 21 /2(2 1))when |a| 0, which is well defined for 2 = 1. If 2 = 1,the JOE mixing function is a Kronecker delta and the prior


8/15


becomes a Laplacian with parameter = 1 = 2. Therefore,if we have all of the mixture models studied, the LLA methodconverges to a stationary point. In practice, we have observedthat five iterations are enough to converge. Thus, the costof sparse coding, with the proposed nonconvex regularizers,is at most five times that of a single 1 sparse coding.Of course, we need a starting point a(0)j and, being a nonconvexproblem, this choice will influence the approximation thatwe obtain. One reasonable choice, used in this paper, isto define a(0)kj = a0, k = 1, . . . , K, j = 1, . . . , N, wherea0 is a scalar, so that (a0) = Ew[], that is, the firstsparse coding corresponds to a Laplacian regularizer whoseparameter is the average value of as given by the mixingprior w().

Finally, note that although the discussion here has revolvedaround the Lagrangian formulation to sparse coding of (4), thistechnique is also applicable to the constrained formulation ofsparse-coding given by (1) for a fixed dictionary D.

Expected Approximation Error: Because we are solvinga convex approximation to the actual target optimization

problem, it is of interest to know how good this approximationis in terms of the original cost function. To give an idea ofthis, after an approximate solution a is obtained, we computethe expected value of the difference between the true andapproximate regularization term values. The expectation istaken, naturally, in terms of the assumed distribution of thecoefficients in a. Because the regularizers are separable, wecan compute the error in a separable way as an expectationover each kth coefficient, q (ak) = Eq [k() ()],where k() is the approximation of k() around the finalestimate of ak. For the case of q = MOE, the expressionobtained is (see Appendix)

MOE(ak, , ) = EMOE(,) k() ()= log(ak +) + 1

ak +

ak + 1

log 1

.

In the MOE case, for and fixed, the minimum of MOEoccurs when ak = (/ 1) = (,). We also haveMOE(0) = ( 1)1 1.

The function q () can be evaluated on each coefficient ofA to give an idea of its quality. For example, from the exper-iments in Section V, we obtained an average value of 0.16,which lies between MOE(0) = 0.19 and mina MOE(a) =0.09. Depending on the experiment, this represents 6% to 7%of the total sparse coding cost function value, showing the

efficiency of the proposed optimization.Comments on Parameter Estimation: Models presented so

far, with the exception of the conditional Jeffreys, dependon hyper-parameters which, in principle, should be tuned foroptimal performance (remember that they do not influence theuniversality of the model). If tuning is needed, it is importantto remember that the proposed universal models are intendedfor reconstruction coefficients of clean data, and thus theirhyper-parameters should be computed from statistics of cleandata, or either by compensating the distortion in the statisticscaused by noise (see [47]). Finally, note that when D is linearlydependent and rank(D) = RM, the coefficients matrix A

resulting from an exact reconstruction of X will have manyzeroes which are not properly explained by any continuousdistribution such as a Laplacian. We sidestep this issue bycomputing the statistics only from the nonzero coefficients inA. See [20] on how to deal with the case P(a = 0) > 0.

V. EXPERIMENTAL RESULTS

In the following experiments, the testing data X are 8 8patches drawn from the Pascal VOC2006 testing subset,7

which are high-quality 640 480 RGB images with 8 bits perchannel. For the experiments, we converted the 2600 images tograyscale by averaging the channels, and scaled the dynamicrange to lie in the [0, 1] interval. Similar results to those shownhere are also obtained for other patch sizes.

A. Dictionary Learning

For the experiments that follow, unless otherwise stated,we use a global overcomplete dictionary D with K

=4M = 256 atoms trained on the full VOC2006 training subsetusing the method described in [48] and [49], which seeks tominimize the following cost during training8:

minD,A

1

N

Nj=1

xj D aj22 + (aj )

+ DTD2

F(21)

where F denotes Frobenius norm. The additional term,DTD2

F, encourages incoherence in the learned dictio-

nary, that is, it forces the atoms to be as orthogonal aspossible. Dictionaries with lower coherence are well known tohave several theoretical advantages such as improved ability

to recover sparse signals [4], [50], and faster and betterconvergence to the solution of the sparse coding problems(1) and (3) [51]. Furthermore, in [48] it was shown thatadding incoherence leads to improvements in a variety ofsparse modeling applications, including the ones discussedbelow.

We used MOE as the regularizer in (21), with = 0.1 and = 1, both chosen empirically. See [1], [8], [48] for detailson the optimization of (3) and (21).

B. MOE as a Prior for Sparse Coding Coefficients

We begin by comparing the performance of the Laplacian

and MOE priors for fitting a single global distribution to thewhole matrix A. We compute A using (1) with 0 andthen, following the discussion in Section IV, restrict our studyto the nonzero elements of A.

The empirical distribution ofA is plotted in Fig. 3(a), alongwith the best fitting Laplacian, MOE, JOE, and a particularly

7Available at http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html#VOC2006.

8While we could have used off-the-shelf dictionaries such as DCT to test ouruniversal sparse coding framework, it is important to use dictionaries that leadto the state-of-the-art results to show the additional potential improvement ofour proposed regularizers.


9/15


Fig. 3. (a) Empirical distribution of the coefficients in A for image patches (blue dots), best fitting Laplacian (green), MOE (red), CMOE (orange), and JOE(yellow) distributions. The Laplacian (KLD = 0.17 bits) clearly does not fit the tails properly, and is not sufficiently peaked at zero. The two models basedon a Gamma prior, MOE (KLD = 0.01 bits) and CMOE (KLD = 0.01 bits), provide an almost perfect fit. The fitted JOE (KLD = 0.14) is the most sharplypeaked at 0, but does not fit the tails as tightly as desired. As a reference, the entropy of the empirical distribution is H = 3.00 bits. (b) KLD for the bestfitting global Laplacian (dark green), per-atom Laplacian (light green), global MOE (dark red) and per-atom MOE (light red), relative to the KLD betweenthe globally fitted MOE distribution and the empirical distribution. The horizontal axis represents the indexes of each atom, k = 1, . . . , K, ordered accordingto the difference in KLD between the global MOE and the per-atom Laplacian model. Note how the global MOE outperforms both the global and per-atomLaplacian models in all but the first four cases. (c) Active set recovery accuracy of 1 and MOE, as defined in Section V-C, for L = 5 and L = 10, as afunction of . The improvement of MOE over 1 is a factor of 5 to 9. (d) Peak signal-to-noise ratio (PSNR) of the recovered sparse signals with respect tothe true signals. In this case, significant improvements can be observed at the high SNR range, specially for highly sparse (L = 5) signals. The performanceof both methods is practically the same for 10.

good example of the conditional Jeffreys (CMOE) distribu-tions.9 The MLE for the Laplacian fit is = N1/ A1 = 27.2(here N1 is the number of nonzero elements in A). For MOE,using (12), we obtained = 2.8 and = 0.07. For JOE,1 = 2.4 and 2 = 371.4. According to the discussion inSection III-C, we used the value = 2.8 obtained using themethod of moments for MOE as a hint for choosing n0 = 2(0 = n0 + 1 = 3 2.8), yielding 0 = 0.07, whichcoincides with the obtained using the method of moments.As observed in Fig. 3(a), in all cases, the proposed mixturemodels fit the data better, significantly for both Gamma-based

mixtures, MOE and CMOE, and slightly for JOE. This isfurther confirmed by the KullbackLeibler divergence (KLD)obtained in each case. Note that JOE fails to significantlyimprove on the Laplacian mode because of the excessivelylarge estimated range [1, 2]. In this sense, it is clear thatthe JOE model is very sensitive to its hyper-parameters, anda better and more robust estimation would be needed for it tobe useful in practice.

Given these results, hereafter we concentrate on the bestcase which is the MOE prior (which, as detailed above, can bederived from the conditional Jeffreys as well, thus representingboth approaches).

From Fig. 1(e), we know that the optimal varies locally

across different regions, thus, we expect the mixture modelsto perform well also on a per-atom basis. This is confirmedin Fig. 3(b), where we show, for each row ak, k = 1, . . . , K,the difference in KLD between the globally fitted MOE distri-bution and the best per-atom fitted MOE, the globally fittedLaplacian, and the per-atom fitted Laplacians, respectively. Ascan be observed, the KLD obtained with the global MOE issignificantly smaller than the global Laplacian in all cases,

9To compute the empirical distribution, we quantized the elements of Auniformly in steps of 28, which for the amount of data available, givesus enough detail and simultaneously reliable statistics for all the quantizedvalues.

and even the per-atom Laplacians in most of the cases. Thisshows that MOE, with only two parameters (which can beeasily estimated, as detailed in the text), is a much bettermodel than K Laplacians (requiring K critical parameters)fitted specifically to the coefficients associated with each atom.Whether these modeling improvements have a practical impactis explored in the next experiments.

C. Recovery of Noisy Sparse Signals

Here we compare the active set recovery properties of theMOE prior, with those of the 1-based one, on data for which

the sparsity assumption |Aj | L holds exactly for all j .To this end, we obtain sparse approximations to each samplexj using the 0-based OMP algorithm on D [21], and recordthe resulting active sets Aj as ground truth. The data isthen contaminated with additive Gaussian noise of variance and the recovery is performed by solving (1) for A with = C M 2 and either the 1 or the MOE-based regularizerfor (). We use C = 1.32, which is a standard value indenoising applications (see [9]).

For each sample j , we measure the error of each method inrecovering the active set as the Hamming distance between thetrue and estimated support of the corresponding reconstructioncoefficients. The accuracy of the method is then given as the

percentage of the samples for which this error falls below acertain threshold T. The results are shown in Fig. 3(c) forL = (5, 10) and T = (2, 4), respectively, for various valuesof . Note the very significant improvement obtained with theproposed model.

Given the estimated active set Aj , the estimated clean patchis obtained by projecting xj onto the subspace defined by theatoms that are active according to Aj , using least squares(which is the standard procedure for denoising once the activeset is determined). We then measure the PSNR of the estimatedpatches with respect to the true ones. The results are shown inFig. 3(d), again for various values of . As can be observed,


10/15


TABLE I

DENOISING RESULTS: IN EACH TABLE, E ACH COLUMN SHOWS THE DENOISING PERFORMANCE OF A LEARNING + CODING COMBINATION.RESULTS ARE SHOWN IN PAIRS, WHERE THE LEF T NUMBER IS THE PSNR BETWEEN THE CLEAN AND RECOVERED INDIVIDUAL PATCHES ,

AND THE RIGHT NUMBER IS THE PSNR BETWEEN THE CLEAN AND RECOVERED IMAGES

Best results are in bold. The proposed MOE produces better final results over both the 0 and 1 ones in all cases, and atpatch level for all > 10. Note that the average values reported are the PSNR of the average MSE, and not the PSNR average.

(a)

(b)

Fig. 4. Sample image denoising results. (a) Barbara, = 30. (b) Boat, = 40. From left to right: noisy, 1/OMP, 1/1, MOE/MOE. The reconstructionobtained with the proposed model is more accurate, as evidenced by a better reconstruction of the texture in Barbara, and sharp edges in Boat, and does notproduce the artifacts seen in both the 1 and 0 reconstructions, which appear as black/white speckles all over Barbara, and ringing on the edges in Boat.

the MOE-based recovery is significantly better, specially inthe high SNR range. Notoriously, the more accurate active setrecovery of MOE does not seem to improve the denoisingperformance in this case. However, as we will see next, itmakes a difference when denoising real-life signals, as wellas for classification tasks.

D. Recovery of Real Signals With Simulated Noise

This experiment is an analogue to the previous one, when

the data are the original natural image patches (without forcingexact sparsity). Because, for this case, the sparsity assumptionis only approximate, and no ground truth is available for theactive sets, we compare the different methods in terms of theirdenoising performance.

A critical strategy in image denoising is the use of overlap-ping patches, where for every pixel in the image a patch isextracted with that pixel as its center. The patches are denoisedindependently as M-dimensional signals and then recombinedinto the final denoised images by simple averaging. Althoughthis consistently improves the final result in all cases, theimprovement is very different depending on the method used

to denoise the individual patches. Therefore, we now comparethe denoising performance of each method at two levels:individual patches and final image.

To denoise each image, the global dictionary described inSection V-A is further adapted to the noisy image patchesusing (21) for a few iterations, and used to encode the noisypatches via (2) with = C M 2 . We repeated the experimentfor two learning variants (1 and MOE regularizers), and twocoding variants [(2) with the regularizer used for learning,and 0 via OMP]. The four variants were applied to the

standard images Barbara, Boats, Lena, Man, and Peppers,and the results are summarized in Table I. We show sampleresults in Fig. 4. Although the quantitative improvements seenin Table I are small compared with 1, there is a significantimprovement at the visual level, as can be seen in Fig. 4. Inall cases, the PSNR obtained coincides or surpasses the onesreported in [1].10

10Note that in [1], the denoised image is finally blended with the noisyimage using an empirical weight, providing an extra improvement to the finalPSNR in some cases. The results in [1] are already better without this extrastep.


11/15


(a) (b) (c)

Fig. 5. Zoomed-in results. (a) Summary. (b) Tools. (c) Detail of zoomed-in results for the framed region, top to bottom and left to right: cubic, 0 , 1, MOE.As can be seen, the MOE result is as sharp as 0 but produces less artifacts. This is reflected in the 0.1 dB overall improvement obtained with MOE, as seenin the leftmost summary table.

Fig. 6. Textures used in the texture classification example.

E. Zooming

As an example of signal recovery in the absence of noise, wetook the previous set of images, plus a particularly challenging

one (Tools), and subsampled them to half each side. Wethen simulated a zooming effect by upsampling them andestimating each of the 75% missing pixels (see [52] andreferences therein). We use a technique similar to the one usedin [53]. The image is first interpolated and then deconvolutedusing a Wiener filter. The deconvoluted image has artifactsthat we treat as noise in the reconstruction. However, as thereis no real noise, we do not perform averaging of the patches,using only the center pixel of xj to fill in the missing pixelat j . The results are summarized in Fig. 5, where we againobserve that using MOE instead of 0 and 1 improves theresults.

F. Classification With Universal Sparse Models

In this section, we apply our proposed universal modelsto a classification problem where each sample xj is to beassigned a class label yj = 1, . . . , c, which serves as an indexto the set of possible classes {C1,C2, . . . , Cc}. We follow theprocedure of [49], where the classifier assigns each sample xjby means of the MAP criterion (5) with the term log P(a)corresponding to the assumed prior, and the dictionaries repre-senting each class are learned from training samples using(21) with the corresponding regularizer (a) = log P(a).Each experiment is repeated for the baseline Laplacian model,

implied in the 1 regularizer, and the universal model MOE,and the results are then compared. In this case, we expect thatthe more accurate prior model for the coefficients will result

in an improved likelihood estimation, which in turn shouldimprove the accuracy of the system.

We begin with a classic texture classification problem,where patches have to be identified as belonging to one out ofa number of possible textures. In this case, we experimentedwith samples of c = 2 and c = 3 textures drawn at randomfrom the Brodatz database,11 the ones actually used shownin Fig. 6. In each case, the experiment was repeated 10times. In each repetition, a dictionary of K = 300 atomswas learned from all 1616 patches of the leftmost halvesof each sample texture. We then classified the patches fromthe rightmost halves of the texture samples. For the c = 2, weobtained an average error rate of 5.13% using 1 against 4.12%

when using MOE, which represents a reduction of 20% inclassification error. For c = 3, the average error rate obtainedwas 13.54% using 1 and 11.48% using MOE, which is 15%lower. Thus, using the universal model instead of1 yields asignificant improvement in this case (see [8] for other resultsin classification of Brodatz textures).

The second sample problem presented is the Graz02 bikedetection problem,12 where each pixel of each testing imagehas to be classified as either background or as part of a

11Available at http://www.ux.uis.no/tranden/brodatz.html.12Available at http://lear.inrialpes.fr/people/marszalek/data/ig02/.


12/15


(a) (b) (c)

Fig. 7. Classification results. (a) Precision versus recall curve. (b) Sample image from the Graz02 dataset. (c) From left to right: Its ground truth and thecorresponding estimated maps obtained with 1 and MOE for a fixed threshold. The precision versus recall curve shows that the mixture model gives a betterprecision in all cases. In the example, the classification obtained with MOE yields less false positives and more true positives than the one obtained with 1.

bike. In the Graz02 dataset, each of the pixels can belongto one of two classes: bike or background. On each of thetraining images (which by convention are the first 150 even-numbered images), we are given a mask that tells us whethereach pixel belongs to a bike or to the background. We then

train a dictionary for bike patches and another for backgroundpatches. Patches that contain pixels from both classes areassigned to the class corresponding to the majority of theirpixels.

In Fig. 7, we show the precision versus recall curvesobtained with the detection framework when either the 1 orthe MOE regularizers were used in the system. As can beseen, the MOE-based model outperforms the 1 in this clas-sification task as well, giving a better precision for all recallvalues.

In the above experiments, the parameters for the 1 prior(), the MOE model (MOE) and the incoherence term ()were all adjusted by cross validation. The only exception is

the MOE parameter , which was chosen based on the fittingexperiment as = 0.07.

VI. CONCLUSION

A framework for designing sparse modeling priors wasintroduced in this paper, using tools from universal coding,which formalizes sparse coding and modeling from a MDLperspective. The priors obtained lead to models with boththeoretical and practical advantages over the traditional 0- and1-based ones. In all derived cases, the designed nonconvexproblems are suitable to be efficiently (approximately) solvedvia a few iterations of (weighted) 1 subproblems. We

also showed that these priors are able to fit the empiricaldistribution of sparse codes of image patches significantlybetter than the traditional IID Laplacian model, and even thenonidentically distributed independent Laplacian model wherea different Laplacian parameter is adjusted to the coefficientsassociated with each atom, thus showing the flexibility andaccuracy of these proposed models. The additional flexibility,furthermore, comes at a small cost of only two parametersthat can be easily and efficiently tuned (either (,) in theMOE model, or (1, 2) in the JOE model), instead of K(dictionary size), as in weighted 1 models. The additionalaccuracy of the proposed models was shown to have significant

practical impact in active set recovery of sparse signals, imagedenoising, and classification applications. Compared with theBayesian approach, we avoid the potential burden of solvingseveral sampled sparse problems, or being forced to use aconjugate prior for computational reasons (although in our

case, a fortiori, the conjugate prior does provide us with a goodmodel). Overall, as demonstrated in this paper, the introductionof information theory tools can lead to formally addressingcritical aspects of sparse modeling.

Future work in this direction includes the design of priorsthat take into account the nonzero mass at a = 0 that appearsin overcomplete models, and online learning of the modelparameters from noisy data, following the technique in [47].

APPENDIX ADERIVATION OF THE MOE MODEL

In this case, we have P(a

|)

=ea and w(

|,)

=(1/())1 e, which, when plugged into (9), givesQ(a|, ) =

=0

ea1

()1 ed

=

()

=0

e(a+) d.

After the change of variables u := (a + ) (u(0) = 0,u() = ), the integral can be written as

Q(a|, ) =

()

=0

eu

u

a +k

du

a +

=

() (a

+)(

+1)

=0e

u ukdu

=

()(a +)(+1)( + 1)

=

()(a +)(+1)()

obtaining Q(a|, ) = (a +)(+1), because the integralon the second line is precisely the definition of ( + 1).The symmetrization is obtained by substituting a by |a| anddividing the normalization constant by two, Q(|a||, ) =0.5 (|a| +)(+1).


13/15


The mean of the MOE distribution (which is defined onlyfor > 1) can be easily computed using integration by parts

(,) =

0

u

(u +)(+1) du

= u

(u +)

0+ 1

0

du

(u +)k

= 1 .

In the same way, it is easy to see that the noncentral momentsof order i are i = (/

1i

).

The MLE estimates of and can be obtained usingany nonlinear optimization technique such as Newton method,using for example the estimates obtained with the method ofmoments as a starting point. In practice, however, we havenot observed any significant improvement in using the MLEestimates over the moments-based ones.

A. Expected Approximation Error in Cost Function

As mentioned in the optimization section, the LLA approxi-mates the MOE regularizer as a weighted 1. Here we developan expression for the expected error between the true functionand the approximate convex one, where the expectation istaken (naturally) with respect to the MOE distribution. Giventhe value of the current iterate a(t) = a0, (assumed positive,because the function and its approximation are symmetric),the approximated regularizer is (t)(a) = log(a0 + ) +(1/|a0| +)(a a0). We have

EaMOE(,)

(t)(a) (a)

= 0

(a + )+1

log(|a0 +) + 1a0 +(a

a0)

log(a +)

da

= log(a0 +) + a0a0 +

+

a0 +

0

a

(a +)+1 da

0

log(a +)(a +)+1 da

= log(a0 +) +a0

a0

+

+ (a0

+)(

1)

log 1

.

APPENDIX BDERIVATION OF THE CONSTRAINED

JEFFREYS (JOE) MODEL

In the case of the exponential distribution, the Fisher infor-mation matrix in (15) evaluates to

I() =

EP(|)

2

2( log + log a)

=

=

EP(|)

1

2

=

= 12

.

By plugging this result into (14) with = [1, 2], 0 1, these

integrals can be solved using integration by parts

+i =+

0ai1e1a da

= ai1 1(1) e

1a+

0 1

(1) (i 1)

+0

ai2e1a da

i =+

0ai1e2ada

= ai1 1(2)

e2a+

0 1

(2)(i 1)

+0

ai2e2a da

where the first term in the right hand side of both equationsevaluates to 0 for i > 1. Therefore, for i > 1 we obtain

the recursions +i = (i 1/1)+i1, i = (i 1/2)i1,which, combined with the result for i = 1, give the finalexpression for all the moments of order i > 0

i = (i 1)!ln(2/1)

1

i1 1

i2

, i = 1, 2, . . . .

In particular, for i = 1 and i = 2, we have 1 =(ln(2/1)1 + 12 )1, 2 = (ln(2/1)2 + 21 )1, which,when combined, give us

1 = 212 + ln(2/1)21

, 2 = 212 ln(2/1)21

. (23)


14/15


One possibility is to solve the nonlinear equation 2/1 =(2 + ln(2/1)21)/(2 ln(2/1)21) for u = 1/2by finding the roots of the nonlinear equation u =(2 + ln u21/2 ln u21) and choosing one of them basedon some side information. Another possibility is to simply fixthe ratio 2/1 beforehand and solve for 1 and 2 using (23).

APPENDIX CDERIVATION OF THE CONDITIONALJEFFREYS (CMOE) MODEL

The conditional Jeffreys method defines a proper prior w()by assuming that n0 samples from the data to be modeled awere already observed. Plugging the Fisher information forthe exponential distribution, I() = 2, into (18) we obtain

w() = P(an0 | )1

P(an0 | )1d

=(n0

j =1 eaj )1

+0

(n0j=1 eaj )1d=

n01en0

j=1 aj+0

n01en0

j=1 aj d.

Denoting S0 =n0

j=1 aj and performing the change ofvariables u := S0 we obtain

w() = n01eS0

Sn00

+0 u

n01eu du= S

n00

n01eS0

(n0)

where the last equation derives from the definition of theGamma function, (n0). We see that the resulting prior w()is a Gamma distribution Gamma(0, 0) with 0

=n0 and

0 = S0 = n0j=1 aj .ACKNOWLEDGMENT

The authors would like to thank J. Mairal for providinghis fast sparse modeling toolbox, SPAMS.13 The authors alsothank F. Lecumberry for his participation in the incoherentdictionary learning method and helpful comments.

REFERENCES

[1] M. Aharon, M. Elad, and A. Bruckstein, The K-SVD: An algorithmfor designing of overcomplete dictionaries for sparse representations,

IEEE Trans. Signal Process., vol. 54, no. 11, pp. 43114322, Nov. 2006.[2] K. Engan, S. Aase, and J. Husoy, Multi-frame compression: Theory and

design, Signal Process., vol. 80, no. 10, pp. 21212140, Oct. 2000.[3] B. Olshausen and D. Field, Sparse coding with an overcomplete basis

set: A strategy employed by V1? Vision Res., vol. 37, no. 23, pp. 33113325, 1997.

[4] I. Daubechies, M. Defrise, and C. De Mol, An iterative thresh-olding algorithm for linear inverse problems with a sparsity constraint,Commun. Pure Appl. Math., vol. 57, no. 11, pp. 14131457, 2004.

[5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angleregression, Ann. Stat., vol. 32, no. 2, pp. 407499, 2004.

[6] S. Chen, D. Donoho, and M. Saunders, Atomic decomposition by basispursuit, SIAM J. Sci. Comput., vol. 20, no. 1, pp. 3361, 1998.

13Available at http://www.di.ens.fr/willow/SPAMS/.

[7] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink, Sparsemultinomial logistic regression: Fast algorithms and generalizationbounds, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 6, pp.957968, Jun. 2005.

[8] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Superviseddictionary learning, in Proc. Adv. Neural Inform. Process. Syst., 2008,pp. 18.

[9] J. Mairal, G. Sapiro, and M. Elad, Learning multiscale sparse repre-sentations for image and video restoration, Comput. Inform. Sci., vol.7, no. 1, pp. 214241, Apr. 2008.

[10] R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng, Self-taught learning:Transfer learning from unlabeled data, in Proc. Int. Conf. Mach. Learn.,Jun. 2007, pp. 759766.

[11] A. Bruckstein, D. Donoho, and M. Elad, From sparse solutions ofsystems of equations to sparse modeling of signals and images, SIAM

Rev., vol. 51, no. 1, pp. 3481, Feb. 2009.[12] R. Giryes, Y. Eldar, and M. Elad, Automatic parameter setting for

iterative shrinkage methods, in Proc. IEEE 25th Convent. Electron.Elect. Eng., Dec. 2008, pp. 820824.

[13] M. Figueiredo, Adaptive sparseness using Jeffreys prior, in Adv. NIPS,T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cambridge, MA:MIT Press, Dec. 2001, pp. 697704.

[14] H. Zou, The adaptive LASSO and its oracle properties, J. Amer. Stat.Assoc., vol. 101, no. 476, pp. 14181429, 2006.

[15] J. Rissanen, Universal coding, information, prediction and estimation,IEEE Trans. Inform. Theory, vol. 30, no. 4, pp. 629636, Jul. 1984.

[16] R. Coifman and M. Wickenhauser, Entropy-based algorithms for bestbasis selection, IEEE Trans. Inform. Theory, vol. 38, no. 2, pp. 713718, Mar. 1992.

[17] P. Moulin and J. Liu, Analysis of multiresolution image denoisingschemes using generalized-Gaussian and complexity priors, IEEETrans. Inform. Theory, vol. 45, no. 3, pp. 909919, Apr. 1999.

[18] N. Saito, Simultaneous noise suppression and signal compression usinga library of orthonormal bases and the MDL criterion, in Waveletsin Geophysics, E. Foufoula-Georgiou and P. Kumar, Eds. New York:Academic, 1994.

[19] E. J. Cands, Compressive sampling, in Proc. Int. Cong. Math.,Aug. 2006, pp. 120.

[20] I. Ramrez and G. Sapiro, An MDL framework for sparse coding anddictionary learning, IEEE Trans. Signal P rocess., vol. 60, no. 6, pp.29132927, Jun. 2012.

[21] S. Mallat and Z. Zhang, Matching pursuit in a time-frequency dictio-

nary, IEEE Trans. Signal Process., vol. 41, no. 12, pp. 33973415, Dec.1993.[22] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical

Learning: Data Mining, Inference and Prediction. New York: Springer-Verlag, Feb. 2009.

[23] R. Tibshirani, Regression shrinkage and selection via the LASSO, J.Royal Stat. Soc. Series B, vol. 58, no. 1, pp. 267288, 1996.

[24] T. Cover and J. Thomas, Elements of Information Theory. New York:Wiley, 2006.

[25] J. Rissanen, Stochastic Complexity in Statistical Inquiry. Singapore:World Scientific, 1992.

[26] E. Lam and J. Goodman, A mathematical analysis of the DCT coeffi-cient distributions for images, IEEE Trans. Image Process, vol. 9, no.10, pp. 16611666, Oct. 2000.

[27] S. Ji, Y. Xue, and L. Carin, Bayesian compressive sensing,IEEE Trans. Signal Process., vol. 56, no. 6, pp. 23462356, Jun.

2008.[28] M. Tipping, Sparse Bayesian learning and the relevance vectormachine, J. Mach. Learn., vol. 1, pp. 211244, Jan. 2001.

[29] D. Wipf and B. Rao, An empirical bayesian strategy for solvingthe simultaneous sparse approximation problem, IEEE Trans. SignalProcess., vol. 55, no. 7, pp. 37043716, Jul. 2007.

[30] D. Wipf, J. Palmer, and B. Rao, Perspectives on sparse Bayesianlearning, in Proc. Adv. Nopany Inst. Professional Studies, Dec. 2003,pp. 1249.

[31] M. Everingham, A. Zisserman, C. Williams, and L. VanGool. The Pascal Visual Object Classes Challenge 2006Voc2006 Results [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf

[32] N. Merhav and M. Feder, Universal prediction, IEEE Trans. ImageTheory, vol. 44, no. 6, pp. 21242147, Oct. 1998.


15/15


[33] Y. Shtarkov, Universal sequential coding of single messages, Probl.Inform. Trans., vol. 23, no. 3, pp. 317, Jul. 1987.

[34] P. Grnwald, The Minimum Description Length Principle. Cambridge,MA: MIT Press, Jun. 2007.

[35] G. Schwartz, Estimating the dimension of a model, Ann. Stat., vol. 6,no. 2, pp. 461464, 1978.

[36] J. Fan and R. Li, Variable selection via nonconcave penalized likelihoodand its oracle properties, J. Amer. Stat. Assoc., vol. 96, no. 456, pp.13481360, Dec. 2001.

[37] E. J. Cands, M. Wakin, and S. Boyd, Enhancing sparsity by reweighted1 minimization, J. Fourier Anal. Appl., vol. 14, no. 5, pp. 877905,Dec. 2008.

[38] R. Saab, R. Chartrand, and O. Yilmaz, Stable sparse approximation vianonconvex optimization, in Proc. IEEE Acoust. Speech Signal Process.

Int. Conf., Apr. 2008, pp. 38853888.[39] S. Foucart and M. Lai, Sparsest solutions of underdetermined linear

systems via q -minimization for 0 < q 1, Appl. Comput. HarmonicAnal., vol. 3, no. 26, pp. 395407, 2009.

[40] J. Trzasko and A. Manduca, Relaxed conditions for sparse signalrecovery with general concave priors, IEEE Trans. Signal Process.,vol. 57, no. 11, pp. 43474354, Nov. 2009.

[41] R. Chartrand, Fast algorithms for nonconvex compressive sensing: MRIreconstruction from very few data, in Proc. IEEE Biomed. Imag.: From

Nano Macro Int. Symp., Jun. 2009, pp. 262265.[42] J. Trzasko and A. Manduca, Highly undersampled magnetic resonance

image reconstruction via homotopic 0-minimization, IEEE Trans.Med. Imag., vol. 28, no. 1, pp. 106121, Jan. 2009.

[43] J. Bernardo and A. Smith, Bayesian Theory. New York: Wiley, 1994.[44] A. Barron, J. Rissanen, and B. Yu, The minimum description length

principle in coding and modeling, IEEE Trans. Inform. Theory, vol. 44,no. 6, pp. 27432760, Oct. 1998.

[45] H. Zou and R. Li, One-step sparse estimates in nonconcave penal-ized likelihood models, Ann. Stat., vol. 36, no. 4, pp. 15091533,2008.

[46] G. Gasso, A. Rakotomamonjy, and S. Canu, Recovering sparse signalswith non-convex penalties and DC programming, IEEE Trans. SignalProcess., vol. 57, no. 12, pp. 46864698, Dec. 2009.

[47] G. Motta, E. Ordentlich, I. Ramirez, G. Seroussi, and M. Weinberger,The iDUDE framework for grayscale image denoising, IEEE Trans.

Image Process., vol. 20, no. 1, pp. 121, Jan. 2011.[48] I. Ramirez, F. Lecumberry, and G. Sapiro, Universal priors for sparse

modeling, in Proc. IEEE Comput. Adv. Multi-Sensor Adaptive Process.

Int. Workshop, Dec. 2009, pp. 197200.[49] I. Ramrez, P. Sprechmann, and G. Sapiro, Classification and clustering

via dictionary learning with structured incoherence and shared features,in Proc. IEEE Comput. Vision Pattern Recogn. Conf., Jun. 2010, pp.35013508.

[50] J. Tropp, Greed is good: Algorithmic results for sparse approximation,IEEE Trans. Inform. Theory, vol. 50, no. 10, pp. 22312242, Oct.2004.

[51] M. Elad, Optimized projections for compressed-sensing, IEEETrans. Signal Process., vol. 55, no. 12, pp. 56955702, Dec.2007.

[52] G. Yu, G. Sapiro, and S. Mallat, Solving inverse problems withpiecewise linear estimators: From Gaussian mixture models to structuredsparsity, IEEE Trans. Image Process., vol. 21, no. 5, pp. 24812499,May 2011.

[53] R. Neelamani, H. Choi, and R. Baraniuk, Forward: Fourier-waveletregularized deconvolution for ill-conditioned systems, IEEE Trans.Signal Process., vol. 52, no. 2, pp. 418433, Feb. 2004.

Ignacio Ramrez (S06) received the E.E. degree

and the M.Sc. degree in electrical engineeringfrom the Universidad de la Repblica, Montevideo,Uruguay, and the Ph.D. degree in scientific compu-tation from the University of Minnesota (UofM),Minneapolis, in 2002, 2007, and 2012, respectively.

He was a Research Assistant with UofM from2008 to 2012. He held temporary research positionswith UofM in 2007 and Hewlett-Packard Laborato-ries, Palo Alto, CA, in 2004. He has been holdingan assistantship with the Instituto de Ingeniera Elc-

trica, Facultad de Ingeniera, UdelaR, since 1999. His current research interestsinclude applied information theory, statistical signal processing and machinelearning, with a focus on multimedia data processing, and automatic modelselection for sparse linear models.

Guillermo Sapiro (M94) was born in Montevideo,Uruguay, on April 3, 1966. He received the B.Sc.(summa cum laude), M.Sc., and Ph.D. degrees fromthe Department of Electrical Engineering, TechnionIsrael Institute of Technology, Haifa, Israel, in 1989,1991, and 1993, respectively.

He was a Technical Staff Member at the researchfacilities with HP Laboratory, Palo Alto, CA,after he was a Post-Doctoral Researcher with theMassachusetts Institute of Technology, Cambridge.He is currently with the Department of Elec-

trical and Computer Engineering, University of Minnesota, Minneapolis,where he is a Distinguished McKnight University Professor and theVincentine Hermes-Luh Chair in electrical and computer engineering. He isengaged in research on differential geometry and geometric partial differentialequations, both in theory and application in computer visions, computergraphics, medical imaging, and image analyses. He recently co-edited a specialissue of IEEE IMAGE PROCESSING in this topic and a second one in the

Journal of Visual Communication and Image Representation. He has authoredor co-authored numerous papers and has written a book (Cambridge UniversityPress, 2001).

Prof. Sapiro was a recipient of the Gutwirth Scholarship for Special Excel-lence in Graduate Studies in 1991, the Ollendorff Fellowship for Excellencein Vision and Image Understanding Work in 1992, the Rothschild Fellowshipfor Post-Doctoral Studies in 1993, the Office of Naval Research YoungInvestigator Award in 1998, the Presidential Early Career Awards for Scientistand Engineers in 1998, the National Science Foundation Career Award in1999, and the National Security Science and Engineering Faculty Fellowshipin 2010. He is a member of SIAM. He is the funding Editor-in-Chief of theSIAM Journal on Imaging Sciences.

journal paper in image

Documents

Transcript of journal paper in image