Bayesian Inference for Neighborhood Filters With … Inference for Neighborhood Filters with...

9
Bayesian Inference for Neighborhood Filters with Application in Denoising Chao-Tsung Huang National Tsing Hua University, Taiwan [email protected] Abstract Range-weighted neighborhood filters are useful and popular for their edge-preserving property and simplicity, but they are originally proposed as intuitive tools. Previous works needed to connect them to other tools or models for indirect property reasoning or parameter estimation. In this paper, we introduce a unified empirical Bayesian frame- work to do both directly. A neighborhood noise model is proposed to reason and infer the Yaroslavsky, bilateral, and modified non-local means filters. An EM+ algorithm is de- vised to estimate the essential parameter, range variance, via the model fitting to empirical distributions. Finally, we apply this framework to color-image denoising. Experimen- tal results show that the proposed model fits noisy images well and the range variance is estimated successfully. The image quality can also be improved by a proposed recursive fitting and filtering scheme. 1. Introduction Range-weighted formulation has been widely used to provide edge-preserved denoising since the introduction of local neighborhood filtering, especially the Yaroslavsky [28] and bilateral [23] filters. Many variations were also proposed for different applications, such as the trilateral fil- ter for high contrast images[6] and the cross bilateral filter for fusing image pairs [20]. The reader is referred to [17] for more applications. The non-local means (NLM) [2] was further proposed for a non-local neighborhood and got fruit- ful patch-based extensions on denoising. The intuitive idea behind is to assign filter coefficients based on similarity, e.g. the bilateral filter is given by ˆ z l = iΛ l w l,i d l,i · y i iΛ l w l,i d l,i (1) where y and ˆ z are the observed and filtered signals re- spectively, and Λ l represents the neighborhood at posi- tion l. The adaptive kernel consists of one range-weighted w l,i = exp(- ky l -yik 2 2 2σ 2 r ) and one distance-weighted d l,i = exp(- kl-ik 2 2 2σ 2 d ). The former adapts to pixel similarity for edge preservation, which is controlled by range variance σ 2 r . And the latter provides a spatial smoothing window. If the spatial weights are all equal, it will degenerate to the Yaroslavsky filter. On the other hand, it will evolve to the NLM if w l,i is defined by patch similarity: w l,i = exp(- b∈B k y l+b - y i+b k 2 2 22 r ),B = |B|, (2) where {y i+b |b ∈ B} forms the patch at position i. Filtering based on range-weighted similarity is effective but lacks of theoretical basis. For understanding its mathemat- ical properties, several previous works studied their con- nections to other classical methods, such as mean shift [1], anisotropic diffusion [3] and Bayesian approach [8, 16], and thus found improvements on the neighborhood filters. How- ever, these connections were unable to provide further in- formation to infer the range variance σ 2 r directly from the observed data. In contrast, without reasoning the properties statistical tech- niques have been adopted to estimate the parameters indi- rectly using the basic observation model y = z + n, (3) where n is additive Gaussian noise. The χ 2 test was used to choose the parameters for the NLM filter in [11]. The Stein’s unbiased risk estimate (SURE) [22] can provide un- biased estimation of the mean squared error (MSE) from noisy and filtered images. Thus many parameter combi- nations can be tested, and the one giving the smallest esti- mated MSE can be selected. Though accurate, the complex- ity is quite high because each combination needs to filter the whole image separately. Contribution. In this paper, we build a unified empirical Bayesian framework to both infer the neighborhood filters and estimate their range variance. We first introduce a novel neighborhood noise model. It reasons the range-weighted formulation using maximum a posteriori (MAP) estima- tion via a soft-edge prior and infers the filters using max- imum likelihood (ML) estimation on different likelihood 1

Transcript of Bayesian Inference for Neighborhood Filters With … Inference for Neighborhood Filters with...

Bayesian Inference for Neighborhood Filters with Application in Denoising

Chao-Tsung HuangNational Tsing Hua University, Taiwan

[email protected]

Abstract

Range-weighted neighborhood filters are useful andpopular for their edge-preserving property and simplicity,but they are originally proposed as intuitive tools. Previousworks needed to connect them to other tools or models forindirect property reasoning or parameter estimation. In thispaper, we introduce a unified empirical Bayesian frame-work to do both directly. A neighborhood noise model isproposed to reason and infer the Yaroslavsky, bilateral, andmodified non-local means filters. An EM+ algorithm is de-vised to estimate the essential parameter, range variance,via the model fitting to empirical distributions. Finally, weapply this framework to color-image denoising. Experimen-tal results show that the proposed model fits noisy imageswell and the range variance is estimated successfully. Theimage quality can also be improved by a proposed recursivefitting and filtering scheme.

1. IntroductionRange-weighted formulation has been widely used to

provide edge-preserved denoising since the introductionof local neighborhood filtering, especially the Yaroslavsky[28] and bilateral [23] filters. Many variations were alsoproposed for different applications, such as the trilateral fil-ter for high contrast images[6] and the cross bilateral filterfor fusing image pairs [20]. The reader is referred to [17]for more applications. The non-local means (NLM) [2] wasfurther proposed for a non-local neighborhood and got fruit-ful patch-based extensions on denoising.The intuitive idea behind is to assign filter coefficients basedon similarity, e.g. the bilateral filter is given by

zl =

∑i∈Λl

wl,idl,i · yi∑i∈Λl

wl,idl,i(1)

where y and z are the observed and filtered signals re-spectively, and Λl represents the neighborhood at posi-tion l. The adaptive kernel consists of one range-weightedwl,i = exp(−‖yl−yi‖

22

2σ2r

) and one distance-weighted dl,i =

exp(−‖l−i‖22

2σ2d

). The former adapts to pixel similarity foredge preservation, which is controlled by range varianceσ2r . And the latter provides a spatial smoothing window.

If the spatial weights are all equal, it will degenerate to theYaroslavsky filter. On the other hand, it will evolve to theNLM if wl,i is defined by patch similarity:

wl,i = exp(−∑b∈B ‖ yl+b − yi+b ‖22

2Bσ2r

), B = |B|, (2)

where {yi+b|b ∈ B} forms the patch at position i.Filtering based on range-weighted similarity is effective butlacks of theoretical basis. For understanding its mathemat-ical properties, several previous works studied their con-nections to other classical methods, such as mean shift [1],anisotropic diffusion [3] and Bayesian approach [8, 16], andthus found improvements on the neighborhood filters. How-ever, these connections were unable to provide further in-formation to infer the range variance σ2

r directly from theobserved data.In contrast, without reasoning the properties statistical tech-niques have been adopted to estimate the parameters indi-rectly using the basic observation model

y = z + n, (3)

where n is additive Gaussian noise. The χ2 test was usedto choose the parameters for the NLM filter in [11]. TheStein’s unbiased risk estimate (SURE) [22] can provide un-biased estimation of the mean squared error (MSE) fromnoisy and filtered images. Thus many parameter combi-nations can be tested, and the one giving the smallest esti-mated MSE can be selected. Though accurate, the complex-ity is quite high because each combination needs to filter thewhole image separately.Contribution. In this paper, we build a unified empiricalBayesian framework to both infer the neighborhood filtersand estimate their range variance. We first introduce a novelneighborhood noise model. It reasons the range-weightedformulation using maximum a posteriori (MAP) estima-tion via a soft-edge prior and infers the filters using max-imum likelihood (ML) estimation on different likelihood

1

functions. We then present an EM+ algorithm to performthe model fitting for estimating the range variance σ2

r fromnoisy images. From the experimental results on color-imagedenoising, we show that the proposed model fits noisy im-ages well and estimates σ2

r as accurate as SURE does. Theadvantage over SURE is also shown both computation-wiseand quality-wise when considering a proposed recursive fil-tering scheme. We also believe that this framework can beextended to other range-weighted algorithms for deeper the-oretical understanding and better parameter estimation.

2. Related WorkJoint filter/parameter inference. In [21], a wavelet-

domain denoising algorithm was developed with the as-sumption that the latent image z is Gaussian scale mixtures(GSM) [26]. The adaptive parameters can then be statis-tically inferred from the neighborhood. If detailed camerainformation is available, [13] can perform noise estimationand removal automatically. Not surprisingly, these filtersare different from the range-weighted ones. The PLOW fil-ter [4] is equivalent to the NLM filter plus a residual filter.Although the residual filter is based on the covariance ma-trices inferred from geometric clusters, the range variancewas still given in a heuristic way. In this paper, we targetthe joint inference for neighborhood filters.SURE-based parameter estimation. The SURE-basedmethod gives the state-of-the-art accuracy for parameter es-timation, and it is basically applicable to any parameter. Itneeds to estimate the noise variance first, and one popularmethod is by median absolute deviation (MAD). It has beenapplied to the bilateral filter for grey [18] and multispec-tral images [19] and also to the NLM filter [24, 25, 5]. Afast implementation for the bilateral filter was also proposedin [12], but several passes of filtering were still required.For the SURE-based method to work well, two conditionsshould be met: the Gaussian noise assumption and a differ-entiable filter kernel on the noisy image y. In this paper, themodel fitting itself is only able to estimate the range vari-ance. However, we will show its applicability to two casesin which the SURE-based method would fail. One is re-cursive filtering for which the noise is no longer Gaussian.The other one is for the NLM filters which use motion es-timation to select candidate patches such that the kernel isdynamic and thus indifferentiable on y.Image denoising. Several types of approaches have beenstudied, including local-based [28, 23], transform-based[21, 27], nonlocal-based [2, 4], and sparsity-based (e.g.BM3D [7], NLSM[14] and WNNM[9]). The last oneshowed superior image quality recently, which was basedon grouping with patch similarity and optimization forsparse representation. However, the parameters are mostlygiven heuristically. Besides, they were often proposed andoptimized for grey images, and additional modification is

required to support color channels. In contrast, our methodcan work on multi-channel signals directly.

3. Noise Model and Bayesian InferenceWe will first propose a novel noise model and infer the

Yaroslavsky filter directly from it. By modifying the likeli-hood function to improve the robustness of estimation, wewill then infer the bilateral filter and a modified NLM filter.

3.1. Neighborhood noise model (NNM)

For a latent k-channel signal zl at position l, we formu-late its neighbors yi∈Λl by GSM to model localized inten-sity edges using soft decisions:

yi = zl +nl,i√wl,i

, (4)

where nl,i are additive white Gaussian noises (AWGN) ofcovariance matrix σ2Ik and wl,l = 1 as the basic model.For the neighboring pixels, a smaller wl,i means a morewidely distributed yi, so there is more likely an edge be-tween positions l and i. Otherwise, smooth texture will beinferred if wl,i is close to one. To infer the Gaussian range-weighted kernel in (1), wl,i6=l are defined as white hiddenvariables of a prior with two parameters ε and α:

fw(w; ε, α) =1

N(ε, α)w−k/2w−αweαw, w ∈ [ε, 1], (5)

where N(ε, α) =∫ 1

εw−k/2w−αweαw dw for normaliza-

tion, α will link σ2 to σ2r , and non-zero ε can guarantee the

convergence of the integration for k ≥ 3.

3.2. Inference for Yaroslavsky filter

Given observed data yi, we have the posterior Φl ∝p(yl; zl)

∏i∈Λl,i6=l p(yi|wl,i; zl)fw(wl,i). Its energy func-

tion Ll can then be derived:

Ll =g2l,l

2σ2+∑i6=l

wl,ig2l,i

2σ2− logw

k2

l,i − log fw(wl,i), (6)

where g2l,i ,‖ zl − yi ‖22. The estimation for wl,i and zl

can be derived by minimizing Ll:

∂Ll∂wl,i

= 0 ⇒ wl,i = e−g2l,i

2σ2r , σ2

r = ασ2, (i 6= l) (7)

∂Ll∂zl

= 0 ⇒ zl =

∑i∈Λl

wl,i · yi∑i∈Λl

wl,i. (8)

The Yaroslavsky filter is then equivalent to the first-iterationestimation for solving this fixed-point problem with an ini-tial condition z

(0)l = yl.

The NNM estimation (7) and (8) can be further interpretedinto two sequential steps respectively:

1. Independent MAP estimation for each wl,i by maxi-mizing p(yi|wl,i; zl)fw(wl,i) when given a fixed zl;

2. ML estimation for zl by optimizing the likelihood

Lz(zl) =g2l,l

2σ2 +∑i 6=l

wl,ig2l,i

2σ2 when given fixed wl,i.

By modifying the likelihood function in the second step forconsidering proximity and patch similarity, the bilateral andNLM filters can be derived respectively as follows.

3.3. Extension for bilateral filter

To consider proximity in a nonparametric way using theGaussian weighting kernel dl,i, the locally weighted maxi-mum likelihood (LWML) [15] can be applied to the likeli-hood Lz to have the pseudo likelihood function

Lz(zl) = dl,l ·g2l,l

2σ2+∑i 6=l

dl,i ·wl,ig

2l,i

2σ2. (9)

Then the LWML estimation by ∂Lz∂zl

= 0 gives the sameformulation as the bilateral filter.

3.4. Extension for NLM filter

The corresponding NNM for each latent pixel in a patch{zl+b, b ∈ B} and its neighborhood Λl+b is as defined in(4). The first-iteration MAP estimation for each soft-edgevariable then gives

wl+b,i+b = e−‖yl+b−yi+b‖

22

2σ2r , i ∈ Λl. (10)

Let the column vectors of the latent patch and the observedpatches be Zl and Yi∈Λ. They are formed by cascadingzl+b and yi+b respectively in a predefined order Υ for b ∈B. Then the partial likelihood from Yi becomes

L(Zl; Yi) = G(Zl; Yi,Σl,i), (11)

where G(·;µ,Σ) is a multivariate Gaussian function, andthe diagonal entries of Σl,i are formed by cascadingdiag( σ2

wl+b,i+bIk) in the predefined order Υ. The entries out-

side of the diagonal have no effect on the following deriva-tion.For NLM filtering, one observed patch Yi has only onesummation weight Wl,i. Thus we use a patch-based like-lihood function to approximate (11), which is devised as

L(Zl; Yi) = G(Zl; Yi,σ2

Wl,iIkB). (12)

Since these two likelihood functions both behave like prob-ability density functions (pdf), we choose Wl,i by minimiz-ing the Kullback-Leibler divergence (KLD) between them:

DNLM , DKL(L ‖ L) = −∫L log

LLdZl, (13)

∂DNLM∂Wl,i

= −∫L(kB

2

1

Wl,i− ‖ Zl −Yi ‖22

2σ2

)dZl

= −

(kB

2

1

Wl,i− 1

2

∑b∈B

k

wl+b,i+b

)= 0, (14)

⇒Wl,i =

(∑b∈B

w−1l+b,i+b/B

)−1

. (15)

Then the ML estimation for the combined patch-based like-lihood functions

∑i∈Λ L(Zl; Yi) becomes

Zl =

∑i∈Λl

Wl,i ·Yi∑i∈Λl

Wl,i, (16)

which suggests a modified NLM (MNLM) filter with thecoefficients Wl,i in (15). Note that Wl,i is the harmonicaverage of wl+b,i+b, while the conventional NLM uses thegeometric average as shown in (2).

4. Model Fitting and Parameter EstimationIn the following, we will first introduce how to build

a robust observation model in chi scale mixtures and thenhow we fit the model using an expectation-maximization-plus (EM+) algorithm.

4.1. Observable chi scale mixtures (CSM)

Let sl,i ,‖ yl − yi ‖2. It is independent of zl and withthe CSM formulation:

sl,i = σ

√wl,i + 1

wl,iul,i, ul,i ∼ χk, (17)

where ul,i is a chi distribution with k degrees of freedom.For simplicity, we use s = sl,i and w = wl,i in the follow-ing. The marginal pdf of s can be derived by

fs(s;σ, ε, α) =

∫ 1

ε

fs,w(s, w;σ, ε, α) dw, (18)

fs,w =1

T (ε, α)

sk−1σ−k

(w + 1)k2

e−ww+1

s2

2σ2 w−αweαw, (19)

where fs,w is the joint pdf of s and w, and T (ε, α) =

2k2−1Γ(k/2)N(ε, α) for normalization.

Fig. 1 shows how the soft-edge prior fw(w) and the CSMpdf fs(s) behave with different α and ε. fw(w) representshow likely intensity edges appear, i.e. more small w formore edges. For smaller α or ε, fw(w) tilts more to theleft and fs(s) has a longer tail such that more edges are ex-pected. For larger α or ε, fw(w) concentrates more on theright, and fs(s) gets closer to a chi distribution. Thus wecan model noisy images by varying α and ε to fit differentimage properties and varying σ for different noise intensity.

α = 2

0 0.5 1

100

102

ε=0.001

ε=0.01

ε=0.1

0 10 2010

−4

10−2

ε=0.001

ε=0.01

ε=0.1

ε = 0.01

0 0.5 1

100

101

102

α= 1

α= 2

α= 5

0 10 2010

−4

10−2

α= 1

α= 2

α= 5

(a) fw(w) (b) fs(s), σ = 1Figure 1. Distributions of the soft-edge w and the observable chiscale mixtures s with different values for α and ε (k = 3).

Another advantage to define s in l2-norm is its robustness.When the components of yi are not independent in the ob-served color space, e.g. RGB, we may apply an orthogo-nal transform to diagonalize their covariance matrix Σy fordecorrelation. However, since the l2-norm is invariant tothe orthogonal transform, we can calculate s (and w) in theobserved color space without performing the decorrelation.Thus the model fitting (and the filtering) can be applied onthe observed yi directly without loss of optimality.

4.2. EM+ algorithm for CSM fitting

Given an observed data set S, its empirical distributionis defined by P (s ∈ S). Then we estimate the correspond-ing pdf fs(s) by iteratively updating the model parametersbased on P (s). Assume in the previous iteration the esti-mated parameters are σ, α, and ε. Our EM+ algorithm up-dates them to σ, α, and ε through the following three steps:EM update, KLD update, and Quasi-Newton (QN) update.

4.2.1 EM update: (σ, α, ε)⇒ (σ, α, ε)

For simplicity, we will ignore the σ, α, and ε in fs,w andfs if the parameters in the previous iteration are used. Theexpected value of the log likelihood function can be derivedas Q(σ, α|σ, α) =

∑j P (sj)q(sj , σ, α|σ, α) where

q(s, σ, α|σ, α) =

∫ 1

ε

p(w|s)LEM (σ, α, ε;w, s) dw, (20)

p(w|s) = fs,w/fs, LEM = log fs,w(s, w; σ, ε, α). (21)

Set ∂Q∂σ = ∂Q∂α = 0 to have the EM update for σ and α:

σ2 =1

k

∑j

P (sj) ·∫ 1

εfs,w(sj , w)s2

jww+1 dw

fs(sj), (22)

H(α, ε) =∑j

P (sj) ·∫ 1

εfs,w(sj , w)w(1− logw) dw

fs(sj),

(23)

where H(α, ε) , Ew;α,ε[w(1 − logw)] and it is an in-creasing function with respect to α because ∂H/∂α =Varw;α,ε[w(1 − logw)] = 0. Thus the corresponding αcan be found quickly using a bisection search on H(α, ε).

4.2.2 KLD update: (σ, α, ε)⇒ (σ, α, ε)

The EM algorithm is unable to update ε because the sup-port of w for p(w|s) and LEM in (20) should be the same.Instead, we update it by optimizing the approximate KLDD =

∑j −P (sj) · log

fs(sj ;σ,ε,α)P (sj)

. With ∂D∂ε = 0, a fixed-

point representation of the optimal ε can be derived. And weuse the first-iteration result with an initial condition ε(0) = εas our updating formulation:

(ε+ 1

ε

) k2

=∑j

P (sj) ·sk−1j σ−ke−

εε+1

s2j

2σ2

2k2−1Γ(k2 )fs(sj ; σ, ε, α)

. (24)

4.2.3 QN update: (σ, α, ε)⇒ (σ, α, ε)

The update formulations in (22), (23), and (24), require nu-merical evaluations for lots of integrals, which prolongs theexecution time for one iteration. Besides, we found thatthe update step sizes are usually small near the optimizingpoint, which increases the number of iterations. Thus theabove updates take a long time to converge in many cases.We adopt the QN method, QN1 in [10], to accelerate thefitting process. Let θ = (σ, α, ε)T , θ = (σ, α, ε)T , θ =

(σ, α, ε)T , and F(θ) = θ(θ)− θ. The QN1 method solvesF(θ) = 0 by maintaining a matrix A which approximatesthe inverse Jacobian matrix J−1

F and is updated using theBroyden’s update method. Then the QN update is derivedby θ = θ + δ where

δ = −A · F(θ). (25)

However, in practice the QN update could be unstable whenA has a high condition number. Some heuristic, e.g. reini-tializing A, is required in this situation. We choose to con-fine θ in a reasonably large range Ψ for its quick conver-gence. When θ is out of Ψ, we will directly set δ to F(θ)as the original EM/KLD update and may further scale downits value to make θ inside Ψ if necessary.

4.3. Discriminative capability of KLD

The fitting quality relies on the discriminative capabilityof KLD between different parameters. Fig. 2 shows theKLD discrimination of fs(s) for k = 3 when α has a smallchange4α = 0.5 and when ε increases by4ε = 0.005. Itis clear that the discriminative capability declines as α or εincreases. For sufficiently large α or ε, fs(s) is very similarto a chi distribution and KLD becomes insensitive to theparameters. Therefore, some upper bounds for α and ε maybe chosen without compromising KLD.

5. Application in Image Denoising

For the inferred filters, each neighborhood can be viewedas a realization of the NNM. Thus the CSM model fittingcan provide parameter estimation. In the following, we willapply this approach to color-image denoising (k = 3).

5.1. Parameter estimation

Before model fitting, we need to derive the empirical dis-tribution P (s). For the Yaroslavsky filter, we have a fixedneighborhood Λl for each pixel, and P (s) can be simplyobtained by accumulating the histogram of all sl,i. For thebilateral filter, we construct its P (s) by accumulating thehistogram with the spatial weight dl,i to consider proxim-ity. For example, the frequency of s = 2 will be increasedby dl,i for an event sl,i = 2. For the MNLM filter, weconsider a dynamic neighborhood Λl which may depend onpatch similarity sorting or hard thresholding for better per-formance. In this case, we need to do the computation toderive Λl and then accumulate the histogram of sl+b,i+b.Given P (s), we perform the CSM fitting using the EM+algorithm and select the final result based on the KLD cal-culated for P (s) and the estimated distribution P (s;θ(m))in each iteration. Besides, we also devise an ε-boundedestimation to handle the KLD insensitivity issue. It is ac-tivated when ε is close to a given upper bound εbd, i.e.|ε−εbd| < 4ε. Since the sensitivity of KLD becomes smallwhen ε is near εbd, we can directly compare KLD throughε = εbd using a bisection search on α. For each α, the σ-update (22) which converges very quickly is used to find thebest corresponding σ and KLD.The algorithm parameters which are fixed in this paper areas follows: maximum fitting iteration M = 15, initial CSMparameter θini = ( smd2 , 3, 10−3) where smd is the modeof P (s), ε-bound criteria4ε = 10−3, and parameter rangeΨ = {θ|σ = 10−5, α ∈ [3, 15], ε ∈ [10−5, εbd]}. The con-vergence condition is that the KLD is smaller than 10−5 orσ2r changes by less than 0.1%. Note that Ψ is used to avoid

unstable QN updates and is defined sufficiently large suchthat only seldom final results touch the boundaries. Theonly exception is εbd which can be used to activate the ε-bounded estimation.

5.2. Recursive local filter

It is shown that the conventional Yaroslavsky/bilateralfilter is equivalent to the first-iteration MAP/ML estima-tion. Simply applying more iterations with the same σ2

r in-deed reduces the energy function (6) but does not help forincreasing PSNR. Instead, we propose to apply the modelfitting and filtering recursively. Each iteration estimatesthe NNM parameters for the current noisy image y(m) andperforms filtering on it with the estimated range varianceσ

2(m)r . This process is terminated when the filter iteration

exceeds Mflt times or the estimated noise intensity σ(m) issmaller than a threshold σcl which represents a clean image.

5.3. Recursive MNLM filter

The proposed recursive MNLM filter has three differ-ences from the recursive local one. First, the basic process-ing unit becomes a patch, instead of a pixel, and a flag bagdecides how to aggregate these patches into one image. Ifbag = 1, each image pixel at position l will be derived byaveraging all its corresponding values in neighboring esti-mated patches Xl+b, b ∈ B. Otherwise, no aggregation willbe performed. Second, the neighborhood for each patch isconstructed dynamically based on some given constraints.Third, a DCT-Wiener filter is introduced to increase the per-formance, which is activated by a flag bdct.The DCT-Wiener filter serves similarly as the residual filterin PLOW. We use the DCT, denoted as T (·), to approxi-mate the decorrelation matrix and then apply element-wiseWiener filtering to update the patch Xl

Xl ← T −1(Wwie ◦ T (Xl)), (26)

where the element of the shrinkage matrix Wwie is

(Wwie)i′j′k′ =σ2X,i′j′k′

σ2X,i′j′k′ + σ2

l

, (27)

and the signal variance is estimated from neighbors byσ2X,i′j′k′ = Ei∈Λl [(T (Y

(m)i ))2

i′j′k′ ] − σ2(m) and the noise

variance by σ2l = σ2(m)∑

i∈ΛlWl,i

due to weighted averaging.

The recursive MNLM filter is summarized in Algorithm 1.

Algorithm 1 Recursive MNLM FilterInput: Noisy image y; ε-bound εbd; DCT-Wiener flag bdct;

Aggregation flag bagOutput: Denoised image z

1: Initialize y(1) = y, z = y, m = 12: repeat3: Get empirical P (s) of y(m) by constructing each Λl4: Get σ(m)

r , σ(m) using parameter estimation5: if (σ(m) ≥ σcl) then6: for each patch Yl in y(m) do

7: Get the estimation Xl =∑i∈Λl

Wl,i·Y(m)i∑

i∈ΛlWl,i

8: Perform DCT-Weiner filtering with bdct9: end for

10: Aggregate Xl to form z(m) with bag11: y(m+1) ← z(m), z← z(m)

12: end if13: m← m+ 114: until (m > Mflt) or (σ(m) < σcl)

05

1015

0

0.2

0.4

−5

−4

−3

−2

−1

αε

05

1015

0

0.2

0.4

−8

−6

−4

−2

0

2

αε

(a) log10D4α=0.5 (b) log10D4ε=0.005

Figure 2. KLD discrimination (k = 3) on (a) α and (b) ε, whereD4α(α, ε) , DKL(fs(s;σ, ε, α) ‖ fs(s;σ, ε, α + 4α)) andD4ε(α, ε) , DKL(fs(s;σ, ε, α) ‖ fs(s;σ, ε + 4ε, α)). Notethat σ affects only the scaling and has no effect onD4α andD4ε.

6. Experiments on Model Fitting and Filtering6.1. Parameter setting

Six filter configurations are tested:

1. YF-5×5: Yaroslavsky filter, Λ: 5×5, εbd = 0.1/1.0;

2. YF-9×9: Yaroslavsky filter, Λ: 9×9, εbd = 0.1/1.0;

3. BF-9×9: Bilateral filter, Λ: 9×9, εbd = 0.1/1.0;

4. BF-13×13: Bilateral filter, Λ: 13×13, εbd = 0.1/1.0;

5. MNLM-Simple: MNLM filter, εbd = 0.1, bdct = 0(DCT-Wiener off), bag = 0 (no aggregation);

6. MNLM-DCT: MNLM filter, εbd = 1.0 (ε-bound off),bdct = 1 (DCT-Wiener on), bag = 1 (one-pixel grid).

Two settings of εbd are tested for local filters. The basicεbd = 1.0 simply follows the definition in (5). In contrast,the empirical εbd = 0.1 will use the ε-bounded estimationfor large ε. The spatial weight of bilateral filters σd is set tothe radius of Λ, e.g. σd = 4 for Λ = 9×9. The patch size forthe MNLM filter is 9 × 9. For constructing MNLM neigh-borhood, we apply motion estimation in a 31 × 31 searchwindow around position l and choose the best ten candi-dates as Λl. For filter termination, the maximum iterationnumber Mflt is set to 3 and σ2

cl set to 10.To obtain the best σ2

r and the SURE estimation for compar-ison, we perform a σ2

r scan (30 values) from α = 0.5 to25.0 for each test condition. The SURE-based method usesthe noise variance σ2

MAD estimated by MAD as done in [4]and is denoted as MAD+SURE. Twelve standard color im-ages of different properties are used in our experiments withAWGN of five different intensity values of σn.

6.2. Yaroslavsky/Bilateral filter

Fig. 3 shows one typical example of the test results. Theresults for other test filters and test images can be browsedonline1. The CSM model can fit the long-tailed empirical

distributions well no matter when the noise is small or large.It also successfully predicts that α should become largeras the noise intensity σn increases, while the conventionalheuristics usually apply a fixed value.Table 1 lists the result of BF-9×9 (εbd = 0.1) in de-tail. MAD+SURE fails in two cases. One is for small σn(e.g. σn=5, 1st iteration) due to the inaccuracy of σ2

MAD.The other one is for iterations after the first iteration (e.g.σn=50, 2nd iteration) because the noise is not Gaussian anymore. In contrast, the CSM estimation performs well interms of both PSNR and σ2

r accuracy in these two cases.It means that the edge information can be well capturedby the CSM model even when the noise is small or be-comes non-Gaussian. This useful property also enablesthe proposed recursive scheme. Besides, in other casesthe CSM estimation shows similar performance comparedto the MAD+SURE. An interesting property can also befound. The CSM estimation usually underestimates thenoise variance σ2, which means it may mistake noises foredges. In contrast, the MAD+SURE tends to overestimatebecause the MAD may mistake edges for noises.The execution time is also shown in Table 1. The proposedmethod runs much faster than the MAD+SURE since it doesnot require a σ2

r scan. Each iteration of it consists of threesteps: getting P (s), CSM fitting, and filtering. The first andthird steps take time proportionally to the image resolutionand filter support size (7.9 s and 6.5 s on average respec-tively). In contrast, the fitting time depends on the EM+iteration numbers and whether the ε-bounded estimation isturned on. For larger noises, the fitting tends to be insen-sitive to KLD, and thus more time is required, e.g. 9.6 son average for σn = 50 (1st iteration) while only 3.5 s forσn = 5. The percentage of the fitting time will be smallerfor larger images. Note that MAD+SURE not only fails topredict in the recursive iterations but also requires signifi-cantly higher computation complexity due to the σ2

r scan.The test results of Yaroslavsky and bilateral filters are sum-marized in Table 2. For εbd = 0.1, the CSM estimationperforms comparably to the SURE in the first iteration forlarge σn and outperforms for small σn. Moreover, the pro-posed recursive fitting and filtering can increase PSNR byup to 1.2 dB. As for εbd = 1.0, its first-iteration results arenot as good as εbd = 0.1 for its less accurate σ2

r estima-tion and higher KLD (due to KLD insensitivity). However,the recursive filtering can recover its quality drop and evenmake it slightly better than εbd = 0.1 with a little moreiterations. Therefore, the basic εbd = 1.0 gives slightly bet-ter quality when using recursive filtering, but the empiricalεbd = 0.1 may be preferred when only one or fewer itera-tions are allowed. The choice of σ2

r affects the performancemuch more sensitively than that of Λ and dl,i, which alsojustifies the need of a good σ2

r estimator.

1http://www.ee.nthu.edu.tw/chaotsung/nnm

0 50 100 150 200 250 300 35010

−10

10−5

Bilateral 9x9; lena, σn=5; Iteration 1

s

P(s

)

Empirical

CSM Fitting (α = 5.28, ε =0.0061, σ =5.68, KLD = 0.0034)

0 100 200 300 40010

−10

10−5

Bilateral 9x9; lena, σn=20; Iteration 1

s

P(s

)

Empirical

CSM Fitting (α = 7.37, ε =0.0281, σ =18.47, KLD = 0.0011)

0 100 200 300 400 500 60010

−10

10−5

Bilateral 9x9; lena, σn=50; Iteration 1

s

P(s

)

Empirical

CSM Fitting (α = 14.33, ε =0.1000, σ =47.08, KLD = 0.0018)

0 200 400 600 800 100034.5

35

35.5

36

36.5

37

37.5

σr2

PS

NR

(dB

)

σr

2 scan

Best result (σr

2 = 117.06, PSNR =37.44 dB)

CSM Estimation (σr

2 = 170.23, PSNR =37.27 dB)

MAD+SURE (σr

2 = 471.64, PSNR =35.87 dB)

0 2000 4000 6000 8000 1000023

24

25

26

27

28

29

30

31

σr2

PS

NR

(dB

)

σr

2 scan

Best result (σr

2 = 3811.97, PSNR =30.85 dB)

CSM Estimation (σr

2 = 2512.61, PSNR =30.66 dB)

MAD+SURE (σr

2 = 5580.31, PSNR =30.70 dB)

0 1 2 3 4 5 6

x 104

14

16

18

20

22

24

26

28

σr2

PS

NR

(dB

)

σr

2 scan

Best result (σr

2 = 55408.73, PSNR =26.63 dB)

CSM Estimation (σr

2 = 31763.08, PSNR =26.40 dB)

MAD+SURE (σr

2 = 55408.73, PSNR =26.63 dB)

Figure 3. BF-9×9 (εbd = 0.1) for Lena with σn = 5, 20, and 50. The top row shows the model fitting of empirical distributions. Thebottom row presents the filtered results, where the best σ2

r is marked by a circle, CSM estimation by a cross, and MAD+SURE by a triangle.

Table 1. Detailed result for BF-9×9 (εbd = 0.1). 4PSNR is calculated by comparing the PSNR to the best result derived by the σ2r scan,

and so as4σ2r which is presented in form of relative percentage, i.e.

σ2r−σ

2r,best

σ2r,best

. The execution time is evaluated by running MATLAB on

a 3.4 GHz CPU (single-thread).

6.3. MNLM filter

The test results are summarized in Table 3. The MNLM-Simple is directly inferred from the patch-based NNM vari-

ation, and the CSM estimation can track the σ2r well for dif-

ferent σn and in different iterations. The recursive MNLM-Simple filter can increase PSNR by up to 3.9 dB. On theother hand, the MNLM-DCT can provide better PSNR due

Table 2. Summary for Yaroslavsky/Bilateral filters. The numbersrepresent the corresponding averages for the twelve test images.4PSNR here is calculated by comparing the PSNR to the bestfirst-iteration result. |4σ2

r | presents the relative percentage of ab-solute difference. m stands for the average number of iterations.The results of εbd = 0.1/1.0 differ only for σn = 40/50.

to the use of DCT-Wiener filtering, and it also terminatesearlier. Though less accurate, the CSM fitting still providesgood estimation of σ2

r for it. Note that the MAD+SUREcannot be applied here because the neighborhood is dynam-ically derived. For comparing the conventional NLM andMNLM, we also tested NLM in form of (2) using a σ2

r

scan. Its best PSNR differs by less than 1% compared toMNLM-Simple, which indicates NLM and MNLM havesimilar performance.Two state-of-the-art denoising algorithms, CPLOW [4] andCBM3D [7] (”C” stands for the versions for color images),

are also tested for comparison. The MAD-derived noisevariance σ2

MAD is used for them, and all other parame-ters are set as default in the software provided by their au-thors. The CBM3D performs best due to the processingin sparse representation and the good heuristic parameters.Our MNLM-DCT outperforms the CPLOW, though shar-ing a similar flow. Besides the CSM parameter optimiza-tion, one other reason is that the CPLOW processes colorchannels separately while we consider all channels jointly.

7. ConclusionIn this paper, we propose and study a unified framework

which can both infer the neighborhood filters and estimatethe range variance. With the neighborhood noise model, weshow that the Yaroslavsky, bilateral, and MNLM filters canbe derived by joint MAP and ML optimization. We alsopresent an EM+ algorithm for parameter estimation via fit-ting the observable CSM. The experimental results on color-image denoising show the effectiveness. The noisy imagescan be fit well and the range variance can be tracked as ac-curate as the multi-pass SURE-based method. Moreover,the CSM fitting also works for filtered images, and this en-ables a recursive filtering scheme and improves PSNR.The proposed framework can be used to build efficientfilters for different constraints automatically, instead ofheuristic tuning. It can also be expected that it will be ap-plied to other range-weighted algorithms by formulating thecorresponding likelihood functions. Therefore, we believethat it will help many computer vision problems be solvedin an empirical Bayesian way, instead of an intuitive way.

AcknowledgmentThis work was supported by the Ministry of Science and

Technology, Taiwan, R.O.C. under Grant No. MOST 103-2218-E-007-008-MY3.

Table 3. Summary of test results for MNLM filters. 4PSNR1 and4PSNR2 are relative to the first-iteration and the second-iteration bestPSNR respectively. The case of |4σ2

r | is similar. The results of CPLOW and CBM3D are given for reference.

References[1] D. Barash and D. Comaniciu. A common framework for non-

linear diffusion, adaptive smoothing, bilateral filtering andmean shift. Image and Vision Computing, 22(1):73–81, Jan2004. 1

[2] A. Buades, B. Coll, and J. M. Morel. A review of image de-noising algorithms, with a new one. SIAM Journal on Multi-scale Modeling and Simulation, 4(2):490–530, 2005. 1, 2

[3] A. Buades, B. Coll, and J.-M. Morel. The staircasing effectin neighborhood filters and its solution. IEEE Transactionson Image Processing, 15(6):1499–1505, Jun 2006. 1

[4] P. Chatterjee and P. Milanfar. Patch-based near-optimal im-age denoising. IEEE Transactions on Image Processing,21(4):1635–1649, Apr 2012. 2, 6, 8

[5] Y. Chen and K. J. R. Liu. Image denoising games. IEEETransactions on Circuits and Systems for Video Technology,23(10):1704–1716, oct 2013. 2

[6] P. Choudhury and J. Tumblin. The trilateral filter for highcontrast images and meshes. In ACM SIGGRAPH 2005Courses, SIGGRAPH ’05, New York, NY, USA, 2005.ACM. 1

[7] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Im-age denoising by sparse 3-D transform-domain collabora-tive filtering. IEEE Transactions on Image Processing,16(8):2080–2095, Aug 2007. 2, 8

[8] M. Elad. On the origin of the bilateral filter and waysto improve it. IEEE Transactions on Image Processing,11(10):1141–1151, Oct 2002. 1

[9] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclearnorm minimization with application to image denoising. InIEEE International Conference on Computer Vision and Pat-tern Recognition, pages 2862–2869, 2014. 2

[10] M. Jamshidian and R. I. Jennrich. Acceleration of theEM algorithm by using quasi-newton methods. Journalof the Royal Statistical Society. Series B (Methodological),59(3):569–587, 1997. 4

[11] C. Kervrann and J. Boulanger. Optimal spatial adaptation forpatch-based image denoising. IEEE Transactions on ImageProcessing, 15(10):2866–2878, Oct 2006. 1

[12] H. Kishan and C. S. Seelamantula. SURE-fast bilateral fil-ters. In Proceedings of International Conference on Acous-tics, Speech, and Signal Processing, pages 1129–1132, 2012.2

[13] C. Liu, R. Szeliski, S. B. Kang, C. L. Zitnick, and W. T.Freeman. Automatic estimation and removal of noise froma single image. IEEE Transactions on Pattern Analysis andMachine Intelligence, 30(2):299–314, Feb 2008. 2

[14] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman.Non-local sparse models for image restoration. In IEEEInternational Conference on Computer Vision, pages 2272–2279, 2009. 2

[15] D. P. McMillen and J. F. McDonald. Locally weighted maxi-mum likelihood estimation: Monte carlo evidence and an ap-plication. In Advances in Spatial Econometrics, pages 225–239. Springer Berlin Heidelberg, 2004. 3

[16] P. Milanfar. A tour of modern image filtering: New insightsand methods, both practical and theoretical. IEEE SignalProcessing Magazine, 30(1):106–128, Jan 2013. 1

[17] S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilat-eral filtering: Theory and applications. In Foundations andTrends in Computer Graphics and Vision, volume 4, pages1–73, 2008. 1

[18] H. Peng and R. Rao. Bilateral kernel parameter optimiza-tion by risk minimization. In Proceedings of InternationalConference on Image Processing, pages 3293–3296, 2010. 2

[19] H. Peng, R. Rao, and S. A. Dianat. Multispectral image de-noising with optimized vector bilateral filter. IEEE Transac-tions on Image Processing, 23(1):264–273, Jan 2014. 2

[20] G. Petschnigg, R. Szeliski, M. Agrawala, M. Cohen,H. Hoppe, and K. Toyama. Digital photography with flashand no-flash image pairs. In ACM SIGGRAPH 2004 Pa-pers, SIGGRAPH ’04, pages 664–672, New York, NY, USA,2004. ACM. 1

[21] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli.Image denoising using scale mixtures of gaussians in thewavelet domain. IEEE Transactions on Image Processing,12(11):1338–1351, Nov 2003. 2

[22] C. M. Stein. Estimation of the mean of a multivariate normaldistribution. The Annals of Statistics, 9(6):1135–1151, 1981.1

[23] C. Tomasi and R. Manduchi. Bilateral filtering for gray andcolor images. In Proceedings of International Conference onComputer Vision, pages 839–846, 1998. 1, 2

[24] D. V. D. Ville and M. Kocher. SURE-based non-local means.IEEE Signal Processing Letters, 16(11):973–976, Nov 2009.2

[25] D. V. D. Ville and M. Kocher. Nonlocal means with di-mensionality reduction and SURE-based parameter selec-tion. IEEE Transactions on Image Processing, 20(9):2683–2690, sep 2011. 2

[26] M. J. Wainwright and E. P. Simoncelli. Scale mixtures ofGaussian and the statistics of natural images. Advaned Neu-ral Information Processing Systems, 12:855–861, May 2000.2

[27] L. Yaroslavsky, K. Egiazarian, and J. Astola. Transform do-main image restoration methods: review, comparison, andinterpretation. In Nonlinear Image Processing and PatternAnalysis XII, volume 4304 of Proc. SPIE, pages 155–169,2001. 2

[28] L. P. Yaroslavsky. Digital Picture Processing - An Introduc-tion. Springer Verlag, 1985. 1, 2