MAP approximation to the variational Bayes Gaussian ... · MAP approximation to the variational...

Soft Comput (2018) 22:3287–3299https://doi.org/10.1007/s00500-017-2565-z

METHODOLOGIES AND APPLICATION

MAP approximation to the variational Bayes Gaussian mixturemodel and application

Kart-Leong Lim1 · Han Wang1

Published online: 4 April 2017© Springer-Verlag Berlin Heidelberg 2017

Abstract The learning of variational inference can bewidely seen as first estimating the class assignment variableand then using it to estimate parameters of themixturemodel.The estimate is mainly performed by computing the expecta-tions of the prior models. However, learning is not exclusiveto expectation. Several authors report other possible con-figurations that use different combinations of maximizationor expectation for the estimation. For instance, variationalinference is generalized under the expectation–expectation(EE) algorithm. Inspired by this, another variant known asthe maximization–maximization (MM) algorithm has beenrecently exploited on various models such as Gaussian mix-ture, Field-of-Gaussians mixture, and sparse-coding-basedFisher vector. Despite the recent success, MM is not withoutissue. Firstly, it is very rare to find any theoretical study com-paring MM to EE. Secondly, the computational efficiencyand accuracy of MM is seldom compared to EE. Hence, it isdifficult to convince the use ofMMover amainstream learnersuch as EE or even Gibbs sampling. In this work, we revisitthe learning of EE and MM on a simple Bayesian GMMcase. We also made theoretical comparison of MM with EEand found that they in fact obtain near identical solutions.In the experiments, we performed unsupervised classifica-tion, comparing the computational efficiency and accuracyof MM and EE on two datasets. We also performed unsuper-

Communicated by V. Loia.

B Kart-Leong [email protected]

Han [email protected]

1 School of Electrical and Electronics Engineering, NanyangTechnological University, 50 Nanyang Avenue,Singapore 639798, Singapore

vised feature learning, comparing Bayesian approach suchas MM with other maximum likelihood approaches on twodatasets.

Keywords Variational Bayes · Gaussian mixture model ·Expectation maximization · Image classification

1 Introduction

Approximating algorithm such as variational inference(Bishop 2006) or Gibbs sampling (Neal 2000) is usuallyemployed when learning a Bayesian mixture model dueto analytically intractable integral, arising from expectationfunctions. Variational inference is a modern alternative toGibbs sampling due to faster convergence and compara-ble accuracy. Today, variational inference is a mainstreamlearner to numerous Bayesian mixture models ranging fromnon-Gaussian mixtures (Fan and Bouguila 2013; Ma andLeijon 2011) to nonparametric mixtures (Blei et al. 2006;Fan and Bouguila 2013) to hierarchical mixture (Teh et al.2004; Paisley et al. 2015). The learning algorithm of vari-ational inference (Bishop 2006) can be widely seen as firstestimating the class assignment variable and then using itto estimate the parameters of the mixture model for eachiteration till convergence. The estimate is performed by com-puting the expectations of the prior models. However, thelearning algorithm is not necessarily exclusive to computingexpectation. Welling and Kurihara saw other possible con-figurations that use different combinations of maximizationand expectation on estimating the class assignment vari-able and mixture model parameters. They call it a familyof alternating learners, and variational inference is general-ized as the expectation–expectation (EE) algorithm. Inspired

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s00500-017-2565-z&domain=pdf

http://orcid.org/0000-0001-9050-2300

3288 K.-L. Lim, H. Wang

by Welling and Kurihara (2006), and Kurihara and Welling(2009), a variant known as the maximization–maximization(MM) algorithm has very recently been applied to variousBayesianmixturemodels such asGaussianmixture, Field-of-Gaussian-based Gaussian mixture, and sparse-coding-basedGaussian mixture.

Welling and Kurihara (2006), and Kurihara and Welling(2009), the authors suggest that there is a trade-off betweencomputation efficiency and accuracy in the alternatinglearner family. MM is the most efficient to compute whileEE is the most accurate. Despite the recent success of MM,it suffers from several problems: (1) MM is not as widelyknown as EE. (2) It is very rare to find any theoretical studycomparing MM to EE. (3) The computational efficiency andaccuracy of MM is seldom compared to EE. Hence, it is dif-ficult to justify or convince the use ofMMover a mainstreamlearner such as EE or even Gibbs sampling.

In our contributions: (1) We explain in detail the algo-rithms for computing EE and MM on a simple BayesianGMM case. (2) We gave a theoretical comparison of MMwith EE. We show that the analytical solution of MM is infact identical to EE for the mixture model parameters despitethat both solutions are attained using different approaches.Thus, in theory MM should not be inferior to EE in terms ofaccuracy. (3) We ran two set of experiments comparing theefficiency and accuracy of MM and EE. (4) We discuss onhow a Bayesian approach using MM compares to a maxi-mum likelihood approach such as K-means and EM. This isalso shown on two sets of experiments. (5) We also explainhow we can relate a Bayesian approach to a maximum like-lihood approach using MM. (6) We also discuss on how tocheck the convergence of MM guaranteed via lower boundin “Appendix.”

In our first experiment, we use MM to perform unsu-pervised classification on the Caltech (Fei-Fei et al. 2007)and Scene Categories (Lazebnik et al. 2006) datasets. Wemainly compare the efficiency and accuracy of MM withEE. We observe that MM is typically faster to compute buthave slightly poorer accuracy than EE. Both methods arealso comparable to state-of-the-art result on these datasets.In our second experiment, we use MM to perform unsuper-vised feature learning on two object classification datasets,Multi-view Car (Ozuysal et al. 2009) and Caltech (Fei-Feiet al. 2007). Our experiments show that a Bayesian approach(such as MM) consistently obtain better mean average pre-cision score than a maximum likelihood approach (such asK-means and EM) at unsupervised feature learning for alldictionary size.

The paper is organized as follows. We discuss relatedworks in Sect. 2, followed by basic terminology of BayesianGMM and variational inference in Sect. 3. In Sect. 4, weexplain the algorithms of EE andMM in depth and we give atheoretical comparison of EE and MM in Sect. 5. The exper-

iments are detailed in Sect. 6 and the paper is conclusion inSect. 7. In the appendix, we discuss the convergence of MMvia computing the lower bound.

2 Related work

Lim et al. (2016) the authors used MM to learn a BayesianGMM for computing BoW and VLAD. It is compared witha non Bayesian approach such as K-means. Their findingon two image classification datasets shows that in general,the MM learner outperforms k-means at unsupervised fea-ture learning for all dictionary sizes. We have also repeatedthis experiment using the same setup. The main difference isthat our Bayesian GMM is not restricted to shared precision.Hence, our best results in this work at 0.74 outperforms the0.68 result reported in Lim et al. (2016).

MM can also be applied to more a complex model suchas the Field-of- Gaussian mixture in Lim and Wang (2016).In this case, instead of learning random image patches foran object category dataset, the authors use random localneighborhood of patches. Each neighborhood contains sev-eral adjacently connected image patches. The approach iseffectively taking a product of individual likelihood for MMlearning. The additional complexity due to the product oflikelihoods (Gaussian) can also be straightforward learntusing MM. The Field-of-Gaussian was also found to out-perform the standard GMM.

Another recentworkusingMMis the sparse-coding-basedGMM in Lim and Wang (2017). In this work, the standardGMM is extended to include an additional sparse coefficientvariable. Instead of using an off-the-shelf learner such asthe feature-sign algorithm, MM is used to perform learningfor the hidden variables of sparse-coding-based GMM. Thelearning performance of MM is reported to be at least onpar or better than the feature-sign algorithm on the sparse-coding-based Fisher vector (Liu et al. 2014) representationas reported in their experiments.

3 Background

3.1 Gaussian mixture model

GMM models a set of N observed variable denoted asx = {xn}Nn=1 ∈ R

D with a set of hidden variables, θ ={μ, τ, z, π}. The dimension of each observed instance isdenoted D. The cluster assignment is denoted z = {zn}Nn=1where zn is a 1 − of − K binary vector, subjected to∑K

k=1 znk = 1 and znk ∈ {0, 1}. Mixture component isdenoted π = {πk}Kk=1 ∈ R

D . We assume diagonal covari-ance Σ = σ 2 I and denote precision as τ = 1

σ 2 where σ ={σ k}Kk=1 ∈ R

D , and mean is denoted μ = {μk}Kk=1 ∈ RD .

123

MAP approximation to the variational Bayes Gaussian mixture model and application 3289

Fig. 1 Graphmodel of BayesianGMM. The unshaded circles in boxesare hidden variables while the variables outside the boxes are hyperpa-rameters. The shaded circle refers to the observed variable. The arrowsshow the relationships between variables and hyperparameters

3.2 Bayesian approach

In a Bayesian approach to GMM (Fig. 1), each hiddenvariable is now modeled by a prior distribution as follows(MacKay 2003; Bishop 2006)

x | z, μ, τ ∼ N(μ, τ−1

)z

μ | τ ∼ N(m0, (λ0τ)−1

)

τ ∼ Gam (a0, b0)

z | π ∼ Mult(π)

π ∼ Dir(α0)

(1)

and their joint distribution can be factorized as

p(x, z, π, μ, τ)=p(x | z, μ, τ)p(z | π)p(π)p(μ | τ)p(τ )

(2)

The hyperparameters α0, a0, b0,m0, λ0 are usually empiri-cally determined or assumed noninformative.

3.3 Variational inference

Variational inference (Bishop 2006; Corduneanu and Bishop2001) approximates the intractable integral of the marginaldistribution

p(x) =∫

p(x | θ)p(θ)dθ (3)

This approximation can be decomposed as a sum

ln p(x) = L + K Ldivergence (4)

where L is a lower bound on the joint distribution betweenobserved and hidden variable. A tractable distribution q(θ)

is used to compute L and K Ldivergence

L =∫

q(θ) ln

(p(x, θ)

q(θ)

)

dθ (5)

K Ldivergence = −∫

q(θ) ln

(p(θ |x)q(θ)

)

dθ (6)

When q(θ) = p(θ |x), the Kullback–Leibler divergence isremoved and L = ln p(X). The tractable distribution q(θ) isalso called the variational posterior distribution and assumesthe following factorization

q(θ) =∏

i

q(θi ) (7)

Variational log posterior distribution is expressed as

ln q(θ j ) = Ei �= j [ln p(x, θi )] + const. (8)

4 Strategies of Bayesian inference

Welling and Kurihara (2006) and Kurihara and Welling(2009) categorized a family of alternating learners forBayesian posteriors that are analytically intractable to learn.They can be generalized as variants of variational infer-ence and can be classified into expectation–expectation (EE),expectation–maximization (EM),maximization–expectation(ME), and maximization–maximization (MM). They have atrade-off between computational speed and accuracy.

In the standard variational inference also categorized asthe EE algorithm, learning is performed by updating theexpectation of its variational posterior for every iteration tillconvergence. Usually, conjugate priors are assumed to con-veniently exploit moments of well-defined prior distributions(i.e. no need to compute integral). Before computing thesemoments, the hyperparameters of the conjugate priors mustbe updated by the sample estimates of E [z] for each iteration.The objective function of EE is

E [z] ∝ exp (EΘ [ln p(x,Θ, z)])

↔ E [Θ] =∫

Θp(Θ; z)dΘ (9)

where z refers to the hyperparameters computed using thelocal statistics of E [z].

Our main interest is the MM algorithm due to its simplic-ity. In MM, the goal is to alternatively compute the MAPestimates for ln q [Θ] and ln q [z]. The objective function is

E [z] ≈ argmaxz

EΘ [ln p(x,Θ, z)]

↔ E [Θ] ≈ argmaxΘ

Ez [ln p(x,Θ, z)] (10)

123


EE is the most accurate while MM is the fastest to com-pute.

4.1 EE algorithm for Bayesian GMM

Themain objective of variational inference is to train amodelby computing the expectation of each variational posteriordistribution for each iteration until convergence. DefiningΘ = {μ, τ, π} ∈ θ , the expectation of q(z) is

E[znk] ∝ exp {EΘ [ln p(x, Θ, z)]}

= 1

Zexp

{E

μk ,τk ,πk[ln p(xn | μk , τk , znk) + ln p(znk |πk)]

}

= 1

Zexp

{1

2E[ln τk

]− 1

2E[τk] (xn − E

[μk])2 + E

[ln πk

]}

(11)

(where Z is a normalization term). For each iteration ofE [znk], we require computing the expectations of other vari-ational distributions from previous iteration. Like most EEalgorithms, prior distributions are restricted to exponentialfamily. Doing so, we can avoid computing the expectationsby conveniently using statistical moments of well definedprior distributions instead. In the case of Gamma, Gaussian,and Dirichlet, their moments are given as

E [ln τk] = ψ(ak) − ln(bk) (12)

E [τk] = akbk

(13)

E [μk] = mk (14)

E [ln πk] = ψ(αk) − ψ

⎛

⎝K∑

j=1

α j

⎞

⎠ (15)

(ψ refers to digamma function). Before computing thesemoments, we need to first update their hyperparameters usingprevious iteration value of E [znk] as follows (Bishop 2006;Cinbis et al. 2016)

ak = a0 + 1

2

N∑

n=1

E [znk] (16)

bk = b0+1

2

(N∑

n=1

(xn−mk)2E [znk]+λ0(mk−m0)

2

)

(17)

λk = λ0 +N∑

n=1

E [znk]

mk = 1

λk

(

λ0m0 +N∑

n=1

E [znk] xn

)

(18)

αk = α0 +N∑

n=1

E [znk] (19)

(We note that the definition for bk in Cinbis et al. 2016 is dif-ferent from ours). The EE algorithm for learning a BayesianGMM is summarized in Algorithm 1.

Algorithm 1 Learning EE algorithm for Bayesian GMMInput: xOutput: E [ln πk ] , E [μk ] , E [ln τk ] , E [τk ] , E [znk ]Initialization:

1. Randomly initialize the expectations E [znk ]2. Initialize the hyperparameters m0, λ0, a0, b0, α0

Compute until convergence:

1. Update the hyperparameters:

ak = a0 + 12

∑Nn=1 E [znk ]

bk = b0 + 12

(∑Nn=1(xn − mk)

2E [znk ] + λ0(mk − m0)2)

λk = λ0 +∑Nn=1 E [znk ]

mk = 1λk

(λ0m0 +∑N

n=1 E [znk ] xn)

αk = α0 +∑Nn=1 E [znk ]

2. Update the expectations:

E [ln τk ] = ψ(ak) − ln(bk)E [τk ] = ak

bkE [μk ] = mkE [ln πk ] = ψ(αk) − ψ(

∑Kj=1 α j )

E [znk ] = 1Zexp

{ 12 E [ln τk ] − 1

2 E [τk ] (xn − E [μk ])2

+E [ln πk ]}End

4.2 MM algorithm for Bayesian GMM

The main steps for the MM algorithm (for learning BayesianGMM) are summarized as follows:

(1) require variational posteriors defined for all hidden vari-ables

(2) use of lower bound for the moments, ln (E [Θ]) ≥E [lnΘ]

(3) moments approximated using mode of distribution(MAP), E [θ ] ≈ argmaxθ ln q(θ)

(4) E [znk] is now MAP estimated

We begin by assuming the following mean field factoriza-tion for tractability (Bishop 2006)

q(z, π, μ, τ) =N∏

n=1

K∏

k=1

q(znk)q(πk)q(μk, τk) (20)

The variational posterior of ln q(θ) is found by keeping termsrelated to θ in p(x, z, π, μ, τ), while the rest are absorbedinto a constant

123


ln q(τk) = Eznk ,μk [ln p(xn | znk, μk, τk) + ln p (μk | τk)

+ ln p (τk)] + const. (21)

ln q(μk) = Eznk [ln p(xn | znk, μk, τk)

+ ln p(μk | τk)] + const. (22)

ln q(πk) = Eznk [ln p (znk | πk) + ln p (πk)] + const. (23)

ln q(znk) = Eμk ,τk ,πk

[ln p(xn | znk, μk, τk)

+ ln p(znk, | πk)] + const. (24)

4.2.1 Optimization of expectation functions

MM can be seen as a variant of EE where instead of comput-ing the expectation of variational posterior, we compute themode

E [θ ] ≈ argmaxθ

ln q(θ) (25)

Next, we MAP approximate the expectations of variationallogposteriors. Their closed-formexpressions canbeobtainedas follows

E [μk] ≈ argmaxμkln q(μk)

=∑N

n=1 xn E[znk] + λ0m0∑N

n=1 E[znk] + λ0(26)

E [πk] ≈ argmaxπkln q(πk) s.t.

∑

k

πk = 1

=∑N

n=1 E [znk] + (α0 − 1)∑K

j=1∑N

n=1 E[znj] (27)

E [τk] ≈ argmaxτkln q(τk)

=12

∑Nn=1 E [znk] + a0 − 1

b0 + 12

(∑Nn=1(xn − μk)2E [znk] + λ0(μk − m0)2

)

(28)

In EE we compute E [znk] ∝ exp {ln q(znk)}. But for MM,E [znk] is now estimated using MAP estimation

E [znk] ≈ argmaxznk ln q(znk)

= argmaxznk

(

E [ln πk]

+ 1

2E [ln τk] − 1

2E [τk] (xn − E [μk])

2)

znk (29)

we use Jensen’s inequality to simplify E [znk] as follows

E [ln τk] ≥ ln E [τk]

E [ln πk] ≥ ln E [πk] (30)

Thus, we now only have to compute E [znk] in terms ofE [τk] , E [πk] , E [μk] as seen below

E [znk] = argmaxznk

(

ln E [πk]

+1

2ln E [τk] − 1

2E [τk] (xn − E [μk])

2)

znk

(31)

4.2.2 MM algorithm

In MM, we use the mode of ln q(θ) for updating E [θ ]instead of computing the actual expectation function. Bydoing so, we may lose stochastic behavior for the hiddenvariables (especially E [znk]), however, computation load isalso reduced. Under certain models, e.g., exponential fam-ily of distribution, the expectation of the distribution is alsoidentical to the mode, MM is identical to EE. This may be anaive assumption that we have taken and is possibly the maindifference between EE and MM. The generic algorithm forlearning MM is E [z] ↔ E [Θ] . The MM algorithm forinferring Bayesian GMM is summarized in Algorithm 2.

Algorithm 2 Learning MM algorithm for Bayesian GMMInput: xOutput: E [μk ] , E [τk ] , E [πk ] , E [znk ]Initialization:

1. Randomly initialize the expectationsE [μk ] , E [znk ] , E [πk ]

2. Initialize the hyperparameters m0, λ0, a0, b0, α0

Compute until convergence

E [τk ] = 12

∑Nn=1 E[znk ]+(a0−1)

b0+ 12

{∑Nn=1(xn−E[μk ])2E[znk ]+λ0(E[μk ]−m0)2

}

E [μk ] =∑N

n=1 xn E[znk ]+λ0m0∑N

n=1 E[znk ]+λ0

E [znk ] = argmaxznk{ln E [πk ] + 1

2 ln E [τk ]

− 12 E [τk ] (xn − E [μk ])2

}znk

E [πk ] =∑N

n=1 E[znk ]+(α0−1)∑K

j=1∑N

n=1 E[znj ]

End

5 Theoretical analysis of MM algorithm

Perhaps the most interesting finding in this work is the factthat bothMMand EE are theoretically related. This is despitethe differences in their analytical approaches (MM rely onMAPapproximation,while EE computes statisticalmomentsvia updated hyperparameters). This is true for the BayesianGMM study in this work.

5.1 Comparison of E [μk]

A sharp reader may have noticed the similarity between thehyperparameter update in Algorithm 1

123


mk = 1

λk

(

λ0m0 +N∑

n=1

E [znk] xn

)

(32)

and the MAP approximated moment in Algorithm 2 below.

E [μk] =∑N

n=1 xn E [znk] + λ0m0∑N

n=1 E [znk] + λ0(33)

By substituting the hyperparameter update below to mk , weobtain identical expressions for both mk and E [μk].

λk = λ0 +N∑

n=1

E [znk] (34)

5.2 Comparison of E [πk]

A Dirichlet distribution is given as

πk ∼ Dir(αk) (35)

The statistical moment is given as

E [πk] = αk∑K

j=1 α j(36)

The corresponding hyperparameter update in Algorithm 1 isgiven as

αk = α0 +N∑

n=1

E [znk] (37)

If we substitute the hyperparameter update to E [πk], weobtain the following

E [πk] = α0 +∑Nn=1 E [znk]

∑Kj=1

(α0 +∑N

n=1 E[znj]) (38)

If we constraint α0 = 0 in the denominator in the aboveexpression, then the EE derived solution is almost identicalto the below MM derived solution in Algorithm 2

E [πk] =∑N

n=1 E [znk] + (α0 − 1)∑K

j=1∑N

n=1 E[znj] (39)

5.3 Comparison of E [τk]

Next, we revisit the moments of a Gamma below. For aGamma distribution

τk ∼ Gam(ak, bk) (40)

the statistical moment is given as

E [τk] = akbk

(41)

Now for Algorithm 1, we see that the hyperparameters are

ak = a0 + 1

2

N∑

n=1

E [znk] (42)

bk = b0 + 1

2

(N∑

n=1

(xn − mk)2E [znk] + λ0(m0 − mk)

2

)

(43)

If we express the statistical moment in terms of the hyperpa-rameter updates, we obtain the EE derived solution

E[τk] = a0 + 1

2∑N

n=1 E[znk]

b0 + 12

(∑Nn=1(xn − mk)

2E[znk]+ λ0(mk − m0)

2)

(44)

If we use substitute E [μk] = mk , we find that the aboveexpression has an identical form to the corresponding expres-sion found in Algorithm 2 below

E [τk ] =12

∑Nn=1 E [znk ]+ (a0 −1)

b0+ 12

(∑

n(xn−E [μk ]) 2E [znk ]+λ0(E [μk ]− m0)2

)

(45)

5.4 Comparisons of E [znk], E [ln τk] and E [lnπk]

Theoretically, since znk ∈ {1, 0} the expected value of znkand the maximum value of znk should be identical when thedataset has sufficient statistics as sample size, N → ∞.

We recall that in our case of MM algorithm, we simplytake Jensen’s inequality to skip additional computation forthe moments approximation

E [ln τk] ≥ ln E [τk]

E [ln πk] ≥ ln E [πk] (46)

From the analysis above, we observe that when applyingtheMM algorithm to infer a Bayesian model, it turns out thatthe solution of the MM algorithm is somehow combiningboth the hyperparameter updates and the statistical momentsof the EE algorithm as one entity. This phenomenal is a coin-cidence as we clearly did not involve any hyperparametersupdates, and we did not require the knowledge of any statis-tical moments (of prior distributions) in the MM algorithm.

123


6 Experiments

We compare the MM algorithm for learning Bayesian GMMon two different problems.

In the first problem,we use BayesianGMM tomodel classseparation. One inherent characteristic of Bayesian GMM isits ability to predict class number. This means that we donot require class label for learning the class separation. Thisproblem is known as model selection or unsupervised classi-fication. The objective is to compare the performance ofMMfor learning a Bayesian model with state-of-the-art results.

In the second problem, we use Bayesian GMM to buildan image dictionary for a Bag-of-Word image representation.This is commonly known as unsupervised feature learning.The goal is to compare the learning performance of Bayesianapproach and maximum likelihood (ML) approach such asK-means and the expectation–maximization algorithm (EM).Finally, we conclude by comparing our Bayesian approachwith aML approach, and we explain how the analytical formof MM algorithm allows easy interpretation of how bothBayesian and ML approaches are related to each other.

6.1 Unsupervised classification I

We train Bayesian GMM as a classifier for both the objectand scene classification datasets as seen in Table 1. For thelearner, we mainly compare EE and MM using criteria suchas the predicted model, accuracy and convergence time. Krefers to the truncation level also set in Fan and Bouguila(2013). We set K = 15 in both datasets.

In theCaltech dataset (Fei-Fei et al. 2007), we use the trainand test setup according to Fan and Bouguila (2013). For theattribute, we use a BoW image representation followed bya PCA reduction. The K-means used in the BoW is 1000and we reduce it to a 256 image vector using PCA. We ranthis setup for both the EE and MM algorithm. In a practicalscenario, we limit to a finite number of iterations for bothmethods. In this experiment, we set this number to 20 for thetabulated results. In Table 2, EE and MM are given 10 runsand the average is taken. Also, we empirical select and fix thebest hyperparameters that appeared in Algorithm 1 and 2.

We observe that in Table 2, EE is only marginally moreaccurate than MM but takes twice as long to compute whichis within our expectation. The confusion matrices for EE andMM are shown in Tables 3 and 4, respectively. However,the model prediction of EE is somehow poorer than MM.

Table 1 Dataset used in unsupervised classification I

Train Test Attribute Class truth K

Caltech 1079 1077 256 4 15

Table 2 Performance evaluation using predicted model, accuracy andcomputational time for 20 iterations of EE and MM

lnGD lnGau KNN SVM EE MM

Model 3.85 3.80 – – 7.60 5.40

Accuracy 88.03 84.19 84.32 88.46 87.38 86.47

Time – – – – 9.98 5.66

Table 3 EE confusion matrix

Airplane Car Face Motobike

Airplane 288 96 16 0

Car 0 61 0 0

Face 0 0 217 0

Motobike 0 0 11 388

Bold indicates best values

Table 4 MM confusion matrix

Airplane Car Face Motobike

Airplane 270 128 2 0

Car 0 61 0 0

Face 0 0 217 0

Motobike 0 3 0 396


We observe the reason is mainly due to insufficient learning,resulting in poorer convergence for EE. The value of E [zn]in EE shows that it tend to contain insignificant but nonzerovalues formost of the clusters. In comparison,MMonly com-pute one nonzero value for E [zn] and many insignificantlyweighted clusters can be pruned away early on. However, forEE, the problem can be elevated by increasing the number ofiterations eg. 100 as shown in Fig. 2. For MM, increasing thenumber of iterations only improve the convergence slightly.We note that for unsupervised classification, we set a thresh-old on the cluster weights generated by both EE and MMto predict model. In Caltech4, we set the common thresholdto 0.05.

We also compare our results to infinite generalizedDirich-let mixture model (lnGD), infinite Gaussian mixture model(lnGau), KNN and SVM as published in Fan and Bouguila(2013). In comparison, we found that our accuracy results forEE and MM is better than lnGau despite using a finite GMMand only slightly poorer than lnGD. However, our predictedmodel using finite mixture model is poorer than the infinitebased mixture models. Also, we can see that unsupervisedclassifiers are generally on par with supervised classifiers,e.g., KNN and SVM for this particular dataset.

6.2 Unsupervised classification II

Next, we run an experiment on the Scene Categories dataset(Lazebnik et al. 2006) for further comparison of EE andMM.

123


Fig. 2 Visualizing the convergence of EE and MM by increasing thenumber of iterations from 20 to 100. The true number of cluster is 4

Table 5 Dataset used in unsupervised classification II

Train Test Attribute Class truth K

Scene categories 1252 – 200, 800 4 15

Table 6 Performance evaluation using predicted model, accuracy andcomputational time for 30 iterations of EE and MM

GID VAR EE MM EE MM

Attribute 81 200 200 800 800

Model 5 9 10.33 – 6.30

Accuracy 80.19 69.32 68.59 – 79.53

Time – 16.17 10.22 – 40.89

We use the training setup in Bdiri et al. (2016) for SceneCategories as seen in Table 5. There is no dataset split intotrain and test images here. All the available images from thefour classes are used for training a Bayesian GMM. We usethe same evaluation criteria as previously. For the attribute,we use a 200 BoWvector and another 2-by-2 spatial pyramidrepresentation or a 800 SPR vector. We limit to 30 iterationsto obtain the results for EE and MM in Table 6. We alsotake 10 reruns and use the average result. Our state-of-the-art comparison is the infinite generalized inverted Dirichletmixture model (GID Var) in Bdiri et al. (2016).

For EE, we found that it slightly outperformsMM in termsof accuracy and predicted model but takes longer to com-pute. When using the 800 SPR vector as inputs, EE met withnumerical errors despite our best attempt.We traced the prob-lem to E [zn] outputing NaN in MATLAB for most of thesamples after the first iteration despite that values of the otherexpectations appear to be normal. We also encountered such

Table 7 EE confusion matrix

Forest Highway InsideCity TallBuilding

Forest 214 108 3 3

Highway 0 235 15 10

InsideCity 0 7 191 110

TallBuilding 2 61 73 220


Table 8 MM confusion matrix


Forest 219 14 0 95

Highway 0 224 3 33




Table 9 MM confusion matrix (SPR vector)


Forest 262 0 3 63

Highway 0 217 18 25




problemwhen using the 200BoWvector, but it only occurredfor several samples which we manually remove from train-ing. Further normalizing the dataset also did not seems tohelp. We believe such issue could be due to that EE may notwork for some datasets due to insufficient statistics.

Fortunately, we do not have the same problem with MM.Moreover, it was able to obtain comparable accuracy resultto GID Var, and it outperforms both EE and MM usingBoW vector with over 10% improvement as well as closer toground truth class prediction. When increasing the vector by4 times for MM, the time taken from also increases linearlyby 4 times, from 10.22 to 40.89 s in Table 6. The confusionmatrices for EE and MM are shown in Tables 7, 8 and 9.

We also visualize the convergence of EE and MM on theScene Categories dataset in Fig. 3. For MM, at 30 iterations(using SPR vector), we see that there appear to be a fewdominantly weighted clusters although the model selectionis not as distinct as in the Caltech dataset case earlier. At 120iterations, we see that between cluster 1–7 and cluster 8–15,there is a large offset and if we allow a threshold at 0.06,then according to MM, the best model selection is all theclusters from 1 to 7. We observe for EE (using BoW vectoras EE cannot converge when using a SPR vector) that whenincreasing the iterations from 30 to 120, the weights of mostclusters are suppressed, while 3 clusters appear more domi-

123


Fig. 3 Visualizing the convergence of EE and MM by increasing thenumber of iterations from 30 to 120. The true number of cluster is 4

nantlyweighted than the rest. If we take a larger threshold say0.08, it may suggest that there are 3 dominant clusters. For a4 class ground truth, it seems that EE was not able to pick upone of the important cluster. Regardless, for a 30 iterationscomputed result for EE in Table 6, we simply use a commonthreshold (with MM) at 0.06 for convenience (Table10).

6.3 Unsupervised feature learning

Our concern is howwellMMmodels the datawhen comparedto traditional learners such as K-means and EM.We evaluatethe performance of MM on the Multi-View Car 8-bin-posesclassification (Ozuysal et al. 2009; Lim et al. 2016) and theCaltech 10-objects classification (Fernando et al. 2012; Lianet al. 2010) datasets.

The first dataset contains 2299 image sequences of 20different cars on a rotating platform (first 10 cars for trainingand next 10 cars for testing). We use 400 of 1259 trainingimages and 1040 testing images. The second dataset has 400training images and 2644 testing images. Both experimentsuse object bounding box to extract features. We extract HoGfeature every 8 pixels with a 2 × 2 partition outputting a1 × 32 descriptor with L2 normalization. In both datasets,we randomly select 2000 and 2500HoG descriptors per classrespectively totalling 20,000 training set for unsupervisedvisual dictionary learning.

Table 10 Dataset used in unsupervised learning

Train Test Attribute Class truth

Car Pose 400 1040 100 ≤ k ≤ 1000 8

Caltech 400 2644 100 ≤ k ≤ 1000 10


Our baseline comparison are K-means and EM. Both K-means and EM are based on maximum likelihood approach.For EM, we only use 10 iterations as computation is verydemandingwhen the number of Gaussian increases, e.g., k =200 takes 3.9h and k = 700 takes 46 hrs for Caltech10. Incomparison, MM takes 0.3h for k = 200. We use clustersizes of 100 ≤ k ≤ 1000 for all methods. We also use 10iterations to trainMM as the learning usually converges after10 iterations for both datasets.

Each image of a dataset contains its own set of features.After unsupervised learning with GMMor K-means, we rep-resent an image by vector quantization. The most popularmethod is Bag-of-Word as follows

BoW =N∑

n=1

E [zn] (47)

We now introduce the use of two new constants C andD (Bishop 2006). We later show that they allow MM to beeasily control with a single value λ0 by empirically fixingthese two constants. They are related to the hyperparametersof Bayesian GMM as follows

m0 = C

λ0

a0 = 1 + λ0

2

b0 = D − C2

2λ0(48)

For C = 4 and D = 6 and a belief of λ0 = 2, thehyperparameters are worked out to be m0 = 2, a0 = 2,b0 = 2. This means that the prior τ ∼ Gam (a0, b0) andμ | τ ∼ N (

m0, (λ0τ)−1) are governed by the hyperparam-eters values, which in turnwill affect the estimation of E [μk]and E [τk] in Algorithm 2. Unfortunately, we do not have anoptimal step for obtaining the hyperparameter values. This isan inherent problem in Bayesian approach and hyperparam-eter optimization is beyond the scope of this paper. Instead,we fix the two constants C, D and vary the prior belief λ0 inthe experiments to empirically select the best result MM canoffer.

Linear SVM is used for classification for all methods.Classification performance is measure in mean average pre-cision (mAP) where average precision is the area under theprecision–recall curve.

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

mAP = 1

|J |∑J

j=1Average Precision (49)

where |J | is the number of classes. We perform ten rerunsand average the mAP results for each method. The behavior

123


Fig. 4 Performance comparison for Multi-view Car bin-pose classifi-cation

Fig. 5 Performance comparison for Caltech-10 object classification

of the methods for different cluster sizes are shown in Figs. 4and 5 for Multi-view Car and Caltech-10 respectively.

In the Multi-view Car experiment in Fig. 4, MM out-performs both K-means and EM for all cluster size. MMperforms less optimal when λ0 is either too large or zero.MM with smaller value of λ0 outperforms having λ0 = 0.The best mAP using λ0 = 5 for MM is 0.7418 at k = 200compared to K-means of 0.66516 at k = 300, and EM of0.67785 at k = 500. The MM gain over best baseline is at6.54% in Table 11.

In the Caltech-10 experiment in Fig. 5, using prior beliefλ0 ≤ 1 works best for MM. The best mAP for MM usingλ0 = 0.5 is 0.89608 at k = 300 compared to K-means of0.87769 at k = 1000, and EM of 0.86046 at k = 500. TheMM gain over best baseline is at 1.84% in Table 12. In both

Table 11 Best average precision evaluation on the Multi-view Cardataset for 8-bin-pose

Bin-pose class K-means EM MMa MM

k = 300 k = 500 k = 300 k = 200

Front 0.99437 0.97332 0.99824 0.98534

Front-left 0.62753 0.45492 0.86211 0.51241

Left 0.69789 0.63366 0.78453 0.89611

Rear-left 0.45773 0.57537 0.31384 0.64906

Rear 0.87559 0.95798 0.96399 0.83582

Rear-right 0.58441 0.5265 0.42318 0.65707

Right 0.33067 0.49751 0.37113 0.50651

Front-right 0.75306 0.80357 0.73772 0.90359

Mean AP 0.66516 0.67785 0.68184 0.74324

a MM using shared precision in Lim et al. (2016)

Table 12 Best average precision evaluation on the Caltech-10 dataset

Object classes K-means EM MM

k = 1000 k = 500 k = 300

Airplane 0.99439 0.98236 0.98774

Bonsai 0.4286 0.30479 0.63457

Car 0.98918 0.99026 0.99026

Chandelier 0.89538 0.88957 0.90423

Faces 0.98578 0.9809 0.988

Hawksbill 0.68685 0.79228 0.58625

Ketch 0.84959 0.83017 0.8814

Leopards 0.98226 0.96713 0.98856

Motorbikes 0.96803 0.98379 0.99302

Watch 0.99688 0.88339 1

Mean AP 0.87769 0.86046 0.89608

Table 13 Comparison of mean average precision on Caltech10

SA SDLA GEMP K-means EM MM

k 400 200 200 1000 500 300

mAP 0.875 0.893 0.929 0.878 0.860 0.896

dataset, it can be seen that even with no prior belief MMoutperforms best result of K-means and EM at k = 300.

In comparison with recent works (Table 13) such as SA(Fernando et al. 2012), SDLM (Lian et al. 2010) and GEMP(Fernando et al. 2012) on Caltech-10, all three methods SA,SDLM and GEMP are supervised learner for feature learn-ing, while MM is an unsupervised learner. Our MM bestperformance is slightly better than SDLM and SA but worsethan GEMP. An improvement could be to consider combin-ing supervised learning for MM in future.

123


6.3.1 Comparing maximum likelihood and Bayesianapproaches using MM algorithm

We now present an intuitive way to explain the relationshipbetween a Bayesian approach and a ML approach to GMMusing the MM algorithm.

MMis governedby aprior beliefλ0 which determines howmuch we place our trust on the prior models in a Bayesianapproach. For zero prior belief λ0 = 0, MM reduces to theML approach of GMM with b0 = 0 as follows

E [μk ] =∑N

n=1 xn E [znk ] + λ0m0∑N

n=1 E [znk ] + λ0

⇒∑N

n=1 xnd E [znk ]∑N

n=1 E [znk ](50)

E [τk ] =12

∑Nn=1 E [znk ] + (a0 − 1)

b0 + 12

{∑Nn=1(xn − E [μk ])2E [znk ] + λ0(E [μk ] − m0)2

}

⇒∑N

n=1 E [znk ]∑N

n=1(xn − E [μk ])2E [znk ](51)

Also, a larger prior belief leads to fewer mixture compo-nents that contribute to the dataset, while a smaller beliefallows more Gaussians to capture the dataset. This canbe seen mathematically in the equation above when λ0increases, we end up with a smaller E [τk] or larger vari-ance (σ 2 = 1

τ), hence, the estimated clusters will end up

with larger within-cluster distance and vice versa. On thetwo unsupervised learning dataset, the prior belief is empir-ically found to be optimal at λ0 ≤ 2 to outperformed bothK-means and EM for all cluster size.

7 Discussion and future work

We have presented a new parameter estimation algorithmfor Bayesian GMM using the MM algorithm. The approachis closely related to the variational inference framework.The main difference is that while variational inference relyon computing statistical moments of prior distributions forlearning, we rely on MAP approximation of variational pos-teriors for computational efficiency purpose. Although bothmethods offer analytical solution, only our method provideanalytical solution in closed form. This is seen by comparingvariational inference to the MM algorithm.

To verify the effectiveness of MM. We compare MMfirstly on two unsupervised image classification datasets. Ourexperimental results shows thatMM ismore efficient to com-pute but slightly poorer than variational inference in termsof accuracy and predicted model. Also, both MM and varia-tional inference are comparable to published state-of-the-artresults. Secondly, we also compare maximum likelihoodapproach (K-means and EM) with Bayesian approach (using

MM) on two unsupervised feature learning datasets. Wefound that given empirically determined optimal hyperpa-rameter settings, Bayesian approach significantly outper-forms maximum likelihood approach at modeling datasets.

A possible future work would be to extend the MMalgorithm to areas where variational inference has receivedsuccess, particularly in Bayesian nonparametric models,hierarchical models and deep learning. Also, Fisher vec-tor traditionally relies on maximum likelihood approach forlearning GMM. We can easily extend Fisher Vector to usinga Bayesian approach to further improve the result for unsu-pervised feature learning.

Acknowledgements We are grateful to Dr Shiping Wang for his help-ful discussion and guidance.

Compliance with ethical standards

Conflict of interest Author Kart-Leong Lim declares no conflict ofinterest. Co-author Han Wang declares no conflict of interest.

Ethical approval This article does not contain any studies with humanparticipants or animals performed by any of the authors.

8 Appendix

In our proposed approach for solving Bayesian GMM usingthe MM algorithm, we emphasis on its computational effi-ciency. For completeness, we now discuss howwe can obtainconvergence by checking the lower bound for each iteration.

L ≥∫∫∫∫

q(z, μ, τ, π) ln

[p(x, z, μ, τ, π)

q(z, μ, τ, π)

]

dz dμ dτ dπ

= E [ln p(x, z, μ, τ, π)] − E [ln q(z, μ, τ, π)]

= E [ln p(x | z, μ, τ)] + E [ln p(z | π)] + E [ln p(π)]

+ E [ln p(μ, τ)] − E [ln q(z)]

− E [ln q(π)] − E [ln q(μ, τ)] (52)

In the lower bound expression above, each expectation func-tion is taken with respect to all of the hidden variables and issimplified using Jensen’s inequality as follows

E [ln p(x | z, μ, τ )]

= E

[K∑

k=1

N∑

n=1

lnN(xn | μk , τ

−1k

)znk]

=K∑

k=1

N∑

n=1

{1

2E [ln τk ] − E [τk ]

2(xn−E [μk ])

2}

E [znk ] (53)

E [ln p(z | π)] = E

[K∑

k=1

N∑

n=1

znk ln πk

]

=K∑

k=1

N∑

n=1

E [znk] E [ln πk] (54)

123


E [ln p(π)] = E

[K∑

k=1

ln Dir (πk | α0)

]

= (α0 − 1)K∑

k=1

E [ln πk] (55)

E [ln p(μ, τ)]

= E

⎡

⎣K∑

k=1

lnN(μk | m0, (λ0τk)

−1)Gam (τk | a0, b0)

⎤

⎦

=K∑

k=1

(

− (E[μk]− m0)

2

2(λ0E

[τk])−1 + (a0 − 1)E

[ln τk

]− b0E[τk])

(56)

E [ln q(z)] =N∑

n=1

K∑

k=1

{

ln E [πk] + 1

2ln E [τk]

− E [τk]

2(xn − E [μk])

2}

E [znk] (57)

E [ln q(π)] = E

[N∑

n=1

K∑

k=1

E [znk] ln πk

+(α0 − 1)K∑

k=1

ln πk

]

=N∑

n=1

K∑

k=1

E [znk] ln E [πk]

+(α0 − 1)K∑

k=1

ln E [πk] (58)

E [ln q(μ, τ)]

= E

⎡

⎣K∑

k=1

⎛

⎝N∑

n=1

{ln τk

2− τk

2(xn − E

[μk])2}

E[znk]

− (E[μk]− m0)

2

2(λ0τk)−1 + (a0 − 1) ln τk − b0τk)

⎞

⎠

⎤

⎦

=K∑

k=1

⎛

⎝N∑

n=1

{E[ln τk

]

2− E

[τk]

2(xn − E

[μk])2

}

E[znk]

− (E[μk]− m0)

2

2(λ0E[τk])−1

+ (a0 − 1)E[ln τk

]− b0[Eτk

]⎞

⎠ (59)

In practice, it is not necessary to compute lower bound aswe can visualize the convergence by checking the changesin weight of each cluster using the equation below, as seenin the experiments later on.

πk =∑N

n=1E [znk]∑K

j=1∑N

n=1E[znj] (60)

References

Bdiri T, Bouguila N, Ziou D (2016) Variational Bayesian inferencefor infinite generalized inverted Dirichlet mixtures with featureselection and its application to clustering. Appl Intell 44(3):507–525

Bishop CM (2006) Pattern recognition and machine learning. Springer,New York

Blei DM, Jordan MI et al (2006) Variational inference for Dirichletprocess mixtures. Bayesian Anal 1(1):121–144

Cinbis RG, Verbeek J, Schmid C (2016) Approximate fisher kernelsof non-iid image models for image categorization. IEEE TransPattern Anal Mach Intell 38(6):1084–1098

Corduneanu A, Bishop CM (2001) Variational Bayesian model selec-tion for mixture distributions. In: Artificial intelligence and statis-tics, vol 2001. Morgan Kaufmann, Waltham, MA, pp 27–34

Fan W, Bouguila N (2013) Variational learning of a Dirichlet processof generalized Dirichlet distributions for simultaneous clusteringand feature selection. Pattern Recognit 46(10):2754–2769

Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual mod-els from few training examples: an incremental bayesian approachtested on 101 object categories. Comput Vis Image Underst106(1):59–70

FernandoB, Fromont E,Muselet D, SebbanM (2012) Supervised learn-ing of Gaussian mixture models for visual vocabulary generation.Pattern Recognit 45(2):897–907

Kurihara K, Welling M (2009) Bayesian k-means as a maximization–expectation algorithm. Neural Comput 21(4):1145–1172

Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatialpyramid matching for recognizing natural scene categories. In:Proceedings of the 2006 IEEE computer society conference oncomputer vision and pattern recognition (CVPR’06), vol 2. IEEE,pp 2169–2178

Lian X-C, Li Z, Wang C, Lu B-L, Zhang L (2010) Probabilistic mod-els for supervised dictionary learning. In: Proceedings of the2010 IEEE conference on computer vision and pattern recogni-tion (CVPR). IEEE, pp 2305–2312

LimK-L,WangH (2016)Learning afield ofGaussianmixturemodel forimage classification. In: Proceedings of the 2016 14th internationalconference on control, automation, robotics and vision (ICARCV).IEEE, pp 1–5

Lim K-L, Wang H (2017) Sparse coding based Fisher vector using aBayesian approach. IEEE Signal Process. Lett. 24(1):91

Lim K-L, Wang H, Mou X (2016) Learning Gaussian mixture modelwith a maximization-maximization algorithm for image classi-fication. In: Proceedings of the 2016 12th IEEE internationalconference on control and automation (ICCA). IEEE, pp 887–891

Liu L, Shen C, Wang L, van den Hengel A, Wang C (2014) Encod-ing high dimensional local features by sparse coding based Fishervectors. In: Advances in neural information processing systems,pp 1143–1151

MacKay DJC (2003) Information theory, inference and learning algo-rithms. Cambridge University Press, Cambridge

Ma Z, Leijon A (2011) Bayesian estimation of beta mixture modelswith variational inference. IEEE Trans Pattern Anal Mach Intell33(11):2160–2173

Neal RM (2000) Markov chain sampling methods for Dirichlet processmixture models. J Comput Graph Stat 9(2):249–265

OzuysalM,LepetitV, FuaP (2009)Pose estimation for category specificmultiview object localization. In: Proceedings of the IEEE confer-ence on computer vision and pattern recognition, 2009. CVPR2009. IEEE, pp 778–785

123


Paisley J, Wang C, Blei DM, Jordan MI (2015) Nested hierarchi-cal dirichlet processes. IEEE Trans Pattern Anal Mach Intell37(2):256–270

Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Sharing clusters amongrelated groups: hierarchical Dirichlet processes. In: Advances inNeural Information Processing Systems, pp 1385–1392

Welling M, Kurihara K (2006) Bayesian k-means as a maximization-expectation algorithm. In: Proceedings of the 2006 SIAM interna-tional conference on data mining, pp 474–478

123

MAP approximation to the variational Bayes Gaussian ... · MAP approximation to the variational...

Documents

Transcript of MAP approximation to the variational Bayes Gaussian ... · MAP approximation to the variational...