Applied Bayesian Nonparametric Mixture Modeling Session 2 ...thanos/notes-2.pdf · Session 2 {...

Introduction Definition Prior specification Marginal samplers Blocked Gibbs sampler Variational approximations

Applied Bayesian Nonparametric Mixture ModelingSession 2 – Dirichlet process mixture models

Athanasios Kottas ([email protected])Abel Rodriguez ([email protected])

University of California, Santa Cruz

2014 ISBA World MeetingCancun, Mexico

Sunday July 13, 2014

Athanasios Kottas and Abel Rodriguez Applied Bayesian Nonparametric Mixture Modeling – Session 2


Outline

1 Introduction

2 Definition

3 Prior specification

4 Marginal samplers

5 Blocked Gibbs sampler

6 Variational approximations



Motivating Dirichlet process mixtures

Recall that the Dirichlet process (DP) is a conjugate prior for randomdistributions under i.i.d. sampling.

However, posterior draws under a DP model correspond (almost surely)to discrete distributions ⇒ This is somewhat unsatisfactory if we aremodeling continuous data ...

One solution is to use convolutions to smooth out posterior estimates⇒ Similar to kernel density estimation.

In a model-based context, this leads to DP mixture models, i.e., amixture model where the mixing distribution is unknown and assigneda DP prior (note that this is different from a mixture of DPs, in whichthe parameters of the DP are random).

Strong connection with finite mixture models.

More generally, we might be interested in using a DP as part of ahierarchical Bayesian model to place a prior on the unknown distri-bution of some of its parameters (e.g., random effects models). Thisleads to semiparametric Bayesian models.



Mixture distributions

Mixture models arise naturally as flexible alternatives to standard para-metric families.

Continuous mixture models (e.g., t, Beta-binomial and Poisson-gammamodels) typically achieve increased heterogeneity but are still limitedto unimodality and usually symmetry.

Finite mixture distributions provide more flexible modeling, and arenow feasible to implement due to advances in simulation-based modelfitting (e.g., Richardson and Green, 1997; Stephens, 2000; Jasra,Holmes and Stephens, 2005).

Rather than handling the very large number of parameters of finitemixture models with a large number of mixands, it may be easier towork with an infinite dimensional specification by assuming a randommixing distribution, which is not restricted to a specified parametricfamily.



Finite mixture models

Recall the structure of a finite mixture model with Kcomponents, forexample, a mixture of K = 2 Gaussian densities:

yi | w , µ1, µ2, σ21 , σ

22

ind.∼ wN(yi ;µ1, σ21) + (1− w)N(yi ;µ2, σ

22),

that is, observation yi arises from a N(µ1, σ21) distribution with prob-

ability w or from a N(µ2, σ22) distribution with probability 1− w (in-

dependently for each i = 1, . . . , n).

In the Bayesian setting, we also set priors for the unknown parameters

(w , µ1, µ2, σ21 , σ

22) ∼ p(w , µ1, µ2, σ

21 , σ

22).




The model can be rewritten in a few different ways. For example, wecan introduce auxiliary random variables L1, . . . , Ln such that Li = 1if yi arises from the N(µ1, σ

21) component (component 1) and Li = 2

if yi is drawn from the N(µ2, σ22) component (component 2). Then,

the model can be written as

yi | Li , µ1, µ2, σ21 , σ

22

ind.∼ N(yi | µLi , σ2Li

)

P(Li = 1|w) = w = 1− P(Li = 2|w)

(w , µ1, µ2, σ21 , σ

22) ∼ p(w , µ1, µ2, σ

21 , σ

22).

If we marginalize over Li we recover the original mixture formulation.The inclusion of indicator variables is very common in finite mixturemodels, and we will make extensive use of it.




We can also write

wN(yi ;µ1, σ21) + (1− w)N(yi ;µ2, σ

22) =

∫N(yi ;µ, σ

2)dG (µ, σ2),

where

G (·) = wδ(µ1,σ21)(·) + (1− w)δ(µ2,σ2

2)(·).

A similar expression can be used for a general K mixture model.

Note that G is discrete (and random) — a natural alternative is touse a DP prior for G , resulting in a Dirichlet process mixture (DPM).

Working with a countable mixture (rather than a finite one) pro-vides theoretical advantages (full support) as well as practical (themodel automatically decides how many components are appropriatefor a given data set). Recall the comment about “good” nonpara-metric/semiparametric Bayes.



Definition of the Dirichlet process mixture model

The Dirichlet process mixture model

F (·;G ) =

∫K (·; θ) dG (θ), G ∼ DP(α,G0),

where K (·; θ) is a parametric distribution function indexed by θ.

The Dirichlet process has been the most widely used prior for therandom mixing distribution G , following the early work by Antoniak(1974), Lo (1984) and Ferguson (1983).

Corresponding mixture density (or probability mass) function,

f (·;G ) =

∫k(·; θ) dG (θ),

where k(·; θ) is the density (or probability mass) function of K (·; θ).

Because G is random, the c.d.f. F (·;G ) and the density functionf (·;G ) are random (Bayesian nonparametric mixture models).



−3 −2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

x

Mix

ing

dist

ribut

ion

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

CD

F o

f the

DP

mix

ture

−4 −2 0 2 4

0.0

0.1

0.2

0.3

x

Den

sity

of t

he D

P m

ixtu

re

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

x

Mix

ing

dist

ribut

ion

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

CD

F o

f the

DP

mix

ture

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

of t

he D

P m

ixtu

re

Figure 2.1: Two realizations from a DP(α = 2, G0 = N(0, 1)) (left column) and the associated cumulative distribution function (center

column) and density function (right column) for a location DP mixture of Gaussian kernels with standard deviation 0.6.



Support of Dirichlet process mixture models

Contrary to DP prior models, the DP mixture F (·;G ) can modelboth discrete distributions (e.g., K (·; θ) might be Poisson or bino-mial) and continuous distributions, either univariate (K (·; θ) can be,e.g., normal, gamma, or uniform) or multivariate (with K (·; θ), say,multivariate normal)

Several useful results for general mixtures of parametric families,

(Discrete) normal location-scale mixtures,∑M

j=1 wjN(· | µj , σ2j ),can

approximate arbitrarily well any density on the real line (Lo, 1984;Ferguson, 1983; Escobar and West, 1995) — analogously, for densitieson Rd (West et al., 1994; Muller et al., 1996).For any non-increasing density f (t) on the positive real line thereexists a distribution function G such that f can be represented as ascale mixture of uniform densities, i.e., f (t) =

∫θ−11[0,θ)(t)dG(θ) —

the result yields flexible DP mixture models for symmetric unimodaldensities (Brunner and Lo, 1989; Brunner, 1995) as well as generalunimodal densities (Brunner, 1992; Lavine and Mockus, 1995; Kottasand Gelfand, 2001; Kottas and Krnjajic, 2009).



Semiparametric Dirichlet process mixture models

Typically, semiparametric DP mixtures are employed

yi | G , φi.i.d.∼ f (·;G , φ) =

∫k(·; θ, φ) dG (θ), i = 1, . . . , n,

G ∼ DP(α,G0),

with a parametric prior p(φ) placed on φ (and, perhaps, hyperpriorsfor α and/or the parameters ψ of G0 ≡ G0(· | ψ)).

Hierarchical formulation for DP mixture models: introduce latentmixing parameter θi associated with yi ,

yi | θi , φind.∼ k(yi ; θi , φ), i = 1, . . . , n,

θi | Gi.i.d.∼ G , i = 1, . . . , n,

G | α,ψ ∼ DP(α,G0(· | ψ)),

φ, α, ψ ∼ p(φ)p(α)p(ψ),



An equivalent formulation

In the context of DP mixtures, the (almost sure) discreteness of re-alizations G from the DP(α,G0) prior is an asset — it allows tiesin the θi , and thus makes DP mixture models appealing for manyapplications, including density estimation and regression.

Using the constructive definition of the DP, G =∑∞`=1 ω`δϑ` , the

prior probability model f (·;G , φ) admits an (almost sure) representa-tion as a countable mixture of parametric densities,

f (·;G , φ) =∞∑`=1

ω`k(·;ϑ`, φ)

Weights: ω1 = z1, ω` = z`∏`−1

r=1 (1 − zr ), ` ≥ 2, with zr i.i.d.beta(1, α).Locations: ϑ` i.i.d. G0 (and the sequences {zr , r = 1,2,. . . } and {ϑ`,` = 1,2,. . . } are independent).



Connection with finite mixture models

The formulation just described has motivated the study of severalvariants of the DP mixture model.

It also provides a link between limits of finite mixtures, with priorfor the weights given by a symmetric Dirichlet distribution, and DPmixture models (e.g., Ishwaran and Zarepour, 2000).

Consider the K finite mixture model

K∑t=1

ωtk(y ;ϑt),

with (ω1, . . . , ωK ) ∼ Dir(α/K , . . . , α/K ) and ϑti.i.d.∼ G0, t = 1, . . . ,K .

When K → ∞, this model corresponds to a DP mixture with kernelk and a DP(α,G0) prior for the mixing distribution.



Prior specification

Taking expectation over G with respect to its DP prior DP(α,G0) weget

E{F (·;G , φ)} = F (·;G0, φ), E{f (·;G , φ)} = f (·;G0, φ).

These expressions facilitate prior specification for the parameters ψ ofG0(· | ψ).

On the other hand, recall that for the DP(α,G0) prior, α controls howclose a realization G is to G0.

In the DP mixture model, α controls the prior distribution of thenumber of distinct elements n∗ of the vector θ = (θ1, . . . , θn), andhence the number of distinct components of the mixture that appearin a sample of size n (Antoniak, 1974; Escobar and West, 1995; Liu,1996).



Prior specification

In particular,

P(n∗ = m | α) = cn(m)n!αm Γ(α)

Γ(α + n), m = 1, . . . , n,

where the factors cn(m) = Pr(n∗ = m | α = 1) can be computedusing certain recurrence formulas (Stirling numbers) (Escobar andWest, 1995).

If α is assigned a prior p(α), P(n∗ = m) =∫

Pr(n∗ = m | α)p(α)dα.

Moreover, for moderately large n,

E(n∗ | α) ≈ α log

(α + n

α

)which can be further averaged over the prior for α to obtain a priorestimate for E(n∗).



Prior specification

Two limiting special cases of the DP mixture model.

One distinct component, when α→ 0+

yi | θ, φind.∼ k(yi ; θ, φ), i = 1, . . . , n

θ | ψ ∼ G0(· | ψ)

φ, ψ ∼ p(φ)p(ψ)

n components (one associated with each observation), when α→∞

yi | θi , φind.∼ k(yi ; θi , φ), i = 1, . . . , n

θi | ψi.i.d.∼ G0(· | ψ), i = 1, . . . , n

φ, ψ ∼ p(φ)p(ψ)



Methods for posterior inference

Data = {yi , i = 1, . . . , n}, i.i.d., conditionally on G and φ, fromf (·;G , φ). (If the model includes a regression component, the dataalso include the covariate vectors xi , and, in such cases, φ, typically,includes the vector of regression coefficients).

Interest in inference for the latent mixing parameters θ = (θ1, . . . , θn),for φ (and the hyperparameters α, ψ), for f (y0;G , φ), and, in gen-eral, for functionals H(F (·;G , φ)) of the random mixture F (·;G , φ)(e.g., c.d.f. function, hazard function, mean and variance functionals,percentile functionals).

Full and exact inference, given the data, for all these random quanti-ties is based on the joint posterior of the DP mixture model

p(G , φ, θ, α, ψ | data)



Marginal posterior simulation methods

Key result: representation of the joint posterior distribution

p(G , φ, θ, α, ψ | data) = p(G | θ, α, ψ)p(θ, φ, α, ψ | data)

p(θ, φ, α, ψ | data) is the marginal posterior for the finite-dimensionalportion of the full parameter vector (G , φ, θ, α, ψ).

G | θ, α, ψ ∼ DP(α, G0), where α = α + n, and

G0(·) =α

α + nG0(· | ψ) +

1

α + n

n∑i=1

δθi (·).

(Hence, the c.d.f., G0(t) = αα+nG0(t | ψ) + 1

α+n

∑ni=1 1[θi ,∞)(t)).

Sampling from the DP(α, G0) is possible using one of its definitions —thus, we can obtain full posterior inference under DP mixture modelsif we sample from the marginal posterior p(θ, φ, α, ψ | data).




The marginal posterior p(θ, φ, α, ψ | data) corresponds to the marginal-ized version of the DP mixture model, obtained after integrating Gover its DP prior (Blackwell and MacQueen, 1973),

yi | θi , φind.∼ k(yi ; θi , φ), i = 1, . . . , n

θ = (θ1, . . . , θn) | α,ψ ∼ p(θ | α,ψ),

φ, α, ψ ∼ p(φ)p(α)p(ψ).

The induced prior distribution p(θ | α,ψ) for the mixing parametersθi can be developed by exploiting the Polya urn characterization ofthe DP,

p(θ | α,ψ) = G0(θ1 | ψ)n∏

i=2

α

α+ i − 1G0(θi | ψ) +

1

α+ i − 1

i−1∑j=1

δθj (θi )

.

For increasing sample sizes, the joint prior p(θ | α,ψ) gets increasinglycomplex to work with.




Therefore, the marginal posterior

p(θ, φ, α, ψ | data) ∝ p(θ | α,ψ)p(φ)p(α)p(ψ)n∏

i=1

k(yi ; θi , φ)

is difficult to work with — even point estimates practically impossibleto compute for moderate to large sample sizes.

Early work for posterior inference:

Some results for certain problems in density estimation, i.e., expres-sions for Bayes point estimates of f (y0; G). (Lo, 1984; Brunner andLo, 1989).Approximations for special cases, e.g., for binomial DP mixtures (Berryand Christensen, 1979).Monte Carlo integration algorithms to obtain point estimates for theθi (Ferguson, 1983; Kuo, 1986a,b).



Simulation-based model fitting

Note that, although the joint prior p(θ | α,ψ) has an awkward ex-pression for samples of realistic size n, the prior full conditionals haveconvenient expressions:

p(θi | {θj : j 6= i}, α, ψ) =α

α+ n − 1G0(θi | ψ) +

1

α+ n − 1

n−1∑j=1

δθj (θi ).

Key idea: (Escobar, 1988; 1994) setup a Markov chain to explorethe posterior p(θ, φ, α, ψ | data) by simulating only from posteriorfull conditional distributions, which arise by combining the likelihoodterms with the corresponding prior full conditionals (in fact, Escobar’salgorithm is essentially a Gibbs sampler developed for a specific classof models!).

Several other Markov chain Monte Carlo (MCMC) methods that im-prove on the original algorithm (e.g., West et al., 1994; Escobar andWest, 1995; Bush and MacEachern, 1996; Neal, 2000; Jain and Neal,2004).




A key property for the implementation of the Gibbs sampler is thediscreteness of G , which induces a clustering of the θi .

n∗: number of distinct elements (clusters) in the vector (θ1, . . . , θn).θ∗j , j = 1,. . . ,n∗: the distinct θi .w = (w1, . . . ,wn): vector of configuration indicators, defined by wi = jif and only if θi = θ∗j , i = 1,. . . ,n.nj : size of j-th cluster, i.e., nj = | {i : wi = j} |, j = 1, . . . , n∗.

(n∗,w, (θ∗1 , . . . , θ∗n∗)) is equivalent to (θ1, . . . , θn).

Standard Gibbs sampler to draw from p(θ, φ, α, ψ | data) (Escobarand West, 1995) is based on the following full conditionals:

1 p(θi | {θi′ : i ′ 6= i} , α, ψ, φ, data), for i = 1, . . . , n.

2 p(φ | {θi : i = 1, . . . , n} , data).

3 p(ψ |{θ∗j : i = 1, . . . , n∗

}, n∗, data).

4 p(α | n∗, data).

(The expressions include conditioning only on the relevant variables, exploiting the

conditional independence structure of the model and properties of the DP).




1 For each i = 1, . . . , n, p(θi | {θi ′ : i ′ 6= i} , α, ψ, φ, data) is simply amixture of n∗− point masses and the posterior for θi based on yi ,

αq0

αq0 +∑n∗−

j=1 n−j qj

h(θi | ψ, φ, yi ) +n∗−∑j=1

n−j qj

αq0 +∑n∗−

j=1 n−j qj

δθ∗−j

(θi ).

qj = k(yi ; θ∗−j , φ).

q0 =∫

k(yi ; θ, φ)g0(θ | ψ)dθ.h(θi | ψ, φ, yi ) ∝ k(yi ; θi , φ)g0(θi | ψ).g0 is the density of G0.

The superscript “−” denotes all relevant quantities when θi is removedfrom the vector (θ1, . . . , θn), e.g., n∗− is the number of clusters in{θi′ : i ′ 6= i}.

Updating θi implicitly updates wi , i = 1,. . . ,n; before updating θi+1,we redefine n∗, θ∗j for j = 1, . . . , n∗, wi for i = 1, . . . , n, and nj , forj = 1, . . . , n∗.



Example: location normal DP mixture model

For example, consider a location mixture of normals with unknownvariance in which

k(yi ; θ, φ) = N(yi ; θ, φ−1).

G0(· | ψ) = N(· | ψ, ν−1).p(φ) = Gam(aφ, bφ).p(ψ) = N(0, 1).

This model can be seen as a Bayesian semiparametric analogue ofGaussian kernel density estimation with φ1/2 playing the role of thebandwidth parameter.

In this case

h(θi | ψ, φ, yi ) = N(φyi+νψφ+ν

, 1φ+ν

).

q0 = N(yi | ψ, φ−1 + ν−1).



R function to sample {θ∗j }

Implementation can be a bit tricky because of bookkeeping.

We assume that components are labeled consecutively from 1 to n∗.

Note that computations of the weights are done on the logarithmicscale (ALWAYS a good practice).

sample.thetast = function(thetast, y, w, phi, alpha, psi, nu ){n = length(y)

for (i in 1:n){if(sum(w==w[i])==1){ #Is this the only observation in this cluster?

w[-i] = as.numeric(factor(w[-i], labels = seq(1,length(unique(w[-i]))))) # Relabel clusters

thetast = thetast[-w[i]] # Eliminate the value of θ associated with this singleton component

}nst = max(w[-i])

p = rep(0, nst+1)

for(j in 1:nst) {p[j] = log(sum(w[-i]==j)) + dnorm(y[i], thetast[j], sqrt(1/phi), log=T)

}p[nst+1] = log(alpha) + dnorm(y[i], psi, sqrt(1/phi+1/nu), log=T)

p = exp(p - max(p))

p = p/sum(p)

w[i] = sample(1:(nst+1), 1, replace=T, p)

if(w[i] == nst+1){ #If a new cluster is created, sample the corresponding value of θ∗thetast = c(thetast, rnorm(1, (phi*y[i]+nu*psi)/(phi+nu), sqrt(1/(phi+nu))))

}}return(list(thetast = thetast, w=w, nst = max(w)))

}




2 The posterior full conditional for φ does not involve the nonparametricpart of the DP mixture model,

p(φ | {θi : i = 1, . . . , n} , data) ∝ p(φ)n∏

i=1

k(yi ; θi , φ).

In our running example with the location normal DP mixture,

φ | {θi : i = 1, . . . , n} , data ∼ Gam

(aφ +

n

2, bφ +

1

2

n∑i=1

(yi − θi )2

).




3 Regarding the parameters ψ of G0,

p(ψ |{θ∗j , j = 1, . . . , n∗

}, n∗, data) ∝ p(ψ)

n∗∏j=1

g0(θ∗j | ψ),

leading, typically, to standard updates.

For instance, in our running example we have:

ψ |{θ∗j , j = 1, . . . , n∗

}, n∗, data ∼ N

ν

nν + 1

n∗∑j=1

θ∗j ,1

nν + 1

.

(It may be tempting to use a flat prior for ψ, but this is a bad idea:DP priors are not well defined if the baseline measure is improper.)




4 Although the posterior full conditional for α is not of a standard form,an augmentation method facilitates sampling if α has a gamma prior(say, with mean aα/bα) (Escobar and West, 1995),

p(α | n∗, data) ∝ p(α)αn∗ Γ(α)

Γ(α + n)

∝ p(α)αn∗−1(α + n)Beta(α + 1, n)

∝ p(α)αn∗−1(α + n)

∫ 1

0

ηα(1− η)n−1dη

Introduce an auxiliary variable η such that

p(α, η | n∗, data) ∝ p(α)αn∗−1(α + n)ηα(1− η)n−1

Extend the Gibbs sampler to draw η | α, data ∼ beta(α + 1, n), andα | η, n∗, data from the two-component gamma mixture:

εGam(aα + n∗, bα − log(η)) + (1− ε)Gam(aα + n∗ − 1, bα − log(η))

where ε = (aα + n∗ − 1)/ {n(bα − log(η)) + aα + n∗ − 1}.



Sampling α under a gamma prior in R

The following R function implements the algorithm proposed in Escobarand West (1995) to sample the DP precision parameter under a gammaprior.

sample.alpha = function(alpha, n, nst, aalpha, balpha){eta = rbeta(1, alpha+1, n)

oddsratio = (aalpha + nst - 1)/(n*(balpha - log(eta)))

epsilon = oddsratio/(1 + oddsratio)

u = runif(1)

if(u < epsilon){alpha = rgamma(1,aalpha+nst,rate = balpha-log(eta))

}else{alpha = rgamma(1,aalpha+nst-1,rate = balpha-log(eta))

}return(alpha)

}



Improved marginal Gibbs sampler

(West et al., 1994; Bush and MacEachern, 1996): adds one morestep where the cluster locations θ∗j are resampled at each iteration toimprove the mixing of the chain.

At each iteration, once step (1) is completed, we obtain a specificnumber of clusters n∗ and configuration w = (w1, . . . ,wn).

After the marginalization over G , the prior for the θ∗j , given the par-

tition (n∗,w), is given by∏n∗

j=1 g0(θ∗j | ψ), i.e., given n∗ and w, theθ∗j are i.i.d. from G0.

Hence, for each j = 1,. . . ,n∗, the posterior full conditional

p(θ∗j | w, n∗, ψ, φ, data) ∝ g0(θ∗j | ψ)∏{i :wi=j}

k(yi ; θ∗j , φ).

In our running example, θ∗j | w, n∗, ψ, φ, data ∼ N(νψ+njφyjν+njφ

, 1ν+njφ

)where nj is the size of cluster j , and yj = n−1

j

∑{i :wi=j} yi is the

average of the observations assigned to cluster j .



Alternative computational schemes

The Gibbs sampler can be difficult or inefficient to implement if:

The integral∫

k(y ; θ, φ)g0(θ | ψ)dθ is not available in closed form(and numerical integration is not feasible or reliable).

Random generation from h(θ | ψ, φ, y) ∝ k(y ; θ, φ)g0(θ | ψ) is notreadily available.

For such cases, alternative MCMC algorithms have been proposed inthe literature (e.g., MacEachern and Muller, 1998; Neal, 2000; Dahl,2005; Jain and Neal, 2007).

Extensions for data structures that include missing or censored ob-servations are also possible (Kuo and Smith, 1992; Kuo and Mallick,1997; Kottas, 2006).

Alternative (to MCMC) fitting techniques have also been studied (e.g.,Liu, 1996; MacEachern et al., 1999; Newton and Zhang, 1999; Bleiand Jordan, 2006).



Posterior predictive distributions

Implementing one of the available MCMC algorithms for DP mixturemodels, we obtain B posterior samples

{θb = (θib : i = 1, . . . , n), αb, ψb, φb} , b = 1, . . . ,B,

from p(θ, φ, α, ψ | data).

Or, equivalently, posterior samples{n∗b ,wb, θ

∗b = (θ∗jb : j = 1, . . . , n∗b), αb, ψb, φb

}, b = 1, . . . ,B,

from p(n∗,w, θ∗ = (θ∗j : j = 1, . . . , n∗), φ, α, ψ | data).

Bayesian density estimate is based on the posterior predictive densityp(y0 | data) corresponding to a new y0 (with associated θ0).

Using, again, the Polya urn structure for the DP,

p(θ0 | n∗,w, θ∗, α, ψ) =α

α + nG0(θ0 | ψ) +

1

α + n

n∗∑j=1

njδθ∗j (θ0).




The posterior predictive distribution for y0 is given by

p(y0 | data) =

∫p(y0 | n∗,w, θ∗, α, ψ, φ)p(n∗,w, θ∗, α, ψ, φ | data)

=

∫ ∫p(y0 | θ0, φ)p(θ0 | n∗,w, θ∗, α, ψ)

p(n∗,w, θ∗, α, ψ, φ | data)dθ0dwdθ∗dαdψdφ

Hence, a sample (y0,b : b = 1, . . . ,B) from p(y0 | data) can beobtained using the MCMC output, where, for each b = 1, . . . ,B:

we first draw θ0,b from p(θ0 | n∗b ,wb, θ∗b , αb, ψb)

and then, draw y0,b from p(y0 | θ0,b, φb) = K(·; θ0,b, φb).




To further highlight the mixture structure, note that we can also write

p(y0 | data) =∫ {α

α + n

∫k(y0 | θ, φ)g0(θ | ψ)dθ +

1

α + n

n∗∑j=1

njk(y0; θ∗j , φ)

}p(n∗,w, θ∗, α, ψ, φ | data)dwdθ∗dαdψdφ

The integrand above is a mixture with n∗ + 1 components, wherethe last n∗ components (that dominate when α is small relative to n)yield a discrete mixture (in θ) of k(·; θ, φ) with the mixture parametersdefined by the distinct θ∗j .

The posterior predictive density for y0 is obtained by averaging thismixture with respect to the posterior distribution of n∗, w, θ∗ and allother parameters.



Inference for general functionals of the random mixture

Note that p(y0 | data) is the posterior point estimate for the densityf (y0;G , φ) (at point y0), i.e., p(y0 | data) = E(f (y0;G , φ) | data)(the Bayesian density estimate under a DP mixture model can beobtained without sampling from the posterior distribution of G ).

Analogously, we can obtain posterior moments for linear functionalsH(F (·;G , φ)) =

∫H(K (·; θ, φ))dG (θ) (Gelfand and Mukhopadhyay,

1995) — for linear functionals, the functional of the mixture is themixture of the functionals applied to the parametric kernel (e.g., den-sity and c.d.f. functionals, mean functional).

How about more general inference for functionals?

Interval estimates for F (y0; G , φ) for specified y0, and, therefore, (point-wise) uncertainty bands for F (·; G , φ)?Inference for derived functions from F (·; G , φ), e.g., cumulative haz-ard, − log(1−F (·; G , φ)), or hazard, f (·; G , φ)/(1−F (·; G , φ)), func-tions?Inference for non-linear functionals, e.g., for percentiles?



Inference for general functionals of the random mixture

Such inferences require the posterior distribution of G — recall,

p(G , φ, θ, α, ψ | data) = p(G | θ, α, ψ)p(θ, φ, α, ψ | data),

and

G | θ, α, ψ ∼ DP

(α + n, G0 =

α

α + nG0(ψ) +

1

α + n

∑n

i=1δθi

).

Hence, given posterior samples (θb, αb, ψb, φb), for b = 1, . . . ,B,from the marginalized version of the DP mixture, we can draw Gb

from p(G | θb, αb, ψb) using:

The constructive definition of the DP with a truncation approximation(Gelfand and Kottas, 2002; Kottas, 2006).The original DP definition if we only need sample paths for the c.d.f.of the mixture (and y is univariate) (e.g., Krnjajic et al., 2008).

Finally, the posterior samples Gb yield posterior samples for{H(F (·;Gb, φb)) : b = 1, . . . ,B} from any functional H(F (·;G , φ)).



Density estimation data example

As an example, we analyze the galaxy data set: velocities (km/second)for 82 galaxies, drawn from six well-separated conic sections of theCorona Borealis region.

The model is a location-scale DP mixture of Gaussian distributions,with a conjugate normal-inverse gamma baseline distribution:

f (·;G ) =

∫N(·;µ, σ2) dG (µ, σ2), G ∼ DP(α,G0),

where G0(µ, σ2) = N(µ;µ0, σ2/κ)IGamma(σ2; ν, s).

We consider four different prior specifications to explore the effect ofincreasing flexibility in the DP prior hyperparameters.

Figure 2.2 shows posterior predictive density estimates obtained usingthe function DPdensity in the R package DPpackage (the code wastaken from one of the examples in the help file).



Density estimation data example: Code

# Data data(galaxy)

galaxy = data.frame(galaxy,speeds=galaxy$speed/1000)

attach(galaxy)

# Initial state

state = NULL

# MCMC parameters

nburn = 1000

nsave = 10000

nskip = 10

ndisplay = 100

mcmc = list(nburn=nburn,nsave=nsave,nskip=nskip,ndisplay=ndisplay)

# Example of Prior information 1

# Fixing alpha, m1, and Psi1

prior1 = list(alpha=1,m1=rep(0,1),psiinv1=diag(0.5,1),nu1=4,tau1=1,tau2=100)


# Fixing alpha and m1

prior2 = list(alpha=1,m1=rep(0,1),psiinv2=solve(diag(0.5,1)),nu1=4,nu2=4,tau1=1,tau2=100)


# Fixing only alpha

prior3 = list(alpha=1,m2=rep(0,1),s2=diag(100000,1),psiinv2=solve(diag(0.5,1)),nu1=4,nu2=4,tau1=1,tau2=100)


# Everything is random

prior4 = list(a0=2,b0=1,m2=rep(0,1),s2=diag(100000,1),psiinv2=solve(diag(0.5,1)),nu1=4,nu2=4,tau1=1,tau2=100)

# Fit the models

fit1.1 = DPdensity(y=speeds,prior=prior1,mcmc=mcmc,state=state,status=TRUE)




# Plot the estimated density

plot(fit1.1,ask=FALSE)







Figure 2.2: Histograms of the raw data and posterior predictive densities under four prior choices for the galaxy data. In the top left panel

we set α = 1, µ0 = 0, s = 2, ν = 4, κ ∼ Gam(0.5, 50); the top right panel uses the same settings except s ∼ IGamma(4, 2); in the

bottom left panel we add hyperprior µ0 ∼ N(0, 100000); and in the bottom right panel we further add hyperprior α ∼ Gam(2, 2).




To explore the clustering structure generated by the model, we canuse the average incidence matrix: entry (i , j) corresponds to the prob-ability that observations yi and yj are assigned to the same mixturecomponent.

A point estimate for the clustering can be obtained from this matrixusing the method described by Lau and Green (2007). Plotting aheatmap of the matrix provides information on the uncertainty of theclustering. Often, to get a readable heatmap you need to permutethe order of the observations.

Obtaining an incidence matrix from the output of DPdensity can bea bit tricky because the R function does not do it by default. If theoutput of the sampler is stored in fit, you access the clustering inthe last iteration ran by typing fit$state$ss.




Figure 2.3: Clustering structure generated by the DP mixture model for the galaxy data. The left panel reproduces the plot from the

bottom right panel of Figure 2.2. The right panel shows the corresponding average incidence matrix.



Conditional posterior simulation methods

The main characteristic of the marginal MCMC methods is that theyare based on the posterior distribution of the DP mixture model,p(θ, φ, α, ψ | data), resulting after marginalizing the random mixingdistribution G (thus, referred to as marginal, or sometimes collapsedmethods).

Although posterior inference for G is possible under the collapsedsampler, it is of interest to study alternative conditional posteriorsimulation approaches that impute G as part of the MCMC algorithm.

Most of the emphasis on conditional methods based on finite trun-cation approximation of G , using its stick-breaking representation —main example: Blocked Gibbs sampler (Ishwaran and Zarepour, 2000;Ishwaran and James, 2001).

Other work based on retrospective sampling techniques (Papaspiliopou-los and Roberts, 2008), or slice sampling (Walker, 2007; Kalli et al.,2011).



Blocked Gibbs sampler

Builds from truncation approximation to mixing distribution G given,for finite N, by

GN(·) =N∑`=1

p`δZ`(·).

The Z`, ` = 1, . . . ,N, are i.i.d. G0.The weights arise through stick-breaking (with truncation)

p1 = V1, p` = V`

`−1∏r=1

(1− Vr ), ` = 2, . . . ,N − 1, pN =

N−1∏r=1

(1− Vr ),

where the V`, ` = 1, . . . ,N − 1, are i.i.d. beta(1, α).Choice of N follows guidelines discussed earlier

The joint prior for p = (p1, . . . , pN), given α, corresponds to a specialcase of the generalized Dirichlet distribution (Connor and Mosimann,1969),

f (p | α) = αN−1pα−1N (1−p1)−1(1− (p1 + p2))−1× . . .× (1−

∑N−2

`=1p`)−1.




Replacing G with GN ≡ (p,Z), where Z = (Z1, . . . ,ZN), in the genericDP mixture model hierarchical formulation, we have:

yi | θi , φind.∼ k(yi ; θi , φ), i = 1, . . . , n,

θi | p,Zi.i.d.∼ GN , i = 1, . . . , n,

p,Z | α,ψ ∼ f (p | α)N∏`=1

g0(Z` | ψ),

φ, α, ψ ∼ p(φ)p(α)p(ψ).

If we marginalize over the θi in the first two stages of the hierarchicalmodel, we obtain a finite mixture model for the yi ,

f (·; p,Z, φ) =N∑`=1

p`k(·; Z`, φ)

(conditionally on (p,Z) and φ), which replaces the countable DPmixture, f (·;G , φ) =

∫k(·; θ, φ)dG (θ) =

∑∞`=1 ω`k(·;ϑ`, φ).




Now, having approximated the countable DP mixture with a finitemixture, the mixing parameters θi can be replaced with configurationvariables L = (L1, . . . , Ln) — each Li takes values in {1, . . . ,N} suchthat Li = ` if only if θi = Z`, for i = 1, . . . , n; ` = 1, . . . ,N.

Final version of the hierarchical model:

yi | Z, Li , φind.∼ k(yi ; ZLi , φ), i = 1, . . . , n,

Li | pi.i.d.∼

N∑`=1

p`δ`(Li ), i = 1, . . . , n,

Z` | ψi.i.d.∼ G0(· | ψ), ` = 1, . . . ,N,

p | α ∼ f (p | α),

φ, α, ψ ∼ p(φ)p(α)p(ψ).

Marginalizing over the Li , we obtain the same finite mixture modelfor the yi : f (·; p,Z, φ) =

∑N`=1 p`k(·;Z`, φ).



Posterior full conditional distributions

1 To update Z` for ` = 1, . . . ,N:

Let n∗ be the number of distinct values {L∗j : j = 1, . . . , n∗} of vectorL.Then, the posterior full conditional for Z`, ` = 1, . . . ,N, can be ex-pressed in general as:

p(Z` | . . . , data) ∝ g0(Z` | ψ)n∗∏j=1

∏{i :Li=L∗j }

k(yi ; ZL∗j, φ).

If ` /∈ {L∗j : j = 1, . . . , n∗}, Z` is drawn from G0(· | ψ).For ` = L∗j , j = 1, . . . , n∗,

p(ZL∗j| . . . , data) ∝ g0(ZL∗j

| ψ)∏

{i :Li=L∗j }

k(yi ; ZL∗j, φ).



2 The posterior full conditional for p is

p(p | . . . , data) ∝ f (p | α)∏N

`=1pM`` ,

where M` = |{i : Li = `}|, ` = 1, . . . ,N.

Results in a generalized Dirichlet distribution, which can be sampledthrough independent latent Beta variables.

V ∗`ind.∼ beta(1 + M`, α +

∑Nr=`+1 Mr ) for ` = 1, . . . ,N − 1.

p1 = V ∗1 ; p` = V ∗`∏`−1

r=1 (1 − V ∗r ) for ` = 2, . . . ,N − 1; and pN =

1−∑N−1`=1 p`.

3 Updating the Li , i = 1, . . . , n:

Each Li is drawn from the discrete distribution on {1, . . . ,N} withprobabilities p`i ∝ p`k(yi ; Z`, φ) for ` = 1, . . . ,N.Note that the update for each Li does not depend on the other Li′ ,i ′ 6= i — this aspect of this Gibbs sampler, along with the blockupdates for the Z`, are key advantages over Polya urn based marginalMCMC methods .



4 The posterior full conditional for φ is

p(φ | . . . , data) ∝ p(φ)n∏

i=1

k(yi ; θi , φ).

5 The posterior full conditional for ψ is

p(ψ | . . . , data) ∝ p(ψ)n∗∏j=1

g0(ZL∗j| ψ).

6 The posterior full conditional for α is proportional to p(α)αN−1pαN ,which with a Gam(aα, bα) prior for α, results in a Gam(N + aα −1, bα−log pN) distribution. (For numerical stability, compute log pN =

log∏N−1

r=1 (1− V ∗r ) =∑N−1

r=1 log(1− V ∗r ).)

Note that the posterior samples from p(Z,p,L, φ, α, ψ | data) yielddirectly the posterior for GN , and thus, full posterior inference for anyfunctional of the (approximate) DP mixture f (·;GN , φ) ≡ f (·; p,Z, φ).



Posterior predictive inference

Posterior predictive density for new y0, with corresponding configura-tion variable L0,

p(y0 | data) =

∫k(y0; ZL0 , φ)

(N∑`=1

p`δ`(L0)

)p(Z, p,L, φ, α, ψ | data)dL0 dZ dL dp dφ dα dψ

=

∫ ( N∑`=1

p`k(y0; Z`, φ)

)p(Z, p,L, φ, α, ψ | data)dZ dL dp dφ dα dψ

= E(f (y0; p,Z, φ) | data).

Hence, p(y0 | data) can be estimated over a grid in y0 by drawingsamples {L0b : b = 1, . . . ,B} for L0, based on the posterior samplesfor p, and computing the Monte Carlo estimate

B−1∑B

b=1k(y0;ZL0b

, φb),

where B is the posterior sample size.



Variational algorithms - Overview

Variational algorithms have become popular in the machine learningliterature over the last 10 years or so, but they are not as well knownas MCMC, so we will start with a very quick overview.

Consider a generic model where y denotes the data and θ denotesthe unknown parameters. Variational algorithms aim at replacing theintractable posterior distribution p(θ | y) with a tractable approxima-tion qη(θ) whose parameter η is chosen to minimize

K (p||q) =

∫ ∞−∞

ln

(qη(θ)

p(θ | y)

)qη(θ)dθ,

the Kullback-Leibler divergence between p(θ | y) and qη(θ).

Variational algorithms are useful because, generally speaking, maxi-mizing is easier (and faster!) than integrating.




In principle, we have a lot of freedom in choosing the functionalform of qη. However, in practice we are limited by the need tohave a tractable approximation and to compute the expectationsEq {log [qη(θ)]} and Eq {log [p(θ | y)]}.

Variational algorithms for the DP typically rely on truncated repre-sentations. Both “collapsed” and “blocked” versions of the algorithmhave been devised, but we will discuss only the “blocked” version.As we will see in a second, variational approximations are easy inconditionally conjugate models (the same type of models where Gibbssampling is straightforward). Extensions for non-conjugate models arenot well developed.




Two key ingredients for a simple algorithm: the particular form of thevariational approximation, and the optimization procedure to be used.

We focus on so-called mean field variational algorithms, which use afully factorized q,

qη(θ) =

p∏k=1

qk,ηk(θk)

and are extremely popular because of their tractability. Furthermore,for conditionally conjugate models that belong to the exponential fam-ily, it is natural to set qk,ηk

(θk) as a member of the conditional con-jugate prior family.We use a conjugate gradient algorithm to solve the maximization prob-lem, recursively updating each ηi one at a time. However, note thatthis is not necessarily a convex problem!




If all the previous assumptions are satisfied, the variational algorithmlooks very much like a Gibbs sampler. More concretely, if the fullconditional distribution for θi is

p(θk | θ−k , y) ∝ exp

{L∑

l=1

gk,l (θ−k , y) tl(θk)− h(θk)

}

where θ−k = (θ1, . . . , θk−1, θk+1, . . . , θp) and gk,1 (θ−k , y), ...,gk,L (θ−k , y)are conditional sufficient statistics, and let the variational approxima-tion be

qk,ηk (θk) ∝ exp

{L∑

l=1

ηk,l tl(θk)− a(θk)

},

then, ηk,l = Eq−k{gk,l (θ−k , y)}. By iterating updates of this form

we can develop a very fast computational algorithm.



Variational inference for conjugate DP mixtures

In order to understand how variational algorithms work for conjugateDP mixtures it is better to start with a simple example and thenexplore the general case.

In particular, consider our running example of a location mixture ofnormals where

yi | Li ,Z, φ ∼ N(yi | ZLi , φ−1) for i = 1, . . . , n.

Li ∼∑N`=1 p`δ`

Z` ∼ N(ψ, ν−1) for ` = 1, . . . ,Np | α ∼ f (p | α)φ ∼ Gam(aφ, bφ)




Our mean-field variational approximation in this case takes the form

qξ(φ)N−1∏`=1

qγ`(V`)N∏`=1

qη(Z`)n∏

i=1

q$i (Li ),

where qξ(φ) = Gam(φ | ξ1, ξ2), qγ`(V`) = beta(V` | γ`,1, γ`,2),qη`(Z`) = N(Z` | η`,1, η`,2) and q$i (Li ) = Categorical($i ).

A couple of useful results that you might want to keep in mind whenderiving the algorithm:

If X ∼ beta(a, b) then E {log X} = Ψ(a) − Ψ(a + b), where Ψ isthe digamma function (the derivative of the logarithm of the gammafunction).

If X ∼ Gam(a, b) where E{X} = a/b then E {log X} = Ψ(a)− log b,where Ψ is again the digamma function.




Update qγ`(V`) = beta(V` | γ`,1, γ`,2):

γ(s)`,1 = 1 +

n∑i=1

$(s−1)i,` , γ

(s)`,2 = α +

n∑i=1

N∑j=`+1

$(s−1)i,j .

Update qη`(Z`) = N(Z` | η`,1, η`,2):

η(s)`,1 =

(ν +

ξ(s−1)1

ξ(s−1)2

n∑i=1

$(s−1)i,`

)−1(νψ +

ξ(s−1)1

ξ(s−1)2

n∑i=1

$(s−1)i,` yi

),

η(s)`,2 =

(ν +

ξ(s−1)1

ξ(s−1)2

n∑i=1

$(s−1)i,`

)−1

.




Update qξ(φ) = Gam(φ | ξ1, ξ2):

ξ(s)1 = aφ +

n

2,

ξ(s)2 = bφ +

n∑i=1

N∑`=1

$(s)i,`

(y2i − 2yiη

(s)`,1 + η

(s)`,2 +

{η

(s)`,1

}2).

Update q$i (Li ) = Categorical($i ) where $(s)i,k ∝ exp

{M

(s)i,`

},

M(s)i,` =

{Ψ(γ

(s)`,1

)− Ψ

(γ

(s)`,1 + γ

(s)`,2

)}+

`−1∑j=1

{Ψ(γ

(s)j,2

)− Ψ

(γ

(s)j,1 + γ

(s)j,2

)}

+1

2

{Ψ(ξ

(s)1

)− log

(ξ

(s)2

)}−

1

2

ξ(s)1

ξ(s)2

(y2i − 2yiη

(s)`,1 + η

(s)`,2 +

[η

(s)`,1

]2).

(Recall that Ψ(·) denotes the digamma function).




R implementation using the L∞ norm of $ as the stopping rule.while(s < maxiter){

for(l in 1:(NN-1)){gamma1[l] = 1 + sum(varpi[,l])

gamma2[l] = alph + sum(varpi[,(l+1):(NN)]) }for(l in 1:NN){

eta1[l] = (nu*psi + xi1*sum(varpi[,l]*y)/xi2)/(nu + xi1*sum(varpi[,l])/xi2)

eta2[l] = 1/(nu + xi1*sum(varpi[,l])/xi2) }xi1 = aa + nn/2

xi2 = bb

for(i in 1:nn){xi2 = xi2 + sum(varpi[i,]*(y[i]*y[i] - 2*y[i]*eta1 + eta2 + eta1*eta1))/2 }

for(i in 1:nn){q = rep(0,NN)

for(l in 1:NN){MM = 0.5*(digamma(xi1) - log(xi2))

- 0.5*xi1*(y[i]*y[i] - 2*y[i]*eta1[l] + eta2[l] + eta1[l]*eta1[l])/xi2

if(l<NN){MM = MM + digamma(gamma1[l]) - digamma(gamma1[l]+gamma2[l]) }

if(l>1){MM = MM + sum(digamma(gamma1[1:(l-1)]) - digamma(gamma1[1:(l-1)]+gamma2[1:(l-1)])) }

q[l] = MM }q = exp(q - max(q))/sum(exp(q - max(q)))

varpi[i,] = q }if(max(abs(varpi-varpi.old)/varpi.old) < tol){

break }varpi.old = varpi

s = s+1 }



Truncated Gibbs vs variational algorithms for the DP

Compare the form of this update with the structure of a collapsed Gibbssampler for the same model:

Sample V(s)` ∼ beta

(γ

(s)`,1, γ

(s)`,2

):

γ(s)`,1 = 1 +

n∑i=1

I(

L(s−1)i = `

), γ

(s)`,2 = α +

n∑i=1

N∑j=`+1

I(

L(s−1)i = j

).

Sample Z(s)` ∼ N

(η

(s)`,1, η

(s)`,2

):

η(s)`,1 =

(ν + φ

(s−1)n∑

i=1

I(L

(s−1)i = `

))−1{νψ + φ

(s−1)n∑

i=1

yi I(L

(s−1)i = `

)},

η(s)`,2 =

(ν + φ

(s−1)n∑

i=1

I(L

(s−1)i = `

))−1

.



Truncated Gibbs vs variational algorithms for the DP

Sample φ(s) ∼ Gam(ξ

(s)1 , ξ

(s)2

):

ξ(s)1 = aφ +

n

2,

ξ(s)2 = bφ +

n∑i=1

N∑`=1

I(L

(s−1)i = j

)(y2i − 2yiZ

(s)` +

{Z

(s)`

}2).

Sample L(s)i ∼

∑N`=1 $

(s)i,` δ` where $

(s)i,` ∝ exp

{M

(s)i,`

}and

M(s)i,` = log

{V

(s)`

}+

`−1∑j=1

log{

1− V(s)`

}+

1

2log(φ

(s))−

1

2φ

(s)(y2i − 2yiZ

(s)` +

{Z

(s)`

}2).




The structure of the variational distribution for other conjugate DPmixtures is very similar:

The variational approximation is typically fully factorized

qξ(φ)qζ(ψ)N−1∏`=1

qγ`(V`)N∏`=1

d∏k=1

qη(Z`,k)n∏

i=1

q$i (Li ),

although in some cases (e.g., mixtures of multivariate normals) we canuse a joint variational approximation for the a block of Zl .

The update for each qγ`(V`) is identical to the one we just used, nomatter what the form of the kernel or prior.

Although the details for the update of q$i (Li ) will differ depending

on the kernel, the structure will be similar to the one before: $(s)i,` ∝

exp {Mi,`} where

M(s)i,` =

{Ψ(γ

(s)`,1

)− Ψ

(γ

(s)`,1 + γ

(s)`,2

)}+

`−1∑j=1

{Ψ(γ

(s)j,2

)− Ψ

(γ

(s)j,1 + γ

(s)j,2

)}+ Eq [log {p(yi | Z, Li , φ)}]}




Extending the algorithm to the case where α is unknown is relativelystraightforward if α ∼ Gam(a, b) a priori.

As long as we stay within conjugate exponential families, updating ψis usually straightforward too.

Non-exponential family models and non-conjugate models are notstraightforward and additional (often ad-hoc) approximations are used.This is an active area of research.

A collapsed version (but still based on a finite approximation) is alsopossible.




The predictive distribution for a new observation yn+1 under the vari-ational approximation can be written as

p(yn+1 | y) ≈N∑`=1

Eq {p`}Eq {k(yn+1 | Z`,φ)}

where Eq {p`} =γ`,1

γ`,1+γ`,2

(∏`−1j=1

γ`,2γ`,1+γ`,2

).

Note that for the scale mixture of normals with unknown variance,Eq {k(yn+1 | Z`,φ)} cannot be computed in closed form (in spite ofthe independence assumption, only one of the two integrals can becomputed in closed form). In this case, we approximate the integralwith the density of a normal distribution with mean η`,1 and scale√η`,2 + ξ1

ξ2.

Inferences for other functionals can be difficult. One can resort toMonte Carlo methods (using samples from q(θ) rather than p(θ | y)),but that removes some of the appeal of variational approximations ...



Gibbs vs. variational algorithms for location mixtures ofnormals

Density estimates

Blocked Gibbs sampler Variational approximation

Speed

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

Speed

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8




Pairwise clustering probabilitiesBlocked Gibbs sampler Variational approximation

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81

13579

111315171921232527293133353739414345474951535557596163656769717375777981

0

0.25

0.5

0.75

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81

13579

111315171921232527293133353739414345474951535557596163656769717375777981

0

0.25

0.5

0.75

1




Besides the obvious differences in the graphs, it is worth noting thatthe variational algorithm often converges to local modes. In this ex-ample, a random initialization almost always leads to a mixture whereessentially only one component is present.

The minimization of K (p||q) can also be justified as maximizing thebound on the log marginal likelihood:

log p(y) ≥ Eq {log p(θ, y)} − Eq {log qη(θ)} .

The gap in the bound is precisely the divergence between qη and thetrue posterior, and can be used to select among multiple solutions.

Generally speaking, variational algorithms tend to underestimate theuncertainty in the posterior distribution and get the correlations amongparameters wrong.


Applied Bayesian Nonparametric Mixture Modeling Session 2 ...thanos/notes-2.pdf · Session 2 {...

Documents

Transcript of Applied Bayesian Nonparametric Mixture Modeling Session 2 ...thanos/notes-2.pdf · Session 2 {...