Variational Inference for Dirichlet Process Mixture

39
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics in Machine Learning Brown University CSCI 2950-P, Fall 2011

description

Variational Inference for Dirichlet Process Mixture. Applied Bayesian Nonparametrics Special Topics in Machine Learning Brown University CSCI 2950-P, Fall 2011. Daniel Klein and Soravit Beer Changpinyo October 11, 2011. Motivation. - PowerPoint PPT Presentation

Transcript of Variational Inference for Dirichlet Process Mixture

Page 1: Variational  Inference  for  Dirichlet  Process Mixture

Variational Inference for Dirichlet Process Mixture

Daniel Klein and Soravit Beer ChangpinyoOctober 11, 2011

Applied Bayesian NonparametricsSpecial Topics in Machine Learning

Brown University CSCI 2950-P, Fall 2011

Page 2: Variational  Inference  for  Dirichlet  Process Mixture

Motivation• WANTED! A systematic approach to sample

from likelihoods and posterior distributions of the DP mixture models

• Markov Chain Monte Carlo (MCMC)• Problems with MCMC

o Can be slow to convergeo Convergence can be difficult to diagnose

• One alternative: Variational methods

Page 3: Variational  Inference  for  Dirichlet  Process Mixture

Variational Methods: Big Picture• An adjustable lower bound on the log

likelihood, indexed by“Variational parameters”

• Optimization problem: to get the tightest lower bound

Page 4: Variational  Inference  for  Dirichlet  Process Mixture

Outline• Brief Review: Dirichlet Process Mixture Models• Variational Inference in Exponential Families• Variational Inference for DP mixtures• Gibbs sampling (MCMC)• Experiments

Page 5: Variational  Inference  for  Dirichlet  Process Mixture

DP Mixture Models

From E.B. Sudderth’s slides

Page 6: Variational  Inference  for  Dirichlet  Process Mixture

DP Mixture Models

Stick lengths =weights assigned to mixture components

Atoms representing mixture components (cluster parameters)

Page 7: Variational  Inference  for  Dirichlet  Process Mixture

DP Mixture Models: NotationsLatent variables Hyperparameters

Observations

Page 8: Variational  Inference  for  Dirichlet  Process Mixture

DP Mixture Models: NotationsLatent variables Hyperparameters

Observations

W = {V, η*, Z}

θ = {α, λ}

X

Page 9: Variational  Inference  for  Dirichlet  Process Mixture

Variational Inference

Usually intractable

So, we are going to approximate it by finding a lower bound of P(X|θ)

Page 10: Variational  Inference  for  Dirichlet  Process Mixture

Variational Inference

Jensen’s inequality

Variational distribution

Page 11: Variational  Inference  for  Dirichlet  Process Mixture

Variational Inference

Add constraint to q by introducing

:=

“the free variational parameters”

ν

Page 12: Variational  Inference  for  Dirichlet  Process Mixture

Variational Inference

Page 13: Variational  Inference  for  Dirichlet  Process Mixture

Variational Inference

How to choose the variational distribution qν(w) such that the optimization of the bound is computationally tractable?

Typically, we break some dependencies between latent variables

Mean field variational approximationsAssume “fully factorized” variational distributions

q ν (w )=∏𝒎=𝟏

𝑴𝒒ν𝒎( wm )

= (

Page 14: Variational  Inference  for  Dirichlet  Process Mixture

Mean Field Variational Inference

Assume fully factorized variational distributions

Page 15: Variational  Inference  for  Dirichlet  Process Mixture

Mean Field Variational Inference

in Exponential Families

Further assume that p(wi | w-i, x, θ)

is a member in exponential family

Further assume that is a member in exponential family

Page 16: Variational  Inference  for  Dirichlet  Process Mixture

Mean Field Variational Inference

in Exponential Families

Further assume that p(wi | w-i, x, θ) is a member in exponential family

Further assume that is a member in exponential family

Page 17: Variational  Inference  for  Dirichlet  Process Mixture

Mean Field Variational Inference

in Exponential Families: Coordinate Ascent

Maximize this with respect to holding other fixed

Leads to an EM-like algorithm:Iteratively update

This algorithm will find a local maximum of the above expression

Page 18: Variational  Inference  for  Dirichlet  Process Mixture

Recap: Mean Field Variational Inference

in Exponential Families

p(wi | w-i, x, θ) is a member in

exponential family

Some calculus

Fully factorizedvariational distributions

A local maximum of

Page 19: Variational  Inference  for  Dirichlet  Process Mixture

19

Update Equation and Other Inference Methods

• Like Gibbs sampling: iteratively pick a component to update using the exclude-one conditional distributiono Gibbs walks on state that approaches sample from true posterioro VDP walks on distributions that approach a locally best approximation to the

true posterior

• Like EM: fit a lower bound to the true posterioro EM maximizes, VDP marginalizeso May find local maxima

Figure from Bishop (2006)

Page 20: Variational  Inference  for  Dirichlet  Process Mixture

20

Aside: Derivation of Update Equation• Nothing deep involved...

o Expansion of variational lower bound using chain rule for expectations

o Set derivative equal to zero and solve

o Take advantage of exponential form of exclude-one conditional distribution

o Everything cancels...except the update equation

Page 21: Variational  Inference  for  Dirichlet  Process Mixture

Aside: Which Kullback-Leibler Divergence?

KL(q||p) KL(p||q)

To minimize the reverse KL divergence (when q factorizes), just match the marginals.

Minimizing the reverse KL is the approach taken in expectation propagation.

Figures from Bishop (2006)

Page 22: Variational  Inference  for  Dirichlet  Process Mixture

22

Aside: Which Kullback-Leibler Divergence?

• Minimizing KL divergence is “zero-forcing”• Minimizing reverse KL divergence is “zero-avoiding”

KL(q||p) KL(p||q)Figures from Bishop (2006)

Page 23: Variational  Inference  for  Dirichlet  Process Mixture

23

Applying Mean-Field Variational Inference to DP Mixtures

• “Mean field variational inference in exponential families”o But we’re in a mixture model, which can’t be an exponential family!

• Enough that the exclude-one conditional distributions are in the exponential family. Examples:o Hidden Markov modelso Mixture modelso State space modelso Hierarchical Bayesian models with (mixture of) conjugate priors

Page 24: Variational  Inference  for  Dirichlet  Process Mixture

24

Variational Lower Bound for DP Mixtures

• Plug the DP Mixture posterior distributiono Taking log so expectations factor...o Shouldn’t the emission term depend on η*?

• Last term has implications for choice of variational distribution

Page 25: Variational  Inference  for  Dirichlet  Process Mixture

25

Picking the Variational Distribution

• Obviously, we want to break dependencies

• Must the factors be exponential families?o In some cases, the optimum must be!

• Proof using calculus of variationso Easier to compute integrals for lower boundo Guarantee of optimal parameters

• Mapping between canonical and moment parameters

• Beta, exponential family, and multinomial distributions, respectively

Page 26: Variational  Inference  for  Dirichlet  Process Mixture

26

Coordinate Ascent• Analogy to EM: we might get stuck in local maxima

Page 27: Variational  Inference  for  Dirichlet  Process Mixture

27

Coordinate Ascent: Derivation

• Relies on clever use of indicator functions and their properties• All the terms in the truncation have closed-form expressions

Page 28: Variational  Inference  for  Dirichlet  Process Mixture

28

Predictive Distribution• Under variational approximation, distribution of atoms and

the (truncated) distribution of stick lengths decouple• Weighted sum of predictive distributions• Suggestive of a MC approximation

Page 29: Variational  Inference  for  Dirichlet  Process Mixture

29

Extensions• Prior as mixture of conjugate distributions• Placing a prior on the scaling parameter α

o Continue complete factorization...o Natural to place Gamma prior on αo Update equation no more difficult than the otherso No modification needed to predictive distribution!

Page 30: Variational  Inference  for  Dirichlet  Process Mixture

30

Empirical Comparison: The Competition

• Collapsed Gibbs sampler (MacEachern 1994)o “CDP”o Predictive distribution as average of predictive distributions from MC sampleso Best suited for conjugate priors

• Blocked Gibbs sampler (Ishwaran and James 2001)o “TDP”o Recall: posterior distribution gets truncatedo Surface similarities to VDP in updates for Z, V, η*o Predictive distribution integrates out everything but Z

• Surprise:TDP CDP

Autocorrelation on size of largest component

Page 31: Variational  Inference  for  Dirichlet  Process Mixture

Empirical Comparison

Page 32: Variational  Inference  for  Dirichlet  Process Mixture

Empirical Comparison

Page 33: Variational  Inference  for  Dirichlet  Process Mixture

Empirical Comparison

Page 34: Variational  Inference  for  Dirichlet  Process Mixture

Empirical Comparison: SummaryDeterministic

FastEasy to assess convergence

Sensitive to initializations = Local MaximumApproximate

Page 35: Variational  Inference  for  Dirichlet  Process Mixture

Image Analysis

Page 36: Variational  Inference  for  Dirichlet  Process Mixture

MNIST: Hand-written digits

Kurihara, Welling, and Vlassis 2006

Page 37: Variational  Inference  for  Dirichlet  Process Mixture

MNIST: Hand-written digits

Kurihara, Welling, and Teh 2007“Variational approximations are much more efficient computationally than Gibbs sampling, with almost no loss in accuracy”

Page 38: Variational  Inference  for  Dirichlet  Process Mixture

Questions?