Another Walkthrough of Variational Bayes

Bevan Jones

Machine Learning Reading Group

Macquarie University

Variational Bayes?

• Bayes ← Bayes’ Theorem

• But the integral is intractable! – Sampling

• Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters…

– Variational Bayes • Change the equations, replacing intractable integrals • This involves searching for a good approximation

• Variational ← Calculus of Variations – A way of searching through a space of functions for the

“best” one

Useful Concepts

• Probability/Information Theory – Bayes’ Theorem – Expectations – Jensen’s Inequality – KL Divergence

• Calculus

– Functionals & Functional Derivatives – Lagrange Multipliers

• Logarithms

Outline

• The true likelihood

• Approximating the posterior

• The lower bound and a definition for “best”

• Finding the optimal approximation – Functionals & functional derivatives

– Connection to KL divergence

• The Mean-field approximation

• An inference procedure

• Dirichlet-multinomial example

• We have some observed data:

• We have a model relating latent variable z to the data:

• To guess z

• The problem is one of computing

• Or just as good

The (Log) Likelihood

Approximating p(z|x)

• The integral in the expression for p(x) may not be easily computed

• But we might be able to get by with an approximation for p(x, z)

• We’ll focus on approximating only part of it

Choosing q

• How to choose q?

• Ideally, we want the q that is closest to p

• Define a lower bound on p

– Make this a “function” of q

• Maximize the lower bound to make it as tight as possible

– Choose q accordingly

• Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

where f is concave

• We can’t calculate the log likelihood, but we can compute the lower bound

• Maximizing F tightens the lower bound on the likelihood

• What q maximizes F? • If q were a variable we could do this by taking

derivatives and solving for q

The Lower Bound

Functionals: the “Variational” in VB

• Functional: a kind of “meta-function” that takes a function as input

• We can view F[q] as a functional of q

• Calculus of functionals parallels that of functions

• Then, we can – take the derivative of F[q] with respect to q,

– set it to 0, and

– solve for q

Derivatives

• The change in functional as we change its function argument

Functional Derivatives

Useful Derivatives

Calculating q

• Use Lagrange multipliers

constraint

Calculating q

• Maximizing F is minimizing the KL divergence

• And

KL Divergence: An Alternative View

Optimal q

• The best q(z) is p(z|x)

Where are we?

• We’ve bounded the likelihood (Jensen’s Ineq.)

• Made this bound tight (Lagrange Multipliers)

• But the best approximation is no approximation at all!

• We need to constrain q so that it’s tractable

Optimal q in an Imperfect World

• We can’t compute q(z)=p(z|x) directly

• Instead, constrain the domain of F[q] to some set of more tractable functions

• This is usually done by making independence assumptions

– The mean field assumption: cut all dependencies

• We have some observed data:

• We have a model relating latent variables z and θ to the data:

• To guess z and θ we need

• But the integral is hard!

• Apply the mean field assumption

Example 2: Mean Field Assumption

The New Lower Bound

Apply mean field assumption

• The integrals get simpler • In fact, these go away

The Benefit of Independence

Optimizing the Lower Bound

Optimal qθ(θ)

constraint

Optimal qz(z)

constraint

The Approximation q ≠ p

Estimating Parameters

• Now we have our approximation q

• We need to compute the expectations

• Use EM-like procedure, alternating between the two – It was hard to do this for p(z,θ|x) – It’s (hopefully) easy for q(z,θ)

• if we’ve defined p to make use of conjugacy • and if we’ve chosen the right constraint for q

Calculating F

• As a side effect of inference, we already have

• It’s the log of the normalization constant for q(z)

• So, we really only need two more expectations

Calculating F

Uses for F

• We can often use F in cases where we would normally use the log likelihood – Measuring convergence

• No guarantee to maximize likelihood, but we do have F

• Others – Model selection

• Choose the model with the highest lower bound

– Selecting the number of clusters • Pick the number that gives us the highest lower bound

– Parameter optimization • Again, optimize the lower bound w.r.t. the parameters

Worked Example

Dirichlet-Multinomial Mixture Model

The Intractable Integral

The Mean Field Assumption

Optimizing F

• Apply Lagrange multipliers just like example 2

• In this case, we have simply replaced z, x, and θ with vectors

• The math is exactly the same

• But we need to find the expectations we skipped before

– Plug in the Dirichlet and multinomial distributions

Optimal q(z,θ)

• Borrowed from example 2 • See slides 36-38

• All we need to do is apply the particulars of the Mixture model

Optimal qθ(θ)

Optimal qφ(φ): The Expectation

Dirichlet Distribution

Optimal qφ(φ): The Numerator

Optimal qφ(φ): The Normalization

Optimal qφ(φ): Conjugacy Helps

Optimal qπ(π)

• q(π) is essentially the same as q(φ)

• The only difference is that there are multiple π’s

• So, q(π) should be a product of Dirichlets

Optimal qπ(π): The Expectation

Optimal qπ(π): The Numerator

Optimal qπ(π): The Denominator

Optimal qπ(π): Putting Them Together

A Useful Standard Result

• The digamma function

• The expectation under a Dirichlet of the log of an individual component of a Dirichlet random variable

Optimal qz(z)

• Again, borrowed from example 2

• See slides 36-38

• Here, we plug in the model definition

Optimal qz(z)

• First, let’s work with the simpler multinomial distribution

• Side effect: a kind of estimate for the multinomial parameter vector

Optimal qz(z): The Expectations

• Now, let’s work with the product of multinomials

• Side effect: a kind of set of multinomial parameter vectors

• This is essentialy the same math required for HMMs and PCFGs

Optimal qz(z): The Expectations

Optimal qz(z): Putting It Together

Implications of Assumption

• We should get the same result with an even weaker assumption

Inference

• “E-Step”: Expected Counts – Topic counts

– Topic-word pair counts

• “M-Step”: The Proportions – Topic j

– Topic-word pair j-k

Calculating F

• Also borrowed from example 2

• See slides 40-41

• But we adapt it for the mixture model

Calculating F

Calculating F: The Normalization Constant

• By product of computing

Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes...

Transcript of Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes...

Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes...

Documents

Transcript of Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes...

Embedding Diffusion in Variational Bayes

Auto-Encoding Variational Bayes - University of Cambridge · 2019-11-06 · Auto-Encoding Variational Bayes PhilipBall,AlexandruCoca,OmarMahmood,RichardShen March16,2018 Motivation

0.1 Bayesian modeling and variational learning: introduc- tion › bayes › biennial2005-bayes.pdf · 0.1 Bayesian modeling and variational learning: introduc- ... In Bayesian data

D VARIATIONAL BAYES FILTERS: UNSUPERVISED LEARNING OF … · 2017. 3. 6. · Published as a conference paper at ICLR 2017 DEEP VARIATIONAL BAYES FILTERS: UNSUPERVISED LEARNING OF

Averaged Collapsed Variational Bayes Inferencejmlr.csail.mit.edu/papers/volume18/14-249/14-249.pdf · Averaged Collapsed Variational Bayes Inference Katsuhiko Ishiguro k .ishiguro

The Thermodynamic Variational Objective · Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. [8] Danilo Jimenez Rezende, Shakir Mohamed,

Variational Inference & Variational Autoencoderscseweb.ucsd.edu/~dasgupta/254-deep-ul/casey-mary.pdfAuto-Encoding Variational Bayes (Kingma & Welling) SGVB (Stochastic Gradient Variational

Covariances, Robustness, and Variational Bayes › ~jordan › papers › ... · of our methods with simple simulated data, an application of automatic differentiation variational

Autoencoding variational Bayes for latent Dirichlet allocation · Autoencoding variational Bayes for latent Dirichlet allocation 3 2 Latent Dirichlet Allocation LDA is probably the

A VARIATIONAL BAYES APPROACH TO VARIABLE SELECTION · A VARIATIONAL BAYES APPROACH TO VARIABLE SELECTION ... Bayes approximation to a linear model with a spike and ... A VARIATIONAL

Chapter 24: Variational Bayes - University College London

Variational Bayes

Information Constraints on Auto-Encoding Variational Bayes · 2019-02-19 · Information Constraints on Auto-Encoding Variational Bayes Romain Lopez 1, Jeffrey Regier , Michael I.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes

, Number 4, pp. 1{48 Mean Field Variational Bayes for ...

Auto-Encoding Variational Bayesrrtammy/DNN/StudentPresentations/AutoEncoderOrZalman.pdf · Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 24 / 39. Variational

MAP approximation to the variational Bayes Gaussian ... · MAP approximation to the variational Bayes Gaussian mixture model and application 3289 Fig. 1 GraphmodelofBayesianGMM.The

Auto-Encoding Variational Bayes - Machine Learningmachinelearning.math.rs/Ilic-VAE.pdfAuto-Encoding Variational Bayes Milan Ilic 3rd April 2019 ... By using the variational autoencoder,

Coupled Variational Bayes via Optimization Embeddingpapers.nips.cc/paper/8177-coupled-variational-bayes-via-optimization-embedding.pdf · Bayes which exploits the primal-dual view

A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model