Variational Bayes 101

Informatics and Mathematical Modelling / Lars Kai Hansen

Adv. Signal Proc. 2006

The Bayes scene Exact averaging in

discrete/small models (Bayes networks)

Approximate averaging: - Monte Carlo methods - Ensemble/mean field

- Variational Bayes methods

Variational-Bayes .orgMLpediaWikipedia

• ISP Bayes:

ICA: mean field, Kalman, dynamical systemsNeuroImaging: Optimal signal detectorApproximate inferenceMachine learning methods

Bayes’ methodology

Minimal error rate obtained when detector is based on posterior

probability (Bayes decision theory)

( | ) ( )( | ) , | 1,..,

P D M P MP M D D x n N

Likelihood may contain unknown parameters

( | ) ( | ) ( | )

[ ( | )] ( | )nn

P D M P D p M d

P x p M d

Bayes’ methodology

Conventional approach is to use most probable parameters

* *( | ) ( | ) ( | ) ( | )n nn n

P D M P x M P x p M However: averaged model is generalization optimal (Hansen, 1999),

( | ) ,( | ) arg max log ( | )BayesianAverage P x D d D

P x D P x M

The hidden agenda of learning

Typically learning proceeds by generalization from limited set of samples…but

We would like to identify the model that generated the data

….Choose the least complex model compatible with data

That I figured

out in 1386

Generalizability is defined as the expected performance on a random new sample ... the mean performance of a model on a ”fresh” data set is an unbiased estimate of generalization

Typical loss functions: <-log p(x)> , < # prediction errors > < [ g(x)-ĝ(x) ] 2 >, <log p(x,g)/p(x)p(g)>, etc

Results can be presented as ”bias-variance trade-off curves” or ”learning curves”

Generalization!

Generalization optimal predictive distribution

”The game of guessing a pdf” Assume: Random teacher drawn from P(θ), random

data set, D, drawn from P(x|θ) The prediction / generalization error is

( , , ) [ log ( | , )] ( | )

( ) ( , , ) ( ) ( | )

D A p x D A P x dx

A D A P P D d dD

Predictive distribution of model A Test sample distribution

Generalization optimal predictive distribution

We define the ”generalization functional” (Hansen, NIPS 1999)

Minimized by the ”Bayesian averaging” predictive distribution

[ (. | .,.)] log ( | ) ( | ) ( | ) ( )

( )[ ( | ) 1]

H q q x D P x dxP D dDP d

D q x D dx dD

( | ) ( )( | ) ( | )

( | ') ( ') '

P D Pq x D P x d

P D P d

Bias-variance trade-off and averaging

Now averaging is good, can we average ”too much”?

Define the family of tempered posterior distributions

Case: univariate normal dist. w. unknown mean parameter…

High temperature: widened posterior average

Low temperature: Narrow average

( ( | ) ( ))( | , ) ( | )

( ( | ') ( ')) '

P D Pq x D T P x d

P D P d

Bayes’ model selection, example Let three models A,B,C be given

A) x is normal N(0,1) B) x is normal N(0,σ2), σ2 is uniform U(0,∞) C) x is normal N(μ,σ2), μ, σ2 are uniform U(0,∞)

2 2 22

n x xn

x n Xn

Model A

The likelihood of N samples is given by

21( | ) exp

Model B

( | ) ( | 0, ) ( )

P D A P D P d

Model C

/ 2 2 22

31 21 22 2

( | ) ( | , ) ( , )

[( ) ]1exp

P D A P D P d d

•Bayesian model selection•C(green) is the correct model,

what if only A(red)+B(blue) are known?

•Bayesian model selection•A (red) is the correct model

Bayesian inference• Bayesian averaging

• Caveats: Bayes can rarely be implemented exactly

Not optimal if the model family is incorrect: ”Bayes can not detect bias”

However, still asymptotically optimal if observation model is correct & prior is ”weak” (Hansen, 1999).

( | , ) ( | , ) ( | , ) ,

ˆ( | , ) ( | , ( ))

p g x D p g x p x D d

p g x D p g x D

Hierarchical Bayes models• Multi-level models in Bayesian averaging

( | , ) ( | , ) ( | , , ) ( | , ) ,

ˆ ˆ( | , ) ( | , ( , ( )))

p g x D p g x p x D p x D d d

p g x D p g x D D

C.P. Robert: The Bayesian Choice - A Decision-Theoretic Motivation.Springer Texts in Statistics, Springer Verlag, NewYork (1994).

G. Golub, M. Heath and G. Wahba, Generalized crossvalidationas a method for choosing a good ridge parameter,Technometrics 21 pp. 215–223, (1979).

K. Friston: A theory of Cortical Responses. Phil. Trans. R. Soc. B 360:815-836 (2005)

Hierarchical Bayes models

posterior( ) prior( )

( | , ) ( | , ) ( | , ) ( | ) ,

( | ) ( | )( | , )

( | ) ( ) exp( ( ))

( | ) ( | ) ( | )

( ) ( )

p g x D p g x p D p D d d

p D pp D

p D p D p d

“learning hyper-

parameters by adjusting prior expectations”

-empirical Bayes-MacKay, (1992)

Hansen et al. (Eusipco, 2006)Cf. Boltzmann learning (Hinton et al. 1983)

Posterior

“Evidence”

Target atMaximal evidence

Hyperparameter dynamics

posterior( ) prior( )

( | ) ( ) exp( )

N AANN

Gaussian prior w adaptive hyperparameter

Discontinuity: Parameter is pruned atLow signal-to-noise Hansen & Rasmussen, Neural Comp (1994)Tipping “Relevance vector machine” (1999)

θ2A is a signal-to-noise measure

θML is maximum lik. opt.

Hyperparameter dynamics

Hyperparameters dynamically updated implies pruning

Pruning decisions based on SNR

Mechanism for cognitive selection, attention?

Hansen & Rasmussen, Neural Comp (1994)

Approximations needed for posteriors Approximations using asymptotic expansions

(Laplace etc) -JL Approximation of posteriors using tractable

(factorized) pdf’s by KL-fitting… Approximation of products using EP -AH Wednesday Approximation by MCMC –OWI Thursday

P. Højen-Sørensen: Thesis (2001)

Illustration of approximation by a gaussian pdf

Variational Bayes

Notation are observables and hidden variables – we analyse the log likelihood of a mixture model

xlog ( | ) log p( , , | )p M M d d

y y θ x θ x

,n ny x

p( , , | ) p( | , , )p( | , )p( | )M M M My θ x y x θ x θ θ

Variational Bayes

log ( | ) log p( , , | )

( , , | )log p( , , | ) log q( )r( )

q( )r( )

( , , | )q( )r( ) log

q( )r( )

( , | , )q( )r( ) log log ( | )

q( )r( )

p M M d d

p MM d d d d

p Md d

p Md d p M

y y θ x θ x

y θ xy θ x θ x x θ θ x

y θ xx θ θ x

θ x yx θ θ x y

q( ) exp log ( , , | )

r( ) exp log ( , , | )

x θ x y

θ θ x y

Variational Bayes:

Conjugate exponential families

( , | , ) , ) ( )exp[ ( , )]

( ) ( , ) ( ) exp( )

( , | , ) ( ) ( ', ') ( ) exp[ ( ( , )]

' ( , )

p M p h g

y x θ y x θ u y x

θ ν θ θ ν

y x θ θ ν θ θ ν u y x

ν ν u y x

Mini exercise What are the natural parameters for a Gaussian? What are the natural parameters for a MoG?

•Observation model and “Bayes factor”

•“Normal inverse gamma” prior – the conjugate prior for the GLM observation model

•Bayes factor is the ratio between normalization const. of NIG’s:

Exercises

Matthew Beal’s Mixture of Factor Analyzers code– Code available (variational-bayes.org)

Code a VB version of the BGML for signal detection– Code available for exact posterior

Variational Bayes 101

Documents

Transcript of Variational Bayes 101

Variational filtering - Wellcome Trust Centre for … · 2011-03-09 · on ensemble dynamics that form the basis of variational filtering. Variational Bayes or ensemble learning (Feynman,

Autoencoding variational Bayes for latent Dirichlet allocationceur-ws.org/Vol-2540/FAIR2019_paper_33.pdf · 2020-01-21 · Autoencoding variational Bayes for latent Dirichlet allocation

Coupled Variational Bayes via Optimization Embeddingpapers.nips.cc/paper/8177-coupled-variational-bayes-via-optimization-embedding.pdf · Bayes which exploits the primal-dual view

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006 Variational Bayes 101.

Stochastic gradient variational Bayes for gamma ...

Auto-Encoding Variational Bayesrrtammy/DNN/StudentPresentations/AutoEncoderOrZalman.pdf · Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 24 / 39. Variational

Auto-Encoding Variational Bayes - Machine Learningmachinelearning.math.rs/Ilic-VAE.pdfAuto-Encoding Variational Bayes Milan Ilic 3rd April 2019 ... By using the variational autoencoder,

D VARIATIONAL BAYES FILTERS: UNSUPERVISED LEARNING OF … · 2017. 3. 6. · Published as a conference paper at ICLR 2017 DEEP VARIATIONAL BAYES FILTERS: UNSUPERVISED LEARNING OF

Auto encoding-variational-bayes

Nonparametric Inference for Auto-Encoding Variational Bayes · 2017-12-19 · Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin *Iman Malik Carl Henrik Ek Neill

Auto-encoding Variational Bayes - Universitetet i oslofolk.uio.no/geirs/STK9200/Vinit_Autoencoding.pdf · 2019-10-28 · Auto-encoding Variational Bayes Vinit Ravishankar Language

Stochastic Gradient Variational Bayes and Normalizing Flows for … · 2020. 9. 14. · Stochastic Gradient Variational Bayes and Normalizing Flows for Estimating Macroeconomic Models

Chapter 24: Variational Bayes - University College London

The Thermodynamic Variational Objective · Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. [8] Danilo Jimenez Rezende, Shakir Mohamed,

Updating Variational Bayes: Fast sequential posterior ... · We refer to this approach as Stochastic Variational Bayes (SVB). There is a rich tradition of using only a subset of the

Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University . Variational Bayes?

Information Kullback-Liebler Divergence Variational Inferencewpenny/bayes-inf/variational-ucl.pdf · Kullback-Liebler Divergence Gaussians Asymmetry Multimodality Variational Bayes

A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model

Variational Approximations in Machine Learning: Theory and ...€¦ · Pierre Alquier (ENSAE) 25 juin 2018. Introduction : Learning with PAC-Bayes Bounds Variational Approximation

A variational Bayes approach to variable selection · 2014-07-18 · A variational Bayes approach to variable selection BY JOHN T. ORMEROD, CHONG YOU AND SAMUEL MULLER¨ School of