Variational Inference

31
Variational Inference Note: Much (meaning almost all) of this has been liberated from John Winn and Matthew Beal’s theses, and David McKay’s book.

Transcript of Variational Inference

Page 1: Variational Inference

Variational Inference

Note: Much (meaning almost all) of this has been liberated from John Winn and Matthew Beal’s theses, and David McKay’s book.

Page 2: Variational Inference

Overview

• Probabilistic models & Bayesian inference

• Variational Inference

• Univariate Gaussian Example

• GMM Example

• Variational Message Passing

Page 3: Variational Inference

Bayesian networks

• Directed graph• Nodes represent

variables• Links show dependencies• Conditional distribution at

each node• Defines a joint

distribution:

.P(C,L,S,I)=P(L) P(C) P(S|C) P(I|L,S)

Lighting color

Surface color

Image color

Object class

C

SL

I

P(L)

P(C)

P(S|C)

P(I|L,S)

Page 4: Variational Inference

Lighting color

Hidden

Bayesian inference

Observed

• Observed variables D and hidden variables H.

• Hidden variables includeparameters and latent variables.

• Learning/inference involves finding:

• P(H1, H2…| D), or• P(H,|D,M) - explicitly for

generative model.

Surface color

Image color

C

SL

I

Object class

Page 5: Variational Inference

Bayesian inference vs. ML/MAP• Consider learning one parameter θ

• How should we represent this posterior distribution?

)()|( PDP

Page 6: Variational Inference

Bayesian inference vs. ML/MAP

θMAP

θ

Maximum of P(V| θ) P(θ)

• Consider learning one parameter θ

P(D| θ) P(θ)

Page 7: Variational Inference

Bayesian inference vs. ML/MAP

P(D| θ) P(θ)

θMAP

θ

High probability massHigh probability density

• Consider learning one parameter θ

Page 8: Variational Inference

Bayesian inference vs. ML/MAP

θML

θSamples

• Consider learning one parameter θ

P(D| θ) P(θ)

Page 9: Variational Inference

Bayesian inference vs. ML/MAP

θML

θVariational

approximation

)(θQ

• Consider learning one parameter θ

P(D| θ) P(θ)

Page 10: Variational Inference

Variational Inference

1. Choose a family of variational distributions Q(H).

2. Use Kullback-Leibler divergence KL(Q||P) as a measure of ‘distance’ between P(H|D) and Q(H).

3. Find Q which minimizes divergence.

(in three easy steps…)

Page 11: Variational Inference

Choose Variational Distribution

• P(H|D) ~ Q(H).• If P is so complex how do we choose Q?• Any Q is better than an ML or MAP point

estimate.• Choose Q so it can “get” close to P and is

tractable – factorize, conjugate.

Page 12: Variational Inference

Kullback-Leibler Divergence

• Derived from Variational Free Energy by Feynman and Bobliubov

• Relative Entropy between two probability distributions• KL(Q||P) > 0 , for any Q (Jensen’s inequality)• KL(Q||P) = 0 iff P = Q.

• Not true distance measure, not symmetric

X xP

xQxQPQKL)()(ln)()||(

Page 13: Variational Inference

Kullback-Leibler Divergence

Minimising KL(Q||P)

P

Q

Q Exclusive

H DHP

HQHQ)|(

)(ln)(

Minimising KL(P||Q) P

H HQ

DHPDHP)(

)|(ln)|(

Inclusive

Page 14: Variational Inference

Kullback-Leibler Divergence

H DHP

HQHQPQKL)|(

)(ln)()||(

H H

DPHQDHP

HQHQPQKL )(ln)(),(

)(ln)()||(

H DHP

DPHQHQPQKL),(

)()(ln)()||(

H HQ

DHPDHPQPK)(

)|(ln)|()||(

H

DPDHP

HQHQPQKL )(ln),(

)(ln)()||(

H DHP

HQHQPQKL)|(

)(ln)()||(

Bayes Rules

Log property

Sum over H

Page 15: Variational Inference

Kullback-Leibler Divergence

H H

HQHQDHPHQQL )(ln)(),(ln)()( DEFINE

• L is the difference between: expectation of the marginal likelihood with respect to Q, and the entropy of Q

• Maximize L(Q) is equivalent to minimizing KL Divergence

• We could not do the same trick for KL(P||Q), thus we will approximate likelihood with a function that has it’s mass where the likelihood is most probable (exclusive).

H

DPDHP

HQHQPQKL )(ln),(

)(ln)()||(

)()(ln)||( QLDPPQKL

Page 16: Variational Inference

Summarize

where

• For arbitrary Q(H)

• We choose a family of Q distributions where L(Q) is tractable to compute.

maximisefixed minimise

Still difficult in general to calculate

Page 17: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)maximise

fixed

Page 18: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)maximise

fixed

Page 19: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)

maximise

fixed

Page 20: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)

maximise

fixed

Page 21: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)

maximise

fixed

Page 22: Variational Inference

Factorised Approximation

• Assume Q factorises

• Optimal solution for one factor given by

• Given the form of Q, find the best H in KL sense• Choose conjugate priors P(H) to give from of Q• Do it iteratively of each Qi(Hi)

ji H

iiijji

DHPHQZ

HQ )),(ln)(exp(1)(*

Page 23: Variational Inference

Derivation

ji H

iijji

DHPHQZ

HQ )),(ln)(exp(1)(*

H H

HQHQDHPHQQL )(ln)(),(ln)()(

H H j

jji

iii

ii HQHQDHPHQ )(ln)(),(ln)(

H H i j

jjiii

ii HQHQDHPHQ )(ln)(),(ln)(

H i H

iiiii

iii

HQHQDHPHQ )(ln)(),(ln)(

H ji H

iiiiH

jjjjji

iijjij

HQHQHQHQDHPHQHQ )(ln)()(ln)(),(ln)()(

Log property

Substitution

Factor one term Qj

Not a Function of Qj

Idea: Use factoring of Q to isolate Qj and maximize L wrt Qj

ZQQKL jj log)||( *

Page 24: Variational Inference

Example: Univariate Gaussian• Normal distribution• Find P(| x)• Conjugate prior • Factorized variational

distribution• Q distribution same form as

prior distributions• Inference involves updating

these hidden parameters

Page 25: Variational Inference

Example: Univariate Gaussian• Use Q* to derive:

• Where <> is the expectation over Q function• Iteratively solve

Page 26: Variational Inference

Example: Univariate Gaussian

• Estimate of log evidence can be found by calculating L(Q):

• Where <.> are expectations wrt to Q(.)

Page 27: Variational Inference

Example

Take four data samples form Gaussian (Thick Line) to find posterior. Dashed lines distribution from sampled variational.

Variational and True posterior from Gaussian given four samples. P() = N(0,1000). P() = Gamma(.001,.001).

Page 28: Variational Inference

VB with Image Segmentation

20 40 60 80 100 120 140 160 180

20

40

60

80

100

120

0 100 200 3000

100

200

0 100 200 3000

50

100

0 100 200 3000

100

200

300

0 100 200 3000

50

100

0 100 200 3000

50

100

150

0 100 200 3000

50

100

RGB histogram of two pixel locations.

“VB at the pixel level will give better results.”

Feature vector (x,y,Vx,Vy,r,g,b) - will have issues with data association.

VB with GMM will be complex – doing this in real time will be execrable.

Page 29: Variational Inference

Lower Bound for GMM-Ugly

Page 30: Variational Inference

Variational Equations for GMM-Ugly

Page 31: Variational Inference

Brings Up VMP – Efficient Computation

Lighting color

Surface color

Image color

Object class

C

SL

I

P(L)

P(C)

P(S|C)

P(I|L,S)