Chapter 06: Variational Bayesian...

38
L EARNING AND I NFERENCE IN G RAPHICAL M ODELS Chapter 06: Variational Bayesian Inference Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control Systems Learning and Inference in Graphical Models. Chapter 06 – p. 1/38

Transcript of Chapter 06: Variational Bayesian...

Page 1: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 06: Variational Bayesian Inference

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 06 – p. 1/38

Page 2: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

References for this chapter

◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 10,Springer, 2006

◮ Charles Fox and Stephen Roberts, A Tutorial on Variational BayesianInference, In: Artificial Intelligence Review, vol. 38, no. 2, pp. 85–95, 2012

◮ John Winn and Christopher M. Bishop, Variational Message Passing, In:Journal of Machine Learning Research, vol. 6, pp- 661-694, 2005http://machinelearning.wustl.edu/mlpapers/paper_files/WinnB05.pdf

Learning and Inference in Graphical Models. Chapter 06 – p. 2/38

Page 3: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Approximative solutions

Observations:

◮ inference on Bayesian networks can be done analytically for polytreescombined with special distributions (categorical, Gauss-linear)

◮ inference is not treatable analytically for the general case

◮ requires numerical or approximative solutions

Learning and Inference in Graphical Models. Chapter 06 – p. 3/38

Page 4: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Approximative inference

Joint probability distribution for many Bayesian networks is pretty complicated andhard to treat analytically

Goal: find a simpler joint probability distribution that approximates the original oneand that can be treated analytically

Example:

n

µ

σ2

Xi

conjugate prior for Gaussians

n

µ

σ2

Xi

desirable, simpler prior for Gaussians

Learning and Inference in Graphical Models. Chapter 06 – p. 4/38

Page 5: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Approximative inference

How can we measure whether two distributions are similar?

Definition: the Kullback-Leibler divergence is an unsymmetric measure for thedissimilarity of two probability distributions. It is defined by

KL(p||q) =∫ ∞

−∞p(x) log

p(x)

q(x)dx

Properties:

◮ KL(p||p) = 0

◮ KL(p||q) ≥ 0 with equality only if p = q almost everywhere

◮ KL(p||q) 6= KL(q||p)◮ KL(p||q) +KL(q||r) 6≥ KL(p||r)

Learning and Inference in Graphical Models. Chapter 06 – p. 5/38

Page 6: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Kullback-Leibler-divergence

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

−10 −5 0 5 10

p(x)

argminq KL(p||q)

argminq KL(q||p)

Question: which Gaussian q minimizes the Kullback-Leibler-DivergenceKL(p||q) and KL(q||p), respectivly ?

Learning and Inference in Graphical Models. Chapter 06 – p. 6/38

Page 7: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

◮ assume we are given a complex distribution p(U |O) where U are theunobserved variables and O the observed variables

◮ we want to approximate it with a parameterized distribution q(U |θ) with θthe set of parameters

◮ we assume that q can be factorized q(U |θ) = ∏

i qi(Ui|θi) with qi aconditional distribution

◮ how should we choose θ to obtain the best approximation?

minimizeθ

KL(q(U |θ)||p(U |O))

Learning and Inference in Graphical Models. Chapter 06 – p. 7/38

Page 8: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

KL(q(U |θ)||p(U |O)) =∫

q(u|θ) log q(u|θ)p(u|O)du

=−∫

q(u|θ) log p(u|O)q(u|θ) du

=−∫

q(u|θ) log p(u,O)

p(o) · q(u|θ)du

=−∫

q(u|θ) log p(u,O)q(u|θ) −q(u|θ) log p(o)du

=−∫

q(u|θ) log p(u,O)q(u|θ) du

︸ ︷︷ ︸

=:L(θ)

+ log p(o)

Observe that p(o) is independent w.r.t. θ.

Hence, we need to maximize L(θ)

Learning and Inference in Graphical Models. Chapter 06 – p. 8/38

Page 9: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

Use the factorization q(U |θ) = ∏

i qi(Ui|θi):

L(θ) =

(n∏

i=1

qi(ui|θi)) log p(u,O)d(u1, . . . , un)

−∫

(n∏

i=1

qi(ui|θi))(n∑

j=1

log qj(uj|θj))d(u1, . . . , un)

=

(n∏

i=1

qi(ui|θi)) log p(u,O)d(u1, . . . , un)

−n∑

j=1

(∫

(n∏

i=1

qi(ui|θi)) log qj(uj|θj)d(u1, . . . , un))

=

(n∏

i=1

qi(ui|θi)) log p(u,O)d(u1, . . . , un)

−n∑

j=1

(∏

i 6=j

(∫

qi(ui|θi)dui)·∫

qj(uj|θj) log qj(uj|θj)duj)

Learning and Inference in Graphical Models. Chapter 06 – p. 9/38

Page 10: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

L(θ) =

(n∏

i=1

qi(ui|θi)) log p(u,O)d(u1, . . . , un)

−n∑

j=1

(∏

i 6=j

(∫

qi(ui|θi)dui)

︸ ︷︷ ︸

=1

·∫

qj(uj|θj) log qj(uj|θj)duj)

=

(n∏

i=1

qi(ui|θi)) log p(u,O)d(u1, . . . , un)

−n∑

j=1

(∫

qj(uj|θj) log qj(uj|θj)duj)

Learning and Inference in Graphical Models. Chapter 06 – p. 10/38

Page 11: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

Select one (arbitrary) factor k:

L(θ) =

(n∏

i=1

qi(ui|θi)) log p(u,O)d(u1, . . . , un)

−n∑

j=1

(∫

qj(uj|θj) log qj(uj|θj)duj)

=

qk(uk|θk)(∫

· · ·∫

log p(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1

duk+1 . . . dun

)

duk −n∑

j=1

(∫

qj(uj|θj) log qj(uj|θj)duj)

=

qk(uk|θk)log(

exp(∫

· · ·∫

log p(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1

duk+1 . . . dun

))

duk −n∑

j=1

(∫

qj(uj|θj) log qj(uj|θj)duj)

Learning and Inference in Graphical Models. Chapter 06 – p. 11/38

Page 12: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

L(θ) =

qk(uk|θk) log(

exp(∫

· · ·∫

log p(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1

duk+1 . . . dun

))

duk −n∑

j=1

(∫

qj(uj|θj) log qj(uj|θj)duj)

=

qk(uk|θk) log q∗k(uk)duk+logZ−n∑

j=1

(∫

qj(uj|θj) log qj(uj|θj)duj)

with

Z=

exp(∫

· · ·∫

log p(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun

)

duk

q∗k(uk)=1

Zexp

(∫

· · ·∫

log p(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun

)

Z serves as normalization constant so that q∗k becomes a density function of aGibbs distribution.

Learning and Inference in Graphical Models. Chapter 06 – p. 12/38

Page 13: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

L(θ) =

qk(uk|θk) log q∗k(uk)duk+logZ−n∑

j=1

(∫

qj(uj|θj) log qj(uj|θj)duj)

=

qk(uk|θk) log q∗k(uk)duk −∫

qk(uk|θk) log qk(uk|θk)duk

+ logZ −∑

j 6=k

(∫

qj(uj|θj) log qj(uj|θj)duj)

=−KL(qk(uk|θk)||q∗k(uk)) + logZ −∑

j 6=k

(∫

qj(uj|θj) log qj(uj|θj)duj)

We want to maximize L. If we keep all θi fixed except of θk we should choose θkso that qk(uk|θk) = q∗k(uk)

If we apply this idea repeatedly cycling through all possible values of k, we obtainan iterative algorithm that converges to a local maximum of L(θ)

Learning and Inference in Graphical Models. Chapter 06 – p. 13/38

Page 14: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

Algorithm:

1. start with arbitrary parameter set θ

2. repeat

3. for k ← 1, . . . , n do

4. select θk so that qk(uk|θk) = q∗k(uk)

5. endfor

6. until convergence of θ

7. return θ

Learning and Inference in Graphical Models. Chapter 06 – p. 14/38

Page 15: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayesian inference

A closer look at q∗k(uk)

q∗k(uk)=1

Zexp

(∫

· · ·∫

log p(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun

)

If p was derived from a Bayesian network, we know that it factors into terms whichbelong to the Markov blanket and other terms, i.e.p(u,O) = p′(u,O) · p′′(u,O) and the second term does not depend on uk.

q∗k(uk) =1

Zexp

(∫

· · ·∫

log p′(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun

+

· · ·∫

log p′′(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun

︸ ︷︷ ︸

constant w .r .t . uk

)

=1

Z ′ exp(∫

· · ·∫

log p′(u,O)∏

i∈blanket(k)

qj(uj|θj)d{ui|i ∈ blanket(k)})

Learning and Inference in Graphical Models. Chapter 06 – p. 15/38

Page 16: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

Example: a sample from a Gaussian with unknown parameters

n

m0 r0 a0 b0

µ s

Xi

µ∼N (m0, r0)

s∼ Γ−1(a0, b0)

Xi ∼N (µ, s)

p(µ, s, x1, . . . , xn) =1√2πr0

e− 1

2(µ−m0)

2

r0

︸ ︷︷ ︸

=:dµ(µ)

·

ba00Γ(a0)

s−a0−1e−b0s

︸ ︷︷ ︸

=:ds(s)

·n∏

i=1

1√2πs

e−12

(xi−µ)2

s

︸ ︷︷ ︸

=:d(µ,s)

Learning and Inference in Graphical Models. Chapter 06 – p. 16/38

Page 17: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

n

m0 r0 a0 b0

µ s

Xi

Modeling full posterior by variational approximation

q(µ, s|m, r, a, b) = qµ(µ|m, r) · qs(s|a, b)

qµ(µ|m, r) =1√2πr

e−12

(µ−m)2

r

qs(s|a, b) =ba

Γ(a)s−a−1e−

bs

Learning and Inference in Graphical Models. Chapter 06 – p. 17/38

Page 18: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

q∗µ(µ)∝ exp

(∫ ∞

−∞log p(µ, s, x1, . . . , xn) · qs(s|a, b)ds

)

q∗s(s)∝ exp

(∫ ∞

−∞log p(µ, s, x1, . . . , xn) · qµ(µ|m, r)dµ

)

Learning and Inference in Graphical Models. Chapter 06 – p. 18/38

Page 19: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

Side calculation:

d(µ, s) =n∏

i=1

1√2πs

e−12

(xi−µ)2

s

=1

(2π)n2 s

n2

e−12s

(∑x2i−2µ

∑xi+nµ

2)

=1

(2π)n2 s

n2

e−n2s

(µ2−2µ∑

xin

+∑

x2in

)

=1

(2π)n2 s

n2

e−n2s

(µ2−2µ∑

xin

+(∑

xin

)2−(∑

xin

)2+∑

x2in

)

=1

(2π)n2 s

n2

e− 1

2(µ−x̄)2+Vx

sn

with x̄ the mean and Vx the variance of Xi

x̄=1

n

xi

Vx =1

n

x2i − (1

n

xi)2

Learning and Inference in Graphical Models. Chapter 06 – p. 19/38

Page 20: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

q∗µ(µ)∝ exp

(∫ ∞

−∞log p(µ, s, x1, . . . , xn) · qs(s|a, b)ds

)

= exp

(∫ ∞

−∞log(dµ(µ) + ds(s) + d(µ, s)) · qs(s|a, b)ds

)

log q∗µ(µ) = const(µ) +

∫ ∞

−∞log dµ(µ)qs(s|a, b)ds

+

∫ ∞

−∞log ds(s) · qs(s|a, b)ds

︸ ︷︷ ︸

=const(µ)

+

∫ ∞

−∞log d(µ, s) · qs(s|a, b)ds

= const(µ) + log dµ(µ)

∫ ∞

−∞qs(s|a, b)ds

︸ ︷︷ ︸

=1

+

∫ ∞

−∞log d(µ, s) · qs(s|a, b)ds

Learning and Inference in Graphical Models. Chapter 06 – p. 20/38

Page 21: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

∫ ∞

−∞log d(µ, s) · qs(s|a, b)ds

=

∫ ∞

−∞log

( 1

(2π)n2 s

n2

e− 1

2(µ−x̄)2+Vx

sn

)· ba

Γ(a)s−a−1e−

bsds

=

∫ ∞

−∞log

1

(2π)n2 s

n2

· ba

Γ(a)s−a−1e−

bsds

︸ ︷︷ ︸

=const(µ)

+

∫ ∞

−∞log e

− 12

(µ−x̄)2+Vxsn · ba

Γ(a)s−a−1e−

bsds

= const(µ) +

∫ ∞

−∞

(− 1

2

(µ− x̄)2 + Vxsn

)· ba

Γ(a)s−a−1e−

bsds

Learning and Inference in Graphical Models. Chapter 06 – p. 21/38

Page 22: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

= const(µ) +

∫ ∞

−∞

(− 1

2

(µ− x̄)2 + Vxsn

)· ba

Γ(a)s−a−1e−

bsds

= const(µ) +

∫ ∞

−∞

(− 1

2

(µ− x̄)2 + Vx1n

)· Γ(a+ 1)

Γ(a) · b ·ba+1

Γ(a+ 1)s−(a+1)−1e−

bsds

= const(µ) +(− 1

2

(µ− x̄)2 + Vx1n

)· ab·∫ ∞

−∞

ba+1

Γ(a+ 1)s−(a+1)−1e−

bsds

︸ ︷︷ ︸

=1

= const(µ) +a

b·(− 1

2

(µ− x̄)21n

)+a

b·(− 1

2

Vx1n

)

︸ ︷︷ ︸

=const(µ)

= const(µ) +a

b·(− 1

2

(µ− x̄)21n

)

Learning and Inference in Graphical Models. Chapter 06 – p. 22/38

Page 23: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

Assembling all pieces:

q∗µ(µ)∝ dµ(µ) · exp(a

b·(− 1

2

(µ− x̄)21n

))

∝ exp(

− 1

2

((µ−m0)2

r0+

(µ− x̄)2ban

))

= exp(

− 1

2

ban(µ2 − 2µm0 +m2

0) + r0(µ2 − 2µx̄+ x̄2)

r0ban

)

= exp(

− 1

2

(r0 +ban)µ2 − 2µ( bm0

an+ r0x̄) +

bm20

an+ r0x̄

2

r0ban

)

= exp(

− 1

2

(µ−bm0an

+r0x̄

r0+ban

)2−(bm0an

+r0x̄

r0+ban

)2 +bm2

0an

+r0x̄2

r0+ban

r0ban

r0+ban

)

Learning and Inference in Graphical Models. Chapter 06 – p. 23/38

Page 24: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

q∗µ(µ)∝ exp(

− 1

2

(µ−bm0an

+r0x̄

r0+ban

)2−(bm0an

+r0x̄

r0+ban

)2 +bm2

0an

+r0x̄2

r0+ban

r0ban

r0+ban

)

∝ exp(

− 1

2

(µ− bm0+r0anx̄r0an+b

)2

r0br0an+b

)

Comapring q∗µ(µ) with parameterized form of qµ(µ|m, r) = 1√2πre−

12

(µ−m)2

r

yields

m← bm0 + r0anx̄

r0an+ b

r← r0b

r0an+ b

Learning and Inference in Graphical Models. Chapter 06 – p. 24/38

Page 25: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

With a similar calculation, we obtain

q∗s(s)∝ s−(a0+n2)−1e−

b0+n2 (Vx+(m−x̄)2+r)

s

Comapring q∗s(s) with parameterized form of qs(s|a, b) = ba

Γ(a)s−a−1e−

bs yields

a← a0 +n

2

b← b0 +n

2(Vx + (m− x̄)2 + r)

Learning and Inference in Graphical Models. Chapter 06 – p. 25/38

Page 26: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

Algorithm:

1. start with arbitrary values of m, r, a, b

2. repeat

3. set m← bm0+r0anx̄r0an+b

4. set r ← r0br0an+b

5. set a← a0 +n2

6. set b← b0 +n2(Vx + (m− x̄)2 + r)

7. until convergence of (m, r, a, b)

8. return θ

Learning and Inference in Graphical Models. Chapter 06 – p. 26/38

Page 27: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

Experiment:

◮ generate sample (n = 20) fromN (10, 9)

◮ use priors for µ and s close to non-informativity

◮ apply 10 iterations of variational Bayes

0 5 10 15 20 250

0.02

0.04

0.06

0.08

0.1

0.12

0.14 blue: original sample distributionblack: sample pointsgreen: ML estimatorred: MAP estimator aftervariational Bayes

Learning and Inference in Graphical Models. Chapter 06 – p. 27/38

Page 28: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

Comparison: full posterior vs. variational posterior

2

7

9 15

√s

µ

full posterior

2

7

9 15

√s

µ

variational approximation

Learning and Inference in Graphical Models. Chapter 06 – p. 28/38

Page 29: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian

◮ bad experience: the calculations are lengthy and error prone

◮ good news: the equations can be resolved whenever the distributionsinvolved are from the exponential family, e.g. Gaussian, Gamma, InverseGamma, Wishart, Inverse Wishart, Dirichlet, Beta, Categorical.

◮ semi-good news: there is a message passing algorithm (variational messagepassing) that implements variational inference for exponential familydistributions. But it is hard to understand and sometimes it is easier to do allthe calculations manually (Winn, 2005)

Learning and Inference in Graphical Models. Chapter 06 – p. 29/38

Page 30: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Mixture distributions

How can we model distributions that are different and more complex thanstandard distribution families?

◮ search the literature for other distributions

◮ combine distributions

• combine distributions of different kind (unusual)

• combine distributions of same kind

→ mixture distributions, e.g. mixture of Gaussian, mixture of Dirichlet, ...

Learning and Inference in Graphical Models. Chapter 06 – p. 30/38

Page 31: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Mixture distributions

How do we combine distributions to a mixture?

Requirements for a density function:

◮ f(x) ≥ 0 for all x ∈ R

∫∞−∞ f(x)dx = 1

Observation: If f and g are pdfs and 0 < w < 1, thenx 7→ w · f(x) + (1− w)g(x) is also a pdf.

More general, if f1, . . . , fk are pdfs and w1, . . . , wk nonnegative numbers with∑k

j=1wj = 1, then x 7→∑k

j=1(wjfj(x)) is a pdf. Such a distribution is called

a mixture distribution.

◮ fj is the j-th component of the mixture

◮ wj serves as mixing weight and models the amount of contribution of fj tothe mixture

Learning and Inference in Graphical Models. Chapter 06 – p. 31/38

Page 32: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Mixture distributions

Interpretation of mixture distributions as structured distributions

◮ each component models one class (category) which is descibed by fj

◮ class j contributes with a ratio of wj to the whole

◮ each sample element xi of a mixture belongs to one component. But, we donot know to which one

Introduce latent variables zi which models to which component xi belongs

k

n

fj

Xi

Zi

~w

Zi ∼ C(~w)Xi|Zi ∼ fZi

Learning and Inference in Graphical Models. Chapter 06 – p. 32/38

Page 33: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Example: Gaussian mixture

k

n

µj sj

Xi

Zi

~w

Zi ∼ C(~w)Xi|Zi ∼N (µZi

, sZi)

How can we apply variational Bayesian inference in Gaussian mixtures?

Learning and Inference in Graphical Models. Chapter 06 – p. 33/38

Page 34: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayes for Gaussian mixture

k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

~βµj ∼N (m0, r0)

sj ∼ Γ−1(a0, b0)

~w ∼D(~β)Zi|~w ∼ C(~w)

Xi|Zi, µZi, sZi∼N (µZi

, sZi)

p(µ1, . . . , µk, s1, . . . , sk, w1, . . . , wk, x1, . . . , xn, z1, . . . , zn)

=k∏

j=1

( 1√2πr0

e− 1

2

(µj−m0)2

r0

k∏

j=1

( ba00Γ(a0)

s−a0−1j e

− b0sj

)

·Γ(∑k

j=1 βj)∏k

j=1 Γ(βj)

k∏

j=1

wβj−1j ·

n∏

i=1

wzi ·n∏

i=1

( 1√2πszi

e− 1

2

(xi−µzi)2

szi

)

Learning and Inference in Graphical Models. Chapter 06 – p. 34/38

Page 35: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayes for Gaussian mixture

Modeling full posterior by variational approximation

q(µ1, . . . , µk, s1, . . . , sk, w1, . . . , wk, z1, . . . , zn|m1, . . . ,mk,

r1, . . . , rk, a1, . . . , ak, b1, . . . , bk, α1, . . . , αk, h1,1, . . . , hn,k) =k∏

j=1

qµ(µj|mj, rj) ·k∏

j=1

qs(sj|aj, bj) · q~w(~w|~α) ·n∏

i=1

qz(zi|hi,1, . . . , hi,k)

with

qµ(µj|mj, rj) =1

√2πrj

e− 1

2

(µj−mj)2

rj

qs(sj|aj, bj) =bjaj

Γ(aj)s−aj−1j e

− bj

sj

q~w(~w|~α) =Γ(∑k

j=1 αj)∏k

j=1 Γ(αj)

k∏

j=1

wαj−1j

qz(zi|hi,1, . . . , hi,k) = hi,zi

Learning and Inference in Graphical Models. Chapter 06 – p. 35/38

Page 36: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayes for Gaussian mixture

After some (more or less complicated) calculations we obtain as update rules

mj ←bjm0 + r0ajnjx̄j

bj + r0ajnj

rj ←r0bj

bj + r0ajnj

aj ← a0 +nj

2

bj ← b0 +nj

2

(Vx,j + (mj − x̄j)2 + rj

)

αj ← βj + nj

hi,j ← c · eψ(αj)+12ψ(aj)− 1

2log(bj)− 1

2

aj

bj((xi−mj)

2+rj)c normalizes

k∑

j=1

hi,j to 1

with

nj =n∑

i=1

hi,j x̄j =1

nj

n∑

i=1

hi,jxj Vx,j =1

nj

n∑

i=1

hi,jx2j − x̄2j

ψ denotes the digamma function ψ(x) =ddx

Γ(x)

Γ(x)

Learning and Inference in Graphical Models. Chapter 06 – p. 36/38

Page 37: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Variational Bayes for Gaussian mixture

Example→ Matlab demo

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

1.2

1.4iteration = 300 k = 30 n = 1000 Plot shows MAP estimate after

variational inference. Sample ofsize 1000 is taken from a uniformdistribution. Priors were set closeto non-informativity.

Learning and Inference in Graphical Models. Chapter 06 – p. 37/38

Page 38: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian

Summary

◮ Kullback-Leibler-divergence

◮ principle of variational Bayes and theoretical derivation

◮ Example: variational Bayes for a Gaussian

◮ mixture distributions

◮ Example: variational Bayes for Gaussian mixtures

Learning and Inference in Graphical Models. Chapter 06 – p. 38/38