Chapter 06: Variational Bayesian...
Transcript of Chapter 06: Variational Bayesian...
LEARNING AND INFERENCE IN GRAPHICAL MODELS
Chapter 06: Variational Bayesian Inference
Dr. Martin Lauer
University of FreiburgMachine Learning Lab
Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems
Learning and Inference in Graphical Models. Chapter 06 – p. 1/38
References for this chapter
◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 10,Springer, 2006
◮ Charles Fox and Stephen Roberts, A Tutorial on Variational BayesianInference, In: Artificial Intelligence Review, vol. 38, no. 2, pp. 85–95, 2012
◮ John Winn and Christopher M. Bishop, Variational Message Passing, In:Journal of Machine Learning Research, vol. 6, pp- 661-694, 2005http://machinelearning.wustl.edu/mlpapers/paper_files/WinnB05.pdf
Learning and Inference in Graphical Models. Chapter 06 – p. 2/38
Approximative solutions
Observations:
◮ inference on Bayesian networks can be done analytically for polytreescombined with special distributions (categorical, Gauss-linear)
◮ inference is not treatable analytically for the general case
◮ requires numerical or approximative solutions
Learning and Inference in Graphical Models. Chapter 06 – p. 3/38
Approximative inference
Joint probability distribution for many Bayesian networks is pretty complicated andhard to treat analytically
Goal: find a simpler joint probability distribution that approximates the original oneand that can be treated analytically
Example:
n
µ
σ2
Xi
conjugate prior for Gaussians
n
µ
σ2
Xi
desirable, simpler prior for Gaussians
Learning and Inference in Graphical Models. Chapter 06 – p. 4/38
Approximative inference
How can we measure whether two distributions are similar?
Definition: the Kullback-Leibler divergence is an unsymmetric measure for thedissimilarity of two probability distributions. It is defined by
KL(p||q) =∫ ∞
−∞p(x) log
p(x)
q(x)dx
Properties:
◮ KL(p||p) = 0
◮ KL(p||q) ≥ 0 with equality only if p = q almost everywhere
◮ KL(p||q) 6= KL(q||p)◮ KL(p||q) +KL(q||r) 6≥ KL(p||r)
Learning and Inference in Graphical Models. Chapter 06 – p. 5/38
Kullback-Leibler-divergence
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
−10 −5 0 5 10
p(x)
argminq KL(p||q)
argminq KL(q||p)
Question: which Gaussian q minimizes the Kullback-Leibler-DivergenceKL(p||q) and KL(q||p), respectivly ?
Learning and Inference in Graphical Models. Chapter 06 – p. 6/38
Variational Bayesian inference
◮ assume we are given a complex distribution p(U |O) where U are theunobserved variables and O the observed variables
◮ we want to approximate it with a parameterized distribution q(U |θ) with θthe set of parameters
◮ we assume that q can be factorized q(U |θ) = ∏
i qi(Ui|θi) with qi aconditional distribution
◮ how should we choose θ to obtain the best approximation?
minimizeθ
KL(q(U |θ)||p(U |O))
Learning and Inference in Graphical Models. Chapter 06 – p. 7/38
Variational Bayesian inference
KL(q(U |θ)||p(U |O)) =∫
q(u|θ) log q(u|θ)p(u|O)du
=−∫
q(u|θ) log p(u|O)q(u|θ) du
=−∫
q(u|θ) log p(u,O)
p(o) · q(u|θ)du
=−∫
q(u|θ) log p(u,O)q(u|θ) −q(u|θ) log p(o)du
=−∫
q(u|θ) log p(u,O)q(u|θ) du
︸ ︷︷ ︸
=:L(θ)
+ log p(o)
Observe that p(o) is independent w.r.t. θ.
Hence, we need to maximize L(θ)
Learning and Inference in Graphical Models. Chapter 06 – p. 8/38
Variational Bayesian inference
Use the factorization q(U |θ) = ∏
i qi(Ui|θi):
L(θ) =
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−∫
(n∏
i=1
qi(ui|θi))(n∑
j=1
log qj(uj|θj))d(u1, . . . , un)
=
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∫
(n∏
i=1
qi(ui|θi)) log qj(uj|θj)d(u1, . . . , un))
=
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∏
i 6=j
(∫
qi(ui|θi)dui)·∫
qj(uj|θj) log qj(uj|θj)duj)
Learning and Inference in Graphical Models. Chapter 06 – p. 9/38
Variational Bayesian inference
L(θ) =
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∏
i 6=j
(∫
qi(ui|θi)dui)
︸ ︷︷ ︸
=1
·∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
Learning and Inference in Graphical Models. Chapter 06 – p. 10/38
Variational Bayesian inference
Select one (arbitrary) factor k:
L(θ) =
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
qk(uk|θk)(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1
duk+1 . . . dun
)
duk −n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
qk(uk|θk)log(
exp(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1
duk+1 . . . dun
))
duk −n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
Learning and Inference in Graphical Models. Chapter 06 – p. 11/38
Variational Bayesian inference
L(θ) =
∫
qk(uk|θk) log(
exp(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1
duk+1 . . . dun
))
duk −n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
qk(uk|θk) log q∗k(uk)duk+logZ−n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
with
Z=
∫
exp(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
)
duk
q∗k(uk)=1
Zexp
(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
)
Z serves as normalization constant so that q∗k becomes a density function of aGibbs distribution.
Learning and Inference in Graphical Models. Chapter 06 – p. 12/38
Variational Bayesian inference
L(θ) =
∫
qk(uk|θk) log q∗k(uk)duk+logZ−n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
qk(uk|θk) log q∗k(uk)duk −∫
qk(uk|θk) log qk(uk|θk)duk
+ logZ −∑
j 6=k
(∫
qj(uj|θj) log qj(uj|θj)duj)
=−KL(qk(uk|θk)||q∗k(uk)) + logZ −∑
j 6=k
(∫
qj(uj|θj) log qj(uj|θj)duj)
We want to maximize L. If we keep all θi fixed except of θk we should choose θkso that qk(uk|θk) = q∗k(uk)
If we apply this idea repeatedly cycling through all possible values of k, we obtainan iterative algorithm that converges to a local maximum of L(θ)
Learning and Inference in Graphical Models. Chapter 06 – p. 13/38
Variational Bayesian inference
Algorithm:
1. start with arbitrary parameter set θ
2. repeat
3. for k ← 1, . . . , n do
4. select θk so that qk(uk|θk) = q∗k(uk)
5. endfor
6. until convergence of θ
7. return θ
Learning and Inference in Graphical Models. Chapter 06 – p. 14/38
Variational Bayesian inference
A closer look at q∗k(uk)
q∗k(uk)=1
Zexp
(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
)
If p was derived from a Bayesian network, we know that it factors into terms whichbelong to the Markov blanket and other terms, i.e.p(u,O) = p′(u,O) · p′′(u,O) and the second term does not depend on uk.
q∗k(uk) =1
Zexp
(∫
· · ·∫
log p′(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
+
∫
· · ·∫
log p′′(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
︸ ︷︷ ︸
constant w .r .t . uk
)
=1
Z ′ exp(∫
· · ·∫
log p′(u,O)∏
i∈blanket(k)
qj(uj|θj)d{ui|i ∈ blanket(k)})
Learning and Inference in Graphical Models. Chapter 06 – p. 15/38
Example: Gaussian
Example: a sample from a Gaussian with unknown parameters
n
m0 r0 a0 b0
µ s
Xi
µ∼N (m0, r0)
s∼ Γ−1(a0, b0)
Xi ∼N (µ, s)
p(µ, s, x1, . . . , xn) =1√2πr0
e− 1
2(µ−m0)
2
r0
︸ ︷︷ ︸
=:dµ(µ)
·
ba00Γ(a0)
s−a0−1e−b0s
︸ ︷︷ ︸
=:ds(s)
·n∏
i=1
1√2πs
e−12
(xi−µ)2
s
︸ ︷︷ ︸
=:d(µ,s)
Learning and Inference in Graphical Models. Chapter 06 – p. 16/38
Example: Gaussian
n
m0 r0 a0 b0
µ s
Xi
Modeling full posterior by variational approximation
q(µ, s|m, r, a, b) = qµ(µ|m, r) · qs(s|a, b)
qµ(µ|m, r) =1√2πr
e−12
(µ−m)2
r
qs(s|a, b) =ba
Γ(a)s−a−1e−
bs
Learning and Inference in Graphical Models. Chapter 06 – p. 17/38
Example: Gaussian
q∗µ(µ)∝ exp
(∫ ∞
−∞log p(µ, s, x1, . . . , xn) · qs(s|a, b)ds
)
q∗s(s)∝ exp
(∫ ∞
−∞log p(µ, s, x1, . . . , xn) · qµ(µ|m, r)dµ
)
Learning and Inference in Graphical Models. Chapter 06 – p. 18/38
Example: Gaussian
Side calculation:
d(µ, s) =n∏
i=1
1√2πs
e−12
(xi−µ)2
s
=1
(2π)n2 s
n2
e−12s
(∑x2i−2µ
∑xi+nµ
2)
=1
(2π)n2 s
n2
e−n2s
(µ2−2µ∑
xin
+∑
x2in
)
=1
(2π)n2 s
n2
e−n2s
(µ2−2µ∑
xin
+(∑
xin
)2−(∑
xin
)2+∑
x2in
)
=1
(2π)n2 s
n2
e− 1
2(µ−x̄)2+Vx
sn
with x̄ the mean and Vx the variance of Xi
x̄=1
n
∑
xi
Vx =1
n
∑
x2i − (1
n
∑
xi)2
Learning and Inference in Graphical Models. Chapter 06 – p. 19/38
Example: Gaussian
q∗µ(µ)∝ exp
(∫ ∞
−∞log p(µ, s, x1, . . . , xn) · qs(s|a, b)ds
)
= exp
(∫ ∞
−∞log(dµ(µ) + ds(s) + d(µ, s)) · qs(s|a, b)ds
)
log q∗µ(µ) = const(µ) +
∫ ∞
−∞log dµ(µ)qs(s|a, b)ds
+
∫ ∞
−∞log ds(s) · qs(s|a, b)ds
︸ ︷︷ ︸
=const(µ)
+
∫ ∞
−∞log d(µ, s) · qs(s|a, b)ds
= const(µ) + log dµ(µ)
∫ ∞
−∞qs(s|a, b)ds
︸ ︷︷ ︸
=1
+
∫ ∞
−∞log d(µ, s) · qs(s|a, b)ds
Learning and Inference in Graphical Models. Chapter 06 – p. 20/38
Example: Gaussian
∫ ∞
−∞log d(µ, s) · qs(s|a, b)ds
=
∫ ∞
−∞log
( 1
(2π)n2 s
n2
e− 1
2(µ−x̄)2+Vx
sn
)· ba
Γ(a)s−a−1e−
bsds
=
∫ ∞
−∞log
1
(2π)n2 s
n2
· ba
Γ(a)s−a−1e−
bsds
︸ ︷︷ ︸
=const(µ)
+
∫ ∞
−∞log e
− 12
(µ−x̄)2+Vxsn · ba
Γ(a)s−a−1e−
bsds
= const(µ) +
∫ ∞
−∞
(− 1
2
(µ− x̄)2 + Vxsn
)· ba
Γ(a)s−a−1e−
bsds
Learning and Inference in Graphical Models. Chapter 06 – p. 21/38
Example: Gaussian
= const(µ) +
∫ ∞
−∞
(− 1
2
(µ− x̄)2 + Vxsn
)· ba
Γ(a)s−a−1e−
bsds
= const(µ) +
∫ ∞
−∞
(− 1
2
(µ− x̄)2 + Vx1n
)· Γ(a+ 1)
Γ(a) · b ·ba+1
Γ(a+ 1)s−(a+1)−1e−
bsds
= const(µ) +(− 1
2
(µ− x̄)2 + Vx1n
)· ab·∫ ∞
−∞
ba+1
Γ(a+ 1)s−(a+1)−1e−
bsds
︸ ︷︷ ︸
=1
= const(µ) +a
b·(− 1
2
(µ− x̄)21n
)+a
b·(− 1
2
Vx1n
)
︸ ︷︷ ︸
=const(µ)
= const(µ) +a
b·(− 1
2
(µ− x̄)21n
)
Learning and Inference in Graphical Models. Chapter 06 – p. 22/38
Example: Gaussian
Assembling all pieces:
q∗µ(µ)∝ dµ(µ) · exp(a
b·(− 1
2
(µ− x̄)21n
))
∝ exp(
− 1
2
((µ−m0)2
r0+
(µ− x̄)2ban
))
= exp(
− 1
2
ban(µ2 − 2µm0 +m2
0) + r0(µ2 − 2µx̄+ x̄2)
r0ban
)
= exp(
− 1
2
(r0 +ban)µ2 − 2µ( bm0
an+ r0x̄) +
bm20
an+ r0x̄
2
r0ban
)
= exp(
− 1
2
(µ−bm0an
+r0x̄
r0+ban
)2−(bm0an
+r0x̄
r0+ban
)2 +bm2
0an
+r0x̄2
r0+ban
r0ban
r0+ban
)
Learning and Inference in Graphical Models. Chapter 06 – p. 23/38
Example: Gaussian
q∗µ(µ)∝ exp(
− 1
2
(µ−bm0an
+r0x̄
r0+ban
)2−(bm0an
+r0x̄
r0+ban
)2 +bm2
0an
+r0x̄2
r0+ban
r0ban
r0+ban
)
∝ exp(
− 1
2
(µ− bm0+r0anx̄r0an+b
)2
r0br0an+b
)
Comapring q∗µ(µ) with parameterized form of qµ(µ|m, r) = 1√2πre−
12
(µ−m)2
r
yields
m← bm0 + r0anx̄
r0an+ b
r← r0b
r0an+ b
Learning and Inference in Graphical Models. Chapter 06 – p. 24/38
Example: Gaussian
With a similar calculation, we obtain
q∗s(s)∝ s−(a0+n2)−1e−
b0+n2 (Vx+(m−x̄)2+r)
s
Comapring q∗s(s) with parameterized form of qs(s|a, b) = ba
Γ(a)s−a−1e−
bs yields
a← a0 +n
2
b← b0 +n
2(Vx + (m− x̄)2 + r)
Learning and Inference in Graphical Models. Chapter 06 – p. 25/38
Example: Gaussian
Algorithm:
1. start with arbitrary values of m, r, a, b
2. repeat
3. set m← bm0+r0anx̄r0an+b
4. set r ← r0br0an+b
5. set a← a0 +n2
6. set b← b0 +n2(Vx + (m− x̄)2 + r)
7. until convergence of (m, r, a, b)
8. return θ
Learning and Inference in Graphical Models. Chapter 06 – p. 26/38
Example: Gaussian
Experiment:
◮ generate sample (n = 20) fromN (10, 9)
◮ use priors for µ and s close to non-informativity
◮ apply 10 iterations of variational Bayes
0 5 10 15 20 250
0.02
0.04
0.06
0.08
0.1
0.12
0.14 blue: original sample distributionblack: sample pointsgreen: ML estimatorred: MAP estimator aftervariational Bayes
Learning and Inference in Graphical Models. Chapter 06 – p. 27/38
Example: Gaussian
Comparison: full posterior vs. variational posterior
2
7
9 15
√s
µ
full posterior
2
7
9 15
√s
µ
variational approximation
Learning and Inference in Graphical Models. Chapter 06 – p. 28/38
Example: Gaussian
◮ bad experience: the calculations are lengthy and error prone
◮ good news: the equations can be resolved whenever the distributionsinvolved are from the exponential family, e.g. Gaussian, Gamma, InverseGamma, Wishart, Inverse Wishart, Dirichlet, Beta, Categorical.
◮ semi-good news: there is a message passing algorithm (variational messagepassing) that implements variational inference for exponential familydistributions. But it is hard to understand and sometimes it is easier to do allthe calculations manually (Winn, 2005)
Learning and Inference in Graphical Models. Chapter 06 – p. 29/38
Mixture distributions
How can we model distributions that are different and more complex thanstandard distribution families?
◮ search the literature for other distributions
◮ combine distributions
• combine distributions of different kind (unusual)
• combine distributions of same kind
→ mixture distributions, e.g. mixture of Gaussian, mixture of Dirichlet, ...
Learning and Inference in Graphical Models. Chapter 06 – p. 30/38
Mixture distributions
How do we combine distributions to a mixture?
Requirements for a density function:
◮ f(x) ≥ 0 for all x ∈ R
◮
∫∞−∞ f(x)dx = 1
Observation: If f and g are pdfs and 0 < w < 1, thenx 7→ w · f(x) + (1− w)g(x) is also a pdf.
More general, if f1, . . . , fk are pdfs and w1, . . . , wk nonnegative numbers with∑k
j=1wj = 1, then x 7→∑k
j=1(wjfj(x)) is a pdf. Such a distribution is called
a mixture distribution.
◮ fj is the j-th component of the mixture
◮ wj serves as mixing weight and models the amount of contribution of fj tothe mixture
Learning and Inference in Graphical Models. Chapter 06 – p. 31/38
Mixture distributions
Interpretation of mixture distributions as structured distributions
◮ each component models one class (category) which is descibed by fj
◮ class j contributes with a ratio of wj to the whole
◮ each sample element xi of a mixture belongs to one component. But, we donot know to which one
Introduce latent variables zi which models to which component xi belongs
k
n
fj
Xi
Zi
~w
Zi ∼ C(~w)Xi|Zi ∼ fZi
Learning and Inference in Graphical Models. Chapter 06 – p. 32/38
Example: Gaussian mixture
k
n
µj sj
Xi
Zi
~w
Zi ∼ C(~w)Xi|Zi ∼N (µZi
, sZi)
How can we apply variational Bayesian inference in Gaussian mixtures?
Learning and Inference in Graphical Models. Chapter 06 – p. 33/38
Variational Bayes for Gaussian mixture
k
n
m0 r0 a0 b0
µj sj
Xi
Zi
~w
~βµj ∼N (m0, r0)
sj ∼ Γ−1(a0, b0)
~w ∼D(~β)Zi|~w ∼ C(~w)
Xi|Zi, µZi, sZi∼N (µZi
, sZi)
p(µ1, . . . , µk, s1, . . . , sk, w1, . . . , wk, x1, . . . , xn, z1, . . . , zn)
=k∏
j=1
( 1√2πr0
e− 1
2
(µj−m0)2
r0
)·
k∏
j=1
( ba00Γ(a0)
s−a0−1j e
− b0sj
)
·Γ(∑k
j=1 βj)∏k
j=1 Γ(βj)
k∏
j=1
wβj−1j ·
n∏
i=1
wzi ·n∏
i=1
( 1√2πszi
e− 1
2
(xi−µzi)2
szi
)
Learning and Inference in Graphical Models. Chapter 06 – p. 34/38
Variational Bayes for Gaussian mixture
Modeling full posterior by variational approximation
q(µ1, . . . , µk, s1, . . . , sk, w1, . . . , wk, z1, . . . , zn|m1, . . . ,mk,
r1, . . . , rk, a1, . . . , ak, b1, . . . , bk, α1, . . . , αk, h1,1, . . . , hn,k) =k∏
j=1
qµ(µj|mj, rj) ·k∏
j=1
qs(sj|aj, bj) · q~w(~w|~α) ·n∏
i=1
qz(zi|hi,1, . . . , hi,k)
with
qµ(µj|mj, rj) =1
√2πrj
e− 1
2
(µj−mj)2
rj
qs(sj|aj, bj) =bjaj
Γ(aj)s−aj−1j e
− bj
sj
q~w(~w|~α) =Γ(∑k
j=1 αj)∏k
j=1 Γ(αj)
k∏
j=1
wαj−1j
qz(zi|hi,1, . . . , hi,k) = hi,zi
Learning and Inference in Graphical Models. Chapter 06 – p. 35/38
Variational Bayes for Gaussian mixture
After some (more or less complicated) calculations we obtain as update rules
mj ←bjm0 + r0ajnjx̄j
bj + r0ajnj
rj ←r0bj
bj + r0ajnj
aj ← a0 +nj
2
bj ← b0 +nj
2
(Vx,j + (mj − x̄j)2 + rj
)
αj ← βj + nj
hi,j ← c · eψ(αj)+12ψ(aj)− 1
2log(bj)− 1
2
aj
bj((xi−mj)
2+rj)c normalizes
k∑
j=1
hi,j to 1
with
nj =n∑
i=1
hi,j x̄j =1
nj
n∑
i=1
hi,jxj Vx,j =1
nj
n∑
i=1
hi,jx2j − x̄2j
ψ denotes the digamma function ψ(x) =ddx
Γ(x)
Γ(x)
Learning and Inference in Graphical Models. Chapter 06 – p. 36/38
Variational Bayes for Gaussian mixture
Example→ Matlab demo
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.2
0.4
0.6
0.8
1
1.2
1.4iteration = 300 k = 30 n = 1000 Plot shows MAP estimate after
variational inference. Sample ofsize 1000 is taken from a uniformdistribution. Priors were set closeto non-informativity.
Learning and Inference in Graphical Models. Chapter 06 – p. 37/38
Summary
◮ Kullback-Leibler-divergence
◮ principle of variational Bayes and theoretical derivation
◮ Example: variational Bayes for a Gaussian
◮ mixture distributions
◮ Example: variational Bayes for Gaussian mixtures
Learning and Inference in Graphical Models. Chapter 06 – p. 38/38