Chapter 08: Direct Maximum Likelihood/MAP Estimation and...
Transcript of Chapter 08: Direct Maximum Likelihood/MAP Estimation and...
LEARNING AND INFERENCE IN GRAPHICAL MODELS
Chapter 08: Direct Maximum Likelihood/MAP Estimationand Incomplete Data Problems
Dr. Martin Lauer
University of FreiburgMachine Learning Lab
Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems
Learning and Inference in Graphical Models. Chapter 08 – p. 1/28
References for this chapter
◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 9,Springer, 2006
◮ Joseph L. Schafer, Analysis of Incomplete Multivariate Data, Chapman&Hall,1997
◮ Zoubin Ghahramani, Michael I. Jordan, Learning from incomplete data,Technical Report #1509, MIT Artificial Intelligence Laboratory, 1994http://dspace.mit.edu/bitstream/handle/1721.1/7202/AIM-1509.pdf?sequence=2
◮ Arthur P. Dempster, Nan M. Laird, Donald B. Rubin, Maximum Likelihoodfrom Incomplete Data via the EM Algorithm, in: Journal of the RoyalStatistical Society Series B, vol. 39, pp. 1-38, 1977
◮ Xiao-Li Meng and Donald B. Rubin, Maximum likelihood estimation via theECM algorithm: A general framework, in: Biometrika, vol. 80, no. 2, pp.267-278, 1993
Learning and Inference in Graphical Models. Chapter 08 – p. 2/28
Motivation
◮ up to now:
1. calculate/approximate p(parameters|data)2. find a “meaningful” reference value for p(parameters|data), e.g.
argmaxparameters p(parameters|data)◮ requires more calculation than is actually necessary
◮ this chapter:
• find argmaxparameters p(parameters|data) direct (MAP) or
• find argmaxparameters p(data|parameters) direct (ML)
◮ Remark: ML and MAP require basically the same approaches. The onlydifference is whether we consider priors (which are just additional factors ingraphical models). Therefore, we consider both approaches together.
Learning and Inference in Graphical Models. Chapter 08 – p. 3/28
Direct MAP calculation
◮ Posterior distribution in a graphical model:
p(u1, . . . , un|o1, . . . , om) =p(u1, . . . , un, o1, . . . , om)
p(o1, . . . , om)∝ p(u1, . . . , un, o1, . . . , om)
=∏
i
fi(Neighbors(i))
= e∑
i log fi(Neighbors(i))
◮ MAP means: solve
arg maxu1,...,un
∑
i
log fi(Neighbors(i))
Learning and Inference in Graphical Models. Chapter 08 – p. 4/28
Direct MAP calculation
Ways to find the MAP
◮ The systems of equations∂
∂uj
∑
i
log fi(Neighbors(i)) = 0
can be resolved analytically → analytical solution for MAP
◮ Each equation∂
∂uj
∑
i
log fi(Neighbors(i)) = 0
can be solved analytically → use an iterative approach
Learning and Inference in Graphical Models. Chapter 08 – p. 5/28
Direct MAP calculation
◮ Iterative approach
1. repeat
2. set u1 ← argmaxu1
∑
i log fi(Neighbors(i))
3. set u2 ← argmaxu2
∑
i log fi(Neighbors(i))
4. . . .
5. set un ← argmaxun
∑
i log fi(Neighbors(i))
6. until convergence
7. return (u1, . . . , un)
◮ The derivatives∂
∂uj
∑
i
log fi(Neighbors(i))
can be calculated easily → use generic gradient descent algorithm fornumerical solution
Second approach often converges faster than generic gradient descent
Learning and Inference in Graphical Models. Chapter 08 – p. 6/28
Example: bearing-only tracking revisited
◮ observing a moving object from a fixed position
◮ object moves with constant velocity
◮ for every point in time, observer senses angleof observation, but only sometimes distance toobject
◮ distributions:~x0 ∼ N (~a,R)
~v ∼ N (~b, S)
~yi|~x0, ~v ∼ N (~x0 + ti~v, σ2I)
ri = ||~yi||wi =
~yi
||~yi||
object movement
unknowndistance
angle ofobservation
observer
unknown~x0 ~v
ri
~wi
����������
����������
ti
σ
~x0
~yi
~v
~wi
ri
n
Learning and Inference in Graphical Models. Chapter 08 – p. 7/28
Example: bearing-only tracking revisited
◮ conditional distributions:
~x0|~v, (~yi), (ti) ∼ N(
(n
σ2I +R−1)−1(
1
σ2
∑
(~yi − tiv) +R−1~a), (n
σ2I +R−1)−1
)
~v|~x0, (~yi), (ti) ∼ N(
(1
σ2
∑
t2i I + S−1)−1(1
σ2(∑
ti(~yi − ~x0)) + S−1~b), (1
σ2
∑
t2i I + S−1)−1)
ri|~x0, ~v, ti, ~wi ∼ N (~wTi (~x0 + ti~v), σ
2)
◮ updates derived from conditionals:
~x0← (n
σ2I +R−1)−1(
1
σ2
∑
(~yi − tiv) +R−1~a)
~v← (1
σ2
∑
t2i I + S−1)−1(1
σ2(∑
ti(~yi − ~x0)) + S−1~b)
ri← ~wTi (~x0 + ti~v)
◮ Matlab demo (using non-informative priors)
Learning and Inference in Graphical Models. Chapter 08 – p. 8/28
Example: Gaussian mixtures revisited
k
n
m0 r0 a0 b0
µj sj
Xi
Zi
~w
~β
µj ∼N (m0, r0)
sj ∼ Γ−1(a0, b0)
~w ∼D(~β)Zi|~w ∼ C(~w)
Xi|Zi, µZi, sZi∼N (µZi
, sZi)
Learning and Inference in Graphical Models. Chapter 08 – p. 9/28
Example: Gaussian mixtures revisited
◮ conditional distributions: see slide 07/36
◮ derive MAP updates:
~w← (β1 + n1 − 1
n− k +∑k
j=1 βj
, . . . ,βk + nk − 1
n− k +∑k
j=1 βj
) with nj = |{i|zi = j}|
µj ←sjm0 + r0
∑
i|zi=j xi
sj + njr0
sj ←b0 +
12
∑
i|zi=j(xi − µj)2
1 + a0 +nj
2
zi← argmaxj
wj√2πsj
e− 1
2
(xi−µj)2
sj
◮ Matlab demo (using priors close to non-informativity)
Learning and Inference in Graphical Models. Chapter 08 – p. 10/28
Example: Gaussian mixtures revisited
Observations:
◮ convergence is very fast
◮ result depends very much frominitialization
◮ we treat zi like parameters ofthe model although mixturemodel is completely specifiedby ~w, µ1, . . . , µk, s1, . . . , sk
◮ zi are no parameters of themixture but latent variableswhich are only used to simplifyour calculations
◮ why should we maximizeposterior w.r.t zi ?
k
n
m0 r0 a0 b0
µj sj
Xi
Zi
~w
~β
Learning and Inference in Graphical Models. Chapter 08 – p. 11/28
Latent variables
◮ Latent variables are
• not part of the stochastic model
• not interesting for the final estimate
• useful to simplify calculations
• often interpreted as missing observation
◮ Examples
• the class assignment variables zi in the mixture modeling can beinterpreted as missing class labels for a multi-class distribution
• the missing distances ri in the bearing-only tracking task can beinterpreted as missing parts of the data
• occluded parts of an object in an image can be seen as missing pixels
• data from a statistical evaluation which have been lost
Learning and Inference in Graphical Models. Chapter 08 – p. 12/28
Incomplete data problems
Let us assume that all data ~x are split into an observed part ~y and a missing part~z, i.e. ~x = (~y, ~z). We can distinguish three cases:
◮ completely missing at random (CMAR): whether an entry of ~x belongs to ~yor ~z is stochastically independent on both, ~y and ~z
P (xi belongs to ~z) = P (xi belongs to ~z|~y) = P (xi belongs to ~z|~y, ~z)◮ missing at random (MAR): whether an entry of ~x belongs to ~y or ~z is
stochastically independent of ~z but might depend on ~y
P (xi belongs to ~z) 6= P (xi belongs to ~z|~y) = P (xi belongs to ~z|~y, ~z)◮ censored data: whether an entry of ~x belongs to ~y or ~z is stochastically
dependent on ~z
P (xi belongs to ~z|~y) 6= P (xi belongs to ~z|~y, ~z)
Learning and Inference in Graphical Models. Chapter 08 – p. 13/28
Incomplete data problems
◮ Discuss the following examples of incomplete data:
• the zi in mixture models
• a sensor that measures values only down to a certain minimal value
• an interrupted connection between a sensor and a host computer so thatsome measurements are not transmitted
• a stereo camera system that measures light intensity and distance but isunable to calculate the distance for overexposed areas
• a sensor that fails often if temperatures are low– if the sensor measures the activities of the sun– if the sensor measures the persons on a beach
• non-responses at public opinion polls
Learning and Inference in Graphical Models. Chapter 08 – p. 14/28
Incomplete data problems
◮ consequences for stochastic analysis
• CMAR: no problem at all, incomplete data do not disturb our results
• MAR: can be treated if we model the stochastic dependency between theobserved data and the missing data
• censored data: no general treatment at all possible. Results will bedisturbed. No reconstruction of missing data possible
◮ we focus on the CMAR+MAR case here
Learning and Inference in Graphical Models. Chapter 08 – p. 15/28
Inference for incomplete data problems
◮ variational Bayes, Monte Carlo:Model the full posterior over the parameters of the model and the latent(missing) data. Afterwards, ignore the latent variables and return the resultfor the parameters of your model.
◮ direct MAP/ML:do not maximize the posterior/likelihood over the parameters and the latentvariables. But, consider all possible values that can be taken by the latentvariables and maximize the posterior/likelihood only w.r.t. the parameters ofyour stochastic model.→ expectation-maximization algorithm (EM)→ expectation-conditional-maximization algorithm (ECM)
Learning and Inference in Graphical Models. Chapter 08 – p. 16/28
EM algorithm
Let us denote
◮ the parameters of the stochastic model (the posterior distribution)~θ = (θ1, . . . , θk)
◮ the latent variables~λ = (λ1, . . . , λm)
◮ the observed data
~o = (o1, . . . , on)
◮ the log-posterior
L(~θ, ~λ, ~o) =∑
i
log fi(Neighbors(i))
Learning and Inference in Graphical Models. Chapter 08 – p. 17/28
EM algorithm
◮ We aim at maximizing the expected log-posterior over all values of the latentvariables
argmax~θ
∫
Rm
L(~θ, ~λ, ~o) · p(~λ|~θ, ~o) d~λ
◮ an iterative approach to solve it
1. start with some parameter vector ~θ
2. repeat
3. Q(~θ′)←∫
Rm L(~θ′, ~λ, ~o) · p(~λ|~θ, ~o) d~λ4. ~θ ← argmax~θ′ Q(~θ′)
5. until convergence
◮ This algorithm is known as expectation maximization algorithm (Dempster,Laird, Rubin, 1977)
• step 3: expectation step (E-step)
• step 4: maximization step (M-step)
Learning and Inference in Graphical Models. Chapter 08 – p. 18/28
EM algorithm
◮ Remarks:
• during the E-step intermediate variables are calculated which allow to
represent Q without relying on the previous values of ~θ
• closed form expressions for Q and explicit maximization often requireslengthy algebraic calculations
• for some applications calculating the E-step means calculating theexpectation values of latent variables. But this does not apply in general.
◮ Famous application areas
• mixture distributions
• learning hidden Markov models from example sequences(Baum-Welch-algorithm)
Learning and Inference in Graphical Models. Chapter 08 – p. 19/28
Example: bearing-only tracking revisited
◮ conditions distribution
ri|~xo, ~v, ~wi ∼ N (~wTi (~x0 + ti~v), σ
2)
◮ the posterior distribution1
2π√
|R|e−
12(~x0−~a)TR−1(~x0−~a)
︸ ︷︷ ︸
prior of ~x0
·
1
2π√
|S|e−
12(~v−~b)TS−1(~v−~b)
︸ ︷︷ ︸
prior of ~v
·
n∏
i=1
1
2πσ2e−
12
|| ~x0+ti~v−ri ~wi||2
σ2
︸ ︷︷ ︸
data term
object movement
unknowndistance
angle ofobservation
observer
unknown~x0 ~v
ri
~wi
����������
����������
ti
σ
~x0
~yi
~v
~wi
ri
n
Learning and Inference in Graphical Models. Chapter 08 – p. 20/28
Example: bearing-only tracking revisited
... (after lengthy, error-prone calculations) ...
Q(~x′0, ~v
′) = const − 1
2(~x′
0 − ~a)TR−1(~x′0 − ~a)− 1
2(~v′ −~b)TS−1(~v′ −~b)
−1
2
n∑
i=1
||~x′0 + ti~v
′ − ρi ~wi||2σ2
with
ρi =
{
ri if ri is observed
(~x0 + ti~v)T ~wi if ri is unobserved
Determining the maxima w.r.t ~x′0, ~v
(
R−1 + nσ2 I ( 1
σ2
∑ti)I
( 1σ2
∑ti)I S−1 + ( 1
σ2
∑t2i )I
)
·(
~x′0
~v′
)
=
(
R−1~a+ 1σ2
∑ρi ~wi
S−1~b+ 1σ2
∑tiρi ~wi
)
→ Matlab demo (using non-informative priors)
Learning and Inference in Graphical Models. Chapter 08 – p. 21/28
ECM algorithm
◮ We still aim at maximizing the expected log-posterior over all values of thelatent variables
argmax~θ
∫
Rm
L(~θ, ~λ, ~o) · p(~λ|~θ, ~o) d~λ
◮ sometimes, the M-step of the EM-algorithm cannot be calculated, i.e.
arg maxθ′1,...,θ
′k
Q(~θ′)
cannot be resolved analytically. But it might happen that
argmaxθ′i
Q(~θ′)
can be resolved for each θ′i or for groups of parameters
◮ Expectation-conditional-maximization algorithm (Meng & Rubin, 1993)
Learning and Inference in Graphical Models. Chapter 08 – p. 22/28
ECM algorithm
◮ Define a set of constraints gi(~θ′, ~θ) on the parameter set, e.g.
gi : θ′j = θj for all j 6= i
◮ replace the single M-step of the EM algorithm by a sequence of CM-steps,one for each constraint
1. start with some parameter vector ~θ
2. repeat
3. Q(~θ′)←∫
Rm L(~θ′, ~λ, ~o) · p(~λ|~θ, ~o) d~λ (M-step)
4. ~θ ← argmax~θ′ Q(~θ′) subject to g1(~θ′, ~θ) (CM-step)
5....
6. ~θ ← argmax~θ′ Q(~θ′) subject to gν(~θ′, ~θ) (CM-step)
7. until convergence
Learning and Inference in Graphical Models. Chapter 08 – p. 23/28
Example: Gaussian mixtures revisited
k
n
m0 r0 a0 b0
µj sj
Xi
Zi
~w
~β
µj ∼N (m0, r0)
sj ∼ Γ−1(a0, b0)
~w ∼D(~β)Zi|~w ∼ C(~w)
Xi|Zi, µZi, sZi∼N (µZi
, sZi)
◮ conditional distribution (cf. slide 07/35)
zi = j|~w, xi, µ1, . . . , µk, s1, . . . , sk ∼ C(hi,1, . . . , hi,k)
with hi,j ∝wj
√2πsj
e− 1
2
(xi−µj)2
sj
Learning and Inference in Graphical Models. Chapter 08 – p. 24/28
Example: Gaussian mixtures revisited
◮ Q(~w′, µ′1, . . . , µ
′k, s
′1, . . . , s
′k) =
k∑
z1=1
· · ·k∑
zn=1
((k∑
j=1
log(1√2πr0
e− 1
2
µ′j−m0
r0 )
︸ ︷︷ ︸
prior of µ′j
+
k∑
j=1
log(ba00
Γ(a0)(s′j)
a0−1e−
b0s′j )
︸ ︷︷ ︸
prior of s′j
+ log(Γ(β1 + · · ·+ βk)
Γ(β1) · · · · · Γ(βk)
k∏
j=1
(w′j)
βj−1)
︸ ︷︷ ︸
prior of ~w′
+
log(1
√2πs′zj
e− 1
2
µ′zj−xi
s′zj )
︸ ︷︷ ︸
data term of xi
+ log(w′zi)
︸ ︷︷ ︸
data term of zi
)· hn,zn · · · · · h1,z1
)
◮ easily, we can maximize Q(blackboard/homework)
Learning and Inference in Graphical Models. Chapter 08 – p. 25/28
Example: Gaussian mixtures revisited
◮ Matlab demo (using non-informative priors)
◮ Some observations on EM/ECM for Gaussian mixtures
• very popular
• very sensitive to initialization of parameters
• overfits the data if mixture is too large (for ML/MAP with non-informativepriors)
Learning and Inference in Graphical Models. Chapter 08 – p. 26/28
Laplace approximation
MAP calculates a best estimate. Can we derive an approximation for the posteriordistribution?
Idea: determine a Gaussian that is locally most similar to the posterior.
Taylor-approximation of log-posterior around MAP estimate ~θMAP
log p(~θ)≈ log p(~θMAP ) + grad(~θ − ~θMAP ) +1
2(~θ − ~θMAP )
TH(~θ − ~θMAP )
= log p(~θMAP ) +1
2(~θ − ~θMAP )
TH(~θ − ~θMAP )
with H the Hessian of log p
log of a Gaussian around ~θMAP :
log1
√2π
d√|Σ|− 1
2(~θ − ~θMAP )
TΣ−1(~θ − ~θMAP )
We obtain the same shape of the Gaussian if we choose Σ−1 = −H . This isknown as Laplace approximation.
Learning and Inference in Graphical Models. Chapter 08 – p. 27/28
Summary
◮ direct maximization of likelihood/posterior
◮ latent variables
◮ incomplete data problems
◮ EM/ECM algorithm
◮ Laplace approximation
Learning and Inference in Graphical Models. Chapter 08 – p. 28/28