Chapter 08: Direct Maximum Likelihood/MAP Estimation and...

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 08: Direct Maximum Likelihood/MAP Estimationand Incomplete Data Problems

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 08 – p. 1/28

References for this chapter

◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 9,Springer, 2006

◮ Joseph L. Schafer, Analysis of Incomplete Multivariate Data, Chapman&Hall,1997

◮ Zoubin Ghahramani, Michael I. Jordan, Learning from incomplete data,Technical Report #1509, MIT Artificial Intelligence Laboratory, 1994http://dspace.mit.edu/bitstream/handle/1721.1/7202/AIM-1509.pdf?sequence=2

◮ Arthur P. Dempster, Nan M. Laird, Donald B. Rubin, Maximum Likelihoodfrom Incomplete Data via the EM Algorithm, in: Journal of the RoyalStatistical Society Series B, vol. 39, pp. 1-38, 1977

◮ Xiao-Li Meng and Donald B. Rubin, Maximum likelihood estimation via theECM algorithm: A general framework, in: Biometrika, vol. 80, no. 2, pp.267-278, 1993


Motivation

◮ up to now:

1. calculate/approximate p(parameters|data)2. find a “meaningful” reference value for p(parameters|data), e.g.

argmaxparameters p(parameters|data)◮ requires more calculation than is actually necessary

◮ this chapter:

• find argmaxparameters p(parameters|data) direct (MAP) or

• find argmaxparameters p(data|parameters) direct (ML)

◮ Remark: ML and MAP require basically the same approaches. The onlydifference is whether we consider priors (which are just additional factors ingraphical models). Therefore, we consider both approaches together.


Direct MAP calculation

◮ Posterior distribution in a graphical model:

p(u1, . . . , un|o1, . . . , om) =p(u1, . . . , un, o1, . . . , om)

p(o1, . . . , om)∝ p(u1, . . . , un, o1, . . . , om)

=∏

i

fi(Neighbors(i))

= e∑

i log fi(Neighbors(i))

◮ MAP means: solve

arg maxu1,...,un

∑

i

log fi(Neighbors(i))



Ways to find the MAP

◮ The systems of equations∂

∂uj

∑

i

log fi(Neighbors(i)) = 0

can be resolved analytically → analytical solution for MAP

◮ Each equation∂

∂uj

∑

i

log fi(Neighbors(i)) = 0

can be solved analytically → use an iterative approach



◮ Iterative approach

1. repeat

2. set u1 ← argmaxu1

∑


3. set u2 ← argmaxu2

∑


4. . . .

5. set un ← argmaxun

∑


6. until convergence

7. return (u1, . . . , un)

◮ The derivatives∂

∂uj

∑

i


can be calculated easily → use generic gradient descent algorithm fornumerical solution

Second approach often converges faster than generic gradient descent


Example: bearing-only tracking revisited

◮ observing a moving object from a fixed position

◮ object moves with constant velocity

◮ for every point in time, observer senses angleof observation, but only sometimes distance toobject

◮ distributions:~x0 ∼ N (~a,R)

~v ∼ N (~b, S)

~yi|~x0, ~v ∼ N (~x0 + ti~v, σ2I)

ri = ||~yi||wi =

~yi

||~yi||

object movement

unknowndistance

angle ofobservation

observer

unknown~x0 ~v

ri

~wi

��

��

ti

σ

~x0

~yi

~v

~wi

ri

n



◮ conditional distributions:

~x0|~v, (~yi), (ti) ∼ N(

(n

σ2I +R−1)−1(

1

σ2

∑

(~yi − tiv) +R−1~a), (n

σ2I +R−1)−1

)

~v|~x0, (~yi), (ti) ∼ N(

(1

σ2

∑

t2i I + S−1)−1(1

σ2(∑

ti(~yi − ~x0)) + S−1~b), (1

σ2

∑

t2i I + S−1)−1)

ri|~x0, ~v, ti, ~wi ∼ N (~wTi (~x0 + ti~v), σ

2)

◮ updates derived from conditionals:

~x0← (n

σ2I +R−1)−1(

1

σ2

∑

(~yi − tiv) +R−1~a)

~v← (1

σ2

∑

t2i I + S−1)−1(1

σ2(∑

ti(~yi − ~x0)) + S−1~b)

ri← ~wTi (~x0 + ti~v)

◮ Matlab demo (using non-informative priors)


Example: Gaussian mixtures revisited

k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

~β

µj ∼N (m0, r0)

sj ∼ Γ−1(a0, b0)

~w ∼D(~β)Zi|~w ∼ C(~w)

Xi|Zi, µZi, sZi∼N (µZi

, sZi)



◮ conditional distributions: see slide 07/36

◮ derive MAP updates:

~w← (β1 + n1 − 1

n− k +∑k

j=1 βj

, . . . ,βk + nk − 1

n− k +∑k

j=1 βj

) with nj = |{i|zi = j}|

µj ←sjm0 + r0

∑

i|zi=j xi

sj + njr0

sj ←b0 +

12

∑

i|zi=j(xi − µj)2

1 + a0 +nj

2

zi← argmaxj

wj√2πsj

e− 1

2

(xi−µj)2

sj

◮ Matlab demo (using priors close to non-informativity)



Observations:

◮ convergence is very fast

◮ result depends very much frominitialization

◮ we treat zi like parameters ofthe model although mixturemodel is completely specifiedby ~w, µ1, . . . , µk, s1, . . . , sk

◮ zi are no parameters of themixture but latent variableswhich are only used to simplifyour calculations

◮ why should we maximizeposterior w.r.t zi ?

k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

~β


Latent variables

◮ Latent variables are

• not part of the stochastic model

• not interesting for the final estimate

• useful to simplify calculations

• often interpreted as missing observation

◮ Examples

• the class assignment variables zi in the mixture modeling can beinterpreted as missing class labels for a multi-class distribution

• the missing distances ri in the bearing-only tracking task can beinterpreted as missing parts of the data

• occluded parts of an object in an image can be seen as missing pixels

• data from a statistical evaluation which have been lost


Incomplete data problems

Let us assume that all data ~x are split into an observed part ~y and a missing part~z, i.e. ~x = (~y, ~z). We can distinguish three cases:

◮ completely missing at random (CMAR): whether an entry of ~x belongs to ~yor ~z is stochastically independent on both, ~y and ~z

P (xi belongs to ~z) = P (xi belongs to ~z|~y) = P (xi belongs to ~z|~y, ~z)◮ missing at random (MAR): whether an entry of ~x belongs to ~y or ~z is

stochastically independent of ~z but might depend on ~y

P (xi belongs to ~z) 6= P (xi belongs to ~z|~y) = P (xi belongs to ~z|~y, ~z)◮ censored data: whether an entry of ~x belongs to ~y or ~z is stochastically

dependent on ~z

P (xi belongs to ~z|~y) 6= P (xi belongs to ~z|~y, ~z)



◮ Discuss the following examples of incomplete data:

• the zi in mixture models

• a sensor that measures values only down to a certain minimal value

• an interrupted connection between a sensor and a host computer so thatsome measurements are not transmitted

• a stereo camera system that measures light intensity and distance but isunable to calculate the distance for overexposed areas

• a sensor that fails often if temperatures are low– if the sensor measures the activities of the sun– if the sensor measures the persons on a beach

• non-responses at public opinion polls



◮ consequences for stochastic analysis

• CMAR: no problem at all, incomplete data do not disturb our results

• MAR: can be treated if we model the stochastic dependency between theobserved data and the missing data

• censored data: no general treatment at all possible. Results will bedisturbed. No reconstruction of missing data possible

◮ we focus on the CMAR+MAR case here


Inference for incomplete data problems

◮ variational Bayes, Monte Carlo:Model the full posterior over the parameters of the model and the latent(missing) data. Afterwards, ignore the latent variables and return the resultfor the parameters of your model.

◮ direct MAP/ML:do not maximize the posterior/likelihood over the parameters and the latentvariables. But, consider all possible values that can be taken by the latentvariables and maximize the posterior/likelihood only w.r.t. the parameters ofyour stochastic model.→ expectation-maximization algorithm (EM)→ expectation-conditional-maximization algorithm (ECM)


EM algorithm

Let us denote

◮ the parameters of the stochastic model (the posterior distribution)~θ = (θ1, . . . , θk)

◮ the latent variables~λ = (λ1, . . . , λm)

◮ the observed data

~o = (o1, . . . , on)

◮ the log-posterior

L(~θ, ~λ, ~o) =∑

i



EM algorithm

◮ We aim at maximizing the expected log-posterior over all values of the latentvariables

argmax~θ

∫

Rm

L(~θ, ~λ, ~o) · p(~λ|~θ, ~o) d~λ

◮ an iterative approach to solve it

1. start with some parameter vector ~θ

2. repeat

3. Q(~θ′)←∫

Rm L(~θ′, ~λ, ~o) · p(~λ|~θ, ~o) d~λ4. ~θ ← argmax~θ′ Q(~θ′)


◮ This algorithm is known as expectation maximization algorithm (Dempster,Laird, Rubin, 1977)

• step 3: expectation step (E-step)

• step 4: maximization step (M-step)


EM algorithm

◮ Remarks:

• during the E-step intermediate variables are calculated which allow to

represent Q without relying on the previous values of ~θ

• closed form expressions for Q and explicit maximization often requireslengthy algebraic calculations

• for some applications calculating the E-step means calculating theexpectation values of latent variables. But this does not apply in general.

◮ Famous application areas

• mixture distributions

• learning hidden Markov models from example sequences(Baum-Welch-algorithm)



◮ conditions distribution

ri|~xo, ~v, ~wi ∼ N (~wTi (~x0 + ti~v), σ

2)

◮ the posterior distribution1

2π√

|R|e−

12(~x0−~a)TR−1(~x0−~a)

︸︷︷︸

prior of ~x0

·

1

2π√

|S|e−

12(~v−~b)TS−1(~v−~b)

︸︷︷︸

prior of ~v

·

n∏

i=1

1

2πσ2e−

12

|| ~x0+ti~v−ri ~wi||2

σ2

︸︷︷︸

data term

object movement

unknowndistance

angle ofobservation

observer

unknown~x0 ~v

ri

~wi

��

��

ti

σ

~x0

~yi

~v

~wi

ri

n



... (after lengthy, error-prone calculations) ...

Q(~x′0, ~v

′) = const − 1

2(~x′

0 − ~a)TR−1(~x′0 − ~a)− 1

2(~v′ −~b)TS−1(~v′ −~b)

−1

2

n∑

i=1

||~x′0 + ti~v

′ − ρi ~wi||2σ2

with

ρi =

{

ri if ri is observed

(~x0 + ti~v)T ~wi if ri is unobserved

Determining the maxima w.r.t ~x′0, ~v

(

R−1 + nσ2 I ( 1

σ2

∑ti)I

( 1σ2

∑ti)I S−1 + ( 1

σ2

∑t2i )I

)

·(

~x′0

~v′

)

=

(

R−1~a+ 1σ2

∑ρi ~wi

S−1~b+ 1σ2

∑tiρi ~wi

)

→ Matlab demo (using non-informative priors)


ECM algorithm

◮ We still aim at maximizing the expected log-posterior over all values of thelatent variables

argmax~θ

∫

Rm

L(~θ, ~λ, ~o) · p(~λ|~θ, ~o) d~λ

◮ sometimes, the M-step of the EM-algorithm cannot be calculated, i.e.

arg maxθ′1,...,θ

′k

Q(~θ′)

cannot be resolved analytically. But it might happen that

argmaxθ′i

Q(~θ′)

can be resolved for each θ′i or for groups of parameters

◮ Expectation-conditional-maximization algorithm (Meng & Rubin, 1993)


ECM algorithm

◮ Define a set of constraints gi(~θ′, ~θ) on the parameter set, e.g.

gi : θ′j = θj for all j 6= i

◮ replace the single M-step of the EM algorithm by a sequence of CM-steps,one for each constraint

1. start with some parameter vector ~θ

2. repeat

3. Q(~θ′)←∫

Rm L(~θ′, ~λ, ~o) · p(~λ|~θ, ~o) d~λ (M-step)

4. ~θ ← argmax~θ′ Q(~θ′) subject to g1(~θ′, ~θ) (CM-step)

5....

6. ~θ ← argmax~θ′ Q(~θ′) subject to gν(~θ′, ~θ) (CM-step)




k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

~β

µj ∼N (m0, r0)

sj ∼ Γ−1(a0, b0)

~w ∼D(~β)Zi|~w ∼ C(~w)

Xi|Zi, µZi, sZi∼N (µZi

, sZi)

◮ conditional distribution (cf. slide 07/35)

zi = j|~w, xi, µ1, . . . , µk, s1, . . . , sk ∼ C(hi,1, . . . , hi,k)

with hi,j ∝wj

√2πsj

e− 1

2

(xi−µj)2

sj



◮ Q(~w′, µ′1, . . . , µ

′k, s

′1, . . . , s

′k) =

k∑

z1=1

· · ·k∑

zn=1

((k∑

j=1

log(1√2πr0

e− 1

2

µ′j−m0

r0 )

︸︷︷︸

prior of µ′j

+

k∑

j=1

log(ba00

Γ(a0)(s′j)

a0−1e−

b0s′j )

︸︷︷︸

prior of s′j

+ log(Γ(β1 + · · ·+ βk)

Γ(β1) · · · · · Γ(βk)

k∏

j=1

(w′j)

βj−1)

︸︷︷︸

prior of ~w′

+

log(1

√2πs′zj

e− 1

2

µ′zj−xi

s′zj )

︸︷︷︸

data term of xi

+ log(w′zi)

︸︷︷︸

data term of zi

)· hn,zn · · · · · h1,z1

)

◮ easily, we can maximize Q(blackboard/homework)



◮ Matlab demo (using non-informative priors)

◮ Some observations on EM/ECM for Gaussian mixtures

• very popular

• very sensitive to initialization of parameters

• overfits the data if mixture is too large (for ML/MAP with non-informativepriors)


Laplace approximation

MAP calculates a best estimate. Can we derive an approximation for the posteriordistribution?

Idea: determine a Gaussian that is locally most similar to the posterior.

Taylor-approximation of log-posterior around MAP estimate ~θMAP

log p(~θ)≈ log p(~θMAP ) + grad(~θ − ~θMAP ) +1

2(~θ − ~θMAP )

TH(~θ − ~θMAP )

= log p(~θMAP ) +1

2(~θ − ~θMAP )

TH(~θ − ~θMAP )

with H the Hessian of log p

log of a Gaussian around ~θMAP :

log1

√2π

d√|Σ|− 1

2(~θ − ~θMAP )

TΣ−1(~θ − ~θMAP )

We obtain the same shape of the Gaussian if we choose Σ−1 = −H . This isknown as Laplace approximation.


Summary

◮ direct maximization of likelihood/posterior

◮ latent variables

◮ incomplete data problems

◮ EM/ECM algorithm

◮ Laplace approximation


Chapter 08: Direct Maximum Likelihood/MAP Estimation and...

Documents

Transcript of Chapter 08: Direct Maximum Likelihood/MAP Estimation and...