Chapter 08: Direct Maximum Likelihood/MAP Estimation and...

28
L EARNING AND I NFERENCE IN G RAPHICAL M ODELS Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control Systems Learning and Inference in Graphical Models. Chapter 08 – p. 1/28

Transcript of Chapter 08: Direct Maximum Likelihood/MAP Estimation and...

Page 1: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 08: Direct Maximum Likelihood/MAP Estimationand Incomplete Data Problems

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 08 – p. 1/28

Page 2: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

References for this chapter

◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 9,Springer, 2006

◮ Joseph L. Schafer, Analysis of Incomplete Multivariate Data, Chapman&Hall,1997

◮ Zoubin Ghahramani, Michael I. Jordan, Learning from incomplete data,Technical Report #1509, MIT Artificial Intelligence Laboratory, 1994http://dspace.mit.edu/bitstream/handle/1721.1/7202/AIM-1509.pdf?sequence=2

◮ Arthur P. Dempster, Nan M. Laird, Donald B. Rubin, Maximum Likelihoodfrom Incomplete Data via the EM Algorithm, in: Journal of the RoyalStatistical Society Series B, vol. 39, pp. 1-38, 1977

◮ Xiao-Li Meng and Donald B. Rubin, Maximum likelihood estimation via theECM algorithm: A general framework, in: Biometrika, vol. 80, no. 2, pp.267-278, 1993

Learning and Inference in Graphical Models. Chapter 08 – p. 2/28

Page 3: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Motivation

◮ up to now:

1. calculate/approximate p(parameters|data)2. find a “meaningful” reference value for p(parameters|data), e.g.

argmaxparameters p(parameters|data)◮ requires more calculation than is actually necessary

◮ this chapter:

• find argmaxparameters p(parameters|data) direct (MAP) or

• find argmaxparameters p(data|parameters) direct (ML)

◮ Remark: ML and MAP require basically the same approaches. The onlydifference is whether we consider priors (which are just additional factors ingraphical models). Therefore, we consider both approaches together.

Learning and Inference in Graphical Models. Chapter 08 – p. 3/28

Page 4: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Direct MAP calculation

◮ Posterior distribution in a graphical model:

p(u1, . . . , un|o1, . . . , om) =p(u1, . . . , un, o1, . . . , om)

p(o1, . . . , om)∝ p(u1, . . . , un, o1, . . . , om)

=∏

i

fi(Neighbors(i))

= e∑

i log fi(Neighbors(i))

◮ MAP means: solve

arg maxu1,...,un

i

log fi(Neighbors(i))

Learning and Inference in Graphical Models. Chapter 08 – p. 4/28

Page 5: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Direct MAP calculation

Ways to find the MAP

◮ The systems of equations∂

∂uj

i

log fi(Neighbors(i)) = 0

can be resolved analytically → analytical solution for MAP

◮ Each equation∂

∂uj

i

log fi(Neighbors(i)) = 0

can be solved analytically → use an iterative approach

Learning and Inference in Graphical Models. Chapter 08 – p. 5/28

Page 6: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Direct MAP calculation

◮ Iterative approach

1. repeat

2. set u1 ← argmaxu1

i log fi(Neighbors(i))

3. set u2 ← argmaxu2

i log fi(Neighbors(i))

4. . . .

5. set un ← argmaxun

i log fi(Neighbors(i))

6. until convergence

7. return (u1, . . . , un)

◮ The derivatives∂

∂uj

i

log fi(Neighbors(i))

can be calculated easily → use generic gradient descent algorithm fornumerical solution

Second approach often converges faster than generic gradient descent

Learning and Inference in Graphical Models. Chapter 08 – p. 6/28

Page 7: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: bearing-only tracking revisited

◮ observing a moving object from a fixed position

◮ object moves with constant velocity

◮ for every point in time, observer senses angleof observation, but only sometimes distance toobject

◮ distributions:~x0 ∼ N (~a,R)

~v ∼ N (~b, S)

~yi|~x0, ~v ∼ N (~x0 + ti~v, σ2I)

ri = ||~yi||wi =

~yi

||~yi||

object movement

unknowndistance

angle ofobservation

observer

unknown~x0 ~v

ri

~wi

����������

����������

ti

σ

~x0

~yi

~v

~wi

ri

n

Learning and Inference in Graphical Models. Chapter 08 – p. 7/28

Page 8: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: bearing-only tracking revisited

◮ conditional distributions:

~x0|~v, (~yi), (ti) ∼ N(

(n

σ2I +R−1)−1(

1

σ2

(~yi − tiv) +R−1~a), (n

σ2I +R−1)−1

)

~v|~x0, (~yi), (ti) ∼ N(

(1

σ2

t2i I + S−1)−1(1

σ2(∑

ti(~yi − ~x0)) + S−1~b), (1

σ2

t2i I + S−1)−1)

ri|~x0, ~v, ti, ~wi ∼ N (~wTi (~x0 + ti~v), σ

2)

◮ updates derived from conditionals:

~x0← (n

σ2I +R−1)−1(

1

σ2

(~yi − tiv) +R−1~a)

~v← (1

σ2

t2i I + S−1)−1(1

σ2(∑

ti(~yi − ~x0)) + S−1~b)

ri← ~wTi (~x0 + ti~v)

◮ Matlab demo (using non-informative priors)

Learning and Inference in Graphical Models. Chapter 08 – p. 8/28

Page 9: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: Gaussian mixtures revisited

k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

µj ∼N (m0, r0)

sj ∼ Γ−1(a0, b0)

~w ∼D(~β)Zi|~w ∼ C(~w)

Xi|Zi, µZi, sZi∼N (µZi

, sZi)

Learning and Inference in Graphical Models. Chapter 08 – p. 9/28

Page 10: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: Gaussian mixtures revisited

◮ conditional distributions: see slide 07/36

◮ derive MAP updates:

~w← (β1 + n1 − 1

n− k +∑k

j=1 βj

, . . . ,βk + nk − 1

n− k +∑k

j=1 βj

) with nj = |{i|zi = j}|

µj ←sjm0 + r0

i|zi=j xi

sj + njr0

sj ←b0 +

12

i|zi=j(xi − µj)2

1 + a0 +nj

2

zi← argmaxj

wj√2πsj

e− 1

2

(xi−µj)2

sj

◮ Matlab demo (using priors close to non-informativity)

Learning and Inference in Graphical Models. Chapter 08 – p. 10/28

Page 11: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: Gaussian mixtures revisited

Observations:

◮ convergence is very fast

◮ result depends very much frominitialization

◮ we treat zi like parameters ofthe model although mixturemodel is completely specifiedby ~w, µ1, . . . , µk, s1, . . . , sk

◮ zi are no parameters of themixture but latent variableswhich are only used to simplifyour calculations

◮ why should we maximizeposterior w.r.t zi ?

k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

Learning and Inference in Graphical Models. Chapter 08 – p. 11/28

Page 12: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Latent variables

◮ Latent variables are

• not part of the stochastic model

• not interesting for the final estimate

• useful to simplify calculations

• often interpreted as missing observation

◮ Examples

• the class assignment variables zi in the mixture modeling can beinterpreted as missing class labels for a multi-class distribution

• the missing distances ri in the bearing-only tracking task can beinterpreted as missing parts of the data

• occluded parts of an object in an image can be seen as missing pixels

• data from a statistical evaluation which have been lost

Learning and Inference in Graphical Models. Chapter 08 – p. 12/28

Page 13: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Incomplete data problems

Let us assume that all data ~x are split into an observed part ~y and a missing part~z, i.e. ~x = (~y, ~z). We can distinguish three cases:

◮ completely missing at random (CMAR): whether an entry of ~x belongs to ~yor ~z is stochastically independent on both, ~y and ~z

P (xi belongs to ~z) = P (xi belongs to ~z|~y) = P (xi belongs to ~z|~y, ~z)◮ missing at random (MAR): whether an entry of ~x belongs to ~y or ~z is

stochastically independent of ~z but might depend on ~y

P (xi belongs to ~z) 6= P (xi belongs to ~z|~y) = P (xi belongs to ~z|~y, ~z)◮ censored data: whether an entry of ~x belongs to ~y or ~z is stochastically

dependent on ~z

P (xi belongs to ~z|~y) 6= P (xi belongs to ~z|~y, ~z)

Learning and Inference in Graphical Models. Chapter 08 – p. 13/28

Page 14: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Incomplete data problems

◮ Discuss the following examples of incomplete data:

• the zi in mixture models

• a sensor that measures values only down to a certain minimal value

• an interrupted connection between a sensor and a host computer so thatsome measurements are not transmitted

• a stereo camera system that measures light intensity and distance but isunable to calculate the distance for overexposed areas

• a sensor that fails often if temperatures are low– if the sensor measures the activities of the sun– if the sensor measures the persons on a beach

• non-responses at public opinion polls

Learning and Inference in Graphical Models. Chapter 08 – p. 14/28

Page 15: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Incomplete data problems

◮ consequences for stochastic analysis

• CMAR: no problem at all, incomplete data do not disturb our results

• MAR: can be treated if we model the stochastic dependency between theobserved data and the missing data

• censored data: no general treatment at all possible. Results will bedisturbed. No reconstruction of missing data possible

◮ we focus on the CMAR+MAR case here

Learning and Inference in Graphical Models. Chapter 08 – p. 15/28

Page 16: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Inference for incomplete data problems

◮ variational Bayes, Monte Carlo:Model the full posterior over the parameters of the model and the latent(missing) data. Afterwards, ignore the latent variables and return the resultfor the parameters of your model.

◮ direct MAP/ML:do not maximize the posterior/likelihood over the parameters and the latentvariables. But, consider all possible values that can be taken by the latentvariables and maximize the posterior/likelihood only w.r.t. the parameters ofyour stochastic model.→ expectation-maximization algorithm (EM)→ expectation-conditional-maximization algorithm (ECM)

Learning and Inference in Graphical Models. Chapter 08 – p. 16/28

Page 17: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

EM algorithm

Let us denote

◮ the parameters of the stochastic model (the posterior distribution)~θ = (θ1, . . . , θk)

◮ the latent variables~λ = (λ1, . . . , λm)

◮ the observed data

~o = (o1, . . . , on)

◮ the log-posterior

L(~θ, ~λ, ~o) =∑

i

log fi(Neighbors(i))

Learning and Inference in Graphical Models. Chapter 08 – p. 17/28

Page 18: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

EM algorithm

◮ We aim at maximizing the expected log-posterior over all values of the latentvariables

argmax~θ

Rm

L(~θ, ~λ, ~o) · p(~λ|~θ, ~o) d~λ

◮ an iterative approach to solve it

1. start with some parameter vector ~θ

2. repeat

3. Q(~θ′)←∫

Rm L(~θ′, ~λ, ~o) · p(~λ|~θ, ~o) d~λ4. ~θ ← argmax~θ′ Q(~θ′)

5. until convergence

◮ This algorithm is known as expectation maximization algorithm (Dempster,Laird, Rubin, 1977)

• step 3: expectation step (E-step)

• step 4: maximization step (M-step)

Learning and Inference in Graphical Models. Chapter 08 – p. 18/28

Page 19: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

EM algorithm

◮ Remarks:

• during the E-step intermediate variables are calculated which allow to

represent Q without relying on the previous values of ~θ

• closed form expressions for Q and explicit maximization often requireslengthy algebraic calculations

• for some applications calculating the E-step means calculating theexpectation values of latent variables. But this does not apply in general.

◮ Famous application areas

• mixture distributions

• learning hidden Markov models from example sequences(Baum-Welch-algorithm)

Learning and Inference in Graphical Models. Chapter 08 – p. 19/28

Page 20: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: bearing-only tracking revisited

◮ conditions distribution

ri|~xo, ~v, ~wi ∼ N (~wTi (~x0 + ti~v), σ

2)

◮ the posterior distribution1

2π√

|R|e−

12(~x0−~a)TR−1(~x0−~a)

︸ ︷︷ ︸

prior of ~x0

·

1

2π√

|S|e−

12(~v−~b)TS−1(~v−~b)

︸ ︷︷ ︸

prior of ~v

·

n∏

i=1

1

2πσ2e−

12

|| ~x0+ti~v−ri ~wi||2

σ2

︸ ︷︷ ︸

data term

object movement

unknowndistance

angle ofobservation

observer

unknown~x0 ~v

ri

~wi

����������

����������

ti

σ

~x0

~yi

~v

~wi

ri

n

Learning and Inference in Graphical Models. Chapter 08 – p. 20/28

Page 21: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: bearing-only tracking revisited

... (after lengthy, error-prone calculations) ...

Q(~x′0, ~v

′) = const − 1

2(~x′

0 − ~a)TR−1(~x′0 − ~a)− 1

2(~v′ −~b)TS−1(~v′ −~b)

−1

2

n∑

i=1

||~x′0 + ti~v

′ − ρi ~wi||2σ2

with

ρi =

{

ri if ri is observed

(~x0 + ti~v)T ~wi if ri is unobserved

Determining the maxima w.r.t ~x′0, ~v

(

R−1 + nσ2 I ( 1

σ2

∑ti)I

( 1σ2

∑ti)I S−1 + ( 1

σ2

∑t2i )I

)

·(

~x′0

~v′

)

=

(

R−1~a+ 1σ2

∑ρi ~wi

S−1~b+ 1σ2

∑tiρi ~wi

)

→ Matlab demo (using non-informative priors)

Learning and Inference in Graphical Models. Chapter 08 – p. 21/28

Page 22: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

ECM algorithm

◮ We still aim at maximizing the expected log-posterior over all values of thelatent variables

argmax~θ

Rm

L(~θ, ~λ, ~o) · p(~λ|~θ, ~o) d~λ

◮ sometimes, the M-step of the EM-algorithm cannot be calculated, i.e.

arg maxθ′1,...,θ

′k

Q(~θ′)

cannot be resolved analytically. But it might happen that

argmaxθ′i

Q(~θ′)

can be resolved for each θ′i or for groups of parameters

◮ Expectation-conditional-maximization algorithm (Meng & Rubin, 1993)

Learning and Inference in Graphical Models. Chapter 08 – p. 22/28

Page 23: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

ECM algorithm

◮ Define a set of constraints gi(~θ′, ~θ) on the parameter set, e.g.

gi : θ′j = θj for all j 6= i

◮ replace the single M-step of the EM algorithm by a sequence of CM-steps,one for each constraint

1. start with some parameter vector ~θ

2. repeat

3. Q(~θ′)←∫

Rm L(~θ′, ~λ, ~o) · p(~λ|~θ, ~o) d~λ (M-step)

4. ~θ ← argmax~θ′ Q(~θ′) subject to g1(~θ′, ~θ) (CM-step)

5....

6. ~θ ← argmax~θ′ Q(~θ′) subject to gν(~θ′, ~θ) (CM-step)

7. until convergence

Learning and Inference in Graphical Models. Chapter 08 – p. 23/28

Page 24: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: Gaussian mixtures revisited

k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

µj ∼N (m0, r0)

sj ∼ Γ−1(a0, b0)

~w ∼D(~β)Zi|~w ∼ C(~w)

Xi|Zi, µZi, sZi∼N (µZi

, sZi)

◮ conditional distribution (cf. slide 07/35)

zi = j|~w, xi, µ1, . . . , µk, s1, . . . , sk ∼ C(hi,1, . . . , hi,k)

with hi,j ∝wj

√2πsj

e− 1

2

(xi−µj)2

sj

Learning and Inference in Graphical Models. Chapter 08 – p. 24/28

Page 25: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: Gaussian mixtures revisited

◮ Q(~w′, µ′1, . . . , µ

′k, s

′1, . . . , s

′k) =

k∑

z1=1

· · ·k∑

zn=1

((k∑

j=1

log(1√2πr0

e− 1

2

µ′j−m0

r0 )

︸ ︷︷ ︸

prior of µ′j

+

k∑

j=1

log(ba00

Γ(a0)(s′j)

a0−1e−

b0s′j )

︸ ︷︷ ︸

prior of s′j

+ log(Γ(β1 + · · ·+ βk)

Γ(β1) · · · · · Γ(βk)

k∏

j=1

(w′j)

βj−1)

︸ ︷︷ ︸

prior of ~w′

+

log(1

√2πs′zj

e− 1

2

µ′zj−xi

s′zj )

︸ ︷︷ ︸

data term of xi

+ log(w′zi)

︸ ︷︷ ︸

data term of zi

)· hn,zn · · · · · h1,z1

)

◮ easily, we can maximize Q(blackboard/homework)

Learning and Inference in Graphical Models. Chapter 08 – p. 25/28

Page 26: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Example: Gaussian mixtures revisited

◮ Matlab demo (using non-informative priors)

◮ Some observations on EM/ECM for Gaussian mixtures

• very popular

• very sensitive to initialization of parameters

• overfits the data if mixture is too large (for ML/MAP with non-informativepriors)

Learning and Inference in Graphical Models. Chapter 08 – p. 26/28

Page 27: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Laplace approximation

MAP calculates a best estimate. Can we derive an approximation for the posteriordistribution?

Idea: determine a Gaussian that is locally most similar to the posterior.

Taylor-approximation of log-posterior around MAP estimate ~θMAP

log p(~θ)≈ log p(~θMAP ) + grad(~θ − ~θMAP ) +1

2(~θ − ~θMAP )

TH(~θ − ~θMAP )

= log p(~θMAP ) +1

2(~θ − ~θMAP )

TH(~θ − ~θMAP )

with H the Hessian of log p

log of a Gaussian around ~θMAP :

log1

√2π

d√|Σ|− 1

2(~θ − ~θMAP )

TΣ−1(~θ − ~θMAP )

We obtain the same shape of the Gaussian if we choose Σ−1 = −H . This isknown as Laplace approximation.

Learning and Inference in Graphical Models. Chapter 08 – p. 27/28

Page 28: Chapter 08: Direct Maximum Likelihood/MAP Estimation and ...ml.informatik.uni-freiburg.de/former/_media/... · This algorithm is known as expectation maximization algorithm (Dempster,

Summary

◮ direct maximization of likelihood/posterior

◮ latent variables

◮ incomplete data problems

◮ EM/ECM algorithm

◮ Laplace approximation

Learning and Inference in Graphical Models. Chapter 08 – p. 28/28