Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.

19
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray

Transcript of Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Maximum Likelihood And Expectation Maximization

Lecture Notes for CMPUT 466/551

Nilanjan Ray

MLE and EM

• Maximum Likelihood Estimation (MLE) and Expectation Maximization are two very important tools in Machine Learning

• Essentially you use them in estimating probability distributions in a learning algorithm; we have already seen one such example– in logistic regression we used MLE

• We will revisit MLE here, realize certain difficulties of MLE

• Then Expectation Maximization (EM) will rescue us

Probability Density Estimation: Quick Points

Two different routes:

Parametric• Provide a parametrized class of

density functions• Tools:

– Maximum likelihood estimation– Expectation Maximization – Sampling techniques– ….

Non-Parametric• Density is modeled by samples:• Tools:

– Kernel Methods– Sampling techniques– …

Revisiting Maximum Likelihood

The data is coming from a known probability distribution

The probability distribution has some parameters that are unknown to you

Example: data is distributed as Gaussian yi ~ N(, 2),so the unknown parameters here are = (, 2)

MLE is a tool that estimates the unknown parameters of the probabilitydistribution from data

MLE: Recapitulation

• Assume observation data yi are independent

• Form the Likelihood:

• Form the Log-likelihood:

• To find out the unknown parameter values, maximize the log-likelihood with respect to the unknown parameters:

},,,{);2

)(exp(

2

1);( 212

2

1N

iN

i

yyyμy

σπL

ZZ

)2log(2

)())

2

)(exp(

2

1log();(

12

2

2

2

1

σπNμyμy

σπl

N

i

iiN

i

Z

N

ii

N

i i μyN

l

N

μ

l

1

222

1 )(1

0;0

MLE: A Challenging Example

Observation data:

histogram

Indicator variable

is the probability with which the observation is chosen from density 2

(1- ) is the probability with which the observation is chosen from density 1

Mixture model:

Source: Department of Statistics, CMU

}1,0{;)1(

),(~);,(~

21

2222

2111

YYY

σμNYσμNY

),();,( 222111 σμθσμθ

MLE: A Challenging Example…

Maximum likelihood fitting for parameters:

Numerically (and of course analytically, too) Challenging to solve!!

),,,,( 2121 σσμμπ

Expectation Maximization: A Rescuer

EM augments the data space– assumes some latent data

Source: Department of Statistics, CMU

EM: A Rescuer…

Maximizing this form of log-likelihood is now tractable

Note that we cannot analytically maximize this log-likelihood

niiiy 1},{ T

Source: Department of Statistics, CMU

);(0 Tl

EM: The Complete Data Likelihood

;)1(

)1(0

1

11

1

0

N

ii

N

iii yl

;0

1

12

2

0

N

ii

N

iii yl

;)1(

))(1(0

1

1

21

212

1

0

N

ii

N

iii μy

σσ

l;

)(0

1

1

22

222

2

0

N

ii

N

iii μy

σσ

l

;0 10

π

l

N

ii

By simple differentiations we have:

How do we get the latent variables?

So, maximization of the complete data likelihood is much easier!

Obtaining Latent Variables

The latent variables are computed as expected values given the data and parameters:

),|1Pr(),|()( iiiii yyEθγ

Apply Bayes’ rule:

πyΦπyΦ

πyΦ

yy

yyθγ

ii

i

iiiiii

iiiiii

)()1)((

)(

)|0Pr(),0|Pr()|1Pr(),1|Pr(

)|1Pr(),1|Pr(),|1Pr()(

21

2

EM for Two-component Gaussian Mixture

• Initialize 1, 1, 2, 2,

• Iterate until convergence– Expectation of latent variables

– Maximization for finding parameters

)2

)(2

)(exp(

11

1

)()1)((

)()(

22

22

21

21

1

221

2

σμy

σμyπyΦπyΦ

πyΦθγ

iiii

ii

;)1(

)1(

1

11

N

ii

N

iii

γ

yγ ;

1

12

N

ii

N

iii

γ

yγ ;

)1(

))(1(

1

1

21

21

N

ii

N

iii

γ

μyγσ ;

)(

1

1

22

22

N

ii

N

iii

γ

μyγσ ;1

N

γπ

N

ii

EM for Mixture of K Gaussians

• Initialize mean vectors, covariance matrices, and mixing probabilities: k, k, k, k =1,2,…,K.

• Expectation Step: compute responsibilities

• Maximization Step: update parameters

• Iterate Steps Expectation and Maximization until convergence

.,,1,,,1,),;(

),;(

1

KkNiΦπ

Φπγ K

n nnin

kkikik

μy

μy

N

i ik

N

i

Tkikiik

k

N

i ikkN

i ik

N

i iikk

γ

γ

N

γπ

γ

γ

1

1

1

1

1

))((

;

μyμy

EM Algorithm in General

T = (Z, Zm) is the complete data; we only know Z, Zm is missing

),|Pr(

)|Pr(

),|Pr(

)|,Pr()|Pr(

θZZ

θT

θZZ

θZZθZ

mm

m

Taking logarithm: )|;();();( 10 ZZθlTθlZθl m

Because we have access to previous parameter values , we can do better:

),(),()]|;([)];([);( 1,|0,| θθRθθQZZθlETθlEZθl mθZZθZT m

Let us now consider the expression: )],(),([)],(),([);();( θθRθθRθθQθθQZθlZθl

It can be shown that ),(),( θθRθθR

Thus if ’ maximizes ),( θθQ then );();( ZθlZθl

This is actually done by Jensen’s inequality

• Start with initial parameter values (0); t = 1• Expectation step: compute

• Maximization step:

• t =t + 1 and iterate

EM Algorithm in General

)];([),( 0,|

)1()1( TθlEθθQ tθZT

t

),(maxarg )1()(

tt θθQ

EM Algorithm: Summary

• Augment the original data space by latent/hidden/missing data

• Frame a suitable probability model for the augmented data space

• In EM iterations, first assume initial values for the parameters

• Iterate the Expectation and the Maximization steps• In the Expectation step, find the expected values of the

latent variables (here you need to use the current parameter values)

• In the Maximization step, first plug in the expected values of the latent variables in the log-likelihood of the augmented data. Then maximize this log-likelihood to reevaluate the parameters

• Iterate last two steps until convergence

Applications of EM

– Mixture models– HMMs– PCA– Latent variable models– Missing data problems– many computer vision problems– …

References

• The EM Algorithm and Extensions by Geoffrey J. MacLauchlan, Thriyambakam Krishnan

• For a non-parametric density estimate by EM look at: http://bioinformatics.uchc.edu/LectureNotes_2006/Tools_EM_SA_2006_files/frame.htm

EM: Important Issues

• Is the convergence of the algorithm guaranteed?

• Does the outcome of EM depend on the initial choice of the parameter values?

• How about the speed of convergence?

• How easy or difficult could it be to compute the expected values of the latent variables?