Transmembrane Protein Prediction Project Presentation CMPUT 606.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
-
Upload
armani-mead -
Category
Documents
-
view
220 -
download
2
Transcript of Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
MLE and EM
• Maximum Likelihood Estimation (MLE) and Expectation Maximization are two very important tools in Machine Learning
• Essentially you use them in estimating probability distributions in a learning algorithm; we have already seen one such example– in logistic regression we used MLE
• We will revisit MLE here, realize certain difficulties of MLE
• Then Expectation Maximization (EM) will rescue us
Probability Density Estimation: Quick Points
Two different routes:
Parametric• Provide a parametrized class of
density functions• Tools:
– Maximum likelihood estimation– Expectation Maximization – Sampling techniques– ….
Non-Parametric• Density is modeled by samples:• Tools:
– Kernel Methods– Sampling techniques– …
Revisiting Maximum Likelihood
The data is coming from a known probability distribution
The probability distribution has some parameters that are unknown to you
Example: data is distributed as Gaussian yi ~ N(, 2),so the unknown parameters here are = (, 2)
MLE is a tool that estimates the unknown parameters of the probabilitydistribution from data
MLE: Recapitulation
• Assume observation data yi are independent
• Form the Likelihood:
• Form the Log-likelihood:
• To find out the unknown parameter values, maximize the log-likelihood with respect to the unknown parameters:
},,,{);2
)(exp(
2
1);( 212
2
1N
iN
i
yyyμy
σπL
ZZ
)2log(2
)())
2
)(exp(
2
1log();(
12
2
2
2
1
σπNμyμy
σπl
N
i
iiN
i
Z
N
ii
N
i i μyN
l
N
yμ
μ
l
1
222
1 )(1
0;0
MLE: A Challenging Example
Observation data:
histogram
Indicator variable
is the probability with which the observation is chosen from density 2
(1- ) is the probability with which the observation is chosen from density 1
Mixture model:
Source: Department of Statistics, CMU
}1,0{;)1(
),(~);,(~
21
2222
2111
YYY
σμNYσμNY
),();,( 222111 σμθσμθ
MLE: A Challenging Example…
Maximum likelihood fitting for parameters:
Numerically (and of course analytically, too) Challenging to solve!!
),,,,( 2121 σσμμπ
Expectation Maximization: A Rescuer
EM augments the data space– assumes some latent data
Source: Department of Statistics, CMU
EM: A Rescuer…
Maximizing this form of log-likelihood is now tractable
Note that we cannot analytically maximize this log-likelihood
niiiy 1},{ T
Source: Department of Statistics, CMU
);(0 Tl
EM: The Complete Data Likelihood
;)1(
)1(0
1
11
1
0
N
ii
N
iii yl
;0
1
12
2
0
N
ii
N
iii yl
;)1(
))(1(0
1
1
21
212
1
0
N
ii
N
iii μy
σσ
l;
)(0
1
1
22
222
2
0
N
ii
N
iii μy
σσ
l
;0 10
Nπ
π
l
N
ii
By simple differentiations we have:
How do we get the latent variables?
So, maximization of the complete data likelihood is much easier!
Obtaining Latent Variables
The latent variables are computed as expected values given the data and parameters:
),|1Pr(),|()( iiiii yyEθγ
Apply Bayes’ rule:
πyΦπyΦ
πyΦ
yy
yyθγ
ii
i
iiiiii
iiiiii
)()1)((
)(
)|0Pr(),0|Pr()|1Pr(),1|Pr(
)|1Pr(),1|Pr(),|1Pr()(
21
2
EM for Two-component Gaussian Mixture
• Initialize 1, 1, 2, 2,
• Iterate until convergence– Expectation of latent variables
– Maximization for finding parameters
)2
)(2
)(exp(
11
1
)()1)((
)()(
22
22
21
21
1
221
2
σμy
σμyπyΦπyΦ
πyΦθγ
iiii
ii
;)1(
)1(
1
11
N
ii
N
iii
γ
yγ ;
1
12
N
ii
N
iii
γ
yγ ;
)1(
))(1(
1
1
21
21
N
ii
N
iii
γ
μyγσ ;
)(
1
1
22
22
N
ii
N
iii
γ
μyγσ ;1
N
γπ
N
ii
EM for Mixture of K Gaussians
• Initialize mean vectors, covariance matrices, and mixing probabilities: k, k, k, k =1,2,…,K.
• Expectation Step: compute responsibilities
• Maximization Step: update parameters
• Iterate Steps Expectation and Maximization until convergence
.,,1,,,1,),;(
),;(
1
KkNiΦπ
Φπγ K
n nnin
kkikik
μy
μy
N
i ik
N
i
Tkikiik
k
N
i ikkN
i ik
N
i iikk
γ
γ
N
γπ
γ
γ
1
1
1
1
1
))((
;
μyμy
yμ
EM Algorithm in General
T = (Z, Zm) is the complete data; we only know Z, Zm is missing
),|Pr(
)|Pr(
),|Pr(
)|,Pr()|Pr(
θZZ
θT
θZZ
θZZθZ
mm
m
Taking logarithm: )|;();();( 10 ZZθlTθlZθl m
Because we have access to previous parameter values , we can do better:
),(),()]|;([)];([);( 1,|0,| θθRθθQZZθlETθlEZθl mθZZθZT m
Let us now consider the expression: )],(),([)],(),([);();( θθRθθRθθQθθQZθlZθl
It can be shown that ),(),( θθRθθR
Thus if ’ maximizes ),( θθQ then );();( ZθlZθl
This is actually done by Jensen’s inequality
• Start with initial parameter values (0); t = 1• Expectation step: compute
• Maximization step:
• t =t + 1 and iterate
EM Algorithm in General
)];([),( 0,|
)1()1( TθlEθθQ tθZT
t
),(maxarg )1()(
tt θθQ
EM Algorithm: Summary
• Augment the original data space by latent/hidden/missing data
• Frame a suitable probability model for the augmented data space
• In EM iterations, first assume initial values for the parameters
• Iterate the Expectation and the Maximization steps• In the Expectation step, find the expected values of the
latent variables (here you need to use the current parameter values)
• In the Maximization step, first plug in the expected values of the latent variables in the log-likelihood of the augmented data. Then maximize this log-likelihood to reevaluate the parameters
• Iterate last two steps until convergence
Applications of EM
– Mixture models– HMMs– PCA– Latent variable models– Missing data problems– many computer vision problems– …
References
• The EM Algorithm and Extensions by Geoffrey J. MacLauchlan, Thriyambakam Krishnan
• For a non-parametric density estimate by EM look at: http://bioinformatics.uchc.edu/LectureNotes_2006/Tools_EM_SA_2006_files/frame.htm