07 Machine Learning - Expectation Maximization

208
Machine Learning for Data Mining Expectation Maximization Andres Mendez-Vazquez July 12, 2015 1 / 72

Transcript of 07 Machine Learning - Expectation Maximization

Page 1: 07 Machine Learning - Expectation Maximization

Machine Learning for Data MiningExpectation Maximization

Andres Mendez-Vazquez

July 12, 2015

1 / 72

Page 2: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

2 / 72

Page 3: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

3 / 72

Page 4: 07 Machine Learning - Expectation Maximization

Maximum-Likelihood

We have a density function p (x|Θ)Assume that we have a data set of size N , X = {x1,x2, ...,xN}

This data is known as evidence.

We assume in addition thatThe vectors are independent and identically distributed (i.i.d.) withdistribution p under parameter θ.

4 / 72

Page 5: 07 Machine Learning - Expectation Maximization

Maximum-Likelihood

We have a density function p (x|Θ)Assume that we have a data set of size N , X = {x1,x2, ...,xN}

This data is known as evidence.

We assume in addition thatThe vectors are independent and identically distributed (i.i.d.) withdistribution p under parameter θ.

4 / 72

Page 6: 07 Machine Learning - Expectation Maximization

What Can We Do With The Evidence?

We may use the Bayes’ Rule to estimate the parameters θ

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

Or, given a new observation x̃

p (x̃|X ) (2)

I.e. to compute the probability of the new observation being supported bythe evidence X .

ThusThe former represents parameter estimation and the latter data prediction.

5 / 72

Page 7: 07 Machine Learning - Expectation Maximization

What Can We Do With The Evidence?

We may use the Bayes’ Rule to estimate the parameters θ

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

Or, given a new observation x̃

p (x̃|X ) (2)

I.e. to compute the probability of the new observation being supported bythe evidence X .

ThusThe former represents parameter estimation and the latter data prediction.

5 / 72

Page 8: 07 Machine Learning - Expectation Maximization

What Can We Do With The Evidence?

We may use the Bayes’ Rule to estimate the parameters θ

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

Or, given a new observation x̃

p (x̃|X ) (2)

I.e. to compute the probability of the new observation being supported bythe evidence X .

ThusThe former represents parameter estimation and the latter data prediction.

5 / 72

Page 9: 07 Machine Learning - Expectation Maximization

Focusing First on the Estimation of the Parameters θ

We can interpret the Bayes’ Rule

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

posterior = likelihood × priorevidence (4)

Thus, we want

likelihood = P (X|Θ)

6 / 72

Page 10: 07 Machine Learning - Expectation Maximization

Focusing First on the Estimation of the Parameters θ

We can interpret the Bayes’ Rule

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

posterior = likelihood × priorevidence (4)

Thus, we want

likelihood = P (X|Θ)

6 / 72

Page 11: 07 Machine Learning - Expectation Maximization

Focusing First on the Estimation of the Parameters θ

We can interpret the Bayes’ Rule

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

posterior = likelihood × priorevidence (4)

Thus, we want

likelihood = P (X|Θ)

6 / 72

Page 12: 07 Machine Learning - Expectation Maximization

What we want...

We want to maximize the likelihood as a function of θ

7 / 72

Page 13: 07 Machine Learning - Expectation Maximization

Maximum-Likelihood

We have

p (x1,x2, ...,xN |Θ) =N∏

i=1p (xi |Θ) (5)

Also known as the likelihood function.

oblematicBecause multiplication of number less than one can beproblematic

L (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log p (xi |Θ) (6)

8 / 72

Page 14: 07 Machine Learning - Expectation Maximization

Maximum-Likelihood

We have

p (x1,x2, ...,xN |Θ) =N∏

i=1p (xi |Θ) (5)

Also known as the likelihood function.

oblematicBecause multiplication of number less than one can beproblematic

L (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log p (xi |Θ) (6)

8 / 72

Page 15: 07 Machine Learning - Expectation Maximization

Maximum-Likelihood

We want to find a Θ∗

Θ∗ = argmaxΘL (Θ|X ) (7)

The classic method∂L (Θ|X )

∂θi= 0 ∀θi ∈ Θ (8)

9 / 72

Page 16: 07 Machine Learning - Expectation Maximization

Maximum-Likelihood

We want to find a Θ∗

Θ∗ = argmaxΘL (Θ|X ) (7)

The classic method∂L (Θ|X )

∂θi= 0 ∀θi ∈ Θ (8)

9 / 72

Page 17: 07 Machine Learning - Expectation Maximization

What happened if we have incomplete data

Data could have been split1 X = observed data or incomplete data2 Y = unobserved data

For this type of problemsWe have the famous Expectation Maximization (EM)

10 / 72

Page 18: 07 Machine Learning - Expectation Maximization

What happened if we have incomplete data

Data could have been split1 X = observed data or incomplete data2 Y = unobserved data

For this type of problemsWe have the famous Expectation Maximization (EM)

10 / 72

Page 19: 07 Machine Learning - Expectation Maximization

What happened if we have incomplete data

Data could have been split1 X = observed data or incomplete data2 Y = unobserved data

For this type of problemsWe have the famous Expectation Maximization (EM)

10 / 72

Page 20: 07 Machine Learning - Expectation Maximization

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

Page 21: 07 Machine Learning - Expectation Maximization

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

Page 22: 07 Machine Learning - Expectation Maximization

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

Page 23: 07 Machine Learning - Expectation Maximization

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

Page 24: 07 Machine Learning - Expectation Maximization

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Page 25: 07 Machine Learning - Expectation Maximization

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Page 26: 07 Machine Learning - Expectation Maximization

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Page 27: 07 Machine Learning - Expectation Maximization

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Page 28: 07 Machine Learning - Expectation Maximization

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Page 29: 07 Machine Learning - Expectation Maximization

New Likelihood Function

The New Likelihood Function

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Note: The complete data likelihood.

Thus, we have

L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)

Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!

13 / 72

Page 30: 07 Machine Learning - Expectation Maximization

New Likelihood Function

The New Likelihood Function

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Note: The complete data likelihood.

Thus, we have

L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)

Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!

13 / 72

Page 31: 07 Machine Learning - Expectation Maximization

New Likelihood Function

The New Likelihood Function

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Note: The complete data likelihood.

Thus, we have

L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)

Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!

13 / 72

Page 32: 07 Machine Learning - Expectation Maximization

Rewriting

This can be rewritten as

L (Θ|X ,Y) = hX ,Θ (Y) (13)

This basically signify that X ,Θ are constant and the only random part isY.

In addition

L (Θ|X ) (14)

It is known as the incomplete-data likelihood function.

14 / 72

Page 33: 07 Machine Learning - Expectation Maximization

Rewriting

This can be rewritten as

L (Θ|X ,Y) = hX ,Θ (Y) (13)

This basically signify that X ,Θ are constant and the only random part isY.

In addition

L (Θ|X ) (14)

It is known as the incomplete-data likelihood function.

14 / 72

Page 34: 07 Machine Learning - Expectation Maximization

Thus

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

=∑Y

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

)Y

p (Y|X ,Θ)

15 / 72

Page 35: 07 Machine Learning - Expectation Maximization

Thus

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

=∑Y

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

)Y

p (Y|X ,Θ)

15 / 72

Page 36: 07 Machine Learning - Expectation Maximization

Thus

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

=∑Y

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

)Y

p (Y|X ,Θ)

15 / 72

Page 37: 07 Machine Learning - Expectation Maximization

Thus

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

=∑Y

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

)Y

p (Y|X ,Θ)

15 / 72

Page 38: 07 Machine Learning - Expectation Maximization

Remarks

ProblemsNormally, it is almost impossible to obtain a closed analytical solution forthe previous equation.

HoweverWe can use the expected value of log p (X ,Y|Θ), which allows us to findan iterative procedure to approximate the solution.

16 / 72

Page 39: 07 Machine Learning - Expectation Maximization

Remarks

ProblemsNormally, it is almost impossible to obtain a closed analytical solution forthe previous equation.

HoweverWe can use the expected value of log p (X ,Y|Θ), which allows us to findan iterative procedure to approximate the solution.

16 / 72

Page 40: 07 Machine Learning - Expectation Maximization

The function we would like to have...

The Q functionWe want an estimation of the complete-data log-likelihood

log p (X ,Y|Θ) (15)

Based in the info provided by X ,Θn−1 where Θn−1 is a previouslyestimated set of parameters at step n.

Think about the following, if we want to remove Yˆ[log p (X ,Y|Θ)] p (Y|X ,Θn−1) dY (16)

Remark: We integrate out Y - Actually, this is the expected value oflog p (X ,Y|Θ).

17 / 72

Page 41: 07 Machine Learning - Expectation Maximization

The function we would like to have...

The Q functionWe want an estimation of the complete-data log-likelihood

log p (X ,Y|Θ) (15)

Based in the info provided by X ,Θn−1 where Θn−1 is a previouslyestimated set of parameters at step n.

Think about the following, if we want to remove Yˆ[log p (X ,Y|Θ)] p (Y|X ,Θn−1) dY (16)

Remark: We integrate out Y - Actually, this is the expected value oflog p (X ,Y|Θ).

17 / 72

Page 42: 07 Machine Learning - Expectation Maximization

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

Page 43: 07 Machine Learning - Expectation Maximization

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

Page 44: 07 Machine Learning - Expectation Maximization

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

Page 45: 07 Machine Learning - Expectation Maximization

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

Page 46: 07 Machine Learning - Expectation Maximization

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Page 47: 07 Machine Learning - Expectation Maximization

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Page 48: 07 Machine Learning - Expectation Maximization

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Page 49: 07 Machine Learning - Expectation Maximization

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Page 50: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

20 / 72

Page 51: 07 Machine Learning - Expectation Maximization

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Page 52: 07 Machine Learning - Expectation Maximization

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Page 53: 07 Machine Learning - Expectation Maximization

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Page 54: 07 Machine Learning - Expectation Maximization

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Page 55: 07 Machine Learning - Expectation Maximization

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Page 56: 07 Machine Learning - Expectation Maximization

Why E-step!!!

From here the nameThis is basically the E-step

The second stepIt tries to maximize the Q function

Θn = argmaxΘQ (Θ,Θn−1) (20)

22 / 72

Page 57: 07 Machine Learning - Expectation Maximization

Why E-step!!!

From here the nameThis is basically the E-step

The second stepIt tries to maximize the Q function

Θn = argmaxΘQ (Θ,Θn−1) (20)

22 / 72

Page 58: 07 Machine Learning - Expectation Maximization

Why E-step!!!

From here the nameThis is basically the E-step

The second stepIt tries to maximize the Q function

Θn = argmaxΘQ (Θ,Θn−1) (20)

22 / 72

Page 59: 07 Machine Learning - Expectation Maximization

Derivation of the EM-Algorithm

The likelihood function we are going to useLet X be a random vector which results from a parametrized family:

L (×) = lnP (X|Θ) (21)

Note: ln (x) is a strictly increasing function.

We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)

Or the maximization of the difference

L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)

23 / 72

Page 60: 07 Machine Learning - Expectation Maximization

Derivation of the EM-Algorithm

The likelihood function we are going to useLet X be a random vector which results from a parametrized family:

L (×) = lnP (X|Θ) (21)

Note: ln (x) is a strictly increasing function.

We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)

Or the maximization of the difference

L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)

23 / 72

Page 61: 07 Machine Learning - Expectation Maximization

Derivation of the EM-Algorithm

The likelihood function we are going to useLet X be a random vector which results from a parametrized family:

L (×) = lnP (X|Θ) (21)

Note: ln (x) is a strictly increasing function.

We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)

Or the maximization of the difference

L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)

23 / 72

Page 62: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

24 / 72

Page 63: 07 Machine Learning - Expectation Maximization

Introducing the Hidden Features

Given that the hidden random vector Y exits with y values

P (X|Θ) =∑

yP (X|y,Θ)P (y|Θ) (23)

Thus

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn) (24)

25 / 72

Page 64: 07 Machine Learning - Expectation Maximization

Introducing the Hidden Features

Given that the hidden random vector Y exits with y values

P (X|Θ) =∑

yP (X|y,Θ)P (y|Θ) (23)

Thus

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn) (24)

25 / 72

Page 65: 07 Machine Learning - Expectation Maximization

Here, we introduce some concepts of convexity

Now, we haveTheorem (Jensen’s inequality)

Let f be a convex function defined on an interval I . If x1, x2, ..., xn ∈ Iand λ1, λ2, ..., λn ≥ 0 with

∑ni=1 λi = 1, then

f( n∑

i=1λixi

)≤

n∑i=1

λi f (xi) (25)

26 / 72

Page 66: 07 Machine Learning - Expectation Maximization

Proof:

For n = 1We have the trivial case

For n = 2The convexity definition.

Now the inductive hypothesisWe assume that the theorem is true for some n.

27 / 72

Page 67: 07 Machine Learning - Expectation Maximization

Proof:

For n = 1We have the trivial case

For n = 2The convexity definition.

Now the inductive hypothesisWe assume that the theorem is true for some n.

27 / 72

Page 68: 07 Machine Learning - Expectation Maximization

Proof:

For n = 1We have the trivial case

For n = 2The convexity definition.

Now the inductive hypothesisWe assume that the theorem is true for some n.

27 / 72

Page 69: 07 Machine Learning - Expectation Maximization

Now, we have

The following linear combination for λi

f(n+1∑

i=1λixi

)= f

(λn+1xn+1 +

n∑i=1

λixi

)

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

λixi

)

28 / 72

Page 70: 07 Machine Learning - Expectation Maximization

Now, we have

The following linear combination for λi

f(n+1∑

i=1λixi

)= f

(λn+1xn+1 +

n∑i=1

λixi

)

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

λixi

)

28 / 72

Page 71: 07 Machine Learning - Expectation Maximization

Now, we have

The following linear combination for λi

f(n+1∑

i=1λixi

)= f

(λn+1xn+1 +

n∑i=1

λixi

)

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

λixi

)

28 / 72

Page 72: 07 Machine Learning - Expectation Maximization

Did you notice?

Something Notablen+1∑i=1

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

Page 73: 07 Machine Learning - Expectation Maximization

Did you notice?

Something Notablen+1∑i=1

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

Page 74: 07 Machine Learning - Expectation Maximization

Did you notice?

Something Notablen+1∑i=1

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

Page 75: 07 Machine Learning - Expectation Maximization

Now

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

i=1λi f (xi) Q.E.D.

30 / 72

Page 76: 07 Machine Learning - Expectation Maximization

Now

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

i=1λi f (xi) Q.E.D.

30 / 72

Page 77: 07 Machine Learning - Expectation Maximization

Now

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

i=1λi f (xi) Q.E.D.

30 / 72

Page 78: 07 Machine Learning - Expectation Maximization

Thus...

It is possible to shown thatGiven ln (x) a concave function =⇒ln [

∑ni=1 λixi ] ≥

∑ni=1 λi ln (xi).

If we take in considerationAssume that the λi = P (y|X ,Θn). We know that

1 P (y|X ,Θn) ≥ 02∑

y P (y|X ,Θn) = 1

31 / 72

Page 79: 07 Machine Learning - Expectation Maximization

Thus...

It is possible to shown thatGiven ln (x) a concave function =⇒ln [

∑ni=1 λixi ] ≥

∑ni=1 λi ln (xi).

If we take in considerationAssume that the λi = P (y|X ,Θn). We know that

1 P (y|X ,Θn) ≥ 02∑

y P (y|X ,Θn) = 1

31 / 72

Page 80: 07 Machine Learning - Expectation Maximization

We have then...

First

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

Page 81: 07 Machine Learning - Expectation Maximization

We have then...

First

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

Page 82: 07 Machine Learning - Expectation Maximization

We have then...

First

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

Page 83: 07 Machine Learning - Expectation Maximization

We have then...

First

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

Page 84: 07 Machine Learning - Expectation Maximization

Next

Because ∑yP (y|X ,Θn) = 1

Then

L (Θ)− L (Θn) =∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)∆=∆ (Θ|Θn)

33 / 72

Page 85: 07 Machine Learning - Expectation Maximization

Next

Because ∑yP (y|X ,Θn) = 1

Then

L (Θ)− L (Θn) =∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)∆=∆ (Θ|Θn)

33 / 72

Page 86: 07 Machine Learning - Expectation Maximization

Then, we have

Then

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)

34 / 72

Page 87: 07 Machine Learning - Expectation Maximization

Then, we have

Then

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)

34 / 72

Page 88: 07 Machine Learning - Expectation Maximization

Then, we have

Then

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)

34 / 72

Page 89: 07 Machine Learning - Expectation Maximization

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Page 90: 07 Machine Learning - Expectation Maximization

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Page 91: 07 Machine Learning - Expectation Maximization

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Page 92: 07 Machine Learning - Expectation Maximization

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Page 93: 07 Machine Learning - Expectation Maximization

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Page 94: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

36 / 72

Page 95: 07 Machine Learning - Expectation Maximization

What are we looking for...

Thus, we want toTo select Θ such that l (Θ|Θn) is maximized i.e given a Θn , we want togenerate Θn+1.

The process can be seen in the following graph

37 / 72

Page 96: 07 Machine Learning - Expectation Maximization

What are we looking for...

Thus, we want toTo select Θ such that l (Θ|Θn) is maximized i.e given a Θn , we want togenerate Θn+1.

The process can be seen in the following graph

37 / 72

Page 97: 07 Machine Learning - Expectation Maximization

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Page 98: 07 Machine Learning - Expectation Maximization

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Page 99: 07 Machine Learning - Expectation Maximization

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Page 100: 07 Machine Learning - Expectation Maximization

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Page 101: 07 Machine Learning - Expectation Maximization

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Page 102: 07 Machine Learning - Expectation Maximization

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Page 103: 07 Machine Learning - Expectation Maximization

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Page 104: 07 Machine Learning - Expectation Maximization

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Page 105: 07 Machine Learning - Expectation Maximization

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Page 106: 07 Machine Learning - Expectation Maximization

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Page 107: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

40 / 72

Page 108: 07 Machine Learning - Expectation Maximization

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

Page 109: 07 Machine Learning - Expectation Maximization

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

Page 110: 07 Machine Learning - Expectation Maximization

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

Page 111: 07 Machine Learning - Expectation Maximization

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

Page 112: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

42 / 72

Page 113: 07 Machine Learning - Expectation Maximization

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Page 114: 07 Machine Learning - Expectation Maximization

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Page 115: 07 Machine Learning - Expectation Maximization

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Page 116: 07 Machine Learning - Expectation Maximization

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Page 117: 07 Machine Learning - Expectation Maximization

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Page 118: 07 Machine Learning - Expectation Maximization

Notes and Convergence of EM

PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).

For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.

44 / 72

Page 119: 07 Machine Learning - Expectation Maximization

Notes and Convergence of EM

PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).

For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.

44 / 72

Page 120: 07 Machine Learning - Expectation Maximization

Notes and Convergence of EM

PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).

For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.

44 / 72

Page 121: 07 Machine Learning - Expectation Maximization

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

Page 122: 07 Machine Learning - Expectation Maximization

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

Page 123: 07 Machine Learning - Expectation Maximization

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

Page 124: 07 Machine Learning - Expectation Maximization

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

Page 125: 07 Machine Learning - Expectation Maximization

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

Page 126: 07 Machine Learning - Expectation Maximization

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

Page 127: 07 Machine Learning - Expectation Maximization

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

Page 128: 07 Machine Learning - Expectation Maximization

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

Page 129: 07 Machine Learning - Expectation Maximization

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

Page 130: 07 Machine Learning - Expectation Maximization

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

Page 131: 07 Machine Learning - Expectation Maximization

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

Page 132: 07 Machine Learning - Expectation Maximization

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

Page 133: 07 Machine Learning - Expectation Maximization

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

Page 134: 07 Machine Learning - Expectation Maximization

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

Page 135: 07 Machine Learning - Expectation Maximization

Problem

Which?We do not know the values of Y.

We can get away by using the following ideaAssume the Y is a random variable.

48 / 72

Page 136: 07 Machine Learning - Expectation Maximization

Problem

Which?We do not know the values of Y.

We can get away by using the following ideaAssume the Y is a random variable.

48 / 72

Page 137: 07 Machine Learning - Expectation Maximization

Thus

You do a first guess for the parameters

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

Then, it is possible to calculate

pj(xi |θg

j

)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:

αj = p (component j) (34)

49 / 72

Page 138: 07 Machine Learning - Expectation Maximization

Thus

You do a first guess for the parameters

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

Then, it is possible to calculate

pj(xi |θg

j

)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:

αj = p (component j) (34)

49 / 72

Page 139: 07 Machine Learning - Expectation Maximization

Thus

You do a first guess for the parameters

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

Then, it is possible to calculate

pj(xi |θg

j

)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:

αj = p (component j) (34)

49 / 72

Page 140: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

50 / 72

Page 141: 07 Machine Learning - Expectation Maximization

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

yi

)p (xi |Θg) We know θg

yi ⇒ Drop it

=αg

yi pyi

(xi |θg

yi

)p (xi |Θg)

=αg

yi pyi

(xi |θg

yi

)∑Mk=1 α

gkpk (xi |θg

k)

In addition

p (y|X ,Θg) =N∏

i=1

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Page 142: 07 Machine Learning - Expectation Maximization

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

yi

)p (xi |Θg) We know θg

yi ⇒ Drop it

=αg

yi pyi

(xi |θg

yi

)p (xi |Θg)

=αg

yi pyi

(xi |θg

yi

)∑Mk=1 α

gkpk (xi |θg

k)

In addition

p (y|X ,Θg) =N∏

i=1

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Page 143: 07 Machine Learning - Expectation Maximization

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

yi

)p (xi |Θg) We know θg

yi ⇒ Drop it

=αg

yi pyi

(xi |θg

yi

)p (xi |Θg)

=αg

yi pyi

(xi |θg

yi

)∑Mk=1 α

gkpk (xi |θg

k)

In addition

p (y|X ,Θg) =N∏

i=1

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Page 144: 07 Machine Learning - Expectation Maximization

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

yi

)p (xi |Θg) We know θg

yi ⇒ Drop it

=αg

yi pyi

(xi |θg

yi

)p (xi |Θg)

=αg

yi pyi

(xi |θg

yi

)∑Mk=1 α

gkpk (xi |θg

k)

In addition

p (y|X ,Θg) =N∏

i=1

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Page 145: 07 Machine Learning - Expectation Maximization

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Page 146: 07 Machine Learning - Expectation Maximization

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Page 147: 07 Machine Learning - Expectation Maximization

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Page 148: 07 Machine Learning - Expectation Maximization

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Page 149: 07 Machine Learning - Expectation Maximization

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Page 150: 07 Machine Learning - Expectation Maximization

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg) =

=

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

yN =1

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

=N∏

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

l pyi

(xi |θg

l

)∑Mk=1 α

gkpk(

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

Page 151: 07 Machine Learning - Expectation Maximization

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg) =

=

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

yN =1

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

=N∏

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

l pyi

(xi |θg

l

)∑Mk=1 α

gkpk(

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

Page 152: 07 Machine Learning - Expectation Maximization

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg) =

=

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

yN =1

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

=N∏

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

l pyi

(xi |θg

l

)∑Mk=1 α

gkpk(

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

Page 153: 07 Machine Learning - Expectation Maximization

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg) =

=

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

yN =1

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

=N∏

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

l pyi

(xi |θg

l

)∑Mk=1 α

gkpk(

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

Page 154: 07 Machine Learning - Expectation Maximization

Thus

We can write Q in the following way

Q (Θ,Θg) =N∑

i=1

M∑l=1

log [αlpl (xi |θl)] p (l|xi ,Θg)

=N∑

i=1

M∑l=1

log (αl) p (l|xi ,Θg) + ...

N∑i=1

M∑l=1

log (pl (xi |θl)) p (l|xi ,Θg) (37)

54 / 72

Page 155: 07 Machine Learning - Expectation Maximization

Thus

We can write Q in the following way

Q (Θ,Θg) =N∑

i=1

M∑l=1

log [αlpl (xi |θl)] p (l|xi ,Θg)

=N∑

i=1

M∑l=1

log (αl) p (l|xi ,Θg) + ...

N∑i=1

M∑l=1

log (pl (xi |θl)) p (l|xi ,Θg) (37)

54 / 72

Page 156: 07 Machine Learning - Expectation Maximization

Thus

We can write Q in the following way

Q (Θ,Θg) =N∑

i=1

M∑l=1

log [αlpl (xi |θl)] p (l|xi ,Θg)

=N∑

i=1

M∑l=1

log (αl) p (l|xi ,Θg) + ...

N∑i=1

M∑l=1

log (pl (xi |θl)) p (l|xi ,Θg) (37)

54 / 72

Page 157: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

55 / 72

Page 158: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

To maximize this expressionWe can maximize the term containing αl and the term containing θl .

This can be doneIn an independent way since they are not related.

56 / 72

Page 159: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

To maximize this expressionWe can maximize the term containing αl and the term containing θl .

This can be doneIn an independent way since they are not related.

56 / 72

Page 160: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

We can us the following constraint for that∑lαl = 1 (38)

We have the following cost function

Q (Θ,Θg) + λ

(∑lαl − 1

)(39)

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

Page 161: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

We can us the following constraint for that∑lαl = 1 (38)

We have the following cost function

Q (Θ,Θg) + λ

(∑lαl − 1

)(39)

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

Page 162: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

We can us the following constraint for that∑lαl = 1 (38)

We have the following cost function

Q (Θ,Θg) + λ

(∑lαl − 1

)(39)

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

Page 163: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

We have thatN∑

i=1

1αl

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

Summing over l, we get

λ = −N (43)

58 / 72

Page 164: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

We have thatN∑

i=1

1αl

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

Summing over l, we get

λ = −N (43)

58 / 72

Page 165: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

We have thatN∑

i=1

1αl

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

Summing over l, we get

λ = −N (43)

58 / 72

Page 166: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

Thus

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

It is possible to get an analytical expressions for θl as functions ofeverything else.

This is for you in a homework!!!

For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.

59 / 72

Page 167: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

Thus

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

It is possible to get an analytical expressions for θl as functions ofeverything else.

This is for you in a homework!!!

For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.

59 / 72

Page 168: 07 Machine Learning - Expectation Maximization

Lagrange Multipliers

Thus

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

It is possible to get an analytical expressions for θl as functions ofeverything else.

This is for you in a homework!!!

For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.

59 / 72

Page 169: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

60 / 72

Page 170: 07 Machine Learning - Expectation Maximization

Remember?

Gaussian Distribution

pl (x|µl ,Σl) = 1(2π)d/2 |Σl |

1/2exp

{−1

2 (x − µl)T Σ−1l (x − µl)

}(45)

61 / 72

Page 171: 07 Machine Learning - Expectation Maximization

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

Page 172: 07 Machine Learning - Expectation Maximization

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

Page 173: 07 Machine Learning - Expectation Maximization

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

Page 174: 07 Machine Learning - Expectation Maximization

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

Page 175: 07 Machine Learning - Expectation Maximization

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

Page 176: 07 Machine Learning - Expectation Maximization

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

(46)

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

Thus

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

)(47)

63 / 72

Page 177: 07 Machine Learning - Expectation Maximization

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

(46)

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

Thus

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

)(47)

63 / 72

Page 178: 07 Machine Learning - Expectation Maximization

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

(46)

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

Thus

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

)(47)

63 / 72

Page 179: 07 Machine Learning - Expectation Maximization

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

(46)

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

Thus

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

)(47)

63 / 72

Page 180: 07 Machine Learning - Expectation Maximization

Finally

The last equation we need∂tr (AB)∂A = B + BT − diag (B) (48)

In addition∂xT Ax∂x (49)

64 / 72

Page 181: 07 Machine Learning - Expectation Maximization

Finally

The last equation we need∂tr (AB)∂A = B + BT − diag (B) (48)

In addition∂xT Ax∂x (49)

64 / 72

Page 182: 07 Machine Learning - Expectation Maximization

Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives

N∑i=1

M∑l=1

log (pl (xi |µl ,Σl)) p (l|xi ,Θg)

=N∑

i=1

M∑l=1

[−1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

Thus, when taking the derivative with respect to µlN∑

i=1

[Σ−1

l (xi − µl) p (l|xi ,Θg)]

= 0 (51)

Then

µl =∑N

i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)

(52)

65 / 72

Page 183: 07 Machine Learning - Expectation Maximization

Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives

N∑i=1

M∑l=1

log (pl (xi |µl ,Σl)) p (l|xi ,Θg)

=N∑

i=1

M∑l=1

[−1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

Thus, when taking the derivative with respect to µlN∑

i=1

[Σ−1

l (xi − µl) p (l|xi ,Θg)]

= 0 (51)

Then

µl =∑N

i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)

(52)

65 / 72

Page 184: 07 Machine Learning - Expectation Maximization

Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives

N∑i=1

M∑l=1

log (pl (xi |µl ,Σl)) p (l|xi ,Θg)

=N∑

i=1

M∑l=1

[−1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

Thus, when taking the derivative with respect to µlN∑

i=1

[Σ−1

l (xi − µl) p (l|xi ,Θg)]

= 0 (51)

Then

µl =∑N

i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)

(52)

65 / 72

Page 185: 07 Machine Learning - Expectation Maximization

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

[−

12

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

Page 186: 07 Machine Learning - Expectation Maximization

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

[−

12

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

Page 187: 07 Machine Learning - Expectation Maximization

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

[−

12

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

Page 188: 07 Machine Learning - Expectation Maximization

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

[−

12

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

Page 189: 07 Machine Learning - Expectation Maximization

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

=12

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

=12

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

Page 190: 07 Machine Learning - Expectation Maximization

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

=12

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

=12

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

Page 191: 07 Machine Learning - Expectation Maximization

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

=12

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

=12

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

Page 192: 07 Machine Learning - Expectation Maximization

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

=12

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

=12

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

Page 193: 07 Machine Learning - Expectation Maximization

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

12

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Or

Σl =∑N

i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)

=∑N

i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)

(54)

68 / 72

Page 194: 07 Machine Learning - Expectation Maximization

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

12

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Or

Σl =∑N

i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)

=∑N

i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)

(54)

68 / 72

Page 195: 07 Machine Learning - Expectation Maximization

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

12

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Or

Σl =∑N

i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)

=∑N

i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)

(54)

68 / 72

Page 196: 07 Machine Learning - Expectation Maximization

Thus, we have the iterative updates

They are

αNewl = 1

N

N∑i=1

p (l|xi ,Θg)

µNewl =

∑Ni=1 xip (l|xi ,Θg)∑N

i=1 p (l|xi ,Θg)

ΣNewl =

∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N

i=1 p (l|xi ,Θg)

69 / 72

Page 197: 07 Machine Learning - Expectation Maximization

Thus, we have the iterative updates

They are

αNewl = 1

N

N∑i=1

p (l|xi ,Θg)

µNewl =

∑Ni=1 xip (l|xi ,Θg)∑N

i=1 p (l|xi ,Θg)

ΣNewl =

∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N

i=1 p (l|xi ,Θg)

69 / 72

Page 198: 07 Machine Learning - Expectation Maximization

Thus, we have the iterative updates

They are

αNewl = 1

N

N∑i=1

p (l|xi ,Θg)

µNewl =

∑Ni=1 xip (l|xi ,Θg)∑N

i=1 p (l|xi ,Θg)

ΣNewl =

∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N

i=1 p (l|xi ,Θg)

69 / 72

Page 199: 07 Machine Learning - Expectation Maximization

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

70 / 72

Page 200: 07 Machine Learning - Expectation Maximization

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Page 201: 07 Machine Learning - Expectation Maximization

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Page 202: 07 Machine Learning - Expectation Maximization

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Page 203: 07 Machine Learning - Expectation Maximization

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Page 204: 07 Machine Learning - Expectation Maximization

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Page 205: 07 Machine Learning - Expectation Maximization

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Page 206: 07 Machine Learning - Expectation Maximization

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Page 207: 07 Machine Learning - Expectation Maximization

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Page 208: 07 Machine Learning - Expectation Maximization

Exercises

Duda and HartChapter 3

3.39, 3.40, 3.41

72 / 72