07 Machine Learning - Expectation Maximization

Post on 15-Apr-2017

288 views 2 download

Transcript of 07 Machine Learning - Expectation Maximization

Machine Learning for Data MiningExpectation Maximization

Andres Mendez-Vazquez

July 12, 2015

1 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

2 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

3 / 72

Maximum-Likelihood

We have a density function p (x|Θ)Assume that we have a data set of size N , X = {x1,x2, ...,xN}

This data is known as evidence.

We assume in addition thatThe vectors are independent and identically distributed (i.i.d.) withdistribution p under parameter θ.

4 / 72

Maximum-Likelihood

We have a density function p (x|Θ)Assume that we have a data set of size N , X = {x1,x2, ...,xN}

This data is known as evidence.

We assume in addition thatThe vectors are independent and identically distributed (i.i.d.) withdistribution p under parameter θ.

4 / 72

What Can We Do With The Evidence?

We may use the Bayes’ Rule to estimate the parameters θ

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

Or, given a new observation x̃

p (x̃|X ) (2)

I.e. to compute the probability of the new observation being supported bythe evidence X .

ThusThe former represents parameter estimation and the latter data prediction.

5 / 72

What Can We Do With The Evidence?

We may use the Bayes’ Rule to estimate the parameters θ

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

Or, given a new observation x̃

p (x̃|X ) (2)

I.e. to compute the probability of the new observation being supported bythe evidence X .

ThusThe former represents parameter estimation and the latter data prediction.

5 / 72

What Can We Do With The Evidence?

We may use the Bayes’ Rule to estimate the parameters θ

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

Or, given a new observation x̃

p (x̃|X ) (2)

I.e. to compute the probability of the new observation being supported bythe evidence X .

ThusThe former represents parameter estimation and the latter data prediction.

5 / 72

Focusing First on the Estimation of the Parameters θ

We can interpret the Bayes’ Rule

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

posterior = likelihood × priorevidence (4)

Thus, we want

likelihood = P (X|Θ)

6 / 72

Focusing First on the Estimation of the Parameters θ

We can interpret the Bayes’ Rule

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

posterior = likelihood × priorevidence (4)

Thus, we want

likelihood = P (X|Θ)

6 / 72

Focusing First on the Estimation of the Parameters θ

We can interpret the Bayes’ Rule

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

posterior = likelihood × priorevidence (4)

Thus, we want

likelihood = P (X|Θ)

6 / 72

What we want...

We want to maximize the likelihood as a function of θ

7 / 72

Maximum-Likelihood

We have

p (x1,x2, ...,xN |Θ) =N∏

i=1p (xi |Θ) (5)

Also known as the likelihood function.

oblematicBecause multiplication of number less than one can beproblematic

L (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log p (xi |Θ) (6)

8 / 72

Maximum-Likelihood

We have

p (x1,x2, ...,xN |Θ) =N∏

i=1p (xi |Θ) (5)

Also known as the likelihood function.

oblematicBecause multiplication of number less than one can beproblematic

L (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log p (xi |Θ) (6)

8 / 72

Maximum-Likelihood

We want to find a Θ∗

Θ∗ = argmaxΘL (Θ|X ) (7)

The classic method∂L (Θ|X )

∂θi= 0 ∀θi ∈ Θ (8)

9 / 72

Maximum-Likelihood

We want to find a Θ∗

Θ∗ = argmaxΘL (Θ|X ) (7)

The classic method∂L (Θ|X )

∂θi= 0 ∀θi ∈ Θ (8)

9 / 72

What happened if we have incomplete data

Data could have been split1 X = observed data or incomplete data2 Y = unobserved data

For this type of problemsWe have the famous Expectation Maximization (EM)

10 / 72

What happened if we have incomplete data

Data could have been split1 X = observed data or incomplete data2 Y = unobserved data

For this type of problemsWe have the famous Expectation Maximization (EM)

10 / 72

What happened if we have incomplete data

Data could have been split1 X = observed data or incomplete data2 Y = unobserved data

For this type of problemsWe have the famous Expectation Maximization (EM)

10 / 72

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Thus

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

New Likelihood Function

The New Likelihood Function

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Note: The complete data likelihood.

Thus, we have

L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)

Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!

13 / 72

New Likelihood Function

The New Likelihood Function

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Note: The complete data likelihood.

Thus, we have

L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)

Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!

13 / 72

New Likelihood Function

The New Likelihood Function

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Note: The complete data likelihood.

Thus, we have

L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)

Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!

13 / 72

Rewriting

This can be rewritten as

L (Θ|X ,Y) = hX ,Θ (Y) (13)

This basically signify that X ,Θ are constant and the only random part isY.

In addition

L (Θ|X ) (14)

It is known as the incomplete-data likelihood function.

14 / 72

Rewriting

This can be rewritten as

L (Θ|X ,Y) = hX ,Θ (Y) (13)

This basically signify that X ,Θ are constant and the only random part isY.

In addition

L (Θ|X ) (14)

It is known as the incomplete-data likelihood function.

14 / 72

Thus

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

=∑Y

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

)Y

p (Y|X ,Θ)

15 / 72

Thus

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

=∑Y

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

)Y

p (Y|X ,Θ)

15 / 72

Thus

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

=∑Y

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

)Y

p (Y|X ,Θ)

15 / 72

Thus

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

=∑Y

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

)Y

p (Y|X ,Θ)

15 / 72

Remarks

ProblemsNormally, it is almost impossible to obtain a closed analytical solution forthe previous equation.

HoweverWe can use the expected value of log p (X ,Y|Θ), which allows us to findan iterative procedure to approximate the solution.

16 / 72

Remarks

ProblemsNormally, it is almost impossible to obtain a closed analytical solution forthe previous equation.

HoweverWe can use the expected value of log p (X ,Y|Θ), which allows us to findan iterative procedure to approximate the solution.

16 / 72

The function we would like to have...

The Q functionWe want an estimation of the complete-data log-likelihood

log p (X ,Y|Θ) (15)

Based in the info provided by X ,Θn−1 where Θn−1 is a previouslyestimated set of parameters at step n.

Think about the following, if we want to remove Yˆ[log p (X ,Y|Θ)] p (Y|X ,Θn−1) dY (16)

Remark: We integrate out Y - Actually, this is the expected value oflog p (X ,Y|Θ).

17 / 72

The function we would like to have...

The Q functionWe want an estimation of the complete-data log-likelihood

log p (X ,Y|Θ) (15)

Based in the info provided by X ,Θn−1 where Θn−1 is a previouslyestimated set of parameters at step n.

Think about the following, if we want to remove Yˆ[log p (X ,Y|Θ)] p (Y|X ,Θn−1) dY (16)

Remark: We integrate out Y - Actually, this is the expected value oflog p (X ,Y|Θ).

17 / 72

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

20 / 72

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Why E-step!!!

From here the nameThis is basically the E-step

The second stepIt tries to maximize the Q function

Θn = argmaxΘQ (Θ,Θn−1) (20)

22 / 72

Why E-step!!!

From here the nameThis is basically the E-step

The second stepIt tries to maximize the Q function

Θn = argmaxΘQ (Θ,Θn−1) (20)

22 / 72

Why E-step!!!

From here the nameThis is basically the E-step

The second stepIt tries to maximize the Q function

Θn = argmaxΘQ (Θ,Θn−1) (20)

22 / 72

Derivation of the EM-Algorithm

The likelihood function we are going to useLet X be a random vector which results from a parametrized family:

L (×) = lnP (X|Θ) (21)

Note: ln (x) is a strictly increasing function.

We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)

Or the maximization of the difference

L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)

23 / 72

Derivation of the EM-Algorithm

The likelihood function we are going to useLet X be a random vector which results from a parametrized family:

L (×) = lnP (X|Θ) (21)

Note: ln (x) is a strictly increasing function.

We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)

Or the maximization of the difference

L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)

23 / 72

Derivation of the EM-Algorithm

The likelihood function we are going to useLet X be a random vector which results from a parametrized family:

L (×) = lnP (X|Θ) (21)

Note: ln (x) is a strictly increasing function.

We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)

Or the maximization of the difference

L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)

23 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

24 / 72

Introducing the Hidden Features

Given that the hidden random vector Y exits with y values

P (X|Θ) =∑

yP (X|y,Θ)P (y|Θ) (23)

Thus

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn) (24)

25 / 72

Introducing the Hidden Features

Given that the hidden random vector Y exits with y values

P (X|Θ) =∑

yP (X|y,Θ)P (y|Θ) (23)

Thus

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn) (24)

25 / 72

Here, we introduce some concepts of convexity

Now, we haveTheorem (Jensen’s inequality)

Let f be a convex function defined on an interval I . If x1, x2, ..., xn ∈ Iand λ1, λ2, ..., λn ≥ 0 with

∑ni=1 λi = 1, then

f( n∑

i=1λixi

)≤

n∑i=1

λi f (xi) (25)

26 / 72

Proof:

For n = 1We have the trivial case

For n = 2The convexity definition.

Now the inductive hypothesisWe assume that the theorem is true for some n.

27 / 72

Proof:

For n = 1We have the trivial case

For n = 2The convexity definition.

Now the inductive hypothesisWe assume that the theorem is true for some n.

27 / 72

Proof:

For n = 1We have the trivial case

For n = 2The convexity definition.

Now the inductive hypothesisWe assume that the theorem is true for some n.

27 / 72

Now, we have

The following linear combination for λi

f(n+1∑

i=1λixi

)= f

(λn+1xn+1 +

n∑i=1

λixi

)

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

λixi

)

28 / 72

Now, we have

The following linear combination for λi

f(n+1∑

i=1λixi

)= f

(λn+1xn+1 +

n∑i=1

λixi

)

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

λixi

)

28 / 72

Now, we have

The following linear combination for λi

f(n+1∑

i=1λixi

)= f

(λn+1xn+1 +

n∑i=1

λixi

)

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

λixi

)

28 / 72

Did you notice?

Something Notablen+1∑i=1

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

Did you notice?

Something Notablen+1∑i=1

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

Did you notice?

Something Notablen+1∑i=1

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

Now

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

i=1λi f (xi) Q.E.D.

30 / 72

Now

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

i=1λi f (xi) Q.E.D.

30 / 72

Now

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1

(1− λn+1)

n∑i=1

λixi

)

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

i=1λi f (xi) Q.E.D.

30 / 72

Thus...

It is possible to shown thatGiven ln (x) a concave function =⇒ln [

∑ni=1 λixi ] ≥

∑ni=1 λi ln (xi).

If we take in considerationAssume that the λi = P (y|X ,Θn). We know that

1 P (y|X ,Θn) ≥ 02∑

y P (y|X ,Θn) = 1

31 / 72

Thus...

It is possible to shown thatGiven ln (x) a concave function =⇒ln [

∑ni=1 λixi ] ≥

∑ni=1 λi ln (xi).

If we take in considerationAssume that the λi = P (y|X ,Θn). We know that

1 P (y|X ,Θn) ≥ 02∑

y P (y|X ,Θn) = 1

31 / 72

We have then...

First

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

We have then...

First

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

We have then...

First

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

We have then...

First

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

Next

Because ∑yP (y|X ,Θn) = 1

Then

L (Θ)− L (Θn) =∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)∆=∆ (Θ|Θn)

33 / 72

Next

Because ∑yP (y|X ,Θn) = 1

Then

L (Θ)− L (Θn) =∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)∆=∆ (Θ|Θn)

33 / 72

Then, we have

Then

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)

34 / 72

Then, we have

Then

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)

34 / 72

Then, we have

Then

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)

34 / 72

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

36 / 72

What are we looking for...

Thus, we want toTo select Θ such that l (Θ|Θn) is maximized i.e given a Θn , we want togenerate Θn+1.

The process can be seen in the following graph

37 / 72

What are we looking for...

Thus, we want toTo select Θ such that l (Θ|Θn) is maximized i.e given a Θn , we want togenerate Θn+1.

The process can be seen in the following graph

37 / 72

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

∑y

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

)}

38 / 72

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Thus

Then

θn+1 =argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

)}

=argmaxΘ

{∑y

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}

39 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

40 / 72

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

42 / 72

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

Notes and Convergence of EM

PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).

For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.

44 / 72

Notes and Convergence of EM

PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).

For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.

44 / 72

Notes and Convergence of EM

PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).

For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.

44 / 72

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log

M∑j=1

αjpj (xi |θj)

(29)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Thus

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

Problem

Which?We do not know the values of Y.

We can get away by using the following ideaAssume the Y is a random variable.

48 / 72

Problem

Which?We do not know the values of Y.

We can get away by using the following ideaAssume the Y is a random variable.

48 / 72

Thus

You do a first guess for the parameters

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

Then, it is possible to calculate

pj(xi |θg

j

)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:

αj = p (component j) (34)

49 / 72

Thus

You do a first guess for the parameters

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

Then, it is possible to calculate

pj(xi |θg

j

)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:

αj = p (component j) (34)

49 / 72

Thus

You do a first guess for the parameters

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

Then, it is possible to calculate

pj(xi |θg

j

)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:

αj = p (component j) (34)

49 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

50 / 72

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

yi

)p (xi |Θg) We know θg

yi ⇒ Drop it

=αg

yi pyi

(xi |θg

yi

)p (xi |Θg)

=αg

yi pyi

(xi |θg

yi

)∑Mk=1 α

gkpk (xi |θg

k)

In addition

p (y|X ,Θg) =N∏

i=1

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

yi

)p (xi |Θg) We know θg

yi ⇒ Drop it

=αg

yi pyi

(xi |θg

yi

)p (xi |Θg)

=αg

yi pyi

(xi |θg

yi

)∑Mk=1 α

gkpk (xi |θg

k)

In addition

p (y|X ,Θg) =N∏

i=1

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

yi

)p (xi |Θg) We know θg

yi ⇒ Drop it

=αg

yi pyi

(xi |θg

yi

)p (xi |Θg)

=αg

yi pyi

(xi |θg

yi

)∑Mk=1 α

gkpk (xi |θg

k)

In addition

p (y|X ,Θg) =N∏

i=1

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

yi

)p (xi |Θg) We know θg

yi ⇒ Drop it

=αg

yi pyi

(xi |θg

yi

)p (xi |Θg)

=αg

yi pyi

(xi |θg

yi

)∑Mk=1 α

gkpk (xi |θg

k)

In addition

p (y|X ,Θg) =N∏

i=1

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

Now, using equation 17

Then

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

log [αyi pyi (xi |θyi )]N∏

j=1

p (yj |xj ,Θg)

=M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

j=1

p (yj |xj ,Θg)

=N∑

i=1

M∑l=1

log [αlpl (xi |θl)]M∑

y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg) =

=

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

yN =1

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

=N∏

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

l pyi

(xi |θg

l

)∑Mk=1 α

gkpk(

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg) =

=

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

yN =1

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

=N∏

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

l pyi

(xi |θg

l

)∑Mk=1 α

gkpk(

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg) =

=

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

yN =1

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

=N∏

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

l pyi

(xi |θg

l

)∑Mk=1 α

gkpk(

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

yN =1

δl,yi

N∏j=1

p (yj |xj ,Θg) =

=

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

yN =1

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

=N∏

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

l pyi

(xi |θg

l

)∑Mk=1 α

gkpk(

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

Thus

We can write Q in the following way

Q (Θ,Θg) =N∑

i=1

M∑l=1

log [αlpl (xi |θl)] p (l|xi ,Θg)

=N∑

i=1

M∑l=1

log (αl) p (l|xi ,Θg) + ...

N∑i=1

M∑l=1

log (pl (xi |θl)) p (l|xi ,Θg) (37)

54 / 72

Thus

We can write Q in the following way

Q (Θ,Θg) =N∑

i=1

M∑l=1

log [αlpl (xi |θl)] p (l|xi ,Θg)

=N∑

i=1

M∑l=1

log (αl) p (l|xi ,Θg) + ...

N∑i=1

M∑l=1

log (pl (xi |θl)) p (l|xi ,Θg) (37)

54 / 72

Thus

We can write Q in the following way

Q (Θ,Θg) =N∑

i=1

M∑l=1

log [αlpl (xi |θl)] p (l|xi ,Θg)

=N∑

i=1

M∑l=1

log (αl) p (l|xi ,Θg) + ...

N∑i=1

M∑l=1

log (pl (xi |θl)) p (l|xi ,Θg) (37)

54 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

55 / 72

Lagrange Multipliers

To maximize this expressionWe can maximize the term containing αl and the term containing θl .

This can be doneIn an independent way since they are not related.

56 / 72

Lagrange Multipliers

To maximize this expressionWe can maximize the term containing αl and the term containing θl .

This can be doneIn an independent way since they are not related.

56 / 72

Lagrange Multipliers

We can us the following constraint for that∑lαl = 1 (38)

We have the following cost function

Q (Θ,Θg) + λ

(∑lαl − 1

)(39)

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

Lagrange Multipliers

We can us the following constraint for that∑lαl = 1 (38)

We have the following cost function

Q (Θ,Θg) + λ

(∑lαl − 1

)(39)

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

Lagrange Multipliers

We can us the following constraint for that∑lαl = 1 (38)

We have the following cost function

Q (Θ,Θg) + λ

(∑lαl − 1

)(39)

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

Lagrange Multipliers

We have thatN∑

i=1

1αl

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

Summing over l, we get

λ = −N (43)

58 / 72

Lagrange Multipliers

We have thatN∑

i=1

1αl

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

Summing over l, we get

λ = −N (43)

58 / 72

Lagrange Multipliers

We have thatN∑

i=1

1αl

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

Summing over l, we get

λ = −N (43)

58 / 72

Lagrange Multipliers

Thus

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

It is possible to get an analytical expressions for θl as functions ofeverything else.

This is for you in a homework!!!

For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.

59 / 72

Lagrange Multipliers

Thus

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

It is possible to get an analytical expressions for θl as functions ofeverything else.

This is for you in a homework!!!

For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.

59 / 72

Lagrange Multipliers

Thus

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

It is possible to get an analytical expressions for θl as functions ofeverything else.

This is for you in a homework!!!

For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.

59 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

60 / 72

Remember?

Gaussian Distribution

pl (x|µl ,Σl) = 1(2π)d/2 |Σl |

1/2exp

{−1

2 (x − µl)T Σ−1l (x − µl)

}(45)

61 / 72

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

|A|

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

(46)

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

Thus

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

)(47)

63 / 72

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

(46)

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

Thus

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

)(47)

63 / 72

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

(46)

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

Thus

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

)(47)

63 / 72

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

(46)

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

Thus

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

)(47)

63 / 72

Finally

The last equation we need∂tr (AB)∂A = B + BT − diag (B) (48)

In addition∂xT Ax∂x (49)

64 / 72

Finally

The last equation we need∂tr (AB)∂A = B + BT − diag (B) (48)

In addition∂xT Ax∂x (49)

64 / 72

Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives

N∑i=1

M∑l=1

log (pl (xi |µl ,Σl)) p (l|xi ,Θg)

=N∑

i=1

M∑l=1

[−1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

Thus, when taking the derivative with respect to µlN∑

i=1

[Σ−1

l (xi − µl) p (l|xi ,Θg)]

= 0 (51)

Then

µl =∑N

i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)

(52)

65 / 72

Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives

N∑i=1

M∑l=1

log (pl (xi |µl ,Σl)) p (l|xi ,Θg)

=N∑

i=1

M∑l=1

[−1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

Thus, when taking the derivative with respect to µlN∑

i=1

[Σ−1

l (xi − µl) p (l|xi ,Θg)]

= 0 (51)

Then

µl =∑N

i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)

(52)

65 / 72

Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives

N∑i=1

M∑l=1

log (pl (xi |µl ,Σl)) p (l|xi ,Θg)

=N∑

i=1

M∑l=1

[−1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

Thus, when taking the derivative with respect to µlN∑

i=1

[Σ−1

l (xi − µl) p (l|xi ,Θg)]

= 0 (51)

Then

µl =∑N

i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)

(52)

65 / 72

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

[−

12

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

[−

12

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

[−

12

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

[−

12

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

=M∑

l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

=12

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

=12

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

=12

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

=12

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

=12

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

=12

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

[−

12

log (|Σl |)N∑

i=1

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

}]

=12

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

=12

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

12

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Or

Σl =∑N

i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)

=∑N

i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)

(54)

68 / 72

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

12

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Or

Σl =∑N

i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)

=∑N

i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)

(54)

68 / 72

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

12

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Or

Σl =∑N

i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)

=∑N

i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)

(54)

68 / 72

Thus, we have the iterative updates

They are

αNewl = 1

N

N∑i=1

p (l|xi ,Θg)

µNewl =

∑Ni=1 xip (l|xi ,Θg)∑N

i=1 p (l|xi ,Θg)

ΣNewl =

∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N

i=1 p (l|xi ,Θg)

69 / 72

Thus, we have the iterative updates

They are

αNewl = 1

N

N∑i=1

p (l|xi ,Θg)

µNewl =

∑Ni=1 xip (l|xi ,Θg)∑N

i=1 p (l|xi ,Θg)

ΣNewl =

∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N

i=1 p (l|xi ,Θg)

69 / 72

Thus, we have the iterative updates

They are

αNewl = 1

N

N∑i=1

p (l|xi ,Θg)

µNewl =

∑Ni=1 xip (l|xi ,Θg)∑N

i=1 p (l|xi ,Θg)

ΣNewl =

∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N

i=1 p (l|xi ,Θg)

69 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

70 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

gl

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

1N

N∑i=1

p(

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

Exercises

Duda and HartChapter 3

3.39, 3.40, 3.41

72 / 72