07 Machine Learning - Expectation Maximization

Machine Learning for Data MiningExpectation Maximization

Andres Mendez-Vazquez

July 12, 2015

1 / 72

Outline1 Introduction

Maximum-Likelihood

2 Incomplete DataAnalogy

3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM

4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm

2 / 72

Maximum-Likelihood

3 / 72

Maximum-Likelihood

We have a density function p (x|Θ)Assume that we have a data set of size N , X = {x1,x2, ...,xN}

This data is known as evidence.

We assume in addition thatThe vectors are independent and identically distributed (i.i.d.) withdistribution p under parameter θ.

4 / 72

Maximum-Likelihood

We have a density function p (x|Θ)Assume that we have a data set of size N , X = {x1,x2, ...,xN}

This data is known as evidence.

We assume in addition thatThe vectors are independent and identically distributed (i.i.d.) withdistribution p under parameter θ.

4 / 72

What Can We Do With The Evidence?

We may use the Bayes’ Rule to estimate the parameters θ

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

Or, given a new observation x̃

p (x̃|X ) (2)

I.e. to compute the probability of the new observation being supported bythe evidence X .

ThusThe former represents parameter estimation and the latter data prediction.

5 / 72

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

p (x̃|X ) (2)

5 / 72

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)

p (x̃|X ) (2)

5 / 72

Focusing First on the Estimation of the Parameters θ

We can interpret the Bayes’ Rule

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

posterior = likelihood × priorevidence (4)

Thus, we want

likelihood = P (X|Θ)

6 / 72

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

Thus, we want

6 / 72

p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)

Interpreted as

Thus, we want

6 / 72

What we want...

We want to maximize the likelihood as a function of θ

7 / 72

Maximum-Likelihood

We have

p (x1,x2, ...,xN |Θ) =N∏

i=1p (xi |Θ) (5)

Also known as the likelihood function.

oblematicBecause multiplication of number less than one can beproblematic

L (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log p (xi |Θ) (6)

8 / 72

Maximum-Likelihood

We have

p (x1,x2, ...,xN |Θ) =N∏

i=1p (xi |Θ) (5)

Also known as the likelihood function.

oblematicBecause multiplication of number less than one can beproblematic

L (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

log p (xi |Θ) (6)

8 / 72

Maximum-Likelihood

We want to find a Θ∗

Θ∗ = argmaxΘL (Θ|X ) (7)

The classic method∂L (Θ|X )

∂θi= 0 ∀θi ∈ Θ (8)

9 / 72

Maximum-Likelihood

We want to find a Θ∗

Θ∗ = argmaxΘL (Θ|X ) (7)

The classic method∂L (Θ|X )

∂θi= 0 ∀θi ∈ Θ (8)

9 / 72

What happened if we have incomplete data

Data could have been split1 X = observed data or incomplete data2 Y = unobserved data

For this type of problemsWe have the famous Expectation Maximization (EM)

10 / 72

The Expectation Maximization

The EM algorithmIt was first developed by Dempster et al. (1977).

Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.

Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra

parameters that are missing or hidden.

11 / 72

Incomplete Data

We assume the followingTwo parts of data

1 X = observed data or incomplete data2 Y = unobserved data

Z = (X ,Y)=Complete Data (9)

Thus, we have the following probability

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Incomplete Data

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Incomplete Data

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Incomplete Data

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

Incomplete Data

p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)

12 / 72

New Likelihood Function

The New Likelihood Function

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Note: The complete data likelihood.

Thus, we have

L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)

Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!

13 / 72

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Thus, we have

13 / 72

L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)

Thus, we have

13 / 72

Rewriting

This can be rewritten as

L (Θ|X ,Y) = hX ,Θ (Y) (13)

This basically signify that X ,Θ are constant and the only random part isY.

In addition

L (Θ|X ) (14)

It is known as the incomplete-data likelihood function.

14 / 72

Rewriting

This can be rewritten as

L (Θ|X ,Y) = hX ,Θ (Y) (13)

This basically signify that X ,Θ are constant and the only random part isY.

In addition

L (Θ|X ) (14)

It is known as the incomplete-data likelihood function.

14 / 72

We can connect both incomplete-complete data equations by doingthe following

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

p (Y|X ,Θ)

15 / 72

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

p (Y|X ,Θ)

15 / 72

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

p (Y|X ,Θ)

15 / 72

L (Θ|X ) =p (X|Θ)=∑Y

p (X ,Y|Θ)

p (Y|X ,Θ) p (X|Θ)

=∑( N∏

i=1p (xi |Θ)

p (Y|X ,Θ)

15 / 72

Remarks

ProblemsNormally, it is almost impossible to obtain a closed analytical solution forthe previous equation.

HoweverWe can use the expected value of log p (X ,Y|Θ), which allows us to findan iterative procedure to approximate the solution.

16 / 72

Remarks

ProblemsNormally, it is almost impossible to obtain a closed analytical solution forthe previous equation.

HoweverWe can use the expected value of log p (X ,Y|Θ), which allows us to findan iterative procedure to approximate the solution.

16 / 72

The function we would like to have...

The Q functionWe want an estimation of the complete-data log-likelihood

log p (X ,Y|Θ) (15)

Based in the info provided by X ,Θn−1 where Θn−1 is a previouslyestimated set of parameters at step n.

Think about the following, if we want to remove Yˆ[log p (X ,Y|Θ)] p (Y|X ,Θn−1) dY (16)

Remark: We integrate out Y - Actually, this is the expected value oflog p (X ,Y|Θ).

17 / 72

The function we would like to have...

The Q functionWe want an estimation of the complete-data log-likelihood

log p (X ,Y|Θ) (15)

Based in the info provided by X ,Θn−1 where Θn−1 is a previouslyestimated set of parameters at step n.

Think about the following, if we want to remove Yˆ[log p (X ,Y|Θ)] p (Y|X ,Θn−1) dY (16)

Remark: We integrate out Y - Actually, this is the expected value oflog p (X ,Y|Θ).

17 / 72

Use the Expected value

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution

p (Y|X ,Θn−1)=marginal distribution of missing data.

18 / 72

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

18 / 72

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

18 / 72

ThenWe want

Q(Θ,Θ(i−1)

)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)

18 / 72

Another Interpretation

Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =

´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY

Something Notable1 In the best of cases, this marginal distribution is a simple analytical

expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

which is not dependent on Θ.

19 / 72

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

19 / 72

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

19 / 72

Actually, we use

p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)

19 / 72

Maximum-Likelihood

20 / 72

Back to the Q function

The intuitionWe have the following analogy:

Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).

Thus, if Y is a discrete random variable

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

q (θ) = EY [h (θ,Y )] =∑

yh (θ, y) pY (y) (19)

21 / 72

Why E-step!!!

From here the nameThis is basically the E-step

The second stepIt tries to maximize the Q function

Θn = argmaxΘQ (Θ,Θn−1) (20)

22 / 72

Why E-step!!!

22 / 72

Why E-step!!!

22 / 72

Derivation of the EM-Algorithm

The likelihood function we are going to useLet X be a random vector which results from a parametrized family:

L (×) = lnP (X|Θ) (21)

Note: ln (x) is a strictly increasing function.

We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)

Or the maximization of the difference

L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)

23 / 72

L (×) = lnP (X|Θ) (21)

23 / 72

L (×) = lnP (X|Θ) (21)

23 / 72

Maximum-Likelihood

24 / 72

Introducing the Hidden Features

Given that the hidden random vector Y exits with y values

P (X|Θ) =∑

yP (X|y,Θ)P (y|Θ) (23)

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn) (24)

25 / 72

Introducing the Hidden Features

Given that the hidden random vector Y exits with y values

P (X|Θ) =∑

yP (X|y,Θ)P (y|Θ) (23)

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn) (24)

25 / 72

Here, we introduce some concepts of convexity

Now, we haveTheorem (Jensen’s inequality)

Let f be a convex function defined on an interval I . If x1, x2, ..., xn ∈ Iand λ1, λ2, ..., λn ≥ 0 with

∑ni=1 λi = 1, then

f( n∑

i=1λixi

n∑i=1

λi f (xi) (25)

26 / 72

Proof:

For n = 1We have the trivial case

For n = 2The convexity definition.

Now the inductive hypothesisWe assume that the theorem is true for some n.

27 / 72

Proof:

27 / 72

Proof:

27 / 72

Now, we have

The following linear combination for λi

f(n+1∑

i=1λixi

(λn+1xn+1 +

n∑i=1

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

28 / 72

Now, we have

f(n+1∑

i=1λixi

(λn+1xn+1 +

n∑i=1

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

28 / 72

Now, we have

f(n+1∑

i=1λixi

(λn+1xn+1 +

n∑i=1

= f(λn+1xn+1 + (1− λn+1)

(1− λn+1)

n∑i=1

≤ λn+1f (xn+1) + (1− λn+1) f(

1(1− λn+1)

n∑i=1

28 / 72

Did you notice?

Something Notablen+1∑i=1

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

Did you notice?

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

Did you notice?

λi = 1

Thusn∑

i=1λi = 1− λn+1

Finally1

(1− λn+1)

n∑i=1

λi = 1

29 / 72

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1− λn+1)

n∑i=1

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

i=1λi f (xi) Q.E.D.

30 / 72

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1− λn+1)

n∑i=1

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

30 / 72

We have that

f(n+1∑

i=1λixi

)≤ λn+1f (xn+1) + (1− λn+1) f

(1− λn+1)

n∑i=1

≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)

n∑i=1

λi f (xi)

≤ λn+1f (xn+1) +n∑

30 / 72

Thus...

It is possible to shown thatGiven ln (x) a concave function =⇒ln [

∑ni=1 λixi ] ≥

∑ni=1 λi ln (xi).

If we take in considerationAssume that the λi = P (y|X ,Θn). We know that

1 P (y|X ,Θn) ≥ 02∑

y P (y|X ,Θn) = 1

31 / 72

Thus...

It is possible to shown thatGiven ln (x) a concave function =⇒ln [

∑ni=1 λixi ] ≥

∑ni=1 λi ln (xi).

If we take in considerationAssume that the λi = P (y|X ,Θn). We know that

1 P (y|X ,Θn) ≥ 02∑

y P (y|X ,Θn) = 1

31 / 72

We have then...

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)

)− ...

∑yP (y|X ,Θn) lnP (X|Θn) Why this?

32 / 72

We have then...

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

)− ...

32 / 72

We have then...

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

)− ...

32 / 72

We have then...

L (Θ)− L (Θn) = ln(∑

yP (X|y,Θ)P (y|Θ)

)− lnP (X|Θn)

= ln(∑

P (y|X ,Θn)

)− lnP (X|Θn)

= ln(∑

P (y|X ,Θn)

)− lnP (X|Θn)

≥∑

yP (y|X ,Θn) ln

)− ...

32 / 72

Because ∑yP (y|X ,Θn) = 1

L (Θ)− L (Θn) =∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)∆=∆ (Θ|Θn)

33 / 72

Because ∑yP (y|X ,Θn) = 1

L (Θ)− L (Θn) =∑

yP (y|X ,Θn) ln

(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)∆=∆ (Θ|Θn)

33 / 72

Then, we have

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)

34 / 72

Then, we have

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

34 / 72

Then, we have

L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)

We define then

l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)

34 / 72

Now, we can do the following

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

(P (X , y|Θn)P (X , y|Θn)

)=L (Θn)

This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal

35 / 72

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

=L (Θn) +∑

yP (y|X ,Θn) ln

)=L (Θn)

35 / 72

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

=L (Θn) +∑

yP (y|X ,Θn) ln

)=L (Θn)

35 / 72

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

=L (Θn) +∑

yP (y|X ,Θn) ln

)=L (Θn)

35 / 72

We have

l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)

=L (Θn) +∑

yP (y|X ,Θn) ln

=L (Θn) +∑

yP (y|X ,Θn) ln

)=L (Θn)

35 / 72

Maximum-Likelihood

36 / 72

What are we looking for...

Thus, we want toTo select Θ such that l (Θ|Θn) is maximized i.e given a Θn , we want togenerate Θn+1.

The process can be seen in the following graph

37 / 72

What are we looking for...

Thus, we want toTo select Θ such that l (Θ|Θn) is maximized i.e given a Θn , we want togenerate Θn+1.

The process can be seen in the following graph

37 / 72

Formally, we want

The following

Θn+1 =argmaxΘ {l (Θ|Θn)}

=argmaxΘ

{L (Θn) +

P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)

)}The terms with Θn are constants.

≈argmaxΘ

P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))

=argmaxΘ

P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln

(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)

P (y,Θ)P (Θ)

38 / 72

Formally, we want

The following

=argmaxΘ

{L (Θn) +

≈argmaxΘ

=argmaxΘ

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln

P (y,Θ)P (Θ)

38 / 72

Formally, we want

The following

=argmaxΘ

{L (Θn) +

≈argmaxΘ

=argmaxΘ

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln

P (y,Θ)P (Θ)

38 / 72

Formally, we want

The following

=argmaxΘ

{L (Θn) +

≈argmaxΘ

=argmaxΘ

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln

P (y,Θ)P (Θ)

38 / 72

Formally, we want

The following

=argmaxΘ

{L (Θn) +

≈argmaxΘ

=argmaxΘ

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln

P (y,Θ)P (Θ)

38 / 72

θn+1 =argmaxΘ

P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

39 / 72

θn+1 =argmaxΘ

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

39 / 72

θn+1 =argmaxΘ

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

39 / 72

θn+1 =argmaxΘ

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

39 / 72

θn+1 =argmaxΘ

P (y,Θ)P (Θ)

=argmaxΘ

P (y|X ,Θn) ln (P (X , y|Θ))

}=argmaxΘ

{Ey|X ,Θn [ln (P (X , y|Θ))]

39 / 72

Maximum-Likelihood

40 / 72

The EM-Algorithm

Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.

E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].

M-StepMaximize this expression with respect to Θ.

41 / 72

The EM-Algorithm

41 / 72

The EM-Algorithm

41 / 72

The EM-Algorithm

41 / 72

Maximum-Likelihood

42 / 72

Notes and Convergence of EM

Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).

Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the

difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize

∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.

43 / 72

PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).

For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.

44 / 72

Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.

We have

p (x|Θ) =M∑

i=1αipi (x|θi) (28)

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

i=1 αi = 13 Each pi is a density function parametrized by θi .

45 / 72

We have

p (x|Θ) =M∑

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

45 / 72

We have

p (x|Θ) =M∑

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

45 / 72

We have

p (x|Θ) =M∑

where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M

45 / 72

A log-likelihood for this function

We have

logL (Θ|X ) = logN∏

i=1p (xi |Θ) =

N∑i=1

M∑j=1

αjpj (xi |θj)

Note: This is too difficult to optimize due to the log function.

HoweverWe can simplify this assuming the following:

1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}

2 yi = k if the i th samples was generated by the kth mixture.

46 / 72

We have

i=1p (xi |Θ) =

N∑i=1

M∑j=1

αjpj (xi |θj)

46 / 72

We have

i=1p (xi |Θ) =

N∑i=1

M∑j=1

αjpj (xi |θj)

46 / 72

We have

i=1p (xi |Θ) =

N∑i=1

M∑j=1

αjpj (xi |θj)

46 / 72

We have

i=1p (xi |Θ) =

N∑i=1

M∑j=1

αjpj (xi |θj)

46 / 72

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

log [P (X ,Y|Θ)] =N∑

i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)

Question Do you need yi if you know θyi or the other way around?

Finally

logL (Θ|X ,Y) =N∑

i=1log [P (xi |yi , θyi ) P (yi)] =

N∑i=1

log [αyi pyi (xi |θyi )] (32)

NOPE: You do not need yi if you know θyi or the other way around.47 / 72

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Finally

N∑i=1

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Finally

N∑i=1

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Finally

N∑i=1

NowWe have

logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)

Finally

N∑i=1

Problem

Which?We do not know the values of Y.

We can get away by using the following ideaAssume the Y is a random variable.

48 / 72

Problem

Which?We do not know the values of Y.

We can get away by using the following ideaAssume the Y is a random variable.

48 / 72

You do a first guess for the parameters

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

Then, it is possible to calculate

pj(xi |θg

)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:

αj = p (component j) (34)

49 / 72

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

pj(xi |θg

49 / 72

Θg = (αg1, ..., α

gM , θ

g1, ..., θ

gM ) (33)

pj(xi |θg

49 / 72

Maximum-Likelihood

50 / 72

Using Bayes’ RuleCompute

p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)

=p (xi |Θg) p

(yi |θg

)p (xi |Θg) We know θg

yi ⇒ Drop it

yi pyi

(xi |θg

)p (xi |Θg)

yi pyi

(xi |θg

)∑Mk=1 α

gkpk (xi |θg

In addition

p (y|X ,Θg) =N∏

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

=p (xi |Θg) p

(yi |θg

yi ⇒ Drop it

yi pyi

(xi |θg

)p (xi |Θg)

yi pyi

(xi |θg

)∑Mk=1 α

gkpk (xi |θg

In addition

p (y|X ,Θg) =N∏

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

=p (xi |Θg) p

(yi |θg

yi ⇒ Drop it

yi pyi

(xi |θg

)p (xi |Θg)

yi pyi

(xi |θg

)∑Mk=1 α

gkpk (xi |θg

In addition

p (y|X ,Θg) =N∏

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

=p (xi |Θg) p

(yi |θg

yi ⇒ Drop it

yi pyi

(xi |θg

)p (xi |Θg)

yi pyi

(xi |θg

)∑Mk=1 α

gkpk (xi |θg

In addition

p (y|X ,Θg) =N∏

p (yi |xi ,Θg) (35)

Where y = (y1, y2, ..., yN )

51 / 72

Now, using equation 17

Q (Θ|Θg) =∑y∈Y

log (L (Θ|X , y)) p (y|X ,Θg)

=∑y∈Y

N∑i=1

log [αyi pyi (xi |θyi )]N∏

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

M∑l=1

δl,yi log [αlpl (xi |θl)]N∏

p (yj |xj ,Θg)

M∑l=1

log [αlpl (xi |θl)]M∑

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

=∑y∈Y

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

M∑l=1

p (yj |xj ,Θg)

M∑l=1

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

=∑y∈Y

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

M∑l=1

p (yj |xj ,Θg)

M∑l=1

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

=∑y∈Y

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

M∑l=1

p (yj |xj ,Θg)

M∑l=1

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

=∑y∈Y

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

p (yj |xj ,Θg)

M∑y2=1

· · ·M∑

N∑i=1

M∑l=1

p (yj |xj ,Θg)

M∑l=1

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg)

52 / 72

It looks quite daunting, but...First notice the following

M∑y1=1

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg) =

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

(xi |θg

)∑Mk=1 α

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

M∑y1=1

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg) =

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

(xi |θg

)∑Mk=1 α

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

M∑y1=1

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg) =

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

(xi |θg

)∑Mk=1 α

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

M∑y1=1

M∑y2=1

· · ·M∑

δl,yi

N∏j=1

p (yj |xj ,Θg) =

M∑y1=1

· · ·M∑

yi−1=1

M∑yi+1=1

· · ·M∑

N∏j=1,j 6=i

p (yj |xj ,Θg)

p (l|xi ,Θg)

j=1,j 6=i

M∑yj=1

p (yj |xj ,Θg)

p (l|xi ,Θg)

=p (l|xi ,Θg) =αg

(xi |θg

)∑Mk=1 α

xi |θgk

)Because

M∑i=1

p (i|xi ,Θg) = 1 (36)

53 / 72

We can write Q in the following way

Q (Θ,Θg) =N∑

M∑l=1

log [αlpl (xi |θl)] p (l|xi ,Θg)

M∑l=1

log (αl) p (l|xi ,Θg) + ...

N∑i=1

M∑l=1

log (pl (xi |θl)) p (l|xi ,Θg) (37)

54 / 72

Q (Θ,Θg) =N∑

M∑l=1

N∑i=1

M∑l=1

54 / 72

Q (Θ,Θg) =N∑

M∑l=1

N∑i=1

M∑l=1

54 / 72

Maximum-Likelihood

55 / 72

Lagrange Multipliers

To maximize this expressionWe can maximize the term containing αl and the term containing θl .

This can be doneIn an independent way since they are not related.

56 / 72

To maximize this expressionWe can maximize the term containing αl and the term containing θl .

This can be doneIn an independent way since they are not related.

56 / 72

We can us the following constraint for that∑lαl = 1 (38)

We have the following cost function

Q (Θ,Θg) + λ

(∑lαl − 1

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

Q (Θ,Θg) + λ

(∑lαl − 1

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

Q (Θ,Θg) + λ

(∑lαl − 1

Deriving by αl

∂αl

[Q (Θ,Θg) + λ

(∑lαl − 1

)]= 0 (40)

57 / 72

We have thatN∑

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

Summing over l, we get

λ = −N (43)

58 / 72

We have thatN∑

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

λ = −N (43)

58 / 72

We have thatN∑

p (l|xi ,Θg) + λ = 0 (41)

ThusN∑

i=1p (l|xi ,Θg) = −λαl (42)

λ = −N (43)

58 / 72

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

It is possible to get an analytical expressions for θl as functions ofeverything else.

This is for you in a homework!!!

For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.

59 / 72

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

59 / 72

αl = 1N

N∑i=1

p (l|xi ,Θg) (44)

About θl

59 / 72

Maximum-Likelihood

60 / 72

Remember?

Gaussian Distribution

pl (x|µl ,Σl) = 1(2π)d/2 |Σl |

1/2exp

2 (x − µl)T Σ−1l (x − µl)

61 / 72

How to use this for Gaussian Distributions

For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑

i xTi Axi = tr (AB) where B =

∑i xixT

i .4∣∣A−1∣∣ = 1

Now, we need the derivative of a matrix function f (A)

Thus, ∂f (A)∂A is going to be the matrix with i, j th entry

[∂f (A)∂ai,j

]where ai,j is

the i, j th entry of A.

62 / 72

∑i xixT

i .4∣∣A−1∣∣ = 1

[∂f (A)∂ai,j

]where ai,j is

62 / 72

∑i xixT

i .4∣∣A−1∣∣ = 1

[∂f (A)∂ai,j

]where ai,j is

62 / 72

∑i xixT

i .4∣∣A−1∣∣ = 1

[∂f (A)∂ai,j

]where ai,j is

62 / 72

∑i xixT

i .4∣∣A−1∣∣ = 1

[∂f (A)∂ai,j

]where ai,j is

62 / 72

In addition

If A is symmetric

∂ |A|∂A =

{Ai,j if i = j2Ai,j if i 6= j

Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of

a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.

∂ log |A|∂A =

Ai,j|A| if i = j2Ai,j if i 6= j

= 2A−1 − diag(A−1

63 / 72

In addition

If A is symmetric

∂ |A|∂A =

∂ log |A|∂A =

= 2A−1 − diag(A−1

63 / 72

In addition

If A is symmetric

∂ |A|∂A =

∂ log |A|∂A =

= 2A−1 − diag(A−1

63 / 72

In addition

If A is symmetric

∂ |A|∂A =

∂ log |A|∂A =

= 2A−1 − diag(A−1

63 / 72

Finally

The last equation we need∂tr (AB)∂A = B + BT − diag (B) (48)

In addition∂xT Ax∂x (49)

64 / 72

Finally

The last equation we need∂tr (AB)∂A = B + BT − diag (B) (48)

In addition∂xT Ax∂x (49)

64 / 72

Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives

N∑i=1

M∑l=1

log (pl (xi |µl ,Σl)) p (l|xi ,Θg)

M∑l=1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

Thus, when taking the derivative with respect to µlN∑

[Σ−1

l (xi − µl) p (l|xi ,Θg)]

= 0 (51)

µl =∑N

i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)

65 / 72

N∑i=1

M∑l=1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

[Σ−1

= 0 (51)

µl =∑N

65 / 72

N∑i=1

M∑l=1

2 log (|Σl |)−12 (xi − µl)T Σ−1

l (xi − µl)]

p (l|xi ,Θg) (50)

[Σ−1

= 0 (51)

µl =∑N

65 / 72

Now, if we derive with respect to Σl

First, we rewrite equation 50

N∑i=1

M∑l=1

log (|Σl |)−12

(xi − µl)T Σ−1l (xi − µl)

]p (l|xi ,Θg)

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l (xi − µl) (xi − µl)T}]

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

Where Nl,i = (xi − µl) (xi − µl)T .

66 / 72

N∑i=1

M∑l=1

log (|Σl |)−12

]p (l|xi ,Θg)

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

66 / 72

N∑i=1

M∑l=1

log (|Σl |)−12

]p (l|xi ,Θg)

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

66 / 72

N∑i=1

M∑l=1

log (|Σl |)−12

]p (l|xi ,Θg)

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

66 / 72

Deriving with respect to Σ−1l

We have that

∂Σ−1l

M∑l=1

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

N∑i=1

p (l|xi ,Θg) (2Σl − diag (Σl))−12

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

Where Ml,i = Σl −Nl,i and S = 12∑N

i=1 p (l|xi ,Θg) Ml,i

67 / 72

We have that

∂Σ−1l

M∑l=1

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

67 / 72

We have that

∂Σ−1l

M∑l=1

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

67 / 72

We have that

∂Σ−1l

M∑l=1

log (|Σl |)N∑

p (l|xi ,Θg)−12

N∑i=1

p (l|xi ,Θg) tr{

Σ−1l Nl,i

N∑i=1

p (l|xi ,Θg)(

2Nl.i − diag(

Nl,i))

N∑i=1

p (l|xi ,Θg)(

2Ml,i − diag(

Ml,i))

=2S − diag (S)

67 / 72

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Σl =∑N

i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)

i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)

68 / 72

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Σl =∑N

68 / 72

Thus, we have

ThusIf 2S − diag (S) = 0 =⇒ S = 0

Implying

N∑i=1

p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)

Σl =∑N

68 / 72

Thus, we have the iterative updates

They are

αNewl = 1

N∑i=1

p (l|xi ,Θg)

µNewl =

∑Ni=1 xip (l|xi ,Θg)∑N

i=1 p (l|xi ,Θg)

ΣNewl =

∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N

i=1 p (l|xi ,Θg)

69 / 72

They are

αNewl = 1

N∑i=1

p (l|xi ,Θg)

µNewl =

i=1 p (l|xi ,Θg)

ΣNewl =

i=1 p (l|xi ,Θg)

69 / 72

They are

αNewl = 1

N∑i=1

p (l|xi ,Θg)

µNewl =

i=1 p (l|xi ,Θg)

ΣNewl =

i=1 p (l|xi ,Θg)

69 / 72

Maximum-Likelihood

70 / 72

EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log

likelihood.2 E-Step:

I Evaluate the the probabilities of component l given xi using the current parameter values:

p (l|xi ,Θg) =α

gl pyi

(xi |θ

)∑Mk=1

αgk pk(

xi |θgk

) .3 M-Step. Re-estimate the parameters using the current iteration values:

αNewl =

N∑i=1

l|xi ,Θg)

µNewl =

∑Ni=1

xip (l|xi ,Θg)∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1

p (l|xi ,Θg)

4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N

i=1log{∑M

l=1αlpl (xi |µl ,Σl)

}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied

return to step 2. 71 / 72

p (l|xi ,Θg) =α

gl pyi

(xi |θ

)∑Mk=1

αgk pk(

xi |θgk

αNewl =

N∑i=1

l|xi ,Θg)

µNewl =

∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg)

i=1log{∑M

p (l|xi ,Θg) =α

gl pyi

(xi |θ

)∑Mk=1

αgk pk(

xi |θgk

αNewl =

N∑i=1

l|xi ,Θg)

µNewl =

∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg)

i=1log{∑M

p (l|xi ,Θg) =α

gl pyi

(xi |θ

)∑Mk=1

αgk pk(

xi |θgk

αNewl =

N∑i=1

l|xi ,Θg)

µNewl =

∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg)

i=1log{∑M

p (l|xi ,Θg) =α

gl pyi

(xi |θ

)∑Mk=1

αgk pk(

xi |θgk

αNewl =

N∑i=1

l|xi ,Θg)

µNewl =

∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg)

i=1log{∑M

p (l|xi ,Θg) =α

gl pyi

(xi |θ

)∑Mk=1

αgk pk(

xi |θgk

αNewl =

N∑i=1

l|xi ,Θg)

µNewl =

∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg)

i=1log{∑M

p (l|xi ,Θg) =α

gl pyi

(xi |θ

)∑Mk=1

αgk pk(

xi |θgk

αNewl =

N∑i=1

l|xi ,Θg)

µNewl =

∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg)

i=1log{∑M

p (l|xi ,Θg) =α

gl pyi

(xi |θ

)∑Mk=1

αgk pk(

xi |θgk

αNewl =

N∑i=1

l|xi ,Θg)

µNewl =

∑Ni=1

p (l|xi ,Θg)

ΣNewl =

∑Ni=1

p (l|xi ,Θg)

i=1log{∑M

Exercises

Duda and HartChapter 3

3.39, 3.40, 3.41

72 / 72

07 Machine Learning - Expectation Maximization

Engineering

Transcript of 07 Machine Learning - Expectation Maximization

Expectation Maximization and Gaussian Mixture Models

Machine Learning (CSE 446): Expectation …Machine Learning (CSE 446): Expectation-Maximization Noah Smith c 2017 University of Washington nasmith@cs.washington.edu December 4, 2017

Expectation Maximization (Intuition) Expectation ... Maximization (Intuition) Expectation Maximization (Maths) 1 Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) •

Expectation-Maximization & Belief Propagation

Maximization-Expectation Algorithmwelling/publications/papers/BKM_techrep.pdfBayesian K-Means as a \Maximization-Expectation" Algorithm Kenichi Kurihara kurihara@mi.cs.titech.ac.jp

Progressive Expectation–Maximization for Hierarchical Volumetric Photon Mappingwjarosz/publications/jakob11... · 2013. 7. 7. · Progressive Expectation–Maximization for Hierarchical

The Expectation Maximization (EM) Algorithm

Clustering and The Expectation-Maximization Algorithm ...mpetrik/teaching/intro_ml_17_files/class10.pdf · Clustering and The Expectation-Maximization Algorithm Unsupervised Learning

Expectation-Maximization (EM) Algorithm

Expectation Maximization - University of Washington

LECTURE 11: EXPECTATION MAXIMIZATION (EM)

Expectation Maximization (EM) Mixture of Gaussians

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

The expectation - maximization algorithm - IEEE …home.ustc.edu.cn/~xiaosong/ppt/EM_tutorial.pdfTitle The expectation - maximization algorithm - IEEE Signal Processing Magazi ne Author

Expectation Maximizationguestrin/Class/10701-S07/Slides/em... · ©2005-2007 Carlos Guestrin Expectation Maximization Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon

Lecture 17 Gaussian Mixture Models and Expectation Maximization Machine Learning.

Expectation Maximization Algorithm

Expectation-Maximization Algorithm for Clustering ... · Expectation Maximization Tutorial by Avi Kak Expectation-Maximization Algorithm for Clustering Multidimensional Numerical

Expectation-Maximization · Expectation Maximization {Convenientalgorithm forcertain Maximum Likelihood problems. { Viablealternativeto Newton or Conjugate Gradient algorithms. {

6. Mixture Models and Expectation-Maximization...Mixture Models and Expectation-Maximization PD Dr. Rudolph Triebel Computer Vision Group Machine Learning for Computer Vision Motivation