236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood...

34
236607 Visual Recognition Tutorial 1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture model EM Algorithm General Setting Jensen’s inequality
  • date post

    20-Jan-2016
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood...

Page 1: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 1

Tutorial 3

• Maximum likelihood – an example

• Maximum likelihood – another example

• Bayesian estimation

• EM for a mixture model

• EM Algorithm General Setting

• Jensen’s inequality

Page 2: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 2

• Bayesian leaning considers (the parameter vector to be

estimated) to be a random variable.

Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning

Bayesian Estimation: General Theory

Page 3: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 3

• Density function for x, given the training data set

(it was defined in the Lect.2)

• From the definition of conditional probability densities

• The first factor is independent of X(n) since it just our assumed form for parameterized density.

• Therefore

Bayesian parametric estimation

D

( ) ( ) ( )( , | ) ( | , ) ( | ).n n np X p X p X x x

( )( | , ) ( | )np X p x x

( ) ( )( | ) ( | ) ( | )n np X p p X d x x

( ) 1{ ,..., }n NX x x

( ) ( )( | ) ( , | )n np X p X d x x

Page 4: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 4

• Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of

If the weighting factor , which is a posterior of peaks very sharply about some value we obtain .

Thus the optimal estimator is the most likely value of given the data and the prior of .

Bayesian parametric estimation

.

( )( | )np X

( )( | ) ( | )np X p x x

Page 5: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 5

Suppose we know the distribution of possible values of that is a prior

Suppose we also have a loss function which measures the penalty for estimating when actual value is

Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk

Note that the loss function is usually continuous.

Bayesian decision making

0 ( ).p

( ) ( )[ | ] ( | ) ( , )n nR X p X d

( , )

.

Page 6: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 6

Let us look at : the optimal estimator is

the most likely value of given the data and the prior of This “most likely value” is given by

Maximum A-Posteriori (MAP) Estimation

( )

( ) 0( )

01

( )0

( ) ( | )arg max ( | ) arg max

( )

( ) ( | )arg max

( | ') ( ') '

nn

n

n

ii

n

p p Xp X

p X

p p x

p X p d

0

( , )1

if

if

Page 7: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 7

since the data is i.i.d.

• We can disregard the normalizing factor when looking for the maximum

Maximum A-Posteriori (MAP) Estimation

( )

1

( | ) ( | )n

ni

i

p X p x

( )( )np X

Page 8: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 8

So, the we are looking for is

MAP - continued

0

1

01

01

01

arg max ( ) ( | )] log is monotonically increasing)

arg max log ( ) ( | )]

arg max log ( ) log ( | )

arg max log ( ) log ( | )

n

ii

n

ii

n

ii

n

ii

p p x

p p x

p p x

p p x

Page 9: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 9

In MAP estimator, the larger n (the size of the data), the less important is in the expression

It can motivate us to omit the prior.

What we get is the maximum likelihood (ML) method.

Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way .

is a log-likelihood of with respect to X(n) .

We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function.

Maximum likelihood

01

log ( ) log ( | )n

ii

p p x

0log ( )p

1

arg max log ( | )n

ii

p x

( )log ( | )np X

Page 10: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 10

Let us find the ML estimator for the parameter of the exponential density function :

so we are actually looking for the maximum of log-likelihood.

Observe:

The maximum is achieved where

We have got the empirical mean (average)

Maximum likelihood – an example

( )

11

1 1arg max ( | ) arg max arg max ln

i ix xn nn

ii

p X e e

1( | ) exp

xp x

2

1ln

1 1ln ln

i

i

x

xd e

x xe

d

2 21 1 1

1 10

n n ni

i ii i i

x nx x

n

Page 11: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 11

Let us find the ML estimator for

Observe:

The maximum is at where

This is the median of the sampled data.

Maximum likelihood – another example

| | | |( )

11

1 1arg max ( | ) arg max arg max ln

2 2i i

n nx xn

ii

p X e e

|1( | )

2xp x e

| |

| |

1ln

1 2ln | | ln 2 ( )

2

i

i

x

xi i

d ee x sign x

d

1

( ) 0n

ii

sign x

Page 12: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 12

We saw Bayesian estimator for 0/1 loss function (MAP).What happens when we assume other loss functions?Example 1: (is unidimensional).The total Bayesian risk here:

We seek its minimum:

Bayesian estimation -revisited

( , ) | |

( ) ( )

( ) ( )

[ | ] ( | ) | |

( | )( ) ( | )( )

n n

n n

R X p X d

p X d p X d

( )( ) ( )[ | ]

( | ) ( | )n

n ndR Xp X d p X d

d

Page 13: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 13

At the which is a solution we have

That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution

Example 2: (squared error).

Total Bayesian risk:

Again, in order to find the minimum, let the derivative be equal 0:

Bayesian estimation -continued

( ) ( )( | ) ( | )n np X d p X d

( , ) | |

( )( | )np X 2( , ) ( )

( ) ( ) 2[ | ] ( | )( )n nR X p X d

Page 14: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 14

• The optimal estimator here is the conditional expectation of given the data X(n) .

Bayesian estimation -continued

( )( )

( ) ( ) ( )

[ | ]( | )2( )

2 ( | ) 2 ( | ) 2 [ | ] 0

nn

n n n

dR Xp X d

d

p X d p X d E X

Page 15: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 15

Mixture Models

1

1K

ii

21 1 1 1

22 2 2 2

( | ) ( | , )

( | , )

p x N x

N x

1

( | ) ( | )K

i ii

p x p x

i( | ) - mixture components - mixing proportionsip x

Page 16: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 16

Mixture Models

• Introduce multinomial random variable Z with components Zk

If and only if Zn takes kth value then . Note that

1 0 0

0 , 1 , 0

0 0 1nZ

1knZ

1 by definitionknk

Z

Page 17: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 17

Mixture Models

where

Marginal prob. of X is

1 1( ,..., , ,..., )K K

( , 1| ) ( | 1, ) ( 1| ) ( | )k k kk k kp x Z p x Z p Z f x

1 1

( | ) ( , 1| ) ( | )K K

kk k k k

k k

p x p x Z f x

( 1| ); ( | ) ( | 1, )k kk k kp Z f x p x Z

Page 18: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 18

Mixture Models

A mixture model as graphical

model. Z – multinomial latent

variable

1

1

( | 1, ) ( 1)( 1| , )

( | 1, ) ( 1)

( | )

( | )

k kk k

Kk k

k kk

k k kK

k k k kk

p x z p zp z x

p x z p z

f x

f x

Conditional prob. of Z.

Define posterior

( 1| , )k kp z x

Page 19: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 19

Unconditional Mixture Models

Cond. Mix.Mod. -> to solve regression and classification

(supervised).

Need observation of data X and labels Y that is (X,Y) pairs.

Uncond. Mix.Mod. -> to solve density estimation problems

Need only observation of data X.

Applications – detection of outliers, compression, unsupervised

classification (clustering) …

Page 20: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 20

Unconditional Mixture Models

Page 21: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 21

Gaussian Mixture Models

Estimate from IID data D={x1,…,xN}

( | ) log ( | ) log ( | , )n i n i in n i

l D p x N x

T 1/ 2 1/ 2

1 1( | ) exp ( ) ( )

(2 ) | | 2i i imi i

p x x x

Page 22: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 22

The K- means algorithm

• Group data D={x1,…,xN} into a set of K clusters, where K is given. Represents i-th cluster as one vector - its mean . Data points assign to the nearest mean .

Phase 1: values for the indicator variable are evaluated by

assigning each point xn to the closed mean:

Phase 2: recompute

i

inz

21 arg min || ||

0 . i j n jn

if i xz

otherwise

i

in nn

i inn

z x

z

Page 23: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 23

EM Algorithm

• If Zn were observed, then it would be “class label” and estimate of mean would be

• We don’t know them and replace them by their conditional expectations, conditioning on data:

But from (6),(7) depends on parameter estimates so we should

iterate.

in nn

i inn

z x

z

[ | ] 1 ( 1| ) 0 ( 0 | ) ( 1| )i i i i in n n n n n n n nE Z x p Z x p Z x p Z x

in nn

i inn

x

in

Page 24: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 24

EM Algorithm

• Iteration formulas:

( ) ( ) ( )

( ) ( ) ( )

( | )( )

( | )

t t ti i n i in t t t

j n j jj

N xt

N x

( 1)( )

( )

in nt n

i inn

t x

t

( 1) ( 1) T( 1)

( )( )( )

( )

i t tn n i n it n

i inn

t x x

t

( 1) 1( )t i

i nnt

N

Page 25: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 25

EM Algorithm

• Expectation step is (14)

• Maximization step is parameter updates (15)-(17)

• What relationship this algorithm has to quantity which we want to maximize - log likelihood (9) ?

• Calculating derivatives of l with respect to the parameters, we have

1

log ( | , )

( | )( | , )

( | )

i n i in ii i

i n i in i i

n j n i i ij

in i n i

lN x

N xN x

N x

x

Page 26: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 26

EM Algorithm

• Setting to zero yields

• Analogously

and mixing proportions:

in nn

i inn

x

T

1( )( )

n in n i n iN

i inn

x x

1

1 Ni

i nnN

Page 27: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 27

EM General Setting

• EM is iterative technique designed for probabilistic models. • We have two sample spaces:

– X which are observed (dataset)– Z which are missing (latent)

• A probability model is

• If we knew Z we would do ML estimation by maximizing

( , | )p x z

( ) log ( , | ) (22)cl p x z

Page 28: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 28

EM General Setting

• Z is not observed so we calculate incomplete log likelihood

• Given Z is not observed so the complete log likelihood is a random quantity and cannot be maximized directly.

• Thus we average over Z using some “averaging distribution” q(z|x).

• We hope that maximizing this surrogate expression will yield value of which will be improvement of initial value of

log ( | ) log ( , | ) (23)z

P X p x z

( ) ( | , ) log ( , | ) (24)cz

l q z x p x z

Page 29: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 29

EM General Setting

• The distribution can be used to obtain lower bound on log likelihood:

• EM is coordinate ascent on

• At the (t+1)st iteration, for fixed twe first maximize with respect to q, which yield qt For this qtwe then maximize

with respect to which yields t

( ) log ( | ) log ( , | )

( , | )log ( | ) (25)

( | )

( , | )( | ) log , )

( | )

z

z

z

l p x p x z

p x zq z x

q z x

p x zq z x L q

q z x

, )L q

, )tL q

( 1) , )tL q

Page 30: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 30

EM General Setting

• E step

• M step

• The M step is equivalently viewed as the maximization of the expected complete log likelihood. Proof:

• Second term is independent of Thus maximizing of with respect to is equivalent tomaximizing .

( 1) arg max , )t tqq L q

( 1) ( 1)arg max , )t tL q

( , | ), ) ( | ) log

( | )

( | ) ( , | ) ( | ) log ( | )

( ) ( | ) log ( | ) (26)

z

z z

c qz

p x zL q q z x

q z x

q z x p x z q z x q z x

l q z x q z x

, )L q ( )cl

Page 31: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 31

EM General Setting

• The E step can be solved ones and for all: choise

yields the maximum:( 1) ( )( | ) ( | , )t tq z x p z x

( )( ) ( ) ( )

( )

( ) ( )

( )

( )

( , | )( | , ), ) ( | , ) log

( | , )

( | , ) log ( | )

log ( | )

( )

tt t t

tz

t t

z

t

t

p x zL p z x p z x

p z x

p z x p x

p x

l

(27)

Page 32: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 32

Definition: function is convex over (a,b) if

Convex Concave

Jensen’s inequality: For convex function

Jensen’s inequality

:f

1 2 1 2 1 2, ( , ), [0,1] ( (1 ) ) ( ) (1 ) ( )x x a b f x x f x f x

[ ( )] ( [ ])E f X f E X

Page 33: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 33

For d.r.v.with two mass points

Let Jensen’s inequality is right for k-1 mass points, then

due to induction assumption

due to convexity

Jensen’s inequality

1 1 2 2 1 1 2 2[ ( )] ( ) ( ) ( ) ( [ ])E f X p f x p f x f p x p x f E X 1 1 2 2[ ] ,E X p x p x [0,1]ip

1

1 1

[ ( )] ( ) ( ) (1 ) ( )(1 )

k ki

i i k k k ii i k

pE f X p f x p f x p f x

p

1

1

( ) (1 )(1 )

ki

k k k ii k

pp f x p f x

p

1

1

(1 )(1 )

ki

k k k ii k

pf p x p x

p

1

( [ ])k

i ii

f p x f E X

Page 34: 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture.

236607 Visual Recognition Tutorial 34

• Let

• Function log is concave, so from Jensen inequality we have:

Jensen’s inequality corollary

log( [ ]) [log( )]

log( ( )) log( ( ))

log( ( )) log( ( ) )

log( ( )) log ( )

( )) ( )

j

j

j

j jj j

q

jj j

q

jj j

q

jj j

E g E g

q g j q g j

q g j g j

q g j g j

q g j g j

1

0

( ) 0

jj

j

q

q

g j