236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood...

236607 Visual Recognition Tutorial 1

Tutorial 3

• Maximum likelihood – an example

• Maximum likelihood – another example

• Bayesian estimation

• EM for a mixture model

• EM Algorithm General Setting

• Jensen’s inequality


• Bayesian leaning considers (the parameter vector to be

estimated) to be a random variable.

Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning

•

Bayesian Estimation: General Theory


• Density function for x, given the training data set

(it was defined in the Lect.2)

• From the definition of conditional probability densities

• The first factor is independent of X(n) since it just our assumed form for parameterized density.

• Therefore

Bayesian parametric estimation

D

( ) ( ) ( )( , | ) ( | , ) ( | ).n n np X p X p X x x

( )( | , ) ( | )np X p x x

( ) ( )( | ) ( | ) ( | )n np X p p X d x x

( ) 1{ ,..., }n NX x x

( ) ( )( | ) ( , | )n np X p X d x x


• Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of

If the weighting factor , which is a posterior of peaks very sharply about some value we obtain .

Thus the optimal estimator is the most likely value of given the data and the prior of .

Bayesian parametric estimation

.

( )( | )np X

( )( | ) ( | )np X p x x


Suppose we know the distribution of possible values of that is a prior

Suppose we also have a loss function which measures the penalty for estimating when actual value is

Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk

Note that the loss function is usually continuous.

Bayesian decision making

0 ( ).p

( ) ( )[ | ] ( | ) ( , )n nR X p X d

( , )

.


Let us look at : the optimal estimator is

the most likely value of given the data and the prior of This “most likely value” is given by

Maximum A-Posteriori (MAP) Estimation

( )

( ) 0( )

01

( )0

( ) ( | )arg max ( | ) arg max

( )

( ) ( | )arg max

( | ') ( ') '

nn

n

n

ii

n

p p Xp X

p X

p p x

p X p d

0

( , )1

if

if


since the data is i.i.d.

• We can disregard the normalizing factor when looking for the maximum

Maximum A-Posteriori (MAP) Estimation

( )

1

( | ) ( | )n

ni

i

p X p x

( )( )np X


So, the we are looking for is

MAP - continued

0

1

01

01

01

arg max ( ) ( | )] log is monotonically increasing)

arg max log ( ) ( | )]

arg max log ( ) log ( | )

arg max log ( ) log ( | )

n

ii

n

ii

n

ii

n

ii

p p x

p p x

p p x

p p x


In MAP estimator, the larger n (the size of the data), the less important is in the expression

It can motivate us to omit the prior.

What we get is the maximum likelihood (ML) method.

Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way .

is a log-likelihood of with respect to X(n) .

We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function.

Maximum likelihood

01

log ( ) log ( | )n

ii

p p x

0log ( )p

1

arg max log ( | )n

ii

p x

( )log ( | )np X


Let us find the ML estimator for the parameter of the exponential density function :

so we are actually looking for the maximum of log-likelihood.

Observe:

The maximum is achieved where

We have got the empirical mean (average)

Maximum likelihood – an example

( )

11

1 1arg max ( | ) arg max arg max ln

i ix xn nn

ii

p X e e

1( | ) exp

xp x

2

1ln

1 1ln ln

i

i

x

xd e

x xe

d

2 21 1 1

1 10

n n ni

i ii i i

x nx x

n


Let us find the ML estimator for

Observe:

The maximum is at where

This is the median of the sampled data.

Maximum likelihood – another example

| | | |( )

11

1 1arg max ( | ) arg max arg max ln

2 2i i

n nx xn

ii

p X e e

|1( | )

2xp x e

| |

| |

1ln

1 2ln | | ln 2 ( )

2

i

i

x

xi i

d ee x sign x

d

1

( ) 0n

ii

sign x


We saw Bayesian estimator for 0/1 loss function (MAP).What happens when we assume other loss functions?Example 1: (is unidimensional).The total Bayesian risk here:

We seek its minimum:

Bayesian estimation -revisited

( , ) | |

( ) ( )

( ) ( )

[ | ] ( | ) | |

( | )( ) ( | )( )

n n

n n

R X p X d

p X d p X d

( )( ) ( )[ | ]

( | ) ( | )n

n ndR Xp X d p X d

d


At the which is a solution we have

That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution

Example 2: (squared error).

Total Bayesian risk:

Again, in order to find the minimum, let the derivative be equal 0:

Bayesian estimation -continued

( ) ( )( | ) ( | )n np X d p X d

( , ) | |

( )( | )np X 2( , ) ( )

( ) ( ) 2[ | ] ( | )( )n nR X p X d


• The optimal estimator here is the conditional expectation of given the data X(n) .

Bayesian estimation -continued

( )( )

( ) ( ) ( )

[ | ]( | )2( )

2 ( | ) 2 ( | ) 2 [ | ] 0

nn

n n n

dR Xp X d

d

p X d p X d E X


Mixture Models

1

1K

ii

21 1 1 1

22 2 2 2

( | ) ( | , )

( | , )

p x N x

N x

1

( | ) ( | )K

i ii

p x p x

i( | ) - mixture components - mixing proportionsip x


Mixture Models

• Introduce multinomial random variable Z with components Zk

If and only if Zn takes kth value then . Note that

1 0 0

0 , 1 , 0

0 0 1nZ

1knZ

1 by definitionknk

Z


Mixture Models

where

Marginal prob. of X is

1 1( ,..., , ,..., )K K

( , 1| ) ( | 1, ) ( 1| ) ( | )k k kk k kp x Z p x Z p Z f x

1 1

( | ) ( , 1| ) ( | )K K

kk k k k

k k

p x p x Z f x

( 1| ); ( | ) ( | 1, )k kk k kp Z f x p x Z


Mixture Models

A mixture model as graphical

model. Z – multinomial latent

variable

1

1

( | 1, ) ( 1)( 1| , )

( | 1, ) ( 1)

( | )

( | )

k kk k

Kk k

k kk

k k kK

k k k kk

p x z p zp z x

p x z p z

f x

f x

Conditional prob. of Z.

Define posterior

( 1| , )k kp z x


Unconditional Mixture Models

Cond. Mix.Mod. -> to solve regression and classification

(supervised).

Need observation of data X and labels Y that is (X,Y) pairs.

Uncond. Mix.Mod. -> to solve density estimation problems

Need only observation of data X.

Applications – detection of outliers, compression, unsupervised

classification (clustering) …


Unconditional Mixture Models


Gaussian Mixture Models

Estimate from IID data D={x1,…,xN}

( | ) log ( | ) log ( | , )n i n i in n i

l D p x N x

T 1/ 2 1/ 2

1 1( | ) exp ( ) ( )

(2 ) | | 2i i imi i

p x x x


The K- means algorithm

• Group data D={x1,…,xN} into a set of K clusters, where K is given. Represents i-th cluster as one vector - its mean . Data points assign to the nearest mean .

Phase 1: values for the indicator variable are evaluated by

assigning each point xn to the closed mean:

Phase 2: recompute

i

inz

21 arg min || ||

0 . i j n jn

if i xz

otherwise

i

in nn

i inn

z x

z


EM Algorithm

• If Zn were observed, then it would be “class label” and estimate of mean would be

• We don’t know them and replace them by their conditional expectations, conditioning on data:

But from (6),(7) depends on parameter estimates so we should

iterate.

in nn

i inn

z x

z

[ | ] 1 ( 1| ) 0 ( 0 | ) ( 1| )i i i i in n n n n n n n nE Z x p Z x p Z x p Z x

in nn

i inn

x

in


EM Algorithm

• Iteration formulas:

( ) ( ) ( )

( ) ( ) ( )

( | )( )

( | )

t t ti i n i in t t t

j n j jj

N xt

N x

( 1)( )

( )

in nt n

i inn

t x

t

( 1) ( 1) T( 1)

( )( )( )

( )

i t tn n i n it n

i inn

t x x

t

( 1) 1( )t i

i nnt

N


EM Algorithm

• Expectation step is (14)

• Maximization step is parameter updates (15)-(17)

• What relationship this algorithm has to quantity which we want to maximize - log likelihood (9) ?

• Calculating derivatives of l with respect to the parameters, we have

1

log ( | , )

( | )( | , )

( | )

i n i in ii i

i n i in i i

n j n i i ij

in i n i

lN x

N xN x

N x

x


EM Algorithm

• Setting to zero yields

• Analogously

and mixing proportions:

in nn

i inn

x

T

1( )( )

n in n i n iN

i inn

x x

1

1 Ni

i nnN


EM General Setting

• EM is iterative technique designed for probabilistic models. • We have two sample spaces:

– X which are observed (dataset)– Z which are missing (latent)

• A probability model is

• If we knew Z we would do ML estimation by maximizing

( , | )p x z

( ) log ( , | ) (22)cl p x z


EM General Setting

• Z is not observed so we calculate incomplete log likelihood

• Given Z is not observed so the complete log likelihood is a random quantity and cannot be maximized directly.

• Thus we average over Z using some “averaging distribution” q(z|x).

• We hope that maximizing this surrogate expression will yield value of which will be improvement of initial value of

log ( | ) log ( , | ) (23)z

P X p x z

( ) ( | , ) log ( , | ) (24)cz

l q z x p x z


EM General Setting

• The distribution can be used to obtain lower bound on log likelihood:

• EM is coordinate ascent on

• At the (t+1)st iteration, for fixed twe first maximize with respect to q, which yield qt For this qtwe then maximize

with respect to which yields t

( ) log ( | ) log ( , | )

( , | )log ( | ) (25)

( | )

( , | )( | ) log , )

( | )

z

z

z

l p x p x z

p x zq z x

q z x

p x zq z x L q

q z x

, )L q

, )tL q

( 1) , )tL q


EM General Setting

• E step

• M step

• The M step is equivalently viewed as the maximization of the expected complete log likelihood. Proof:

• Second term is independent of Thus maximizing of with respect to is equivalent tomaximizing .

( 1) arg max , )t tqq L q

( 1) ( 1)arg max , )t tL q

( , | ), ) ( | ) log

( | )

( | ) ( , | ) ( | ) log ( | )

( ) ( | ) log ( | ) (26)

z

z z

c qz

p x zL q q z x

q z x

q z x p x z q z x q z x

l q z x q z x

, )L q ( )cl


EM General Setting

• The E step can be solved ones and for all: choise

yields the maximum:( 1) ( )( | ) ( | , )t tq z x p z x

( )( ) ( ) ( )

( )

( ) ( )

( )

( )

( , | )( | , ), ) ( | , ) log

( | , )

( | , ) log ( | )

log ( | )

( )

tt t t

tz

t t

z

t

t

p x zL p z x p z x

p z x

p z x p x

p x

l

(27)


Definition: function is convex over (a,b) if

Convex Concave

Jensen’s inequality: For convex function

Jensen’s inequality

:f

1 2 1 2 1 2, ( , ), [0,1] ( (1 ) ) ( ) (1 ) ( )x x a b f x x f x f x

[ ( )] ( [ ])E f X f E X


For d.r.v.with two mass points

Let Jensen’s inequality is right for k-1 mass points, then

due to induction assumption

due to convexity

Jensen’s inequality

1 1 2 2 1 1 2 2[ ( )] ( ) ( ) ( ) ( [ ])E f X p f x p f x f p x p x f E X 1 1 2 2[ ] ,E X p x p x [0,1]ip

1

1 1

[ ( )] ( ) ( ) (1 ) ( )(1 )

k ki

i i k k k ii i k

pE f X p f x p f x p f x

p

1

1

( ) (1 )(1 )

ki

k k k ii k

pp f x p f x

p

1

1

(1 )(1 )

ki

k k k ii k

pf p x p x

p

1

( [ ])k

i ii

f p x f E X


• Let

• Function log is concave, so from Jensen inequality we have:

Jensen’s inequality corollary

log( [ ]) [log( )]

log( ( )) log( ( ))

log( ( )) log( ( ) )

log( ( )) log ( )

( )) ( )

j

j

j

j jj j

q

jj j

q

jj j

q

jj j

E g E g

q g j q g j

q g j g j

q g j g j

q g j g j

1

0

( ) 0

jj

j

q

q

g j

236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood...

Documents

Transcript of 236607 Visual Recognition Tutorial1 Tutorial 3 Maximum likelihood – an example Maximum likelihood...