Download - 236607 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization.

236607 Visual Recognition Tutorial 1

Tutorial 4

• Maximum likelihood – an example

• Maximum likelihood – another example

• Bayesian estimation

• Expectation Maximization Algorithm

• Jensen’s inequality

• EM for a mixture model


• Bayesian leaning considers (the parameter vector to be

estimated) to be a random variable.

Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning

•

Bayesian Estimation: General Theory


• Density function for x, given the training data set

(it was defined in the Lect.2)

• From the definition of conditional probability densities

• The first factor is independent of X(n) since it just our assumed form for parameterized density.

• Therefore

Bayesian parametric estimation

D

( ) ( ) ( )( , | ) ( | , ) ( | ).n n np X p X p X x x

( )( | , ) ( | )np X p x x

( ) ( )( | ) ( | ) ( | )n np X p p X d x x

( ) 1{ ,..., }n NX x x

( ) ( )( | ) ( , | )n np X p X d x x


• Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of

If the weighting factor , which is a posterior of peaks very sharply about some value we obtain .

Thus the optimal estimator is the most likely value of given the data and the prior of .

Bayesian parametric estimation

.

( )( | )np X

( )( | ) ( | )np X p x x


Suppose we know the distribution of possible values of that is a prior

Suppose we also have a loss function which measures the penalty for estimating when actual value is

Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk

Note that the loss function is usually continuous.

Bayesian decision making

0 ( ).p

( ) ( )[ | ] ( | ) ( , )n nR X p X d

( , )

.


Let us look at : the optimal estimator is

the most likely value of given the data and the prior of This “most likely value” is given by

Maximum A-Posteriori (MAP) Estimation

( )

( ) 0( )

01

( )0

( ) ( | )arg max ( | ) arg max

( )

( ) ( | )arg max

( | ') ( ') '

nn

n

n

ii

n

p p Xp X

p X

p p x

p X p d

0

( , )1

if

if


since the data is i.i.d.

• We can disregard the normalizing factor when looking for the maximum

Maximum A-Posteriori (MAP) Estimation

( )

1

( | ) ( | )n

ni

i

p X p x

( )( )np X


So, the we are looking for is

MAP - continued

0

1

01

01

01

arg max ( ) ( | )] log is monotonically increasing)

arg max log ( ) ( | )]

arg max log ( ) log ( | )

arg max log ( ) log ( | )

n

ii

n

ii

n

ii

n

ii

p p x

p p x

p p x

p p x


In MAP estimator, the larger n (the size of the data), the less important is in the expression

It can motivate us to omit the prior.

What we get is the maximum likelihood (ML) method.

Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way .

is a log-likelihood of with respect to X(n) .

We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function.

Maximum likelihood

01

log ( ) log ( | )n

ii

p p x

0log ( )p

1

arg max log ( | )n

ii

p x

( )log ( | )np X


Let us find the ML estimator for the parameter of the exponential density function :

so we are actually looking for the maximum of log-likelihood.

Observe:

The maximum is achieved where

We have got the empirical mean (average)

Maximum likelihood – an example

( )

11

1 1arg max ( | ) arg max arg max ln

i ix xn nn

ii

p X e e

1( | ) exp

xp x

2

1ln

1 1ln ln

i

i

x

xd e

x xe

d

2 21 1 1

1 10

n n ni

i ii i i

x nx x

n


Let us find the ML estimator for

Observe:

The maximum is at where

This is the median of the sampled data.

Maximum likelihood – another example

| | | |( )

11

1 1arg max ( | ) arg max arg max ln

2 2i i

n nx xn

ii

p X e e

|1( | )

2xp x e

| |

| |

1ln

1 2ln | | ln 2 ( )

2

i

i

x

xi i

d ee x sign x

d

1

( ) 0n

ii

sign x


We saw Bayesian estimator for 0/1 loss function (MAP).What happens when we assume other loss functions?Example 1: (is unidimensional).The total Bayesian risk here:

We seek its minimum:

Bayesian estimation -revisited

( , ) | |

( ) ( )

( ) ( )

[ | ] ( | ) | |

( | )( ) ( | )( )

n n

n n

R X p X d

p X d p X d

( )( ) ( )[ | ]

( | ) ( | )n

n ndR Xp X d p X d

d


At the which is a solution we have

That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution

Example 2: (squared error).

Total Bayesian risk:

Again, in order to find the minimum, let the derivative be equal 0:

Bayesian estimation -continued

( ) ( )( | ) ( | )n np X d p X d

( , ) | |

( )( | )np X 2( , ) ( )

( ) ( ) 2[ | ] ( | )( )n nR X p X d


• The optimal estimator here is the conditional expectation of given the data X(n) .

Bayesian estimation -continued

( )( )

( ) ( ) ( )

[ | ]( | )2( )

2 ( | ) 2 ( | ) 2 [ | ] 0

nn

n n n

dR Xp X d

d

p X d p X d E X


Definition: function is convex over (a,b) if

Convex Concave

Jensen’s inequality: For convex function

Jensen’s inequality

:f

1 2 1 2 1 2, ( , ), [0,1] ( (1 ) ) ( ) (1 ) ( )x x a b f x x f x f x

[ ( )] ( [ ])E f X f E X


For d.r.v.with two mass points

Let Jensen’s inequality is right for k-1 mass points, then

due to induction assumption

due to convexity

Jensen’s inequality

1 1 2 2 1 1 2 2[ ( )] ( ) ( ) ( ) ( [ ])E f X p f x p f x f p x p x f E X 1 1 2 2[ ] ,E X p x p x [0,1]ip

1

1 1

[ ( )] ( ) ( ) (1 ) ( )(1 )

k ki

i i k k k ii i k

pE f X p f x p f x p f x

p

1

1

( ) (1 )(1 )

ki

k k k ii k

pp f x p f x

p

1

1

(1 )(1 )

ki

k k k ii k

pf p x p x

p

1

( [ ])k

i ii

f p x f E X


• Let

• Function log is concave, so from Jensen inequality we have:

Jensen’s inequality corollary

log( [ ]) [log( )]

log( ( )) log( ( ))

log( ( )) log( ( ) )

log( ( )) log ( )

( )) ( )

j

j

j

j jj j

q

jj j

q

jj j

q

jj j

E g E g

q g j q g j

q g j g j

q g j g j

q g j g j

1

0

( ) 0

jj

j

q

q

g j


• EM is iterative technique designed for probabilistic models.

• We have:

two sample spaces: – X which are observed

– Y which are missing

Vector of parameters which gives a distribution of X.

We should find

or

EM Algorithm

arg max ( | )ML P X

arg max ( | )MAP P X


The problem is that to calculate

Is difficult, but calculation of is relatively easy

We define:

The algorithm makes cyclically two steps:

E: Compute (see (10) below)

M:

EM Algorithm

( | , ')( | ') [log ( , | )]P Y XQ E P X Y

( | ')Q

1 arg max ( | ')m Q

( | ) ( , | )P X P X Y dy ( , | )P X Y


• EM is iterative technique designed for probabilistic models.

Maximizing a function with lower-bound approximation vs.

linear approximation

EM Algorithm


• Gradient descend makes linear approximation to the objective function (O.F.), Newton’s method makes quadratic approx. But optimal step is not known.

• EM instead makes a local approx. that is lower bound (l.b.) to the O.F.

• Choosing a new guess to maximize the l.b. will always be an improvement, if gradient is not zero.

• Thus two steps: E – compute a l.b., M-maximize the l.b.

The bound used by EM is following from Jensen’s inequality.

EM Algorithm


• We should make maximization of the function

where X is a matrix of observed data. If f() is simple, we find maximum by equating its gradient to zero

But if f() is a mixture (of simple functions) it is difficult.

This is a situation for the EM.

Given a guess for find lower bound for f() with a

function g(q(y)), parameterized by free variables q(y).

The General EM Algorithm

( ) ( | ) (2)f p X

( ) ( | ) ( , | ) (3)Y

f p p Y X X


• Gradient descend makes linear approximation to the

provided

Define

If we want the lower bound g(,q) to touch f at the current

guess for , we choose to maximize G(q, .

EM Algorithm

( )( , | ) ( , | )

( ) ( ) ( , ( ))( ) ( )

qp p

f q g qq q

y

yy

X y X yy y

y y

( ) 1q yy

( , ) log ( , ) ( ) log ( , | ) ( ) log ( )G q g q q p q q yy X y y y


• Adding the Lagrange multiplier to the constraint on q gives:

• For this choice the bound becomes

• So indeed it touches the objective f() .

EM Algorithm

( , ) (1 ( )) ( ) log ( , | ) ( ) log ( )G q q q p q q y yy y X y y y

1 log ( , | ) log ( ) 0( )

Gp q

q

X y yy

( , | )( ) ( | , )

( , | )

pq p

p

y

X yy y X

X y

( )

( )( )( , | )( , ) ( | ) ( | ) ( | ) (9)

( | , )

qqqp

g q p p pp

y

yyy

y y

X yX X X

y X


• Finding q to get a good bound is the “E” step.

• To get the next guess for we maximize the bound over (this is the “M” step). It is problem-dependent. The relevant term of G is

• It may be difficult and also it isn’t strictly necessary to maximize the bound over . This is sometimes called “generalized EM”.

• It is clear from the figure that the derivative of g at the current guess is identical to the derivative of f .

EM Algorithm

( )( ) log ( , | ) log ( , | )qq p p yyy X y X y


• We have a mixture of two one-dimensional Gaussians (k=2).

• Let mixture coefficients be equal:

• Let variances be

• The problem is to find

• We have sample set

EM for a mixture model

0.5i 1 2 1

1 2,1 ,1

1 1( | ) ( ) ( )

2 2P x N x N x

1 2( , )

( )1,...,

nnx x x


• To use an algorithm of EM define hidden random variables (indicators)

• Thus for every i we have:

• We define every hidden variables:

• The aim is to calculate and to maximize Q.


,

1

0i j

i j

x was chosen fromNz

otherwise

,1 ,2 1i iz z

, 1 1,2

n

i j i jZ z


• For every xi we have:

• From the assumption of iid for the sample set we have:

• We see that an expression is linear in .


22

,1

1( )

2

,1 ,2

1 1( , , | )

2 2

i j i ij

z x

i i iP x z z e

( ),1 ,2

1

22

,1 1

log ( , | ) log ( , , | )

1 1log ( )

22

nn

i i ii

n

i j i ji j

P x Z P x z z

z x

ijz


• STEP E:

• We want to calculate an expected value relative to


( )( | , ')nP Z x

( ) ( )

2( ) 2

,( | , ') ( | , ')1 1

1 1[log ( , | )] log [ ]( )

22n n

nn

i j i jP Z x P Z xi j

E P Z x E z x

( )

2

2 21 2

, , , ,( | , ')

1( )

2

1 1( ) ( )

2 2

[ ] ( 1| , ') 1 ( 0 | , ') 0 ( 1| , ')

1

2( )1

2

n

i j

i i

i j i j i i j i i j iP Z x

x

i jx x

E z P z x P z x P z x

eP x was chosen from N

e e


• STEP M:

• Differentiating and equating to zero we’ll have:

• Thus


1 2

22

( , ) 1 1

1arg max ( | ') arg min [ ]( )

2

n

new ij i ji j

Q E z x

1

[ ]( ) 0n

ij i jij

FE z x

1

1[ ]

n

j iji

E zn


In what follows we use j instead of y because missing variables are discrete in this example.

• Model density is a linear combination of component densities

p(x | j,) :

where M is a number of basis functions (parameter of the model),

P(j) are mixing parameters. They actually are prior probabilities of the data point having been generated from component j of the mixture.

EM mixture of Gaussians

1

( ) ( | ) ( ) (11)M

j

p p j P j

x x


• They satisfy

• The component density function p(x | j) are normalized:

• We shall use Gaussians for p(x | j)

• We should find


1

( ) 1 0 ( ) 1M

j

P j P j

( | ) 1p j d x x

2

2 / 2 2

|| ||1( | ) exp

(2 ) 2j

dj j

p j

xx

( ), andj jP j μ


• STEP E: calculate

when . (See formulas (8) and (10))

• We have:

• We maximize (17) with constrain (12):


( ) ( | , )oldq j p j x

( )( , ) [log ( , | )]new old newq jQ E P j x

| , ) ( )( | , )

)

p j P jp j

p

x

xx

( , | ) ( ) ( | , )P j P j p j x x

2

21 1

|| ||( , ) ( | , ) log ( ) log (17)

2( )

i newN Mjnew old i old new new

j newi j j

Q p j P j d

x μ

x

1

(1 ( )) (18)M

new

j

Q P j


• STEP M: Derivative of (18) with respect to Pnew(j):

• Thus

• Using (12) we shall have

• So from (21) and (20) :


1

( | , )0

( )

i oldN

newi

p j

P j

x

1

( | , ) ( )N

i old new

i

p j P j

x

1 1

( | , )M N

i old

j i

p j

x


• By calculating derivatives from(18) due to and we’ll have:

EM mixture model. General case

1

1 1

( | , )( )

( | , )

Ni old

new iM N

i old

j i

p jP j

p j

x

x

newjμ

newj

1

1

( | , )(23)

( | , )

Ni old i

new ij N

i old

i

p j

p j

x xμ

x

2

2 1

1

( | , ) || ||1

( ) (24)( | , )

Ni old i new

jnew ij N

i old

i

p j

d p j

x x

x


• Algorithm for calculating p(x) (formula (11)).

For every x

begin initialize

do fixed number of times

Calculate formulas (22),(23),(24)

return formula (11).

end

EM mixture model. General case

2( ), ,j jP j μ