236607 Visual Recognition Tutorial 1
Tutorial 4
• Maximum likelihood – an example
• Maximum likelihood – another example
• Bayesian estimation
• Expectation Maximization Algorithm
• Jensen’s inequality
• EM for a mixture model
236607 Visual Recognition Tutorial 2
• Bayesian leaning considers (the parameter vector to be
estimated) to be a random variable.
Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning
•
Bayesian Estimation: General Theory
236607 Visual Recognition Tutorial 3
• Density function for x, given the training data set
(it was defined in the Lect.2)
• From the definition of conditional probability densities
• The first factor is independent of X(n) since it just our assumed form for parameterized density.
• Therefore
Bayesian parametric estimation
D
( ) ( ) ( )( , | ) ( | , ) ( | ).n n np X p X p X x x
( )( | , ) ( | )np X p x x
( ) ( )( | ) ( | ) ( | )n np X p p X d x x
( ) 1{ ,..., }n NX x x
( ) ( )( | ) ( , | )n np X p X d x x
236607 Visual Recognition Tutorial 4
• Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of
If the weighting factor , which is a posterior of peaks very sharply about some value we obtain .
Thus the optimal estimator is the most likely value of given the data and the prior of .
Bayesian parametric estimation
.
( )( | )np X
( )( | ) ( | )np X p x x
236607 Visual Recognition Tutorial 5
Suppose we know the distribution of possible values of that is a prior
Suppose we also have a loss function which measures the penalty for estimating when actual value is
Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk
Note that the loss function is usually continuous.
Bayesian decision making
0 ( ).p
( ) ( )[ | ] ( | ) ( , )n nR X p X d
( , )
.
236607 Visual Recognition Tutorial 6
Let us look at : the optimal estimator is
the most likely value of given the data and the prior of This “most likely value” is given by
Maximum A-Posteriori (MAP) Estimation
( )
( ) 0( )
01
( )0
( ) ( | )arg max ( | ) arg max
( )
( ) ( | )arg max
( | ') ( ') '
nn
n
n
ii
n
p p Xp X
p X
p p x
p X p d
0
( , )1
if
if
236607 Visual Recognition Tutorial 7
since the data is i.i.d.
• We can disregard the normalizing factor when looking for the maximum
Maximum A-Posteriori (MAP) Estimation
( )
1
( | ) ( | )n
ni
i
p X p x
( )( )np X
236607 Visual Recognition Tutorial 8
So, the we are looking for is
MAP - continued
0
1
01
01
01
arg max ( ) ( | )] log is monotonically increasing)
arg max log ( ) ( | )]
arg max log ( ) log ( | )
arg max log ( ) log ( | )
n
ii
n
ii
n
ii
n
ii
p p x
p p x
p p x
p p x
236607 Visual Recognition Tutorial 9
In MAP estimator, the larger n (the size of the data), the less important is in the expression
It can motivate us to omit the prior.
What we get is the maximum likelihood (ML) method.
Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way .
is a log-likelihood of with respect to X(n) .
We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function.
Maximum likelihood
01
log ( ) log ( | )n
ii
p p x
0log ( )p
1
arg max log ( | )n
ii
p x
( )log ( | )np X
236607 Visual Recognition Tutorial 10
Let us find the ML estimator for the parameter of the exponential density function :
so we are actually looking for the maximum of log-likelihood.
Observe:
The maximum is achieved where
We have got the empirical mean (average)
Maximum likelihood – an example
( )
11
1 1arg max ( | ) arg max arg max ln
i ix xn nn
ii
p X e e
1( | ) exp
xp x
2
1ln
1 1ln ln
i
i
x
xd e
x xe
d
2 21 1 1
1 10
n n ni
i ii i i
x nx x
n
236607 Visual Recognition Tutorial 11
Let us find the ML estimator for
Observe:
The maximum is at where
This is the median of the sampled data.
Maximum likelihood – another example
| | | |( )
11
1 1arg max ( | ) arg max arg max ln
2 2i i
n nx xn
ii
p X e e
|1( | )
2xp x e
| |
| |
1ln
1 2ln | | ln 2 ( )
2
i
i
x
xi i
d ee x sign x
d
1
( ) 0n
ii
sign x
236607 Visual Recognition Tutorial 12
We saw Bayesian estimator for 0/1 loss function (MAP).What happens when we assume other loss functions?Example 1: (is unidimensional).The total Bayesian risk here:
We seek its minimum:
Bayesian estimation -revisited
( , ) | |
( ) ( )
( ) ( )
[ | ] ( | ) | |
( | )( ) ( | )( )
n n
n n
R X p X d
p X d p X d
( )( ) ( )[ | ]
( | ) ( | )n
n ndR Xp X d p X d
d
236607 Visual Recognition Tutorial 13
At the which is a solution we have
That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution
Example 2: (squared error).
Total Bayesian risk:
Again, in order to find the minimum, let the derivative be equal 0:
Bayesian estimation -continued
( ) ( )( | ) ( | )n np X d p X d
( , ) | |
( )( | )np X 2( , ) ( )
( ) ( ) 2[ | ] ( | )( )n nR X p X d
236607 Visual Recognition Tutorial 14
• The optimal estimator here is the conditional expectation of given the data X(n) .
Bayesian estimation -continued
( )( )
( ) ( ) ( )
[ | ]( | )2( )
2 ( | ) 2 ( | ) 2 [ | ] 0
nn
n n n
dR Xp X d
d
p X d p X d E X
236607 Visual Recognition Tutorial 15
Definition: function is convex over (a,b) if
Convex Concave
Jensen’s inequality: For convex function
Jensen’s inequality
:f
1 2 1 2 1 2, ( , ), [0,1] ( (1 ) ) ( ) (1 ) ( )x x a b f x x f x f x
[ ( )] ( [ ])E f X f E X
236607 Visual Recognition Tutorial 16
For d.r.v.with two mass points
Let Jensen’s inequality is right for k-1 mass points, then
due to induction assumption
due to convexity
Jensen’s inequality
1 1 2 2 1 1 2 2[ ( )] ( ) ( ) ( ) ( [ ])E f X p f x p f x f p x p x f E X 1 1 2 2[ ] ,E X p x p x [0,1]ip
1
1 1
[ ( )] ( ) ( ) (1 ) ( )(1 )
k ki
i i k k k ii i k
pE f X p f x p f x p f x
p
1
1
( ) (1 )(1 )
ki
k k k ii k
pp f x p f x
p
1
1
(1 )(1 )
ki
k k k ii k
pf p x p x
p
1
( [ ])k
i ii
f p x f E X
236607 Visual Recognition Tutorial 17
• Let
• Function log is concave, so from Jensen inequality we have:
Jensen’s inequality corollary
log( [ ]) [log( )]
log( ( )) log( ( ))
log( ( )) log( ( ) )
log( ( )) log ( )
( )) ( )
j
j
j
j jj j
q
jj j
q
jj j
q
jj j
E g E g
q g j q g j
q g j g j
q g j g j
q g j g j
1
0
( ) 0
jj
j
q
q
g j
236607 Visual Recognition Tutorial 18
• EM is iterative technique designed for probabilistic models.
• We have:
two sample spaces: – X which are observed
– Y which are missing
Vector of parameters which gives a distribution of X.
We should find
or
EM Algorithm
arg max ( | )ML P X
arg max ( | )MAP P X
236607 Visual Recognition Tutorial 19
The problem is that to calculate
Is difficult, but calculation of is relatively easy
We define:
The algorithm makes cyclically two steps:
E: Compute (see (10) below)
M:
EM Algorithm
( | , ')( | ') [log ( , | )]P Y XQ E P X Y
( | ')Q
1 arg max ( | ')m Q
( | ) ( , | )P X P X Y dy ( , | )P X Y
236607 Visual Recognition Tutorial 20
• EM is iterative technique designed for probabilistic models.
Maximizing a function with lower-bound approximation vs.
linear approximation
EM Algorithm
236607 Visual Recognition Tutorial 21
• Gradient descend makes linear approximation to the objective function (O.F.), Newton’s method makes quadratic approx. But optimal step is not known.
• EM instead makes a local approx. that is lower bound (l.b.) to the O.F.
• Choosing a new guess to maximize the l.b. will always be an improvement, if gradient is not zero.
• Thus two steps: E – compute a l.b., M-maximize the l.b.
The bound used by EM is following from Jensen’s inequality.
EM Algorithm
236607 Visual Recognition Tutorial 22
• We should make maximization of the function
where X is a matrix of observed data. If f() is simple, we find maximum by equating its gradient to zero
But if f() is a mixture (of simple functions) it is difficult.
This is a situation for the EM.
Given a guess for find lower bound for f() with a
function g(q(y)), parameterized by free variables q(y).
The General EM Algorithm
( ) ( | ) (2)f p X
( ) ( | ) ( , | ) (3)Y
f p p Y X X
236607 Visual Recognition Tutorial 23
• Gradient descend makes linear approximation to the
provided
Define
If we want the lower bound g(,q) to touch f at the current
guess for , we choose to maximize G(q, .
EM Algorithm
( )( , | ) ( , | )
( ) ( ) ( , ( ))( ) ( )
qp p
f q g qq q
y
yy
X y X yy y
y y
( ) 1q yy
( , ) log ( , ) ( ) log ( , | ) ( ) log ( )G q g q q p q q yy X y y y
236607 Visual Recognition Tutorial 24
• Adding the Lagrange multiplier to the constraint on q gives:
• For this choice the bound becomes
• So indeed it touches the objective f() .
EM Algorithm
( , ) (1 ( )) ( ) log ( , | ) ( ) log ( )G q q q p q q y yy y X y y y
1 log ( , | ) log ( ) 0( )
Gp q
q
X y yy
( , | )( ) ( | , )
( , | )
pq p
p
y
X yy y X
X y
( )
( )( )( , | )( , ) ( | ) ( | ) ( | ) (9)
( | , )
qqqp
g q p p pp
y
yyy
y y
X yX X X
y X
236607 Visual Recognition Tutorial 25
• Finding q to get a good bound is the “E” step.
• To get the next guess for we maximize the bound over (this is the “M” step). It is problem-dependent. The relevant term of G is
• It may be difficult and also it isn’t strictly necessary to maximize the bound over . This is sometimes called “generalized EM”.
• It is clear from the figure that the derivative of g at the current guess is identical to the derivative of f .
EM Algorithm
( )( ) log ( , | ) log ( , | )qq p p yyy X y X y
236607 Visual Recognition Tutorial 26
• We have a mixture of two one-dimensional Gaussians (k=2).
• Let mixture coefficients be equal:
• Let variances be
• The problem is to find
• We have sample set
EM for a mixture model
0.5i 1 2 1
1 2,1 ,1
1 1( | ) ( ) ( )
2 2P x N x N x
1 2( , )
( )1,...,
nnx x x
236607 Visual Recognition Tutorial 27
• To use an algorithm of EM define hidden random variables (indicators)
• Thus for every i we have:
• We define every hidden variables:
• The aim is to calculate and to maximize Q.
EM for a mixture model
,
1
0i j
i j
x was chosen fromNz
otherwise
,1 ,2 1i iz z
, 1 1,2
n
i j i jZ z
236607 Visual Recognition Tutorial 28
• For every xi we have:
• From the assumption of iid for the sample set we have:
• We see that an expression is linear in .
EM for a mixture model
22
,1
1( )
2
,1 ,2
1 1( , , | )
2 2
i j i ij
z x
i i iP x z z e
( ),1 ,2
1
22
,1 1
log ( , | ) log ( , , | )
1 1log ( )
22
nn
i i ii
n
i j i ji j
P x Z P x z z
z x
ijz
236607 Visual Recognition Tutorial 29
• STEP E:
• We want to calculate an expected value relative to
EM for a mixture model
( )( | , ')nP Z x
( ) ( )
2( ) 2
,( | , ') ( | , ')1 1
1 1[log ( , | )] log [ ]( )
22n n
nn
i j i jP Z x P Z xi j
E P Z x E z x
( )
2
2 21 2
, , , ,( | , ')
1( )
2
1 1( ) ( )
2 2
[ ] ( 1| , ') 1 ( 0 | , ') 0 ( 1| , ')
1
2( )1
2
n
i j
i i
i j i j i i j i i j iP Z x
x
i jx x
E z P z x P z x P z x
eP x was chosen from N
e e
236607 Visual Recognition Tutorial 30
• STEP M:
• Differentiating and equating to zero we’ll have:
• Thus
EM for a mixture model
1 2
22
( , ) 1 1
1arg max ( | ') arg min [ ]( )
2
n
new ij i ji j
Q E z x
1
[ ]( ) 0n
ij i jij
FE z x
1
1[ ]
n
j iji
E zn
236607 Visual Recognition Tutorial 31
In what follows we use j instead of y because missing variables are discrete in this example.
• Model density is a linear combination of component densities
p(x | j,) :
where M is a number of basis functions (parameter of the model),
P(j) are mixing parameters. They actually are prior probabilities of the data point having been generated from component j of the mixture.
EM mixture of Gaussians
1
( ) ( | ) ( ) (11)M
j
p p j P j
x x
236607 Visual Recognition Tutorial 32
• They satisfy
• The component density function p(x | j) are normalized:
• We shall use Gaussians for p(x | j)
• We should find
EM mixture of Gaussians
1
( ) 1 0 ( ) 1M
j
P j P j
( | ) 1p j d x x
2
2 / 2 2
|| ||1( | ) exp
(2 ) 2j
dj j
p j
xx
( ), andj jP j μ
236607 Visual Recognition Tutorial 33
• STEP E: calculate
when . (See formulas (8) and (10))
• We have:
• We maximize (17) with constrain (12):
EM mixture of Gaussians
( ) ( | , )oldq j p j x
( )( , ) [log ( , | )]new old newq jQ E P j x
| , ) ( )( | , )
)
p j P jp j
p
x
xx
( , | ) ( ) ( | , )P j P j p j x x
2
21 1
|| ||( , ) ( | , ) log ( ) log (17)
2( )
i newN Mjnew old i old new new
j newi j j
Q p j P j d
x μ
x
1
(1 ( )) (18)M
new
j
Q P j
236607 Visual Recognition Tutorial 34
• STEP M: Derivative of (18) with respect to Pnew(j):
• Thus
• Using (12) we shall have
• So from (21) and (20) :
EM mixture of Gaussians
1
( | , )0
( )
i oldN
newi
p j
P j
x
1
( | , ) ( )N
i old new
i
p j P j
x
1 1
( | , )M N
i old
j i
p j
x
236607 Visual Recognition Tutorial 35
• By calculating derivatives from(18) due to and we’ll have:
EM mixture model. General case
1
1 1
( | , )( )
( | , )
Ni old
new iM N
i old
j i
p jP j
p j
x
x
newjμ
newj
1
1
( | , )(23)
( | , )
Ni old i
new ij N
i old
i
p j
p j
x xμ
x
2
2 1
1
( | , ) || ||1
( ) (24)( | , )
Ni old i new
jnew ij N
i old
i
p j
d p j
x x
x
236607 Visual Recognition Tutorial 36
• Algorithm for calculating p(x) (formula (11)).
For every x
begin initialize
do fixed number of times
Calculate formulas (22),(23),(24)
return formula (11).
end
EM mixture model. General case
2( ), ,j jP j μ
Top Related