Post on 20-Jan-2016
236607 Visual Recognition Tutorial 1
Tutorial 3
• Maximum likelihood – an example
• Maximum likelihood – another example
• Bayesian estimation
• EM for a mixture model
• EM Algorithm General Setting
• Jensen’s inequality
236607 Visual Recognition Tutorial 2
• Bayesian leaning considers (the parameter vector to be
estimated) to be a random variable.
Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning
•
Bayesian Estimation: General Theory
236607 Visual Recognition Tutorial 3
• Density function for x, given the training data set
(it was defined in the Lect.2)
• From the definition of conditional probability densities
• The first factor is independent of X(n) since it just our assumed form for parameterized density.
• Therefore
Bayesian parametric estimation
D
( ) ( ) ( )( , | ) ( | , ) ( | ).n n np X p X p X x x
( )( | , ) ( | )np X p x x
( ) ( )( | ) ( | ) ( | )n np X p p X d x x
( ) 1{ ,..., }n NX x x
( ) ( )( | ) ( , | )n np X p X d x x
236607 Visual Recognition Tutorial 4
• Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of
If the weighting factor , which is a posterior of peaks very sharply about some value we obtain .
Thus the optimal estimator is the most likely value of given the data and the prior of .
Bayesian parametric estimation
.
( )( | )np X
( )( | ) ( | )np X p x x
236607 Visual Recognition Tutorial 5
Suppose we know the distribution of possible values of that is a prior
Suppose we also have a loss function which measures the penalty for estimating when actual value is
Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk
Note that the loss function is usually continuous.
Bayesian decision making
0 ( ).p
( ) ( )[ | ] ( | ) ( , )n nR X p X d
( , )
.
236607 Visual Recognition Tutorial 6
Let us look at : the optimal estimator is
the most likely value of given the data and the prior of This “most likely value” is given by
Maximum A-Posteriori (MAP) Estimation
( )
( ) 0( )
01
( )0
( ) ( | )arg max ( | ) arg max
( )
( ) ( | )arg max
( | ') ( ') '
nn
n
n
ii
n
p p Xp X
p X
p p x
p X p d
0
( , )1
if
if
236607 Visual Recognition Tutorial 7
since the data is i.i.d.
• We can disregard the normalizing factor when looking for the maximum
Maximum A-Posteriori (MAP) Estimation
( )
1
( | ) ( | )n
ni
i
p X p x
( )( )np X
236607 Visual Recognition Tutorial 8
So, the we are looking for is
MAP - continued
0
1
01
01
01
arg max ( ) ( | )] log is monotonically increasing)
arg max log ( ) ( | )]
arg max log ( ) log ( | )
arg max log ( ) log ( | )
n
ii
n
ii
n
ii
n
ii
p p x
p p x
p p x
p p x
236607 Visual Recognition Tutorial 9
In MAP estimator, the larger n (the size of the data), the less important is in the expression
It can motivate us to omit the prior.
What we get is the maximum likelihood (ML) method.
Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way .
is a log-likelihood of with respect to X(n) .
We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function.
Maximum likelihood
01
log ( ) log ( | )n
ii
p p x
0log ( )p
1
arg max log ( | )n
ii
p x
( )log ( | )np X
236607 Visual Recognition Tutorial 10
Let us find the ML estimator for the parameter of the exponential density function :
so we are actually looking for the maximum of log-likelihood.
Observe:
The maximum is achieved where
We have got the empirical mean (average)
Maximum likelihood – an example
( )
11
1 1arg max ( | ) arg max arg max ln
i ix xn nn
ii
p X e e
1( | ) exp
xp x
2
1ln
1 1ln ln
i
i
x
xd e
x xe
d
2 21 1 1
1 10
n n ni
i ii i i
x nx x
n
236607 Visual Recognition Tutorial 11
Let us find the ML estimator for
Observe:
The maximum is at where
This is the median of the sampled data.
Maximum likelihood – another example
| | | |( )
11
1 1arg max ( | ) arg max arg max ln
2 2i i
n nx xn
ii
p X e e
|1( | )
2xp x e
| |
| |
1ln
1 2ln | | ln 2 ( )
2
i
i
x
xi i
d ee x sign x
d
1
( ) 0n
ii
sign x
236607 Visual Recognition Tutorial 12
We saw Bayesian estimator for 0/1 loss function (MAP).What happens when we assume other loss functions?Example 1: (is unidimensional).The total Bayesian risk here:
We seek its minimum:
Bayesian estimation -revisited
( , ) | |
( ) ( )
( ) ( )
[ | ] ( | ) | |
( | )( ) ( | )( )
n n
n n
R X p X d
p X d p X d
( )( ) ( )[ | ]
( | ) ( | )n
n ndR Xp X d p X d
d
236607 Visual Recognition Tutorial 13
At the which is a solution we have
That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution
Example 2: (squared error).
Total Bayesian risk:
Again, in order to find the minimum, let the derivative be equal 0:
Bayesian estimation -continued
( ) ( )( | ) ( | )n np X d p X d
( , ) | |
( )( | )np X 2( , ) ( )
( ) ( ) 2[ | ] ( | )( )n nR X p X d
236607 Visual Recognition Tutorial 14
• The optimal estimator here is the conditional expectation of given the data X(n) .
Bayesian estimation -continued
( )( )
( ) ( ) ( )
[ | ]( | )2( )
2 ( | ) 2 ( | ) 2 [ | ] 0
nn
n n n
dR Xp X d
d
p X d p X d E X
236607 Visual Recognition Tutorial 15
Mixture Models
1
1K
ii
21 1 1 1
22 2 2 2
( | ) ( | , )
( | , )
p x N x
N x
1
( | ) ( | )K
i ii
p x p x
i( | ) - mixture components - mixing proportionsip x
236607 Visual Recognition Tutorial 16
Mixture Models
• Introduce multinomial random variable Z with components Zk
If and only if Zn takes kth value then . Note that
1 0 0
0 , 1 , 0
0 0 1nZ
1knZ
1 by definitionknk
Z
236607 Visual Recognition Tutorial 17
Mixture Models
where
Marginal prob. of X is
1 1( ,..., , ,..., )K K
( , 1| ) ( | 1, ) ( 1| ) ( | )k k kk k kp x Z p x Z p Z f x
1 1
( | ) ( , 1| ) ( | )K K
kk k k k
k k
p x p x Z f x
( 1| ); ( | ) ( | 1, )k kk k kp Z f x p x Z
236607 Visual Recognition Tutorial 18
Mixture Models
A mixture model as graphical
model. Z – multinomial latent
variable
1
1
( | 1, ) ( 1)( 1| , )
( | 1, ) ( 1)
( | )
( | )
k kk k
Kk k
k kk
k k kK
k k k kk
p x z p zp z x
p x z p z
f x
f x
Conditional prob. of Z.
Define posterior
( 1| , )k kp z x
236607 Visual Recognition Tutorial 19
Unconditional Mixture Models
Cond. Mix.Mod. -> to solve regression and classification
(supervised).
Need observation of data X and labels Y that is (X,Y) pairs.
Uncond. Mix.Mod. -> to solve density estimation problems
Need only observation of data X.
Applications – detection of outliers, compression, unsupervised
classification (clustering) …
236607 Visual Recognition Tutorial 20
Unconditional Mixture Models
236607 Visual Recognition Tutorial 21
Gaussian Mixture Models
Estimate from IID data D={x1,…,xN}
( | ) log ( | ) log ( | , )n i n i in n i
l D p x N x
T 1/ 2 1/ 2
1 1( | ) exp ( ) ( )
(2 ) | | 2i i imi i
p x x x
236607 Visual Recognition Tutorial 22
The K- means algorithm
• Group data D={x1,…,xN} into a set of K clusters, where K is given. Represents i-th cluster as one vector - its mean . Data points assign to the nearest mean .
Phase 1: values for the indicator variable are evaluated by
assigning each point xn to the closed mean:
Phase 2: recompute
i
inz
21 arg min || ||
0 . i j n jn
if i xz
otherwise
i
in nn
i inn
z x
z
236607 Visual Recognition Tutorial 23
EM Algorithm
• If Zn were observed, then it would be “class label” and estimate of mean would be
• We don’t know them and replace them by their conditional expectations, conditioning on data:
But from (6),(7) depends on parameter estimates so we should
iterate.
in nn
i inn
z x
z
[ | ] 1 ( 1| ) 0 ( 0 | ) ( 1| )i i i i in n n n n n n n nE Z x p Z x p Z x p Z x
in nn
i inn
x
in
236607 Visual Recognition Tutorial 24
EM Algorithm
• Iteration formulas:
( ) ( ) ( )
( ) ( ) ( )
( | )( )
( | )
t t ti i n i in t t t
j n j jj
N xt
N x
( 1)( )
( )
in nt n
i inn
t x
t
( 1) ( 1) T( 1)
( )( )( )
( )
i t tn n i n it n
i inn
t x x
t
( 1) 1( )t i
i nnt
N
236607 Visual Recognition Tutorial 25
EM Algorithm
• Expectation step is (14)
• Maximization step is parameter updates (15)-(17)
• What relationship this algorithm has to quantity which we want to maximize - log likelihood (9) ?
• Calculating derivatives of l with respect to the parameters, we have
1
log ( | , )
( | )( | , )
( | )
i n i in ii i
i n i in i i
n j n i i ij
in i n i
lN x
N xN x
N x
x
236607 Visual Recognition Tutorial 26
EM Algorithm
• Setting to zero yields
• Analogously
and mixing proportions:
in nn
i inn
x
T
1( )( )
n in n i n iN
i inn
x x
1
1 Ni
i nnN
236607 Visual Recognition Tutorial 27
EM General Setting
• EM is iterative technique designed for probabilistic models. • We have two sample spaces:
– X which are observed (dataset)– Z which are missing (latent)
• A probability model is
• If we knew Z we would do ML estimation by maximizing
( , | )p x z
( ) log ( , | ) (22)cl p x z
236607 Visual Recognition Tutorial 28
EM General Setting
• Z is not observed so we calculate incomplete log likelihood
• Given Z is not observed so the complete log likelihood is a random quantity and cannot be maximized directly.
• Thus we average over Z using some “averaging distribution” q(z|x).
• We hope that maximizing this surrogate expression will yield value of which will be improvement of initial value of
log ( | ) log ( , | ) (23)z
P X p x z
( ) ( | , ) log ( , | ) (24)cz
l q z x p x z
236607 Visual Recognition Tutorial 29
EM General Setting
• The distribution can be used to obtain lower bound on log likelihood:
• EM is coordinate ascent on
• At the (t+1)st iteration, for fixed twe first maximize with respect to q, which yield qt For this qtwe then maximize
with respect to which yields t
( ) log ( | ) log ( , | )
( , | )log ( | ) (25)
( | )
( , | )( | ) log , )
( | )
z
z
z
l p x p x z
p x zq z x
q z x
p x zq z x L q
q z x
, )L q
, )tL q
( 1) , )tL q
236607 Visual Recognition Tutorial 30
EM General Setting
• E step
• M step
• The M step is equivalently viewed as the maximization of the expected complete log likelihood. Proof:
• Second term is independent of Thus maximizing of with respect to is equivalent tomaximizing .
( 1) arg max , )t tqq L q
( 1) ( 1)arg max , )t tL q
( , | ), ) ( | ) log
( | )
( | ) ( , | ) ( | ) log ( | )
( ) ( | ) log ( | ) (26)
z
z z
c qz
p x zL q q z x
q z x
q z x p x z q z x q z x
l q z x q z x
, )L q ( )cl
236607 Visual Recognition Tutorial 31
EM General Setting
• The E step can be solved ones and for all: choise
yields the maximum:( 1) ( )( | ) ( | , )t tq z x p z x
( )( ) ( ) ( )
( )
( ) ( )
( )
( )
( , | )( | , ), ) ( | , ) log
( | , )
( | , ) log ( | )
log ( | )
( )
tt t t
tz
t t
z
t
t
p x zL p z x p z x
p z x
p z x p x
p x
l
(27)
236607 Visual Recognition Tutorial 32
Definition: function is convex over (a,b) if
Convex Concave
Jensen’s inequality: For convex function
Jensen’s inequality
:f
1 2 1 2 1 2, ( , ), [0,1] ( (1 ) ) ( ) (1 ) ( )x x a b f x x f x f x
[ ( )] ( [ ])E f X f E X
236607 Visual Recognition Tutorial 33
For d.r.v.with two mass points
Let Jensen’s inequality is right for k-1 mass points, then
due to induction assumption
due to convexity
Jensen’s inequality
1 1 2 2 1 1 2 2[ ( )] ( ) ( ) ( ) ( [ ])E f X p f x p f x f p x p x f E X 1 1 2 2[ ] ,E X p x p x [0,1]ip
1
1 1
[ ( )] ( ) ( ) (1 ) ( )(1 )
k ki
i i k k k ii i k
pE f X p f x p f x p f x
p
1
1
( ) (1 )(1 )
ki
k k k ii k
pp f x p f x
p
1
1
(1 )(1 )
ki
k k k ii k
pf p x p x
p
1
( [ ])k
i ii
f p x f E X
236607 Visual Recognition Tutorial 34
• Let
• Function log is concave, so from Jensen inequality we have:
Jensen’s inequality corollary
log( [ ]) [log( )]
log( ( )) log( ( ))
log( ( )) log( ( ) )
log( ( )) log ( )
( )) ( )
j
j
j
j jj j
q
jj j
q
jj j
q
jj j
E g E g
q g j q g j
q g j g j
q g j g j
q g j g j
1
0
( ) 0
jj
j
q
q
g j