2 – In previous chapters: – We could design an optimal classifier if we knew the prior...
-
Upload
eileen-rycroft -
Category
Documents
-
view
219 -
download
0
Transcript of 2 – In previous chapters: – We could design an optimal classifier if we knew the prior...
2
– In previous chapters:– We could design an optimal classifier if we knew
the prior probabilities P(wi) and the class-conditional probabilities P(x|wi)
– Unfortunately, in pattern recognition applications we rarely have this kind of complete knowledge about the probabilistic structure of the problem.
Parameter Estimation
3
– We have a number of design samples or training data.– The problem is to find some way to use this
information to design or train the classifier.– One approach:
– Use the samples to estimate the unknown probabilities and probability densities,
– And then use the resulting estimates as if they were the true values.
Parameter Estimation
4
– We have a number of design samples or training data.– The problem is to find some way to use this
information to design or train the classifier.– One approach:
– Use the samples to estimate the unknown probabilities and probability densities,
– And then use the resulting estimates as if they were the true values.
Parameter Estimation
5
– What you are going to estimate in your Homework ?
Parameter Estimation
6
– If we know the number parameters in advance and our general knowledge about the problem permits us to parameterize the conditional desities then severity of the problem can be reduced significantly.
Parameter Estimation
7
– For example:– We can reasonaby assume that the p(x|wi) is a normal
density with mean µ and covariance matrix ∑,– We do not know the exact values of these quantities,– However, this knowledge simplifies the problem from one of
estimating an unknown function p(x|wi) to one of estimating the parameters the mean µi and covariance matrix ∑i
– Estimating p(x|wi) estimating µi and ∑i
Parameter Estimation
8
– Data availability in a Bayesian framework We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete information!
– Design a classifier from a training sample No problem with prior estimation Samples are often too small for class-conditional estimation
(large dimension of feature space!)
Parameter Estimation
9
– Given a bunch of data from each class how to estimate the parameters of class conditional densities, P(x | i) ?
– Ex: P(x | i) = N( j, j) is Normal.
Parameters j = ( j, j)
Parameter Estimation
Two major approaches
Maximum-Likelihood Method Bayesian Method
– Use P(i | x) for our classification rule!
– Results are nearly identical, but the approaches are different
Maximum-Likelihood vs. Bayesian:
Maximum Likelihood
Parameters are fixed but unknown!
Best parameters are obtained by maximizing the probability of obtaining the samples observed
Bayes
Parameters are random variables having some known distribution
Best parameters are obtained by estimating them given the data
Major assumptions
– A priori information P( i) for each category is available
– Samples are i.i.d. and P(x | i) is Normal
P(x | i) ~ N( i, i)
– Note: Characterized by 2 parameters
12
1
Has good convergence properties as the sample size increases
Simpler than any other alternative techniques
13
2
Maximum-Likelihood Estimation
14
2
Maximum-Likelihood Estimation
General principle Assume we have c classes and The samples in class j have been drawn according to the
probability law p(x | j)
p(x | j) ~ N( j, j)
p(x | j) P (x | j, j) where:
15
2
Maximum-Likelihood Estimation
Our problem is to use the information provided by the training samples to obtain good estimates for the unknown parameter vectors associated with each category.
16
2
Maximum-Likelihood Estimation
Likelihood Function
Use the information provided by the training samples to estimate = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category
Suppose that D contains n samples, x1, x2,…, xn
17
samples) ofset the w.r.t. of likelihood the called is )|D(P
)(F)|x(P)|D(Pnk
1kk
Goal: find an estimate
Find which maximizes P(D | )
“It is the value of that best agrees with the actually observed training sample”
18
Maximize log likelihood function:l() = ln P(D | )
Let = (1, 2, …, p)t and let be the gradient operator
New problem statement: Determine that maximizes the log-likelihood
20
t
p21
,...,,
)(lmaxargˆ
21
Set of necessary conditions for an optimum is:
l = 0
))|x(Plnl( k
nk
1k
2
22
Example of a specific case: unknown
– P(xi | ) ~ N(, )(Samples are drawn from a multivariate normal population)
= therefore:• The ML estimate for must satisfy:
)x()|x(Pln and
)x()x(21
)2(ln21
)|x(Pln
1
kk
1
kt
kd
k
0)ˆx( k
nk
1k
1
23
• Multiplying by and rearranging, we obtain:
Just the arithmetic average of the samples of the training samples!Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector
= (1, 2, …, c)t and perform an optimal classification!
nk
1kkx
n1
ˆ
24
ML Estimation: – Gaussian Case: unknown and
= (1, 2) = (, 2)
02
)x(
21
0)x(1
0))|x(P(ln
))|x(P(ln
l
)x(2
12ln
21
)|x(Plnl
22
21k
2
1k2
k2
k1
21k
22k
25
Summation:
Combining (1) and (2), one obtains:
n
)x( ;
n
x
nk
1k
2k
2nk
1k
k
nk
1k
nk
1k22
21k
2
nk
1k1k
2
(2) 0ˆ
)ˆx(ˆ1
(1) 0)x(ˆ1
Example 1:
Consider an exponential distribution
otherwise
(single feature, single parameter)
With a random sample
Estimate θ ?
)( Xf
},...,,{ 21 nXXX
)(L
0
0);(
xeXf
x
X
Example 1:
Consider an exponential distribution
otherwise
(single feature, single parameter)
With a random sample
: valid for
(inverse of average)
)( Xf
n
iin
n
ii
n
ii
ndLd
ddl
n
ii
n
ii
n
i
xn
in
xx
n
x
xnxLl
eXXXfL i
1
11
1
)(ln
111
.
121
1ˆˆ
0
lnln)(ln)(
.),...,,()(
},...,,{ 21 nXXX
0x
)(L
0
0);(
xeXf
x
X
Example 2:
Multivariate Gaussian with unknown mean vector M. Assume is known.
k samples from the same distribution:(iid)
(linear algebra)
kXXX .,,........., 21
)()(2
1
12/12/
1
)2(
1)|(
MXMXk
in
iT
i
eMXL
)()(2
1
12/12/
1
)2(
1loglog
MXMXk
inMM
iT
i
eLl
))()(2
1||log
2
1)2log(
2( 1
1
MXMXn
iT
i
k
iM
))((1
1
k
ii MX
(sample average or sample mean)
k
ii MkX
1
1 )(0
k
iiXk
M1
1
Example 3:
Binary variables with unknown parameters(n parameters)
How to estimate these Think about your homework ?
nipi 1,
nipi 1,
Example 3:
Binary variables with unknown parameters(n parameters)
So, k samples
here is the element of sample .
nipi 1,
n
i
n
iiiii pxpxXP
1 1
)1log()1(log)(log
k
jjXPLl
1
)(loglog
k
j
n
i
n
iiijiij pxpx
1 1 1
)1log()1(log(
ijxthi
thjjX
So,
is the sample average of the feature.
Lp
Lp
Lp
L
n
pi
log
:
log
log
log2
1
k
jiij
i
ij
i
pxp
xL
p 1
))1)(1(((log
k
j
k
jij
iij
i
xp
xp 1 1
)1(ˆ1
1ˆ1
0
k
jiji x
kp
1
1ˆ
ip
Since is binary, will be the same as counting the occurances of ‘1’.
Consider character recognition problem with binary matrices.
For each pixel, count the number of 1’s and this is the estimate of .
iX
k
jijx
1
A B
ip
References
R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, New York: John Wiley, 2001. Selim Aksoy, Pattern Recognition Course Materials, 2011. M. Narasimha Murty, V. Susheela Devi, Pattern Recognition an Algorithmic Approach,
Springer, 2011. “Bayes Decision Rule Discrete Features”,
http://www.cim.mcgill.ca/~friggi/bayes/decision/binaryfeatures.html, access: 2011.