2 – In previous chapters: – We could design an optimal classifier if we knew the prior...

Post on 16-Dec-2015

219 views 0 download

Transcript of 2 – In previous chapters: – We could design an optimal classifier if we knew the prior...

2

– In previous chapters:– We could design an optimal classifier if we knew

the prior probabilities P(wi) and the class-conditional probabilities P(x|wi)

– Unfortunately, in pattern recognition applications we rarely have this kind of complete knowledge about the probabilistic structure of the problem.

Parameter Estimation

3

– We have a number of design samples or training data.– The problem is to find some way to use this

information to design or train the classifier.– One approach:

– Use the samples to estimate the unknown probabilities and probability densities,

– And then use the resulting estimates as if they were the true values.

Parameter Estimation

4

– We have a number of design samples or training data.– The problem is to find some way to use this

information to design or train the classifier.– One approach:

– Use the samples to estimate the unknown probabilities and probability densities,

– And then use the resulting estimates as if they were the true values.

Parameter Estimation

5

– What you are going to estimate in your Homework ?

Parameter Estimation

6

– If we know the number parameters in advance and our general knowledge about the problem permits us to parameterize the conditional desities then severity of the problem can be reduced significantly.

Parameter Estimation

7

– For example:– We can reasonaby assume that the p(x|wi) is a normal

density with mean µ and covariance matrix ∑,– We do not know the exact values of these quantities,– However, this knowledge simplifies the problem from one of

estimating an unknown function p(x|wi) to one of estimating the parameters the mean µi and covariance matrix ∑i

– Estimating p(x|wi) estimating µi and ∑i

Parameter Estimation

8

– Data availability in a Bayesian framework We could design an optimal classifier if we knew:

– P(i) (priors)

– P(x | i) (class-conditional densities)

Unfortunately, we rarely have this complete information!

– Design a classifier from a training sample No problem with prior estimation Samples are often too small for class-conditional estimation

(large dimension of feature space!)

Parameter Estimation

9

– Given a bunch of data from each class how to estimate the parameters of class conditional densities, P(x | i) ?

– Ex: P(x | i) = N( j, j) is Normal.

Parameters j = ( j, j)

Parameter Estimation

Two major approaches

Maximum-Likelihood Method Bayesian Method

– Use P(i | x) for our classification rule!

– Results are nearly identical, but the approaches are different

Maximum-Likelihood vs. Bayesian:

Maximum Likelihood

Parameters are fixed but unknown!

Best parameters are obtained by maximizing the probability of obtaining the samples observed

Bayes

Parameters are random variables having some known distribution

Best parameters are obtained by estimating them given the data

Major assumptions

– A priori information P( i) for each category is available

– Samples are i.i.d. and P(x | i) is Normal

P(x | i) ~ N( i, i)

– Note: Characterized by 2 parameters

12

1

Has good convergence properties as the sample size increases

Simpler than any other alternative techniques

13

2

Maximum-Likelihood Estimation

14

2

Maximum-Likelihood Estimation

General principle Assume we have c classes and The samples in class j have been drawn according to the

probability law p(x | j)

p(x | j) ~ N( j, j)

p(x | j) P (x | j, j) where:

15

2

Maximum-Likelihood Estimation

Our problem is to use the information provided by the training samples to obtain good estimates for the unknown parameter vectors associated with each category.

16

2

Maximum-Likelihood Estimation

Likelihood Function

Use the information provided by the training samples to estimate = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category

Suppose that D contains n samples, x1, x2,…, xn

17

samples) ofset the w.r.t. of likelihood the called is )|D(P

)(F)|x(P)|D(Pnk

1kk

Goal: find an estimate

Find which maximizes P(D | )

“It is the value of that best agrees with the actually observed training sample”

18

Maximize log likelihood function:l() = ln P(D | )

Let = (1, 2, …, p)t and let be the gradient operator

New problem statement: Determine that maximizes the log-likelihood

20

t

p21

,...,,

)(lmaxargˆ

21

Set of necessary conditions for an optimum is:

l = 0

))|x(Plnl( k

nk

1k

2

22

Example of a specific case: unknown

– P(xi | ) ~ N(, )(Samples are drawn from a multivariate normal population)

= therefore:• The ML estimate for must satisfy:

)x()|x(Pln and

)x()x(21

)2(ln21

)|x(Pln

1

kk

1

kt

kd

k

0)ˆx( k

nk

1k

1

23

• Multiplying by and rearranging, we obtain:

Just the arithmetic average of the samples of the training samples!Conclusion:

If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector

= (1, 2, …, c)t and perform an optimal classification!

nk

1kkx

n1

ˆ

24

ML Estimation: – Gaussian Case: unknown and

= (1, 2) = (, 2)

02

)x(

21

0)x(1

0))|x(P(ln

))|x(P(ln

l

)x(2

12ln

21

)|x(Plnl

22

21k

2

1k2

k2

k1

21k

22k

25

Summation:

Combining (1) and (2), one obtains:

n

)x( ;

n

x

nk

1k

2k

2nk

1k

k

nk

1k

nk

1k22

21k

2

nk

1k1k

2

(2) 0ˆ

)ˆx(ˆ1

(1) 0)x(ˆ1

Example 1:

Consider an exponential distribution

otherwise

(single feature, single parameter)

With a random sample

Estimate θ ?

)( Xf

},...,,{ 21 nXXX

)(L

0

0);(

xeXf

x

X

Example 1:

Consider an exponential distribution

otherwise

(single feature, single parameter)

With a random sample

: valid for

(inverse of average)

)( Xf

n

iin

n

ii

n

ii

ndLd

ddl

n

ii

n

ii

n

i

xn

in

xx

n

x

xnxLl

eXXXfL i

1

11

1

)(ln

111

.

121

1ˆˆ

0

lnln)(ln)(

.),...,,()(

},...,,{ 21 nXXX

0x

)(L

0

0);(

xeXf

x

X

Example 2:

Multivariate Gaussian with unknown mean vector M. Assume is known.

k samples from the same distribution:(iid)

(linear algebra)

kXXX .,,........., 21

)()(2

1

12/12/

1

)2(

1)|(

MXMXk

in

iT

i

eMXL

)()(2

1

12/12/

1

)2(

1loglog

MXMXk

inMM

iT

i

eLl

))()(2

1||log

2

1)2log(

2( 1

1

MXMXn

iT

i

k

iM

))((1

1

k

ii MX

(sample average or sample mean)

k

ii MkX

1

1 )(0

k

iiXk

M1

1

Example 3:

Binary variables with unknown parameters(n parameters)

How to estimate these Think about your homework ?

nipi 1,

nipi 1,

Example 3:

Binary variables with unknown parameters(n parameters)

So, k samples

here is the element of sample .

nipi 1,

n

i

n

iiiii pxpxXP

1 1

)1log()1(log)(log

k

jjXPLl

1

)(loglog

k

j

n

i

n

iiijiij pxpx

1 1 1

)1log()1(log(

ijxthi

thjjX

So,

is the sample average of the feature.

Lp

Lp

Lp

L

n

pi

log

:

log

log

log2

1

k

jiij

i

ij

i

pxp

xL

p 1

))1)(1(((log

k

j

k

jij

iij

i

xp

xp 1 1

)1(ˆ1

1ˆ1

0

k

jiji x

kp

1

ip

Since is binary, will be the same as counting the occurances of ‘1’.

Consider character recognition problem with binary matrices.

For each pixel, count the number of 1’s and this is the estimate of .

iX

k

jijx

1

A B

ip

References

R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, New York: John Wiley, 2001. Selim Aksoy, Pattern Recognition Course Materials, 2011. M. Narasimha Murty, V. Susheela Devi, Pattern Recognition an Algorithmic Approach,

Springer, 2011. “Bayes Decision Rule Discrete Features”,

http://www.cim.mcgill.ca/~friggi/bayes/decision/binaryfeatures.html, access: 2011.