2 – In previous chapters: – We could design an optimal classifier if we knew the prior...

34

Transcript of 2 – In previous chapters: – We could design an optimal classifier if we knew the prior...

Page 1: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Page 2: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

2

– In previous chapters:– We could design an optimal classifier if we knew

the prior probabilities P(wi) and the class-conditional probabilities P(x|wi)

– Unfortunately, in pattern recognition applications we rarely have this kind of complete knowledge about the probabilistic structure of the problem.

Parameter Estimation

Page 3: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

3

– We have a number of design samples or training data.– The problem is to find some way to use this

information to design or train the classifier.– One approach:

– Use the samples to estimate the unknown probabilities and probability densities,

– And then use the resulting estimates as if they were the true values.

Parameter Estimation

Page 4: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

4

– We have a number of design samples or training data.– The problem is to find some way to use this

information to design or train the classifier.– One approach:

– Use the samples to estimate the unknown probabilities and probability densities,

– And then use the resulting estimates as if they were the true values.

Parameter Estimation

Page 5: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

5

– What you are going to estimate in your Homework ?

Parameter Estimation

Page 6: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

6

– If we know the number parameters in advance and our general knowledge about the problem permits us to parameterize the conditional desities then severity of the problem can be reduced significantly.

Parameter Estimation

Page 7: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

7

– For example:– We can reasonaby assume that the p(x|wi) is a normal

density with mean µ and covariance matrix ∑,– We do not know the exact values of these quantities,– However, this knowledge simplifies the problem from one of

estimating an unknown function p(x|wi) to one of estimating the parameters the mean µi and covariance matrix ∑i

– Estimating p(x|wi) estimating µi and ∑i

Parameter Estimation

Page 8: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

8

– Data availability in a Bayesian framework We could design an optimal classifier if we knew:

– P(i) (priors)

– P(x | i) (class-conditional densities)

Unfortunately, we rarely have this complete information!

– Design a classifier from a training sample No problem with prior estimation Samples are often too small for class-conditional estimation

(large dimension of feature space!)

Parameter Estimation

Page 9: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

9

– Given a bunch of data from each class how to estimate the parameters of class conditional densities, P(x | i) ?

– Ex: P(x | i) = N( j, j) is Normal.

Parameters j = ( j, j)

Parameter Estimation

Page 10: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Two major approaches

Maximum-Likelihood Method Bayesian Method

– Use P(i | x) for our classification rule!

– Results are nearly identical, but the approaches are different

Page 11: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Maximum-Likelihood vs. Bayesian:

Maximum Likelihood

Parameters are fixed but unknown!

Best parameters are obtained by maximizing the probability of obtaining the samples observed

Bayes

Parameters are random variables having some known distribution

Best parameters are obtained by estimating them given the data

Page 12: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Major assumptions

– A priori information P( i) for each category is available

– Samples are i.i.d. and P(x | i) is Normal

P(x | i) ~ N( i, i)

– Note: Characterized by 2 parameters

12

1

Page 13: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Has good convergence properties as the sample size increases

Simpler than any other alternative techniques

13

2

Maximum-Likelihood Estimation

Page 14: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

14

2

Maximum-Likelihood Estimation

Page 15: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

General principle Assume we have c classes and The samples in class j have been drawn according to the

probability law p(x | j)

p(x | j) ~ N( j, j)

p(x | j) P (x | j, j) where:

15

2

Maximum-Likelihood Estimation

Page 16: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Our problem is to use the information provided by the training samples to obtain good estimates for the unknown parameter vectors associated with each category.

16

2

Maximum-Likelihood Estimation

Page 17: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Likelihood Function

Use the information provided by the training samples to estimate = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category

Suppose that D contains n samples, x1, x2,…, xn

17

samples) ofset the w.r.t. of likelihood the called is )|D(P

)(F)|x(P)|D(Pnk

1kk

Page 18: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Goal: find an estimate

Find which maximizes P(D | )

“It is the value of that best agrees with the actually observed training sample”

18

Page 19: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Page 20: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Maximize log likelihood function:l() = ln P(D | )

Let = (1, 2, …, p)t and let be the gradient operator

New problem statement: Determine that maximizes the log-likelihood

20

t

p21

,...,,

)(lmaxargˆ

Page 21: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

21

Set of necessary conditions for an optimum is:

l = 0

))|x(Plnl( k

nk

1k

2

Page 22: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

22

Example of a specific case: unknown

– P(xi | ) ~ N(, )(Samples are drawn from a multivariate normal population)

= therefore:• The ML estimate for must satisfy:

)x()|x(Pln and

)x()x(21

)2(ln21

)|x(Pln

1

kk

1

kt

kd

k

0)ˆx( k

nk

1k

1

Page 23: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

23

• Multiplying by and rearranging, we obtain:

Just the arithmetic average of the samples of the training samples!Conclusion:

If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector

= (1, 2, …, c)t and perform an optimal classification!

nk

1kkx

n1

ˆ

Page 24: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

24

ML Estimation: – Gaussian Case: unknown and

= (1, 2) = (, 2)

02

)x(

21

0)x(1

0))|x(P(ln

))|x(P(ln

l

)x(2

12ln

21

)|x(Plnl

22

21k

2

1k2

k2

k1

21k

22k

Page 25: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

25

Summation:

Combining (1) and (2), one obtains:

n

)x( ;

n

x

nk

1k

2k

2nk

1k

k

nk

1k

nk

1k22

21k

2

nk

1k1k

2

(2) 0ˆ

)ˆx(ˆ1

(1) 0)x(ˆ1

Page 26: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Example 1:

Consider an exponential distribution

otherwise

(single feature, single parameter)

With a random sample

Estimate θ ?

)( Xf

},...,,{ 21 nXXX

)(L

0

0);(

xeXf

x

X

Page 27: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Example 1:

Consider an exponential distribution

otherwise

(single feature, single parameter)

With a random sample

: valid for

(inverse of average)

)( Xf

n

iin

n

ii

n

ii

ndLd

ddl

n

ii

n

ii

n

i

xn

in

xx

n

x

xnxLl

eXXXfL i

1

11

1

)(ln

111

.

121

1ˆˆ

0

lnln)(ln)(

.),...,,()(

},...,,{ 21 nXXX

0x

)(L

0

0);(

xeXf

x

X

Page 28: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Example 2:

Multivariate Gaussian with unknown mean vector M. Assume is known.

k samples from the same distribution:(iid)

(linear algebra)

kXXX .,,........., 21

)()(2

1

12/12/

1

)2(

1)|(

MXMXk

in

iT

i

eMXL

)()(2

1

12/12/

1

)2(

1loglog

MXMXk

inMM

iT

i

eLl

))()(2

1||log

2

1)2log(

2( 1

1

MXMXn

iT

i

k

iM

))((1

1

k

ii MX

Page 29: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

(sample average or sample mean)

k

ii MkX

1

1 )(0

k

iiXk

M1

1

Page 30: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Example 3:

Binary variables with unknown parameters(n parameters)

How to estimate these Think about your homework ?

nipi 1,

nipi 1,

Page 31: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Example 3:

Binary variables with unknown parameters(n parameters)

So, k samples

here is the element of sample .

nipi 1,

n

i

n

iiiii pxpxXP

1 1

)1log()1(log)(log

k

jjXPLl

1

)(loglog

k

j

n

i

n

iiijiij pxpx

1 1 1

)1log()1(log(

ijxthi

thjjX

Page 32: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

So,

is the sample average of the feature.

Lp

Lp

Lp

L

n

pi

log

:

log

log

log2

1

k

jiij

i

ij

i

pxp

xL

p 1

))1)(1(((log

k

j

k

jij

iij

i

xp

xp 1 1

)1(ˆ1

1ˆ1

0

k

jiji x

kp

1

ip

Page 33: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Since is binary, will be the same as counting the occurances of ‘1’.

Consider character recognition problem with binary matrices.

For each pixel, count the number of 1’s and this is the estimate of .

iX

k

jijx

1

A B

ip

Page 34: 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

References

R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, New York: John Wiley, 2001. Selim Aksoy, Pattern Recognition Course Materials, 2011. M. Narasimha Murty, V. Susheela Devi, Pattern Recognition an Algorithmic Approach,

Springer, 2011. “Bayes Decision Rule Discrete Features”,

http://www.cim.mcgill.ca/~friggi/bayes/decision/binaryfeatures.html, access: 2011.