MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017....

108
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE Supervised Learning Unsupervised Learning Nonparametric Approach Parametric Approach Nonparametric Approach Parametric Approach Bayes Decision Theory “Optimal” Rules Plug-in Rules Density Estimation Geometric Rules (K-NN, MLP) Mixture Resolving Cluster Analysis (Hard, Fuzzy)

Transcript of MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017....

Page 1: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

CourseOutlineMODELINFORMATION

COMPLETE INCOMPLETE

SupervisedLearning

UnsupervisedLearning

NonparametricApproach

ParametricApproach

NonparametricApproach

ParametricApproach

BayesDecisionTheory

“Optimal”Rules

Plug-inRules

DensityEstimation

GeometricRules(K-NN,MLP)

MixtureResolving

ClusterAnalysis(Hard,Fuzzy)

Page 2: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Two-dimensionalFeatureSpaceSupervised Learning

Page 3: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Chapter 3:Maximum-Likelihood & Bayesian

Parameter Estimation

● Introduction● Maximum-Likelihood Estimation● Bayesian Estimation● Curse of Dimensionality● Component analysis & Discriminants●

Page 4: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

3● Bayesian framework

● To design an optimal classifier we need:● P(wi) : priors● P(x | wi) : class-conditional densities

What if this information is not available?

●Supervised Learning: Design a classifier based on a set of labeled training samples● Assume priors are known● Sufficient no. of training samples available to

estimate P(x | wi)

1

Page 5: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

4●Assumption:

● Parametric model of P(x | wi) is available

●For example, for Gaussian pdf assumeP(x | wi) ~ N( µi, Si), i = 1,..,c

Parameters (µi, Si ) are not known, but labeled training samples are available to estimate them

●Parameter estimation ● Maximum-Likelihood (ML) estimation● Bayesian estimation● For large n, estimates from the two methods are

nearly identical

1

Page 6: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

5

●ML parameter estimation (MLE): ● Parameters are assumed to be fixed but unknown!● Best parametric estimates are obtained by maximizing

the probability of obtaining the samples observed

●Bayesian parameter estimation:● Unknown parameters are random variables with some

known prior distribution; ● Use prior and samples to obtain the posteriori density● Parameter estimate is derived from posteriori & loss fn.

●Both methods use P(wi | x) for decision rule!

1

Page 7: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

6● Maximum-Likelihood Parameter Estimation

● Has good convergence properties as the sample size increases; estimated parameter value approaches the true value as n increases

● Most simple method for parameter estimation● General principle

● Assume we have c classes andP(x | wj) ~ N( µj, Sj)P(x | wj) º P (x | wj, qj), where

)...)x,xcov(,,,...,,(),( nj

mj

22j

11j

2j

1jjj ssµµ=Sµ=q

2

Use class wj samples to estimate class wj parameters: µj, Sj

Page 8: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

7● Use the training samples to estimate qq = (q1, q2, …, qc);

qi (i = 1, 2, …, c) is parameter for the wi

● Sample set D contains n iid samples, x1, x2,…, xn

● ML estimate of q is the value that maximizes P(D | q)It is the value of q that best agrees with the observed training samples

samples) ofset the w.r.t. of likelihood the called is )|D(P

)(F)|x(P)|D(Pnk

1kk

qq

q=qÕ=q=

=

2

Page 9: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

8

2

Page 10: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

9● ML estimation

● Let q = (q1, q2, …, qp)t and Ñq be the gradient operator

● We define l(q) as the log-likelihood functionl(q) = ln P(D | q)

● Determine q that maximizes the log-likelihood

t

p21,...,, ú

û

ùêë

é

q¶¶

q¶¶

q¶¶

=Ñq

)(lmaxargˆ q=qq

2

Page 11: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

10

Set of necessary conditions for an optimum:

Ñql = 0

))|x(Plnl( knk

1kqåÑ=Ñ

=

=qq

2

Page 12: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

11

● P(x | µ) ~ N(µ, S); µ is not known but S is knownSamples are drawn from a multivariate Gaussian

The ML estimate for µ must satisfy:

[ ]

)x()|x(Pln and

)x()x(21)2(ln

21)|x(Pln

1kk

1k

tk

dk

µ-å=µÑ

µ-åµ--Sp-=µ

-

-

0)ˆx( knk

1k

1 =µ-å S=

=

-

2

Page 13: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

12• Multiplying by S and rearranging terms:

MLE of the mean of the Gaussian distribution is the “sample mean”

Conclusion: Given P(xk | wj, qj), j = 1, 2, …,c to be Gaussian in d-

dimensions, estimate the vector q = (q1, q2, …, qc)t

and then use the maximum a posteriori rule (Bayes decision rule)

å=µ=

=

nk

1kkxn

2

Page 14: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

13

● ML Estimation: ● Univariate Gaussian Case: unknown µ & s

q = (q1, q2) = (µ, s2)● For the kth sample (observation)

ïïî

ïïí

ì

=qq-

+q

-

=q-q

=

÷÷÷÷

ø

ö

çççç

è

æ

qsqs

qsqs

q-q

-pq-=q=

q

02

)x(21

0)x(1

0))|x(P(ln

))|x(P(lnl

)x(212ln

21)|x(Plnl

22

21k

2

1k2

k2

k1

21k

22k

2

Page 15: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

14

Introduce summation to account for n samples:

Combining (1) and (2), we get:

n

)x( ;

nx

nk

1k

2k

2nk

1kk

å µ-=så=µ

=

==

=

ïïî

ïïí

ì

å å =qq-

+q

-

å =q-q=

=

=

=

=

=

nk

1k

nk

1k 22

21k

2

nk

1k1k

2

(2) 0ˆ)ˆx(

ˆ1

(1) 0)x(ˆ1

2

Page 16: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 3

15

ML estimate for s2 is biased

An unbiased estimator for S is:

222i .

n1n)xx(

n1E s¹s

-=úû

ùêëé -S

!!!!! "!!!!! #$matrix covariance Sample

nk

1k

tkk )ˆx)(x(

1-n1C å µ-µ-=

=

=

2

Page 17: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

ML vs. Bayesian Parameter EstimationUnknown Parameter is the Prob. of Heads of a coin

Page 18: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 19: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 20: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 21: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 22: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1

21

● Bayesian Estimation (Bayesian learning)● In MLE q was supposed to have a fixed value● In Bayesian learning q is a random variable● Direct estimation of posterior probabilities P(wi | x)

lies at the heart of Bayesian classification● Goal: compute P(wi | x, D)

Given the training sample set D, Bayes formula can be written

å ww

ww=w

=

c

1jjj

iii

)|(P).,|x(P

)|(P).,|x(P),x|(PDD

DDD

3

Page 23: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1

22

● Derivation of the preceding equation:

)(P).,|x(P

)(P).,|x(P),x|(P

:Thus)this! provides sample (Training )|(P)(P

)|,x(P)|x(P)|(P).|x(P)|,x(P

c

1jjj

iiii

ii

jj

iii

å ww

ww=w

w=w

å w=ww=w

=D

DD

D

DDDDD

3

Page 24: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1

23

● Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate q using the a-posteriori density P(q | D)

● The univariate Gaussian case: P(µ | D)µ is the only unknown parameter

µ0 and s0 are known!

),N( ~ )P(),N( ~ ) | P(x

200

2

sµµ

sµµ

4

Page 25: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1

24

● Reproducing density

The updated parameters of the prior:

Õ µµa=

ò µµµµµ

=

=

nk

1kk )(P).|x(P

(1) d)(P).|(P)(P).|(P)|(P

DDD

(2) ),(N~)|(P 2nn sµµ D

220

2202

n

0220

2

n2200

20

n

n and

.n

ˆ n

n

s+sss

=s

µs+s

s+µ÷÷

ø

öççè

æ

s+ss

4

Page 26: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1

25

4

Page 27: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1

26● The univariate case P(x | D)

● P(µ | D) has been computed● P(x | D) remains to be computed!

It provides:

Desired class-conditional density P(x | Dj, wj)P(x | Dj, wj) together with P(wj) and using Bayes

formula, we obtain the Bayesian classification rule:

Gaussian is d)|(P).|x(P)|x(P µµò µ= DD

),(N~)|x(P 2n

2n s+sµD

[ ] [ ])(P).,|x(PMax,x|(PMax jjjj

jj

wwºwww

DD

4

Page 28: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1

27

● Bayesian Parameter Estimation: General Theory

● P(x | D) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are:

● The form of P(x | q) is assumed known, but the value of q is not known exactly

● Our knowledge about q is assumed to be contained in a known prior density P(q)

● The rest of our knowledge about q is contained in a set D of n random variables x1, x2, …, xn drawn from P(x)

5

Page 29: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1

28

The basic problem is:1. Compute the posterior density P(q | D)2. Derive P(x | D)Using Bayes formula, we have:

And by independence assumption:

)|x(P)|(P knk

1kqÕ=q

=

=D

,d)(P).|(P)(P).|(P)|(P

ò qqqqq

=qDDD

5

Page 30: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Iris Dataset• Three types of iris flower: Setosa, Versicolor, Virginica• Four features: Sepal length, sepal width, petal length petal

width (all in cm.)• 50 patterns/class• Available in UCI Machine Learning

Repository

Fisher, R. A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936)

Page 31: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

PCA

Explained variance ratio

1st component 0.925

2nd component 0.053

Page 32: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

LDA

Explained variance ratio

1st component 0.992

2nd component 0.009

Page 33: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

ISOMAP

Page 34: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Low Dimensional Embedding of High Dimensional Data

• Given n patterns in a d-dim space, embed the points in m dimensions, m<<d

• Purpose: data compression; avoid overfitting by reducing dimensionality; find “meaningful” low-dim structures in their high-dimensional observations

• Feature selection v. feature extraction• Feature extraction: linear v. non-linear• Linear feature extraction or projection: unsupervised

(PCA) v. supervised (LDA)• Non-linear feature extraction (Isompap)

Page 35: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Eigen Decomposition

2 11 2é ùê úë û

Ex 1.

l=Aw w

• Given a linear transformation A, a non-zero vector w is an eigen-vector of A if it satisfies the eigenvalue equation for some scalar l

Solution: 0( ) 0det( ) 0

ll

l

- =Þ - =Þ - =

Aw IwA I wA I

2

2

1 2

2 1det 0

1 2

(2 ) 1 04 3 01and 3

ll

ll ll l

-é ù=ê ú-ë û

- - =

- + =Þ = =

11 1

11 2

22 1

22 2

2 10

1 2

2 10

1 2

ee

ee

ll

ll

- é ùé ù=ê úê ú-ë û ë û

- é ùé ù=ê úê ú-ë û ë û

(Characteristic equation)

Eigenvalue:

Eigenvector:

Eigenvector is normalized as2 21 2 1e e+ =

1112

2122

0.70710.7071

0.70710.7071

ee

ee

é ù -é ù=ê ú ê úë ûë û

é ù é ù=ê ú ê úë ûë û

1e 2e

Page 36: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

x

y

Eigenvectors:

Eigenvalues:

0.5238 0.85190.8519 0.5238

-é ùê ú- -ë û

1.7230 00 5.6644

é ùê úë û

μ = [2, 1]

Σ =5 22 3é ùê úë û

Page 37: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Eigenvectors:

Eigenvalues:

0.2190 0.0522 -0.97430.8735 -0.4554 0.17200.4347 0.8888 0.1453

é ùê úê úê úë û480.4256 0 0

0 498.6763 00 0 568.5106

é ùê úê úê úë û

μ = [4, 2, 1]

1 0 00 1 00 0 1

é ùê úê úê úë û

Σ =

Page 38: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

PCA

Find a transformation w, such that the wTx is dispersed the most (maximum distribution)

XwY T=

Page 39: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Scatter Matrices• m = mean vector of all n patterns (grand mean)• mi = mean vector of class i patterns• SW = within-class scatter matrix. It is proportional to

the sample covariance matrix for the pooled d-dimensional data. It is symmetric and positive semidefinite, and is usually nonsingular if n > d

• SB = between-class scatter matrix. It is symmetric and positive semidefinite, but because it is the outer product of two vectors, its rank is at most (C-1)

• ST = total scatter of all n patterns• For any w, SBw is in the direction of (m1-m2)

Page 40: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

(92)

(97)

(109)

(115)

(113)

(116)

Page 41: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis (PCA)

• What is the best representation of n d-dim samples x1,…,xn by a single point x0?

• Find x0 such that the sum of the squared distances between x0 and all xk is minimized

• Define squared-error criterion function J0(x0) by

and find x0 that minimizes J0. • The solution is given by x0=m, where m is the

sample mean,

( ) 2

1,

n

ok

J=

= -å0 0 kx x x

1

1 .n

kn =

= å km x

Page 42: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis • Sample mean is a zero-dim representation of data;

It does not reveal any of the data variability

• What is the best one-dim representation?• Project data to a line through the sample mean. If

e is a unit vector in the direction of the line, equation of the line can be written as

Representing xk by m+ake, find the “optimal” set of coefficients ak by minimizing the squared error

,a= +x m e

2 21 1

1 1

2 2 2

1 1 1

( ,..., , ) ( ) ( )

2 ( ) ( ) .

n n

n k kk kn n n

tk k

k k k

J a a a a

a a

= =

= = =

= + - = - -

= - - + -

å å

å å å

k k

k k

e m e x e x m

e e x m x m

P P P P

P P P P

Page 43: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis• Since differentiate with respect to ak, and

set the derivative to zero

• To obtain a least-squares solution, project the vectors xk to the line in the direction of e that passes through the sample mean

• What is the best direction e for the line? The solution involves the scatter matrix S

• The best direction is the eigenvector of the scatter matrix with the largest eigenvalue

( ).tka = -ke x m

1( )( ) .

nt

k==å k kS x -m x -m

1=e

Page 44: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis• Scatter matrix ST is real and symmetric; it’s

eigenvectors are orthogonal and form a set of basis vectors for representing any vector x

• The coefficients ai in Eq. (89) are the components of x in that basis, called the principal components

• Data points x1,…xn can be viewed as a cloud in d-dimensions; eigenvectors of the scatter matrix are the principal axes of the point cloud

• PCA reduces dimensionality by restricting attention to those directions along which the scatter of the cloud is greatest (largest eigenvalues)

Page 45: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Face Representation using PCA and LDA

…EigenFaces

Fisherfaces

Reconstructed face

Input face

…PCA LDA

Minimize reconstruction error Maximize between-class to within-class scatter

56.4 38.6 -19.7 9.8 -45.9 19.6 - 14.2

18.3 35.6 -17.5 -27.6 60.6 -20.8 41.9 -9.6

Page 46: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Discriminant Analysis• PCA finds components that explain data variance;

the components may not be useful for discrimination between different classes

• Since no category label is used, components discarded by PCA might be exactly those that are needed for distinguishing between classes

• Whereas PCA seeks directions that are effective for representation, discriminant analysis seeks directions that are effective for discrimination

• Special case of multiple discriminant analysis is Fisher linear discriminant for C=2

Page 47: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear Discriminant• Given n d-dim samples x1, ..., xn; n1 in the subset

D1 labeled ω1 and n2 in the subset D2 labeled ω2

• Find a projection that maintains separation present in the d-dim. space

• Geometrically, if ||w||= 1, each yi is the projection of the corresponding xi onto a line in the direction of w. The magnitude of w is of no significance, since it merely scales y

• Find w s.t. if d-dim samples labeled ω1 fall more or less into one cluster while those labeled ω2 fall in another, we want the projected points onto the line to be well separated as well

ty =w x

Page 48: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear Discriminant

Figure 3.5 illustrates the effect of choosing two different values for w for a two-dimensional example. If the original distributions are multimodal and highly overlapping, even the “best” w is unlikely to provide adequate separation

Page 49: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear Discriminant• Fisher linear discriminant is the linear

function that maximizes ratio of between-class scatter to within-class scatter

• 2-class classification problem has been converted from the given d-dimensional space to one-dimensional projected space

• Find a threshold, i.e., a point along the one-dimensional subspace separating the projected points from the two classes

Page 50: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear Discriminant• In terms of SB and SW, the criterion function J(·)

can be written as

• A vector w that maximizes J(·) must satisfy

for some constant λ, which is a generalized eigenvalue problem

• If SW is nonsingular we can obtain a conventional eigenvalue problem by writing

Page 51: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear DiscriminantIn our particular case, it is unnecessary to solve for the eigenvalues and eigenvectors of due to the fact that SBw is always in the direction of m1-m2. Since the scale factor for w is immaterial, we can immediately write the solution for the w that optimizes J(·):

Page 52: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear Discriminant• When the conditional densities p(x|ωi) are

multivariate normal with equal covariance Σ, the threshold can be computed directly from the optimal decision boundary (Chapter 2)

where w0 is a constant involving w and the prior.

• Thus, for the normal, equal-covariance case, the optimal decision rule is merely to decide ω1 if Fisher’s linear discriminant exceed some threshold, and to decide ω2 otherwise.

Page 53: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Multiple Discriminant Analysis

• Generalize 2-class Fisher’s linear discriminant to c-class problem

• Now, the projection is from a d-dimensional space to a (c - 1)-dimensional space, d ≥ c

Page 54: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Multiple Discriminant Analysis

• Because SB is the sum of c matrices of rank one or

less, and because only c−1 of these are independent,

SB is of rank c−1 or less. Thus, no more than c−1 of

the eigenvalues are nonzero, and so the new

dimensionality is up to (c-1).

Page 55: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Multiple Discriminant Analysis• The projection from a d-dimensional space to a

(c-1)-dimensional space is accomplished by c-1 discriminant functions

• If the yi are viewed as components of a vector yand the weight vectors wi are viewed as the columns of a dx(c − 1) matrix W, then the projection can be written as a single matrix equation

The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in

Page 56: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Multiple Discriminant Analysis

Figure 3.6: Three 3-dimensional distributions are projected onto two-dimensional subspaces, described by a normal vectors w1 and w2. Informally, multiple discriminant methods seek the optimum such subspace, i.e., the one with the greatest separation of the projected distributions for a given total within-scatter matrix, here as associated with w1.

Page 57: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

LDA

1TY = 1w X 2 2

TY = w X

Find a transformation w, such that the wTX1 and wTX2 are maximally separated & each class is minimally dispersed (maximum separation)

Page 58: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

PCA vs. LDA

T=Y w X

1( )( )

j i

cT

j i j ii x C

x xµ µ= Î

= - -å åwS

1iiN wÎ

å=i xµ x

l=-1W BS S w w

1( )( )

cT

i i iiN µ µ µ µ

=

= - -åbS1( )( )

nT

i iix xµ µ

=

= - -åS

l=Sw w

1

1 n

in =

= å iµ x

PCA LDA

Sample meanMean for each class

Scatter matrix

Within-class scatter

Between-class scatter

Eigen decomposition

Eigen decomposition

X is transformed to Y using w

Page 59: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis (PCA)• Example

• X={(4,1),(2,4),(2,3),(3,6),(4,4)}

[3.0 3.6],4.0 2.02.0 13.2

=

-é ù= ê ú-ë û

µ

S

• Statistics

• Solve the Eigen value problem

l=Sw w

Page 60: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Linear Discriminant Analysis (LDA)• Example

• X1= {(4,1),(2,4),(2,3),(3,6),(4,4)}• X2= {(9,10),(6,8),(9,5),(8,7),(10,8)}

[3.0 3.6], [7.67 7.0], [5.7 5.6]4.0 2.0 11.89 2.0

,2.0 13.2 2.0 15.0

= = =

-é ù é ù= =ê ú ê ú-ë û ë û

1 2

1 2

µ µ µ

S S

• Class statistics

• Within and between class scatter

72 54 15.89 0.0,

54 40 0.0 28.2é ù é ù

= =ê ú ê úë û ë û

B WS S

• Solve the Eigen value problem

l=-1W BS S w w

Page 61: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

AGlobalGeometricFrameworkforNonlinearDimensionalityReductionTenenbaum,deSilvaandLangford,Science,V.290,22Dec2000

• Althoughinputdimensionalitymaybequitehigh(e.g.4096for64x64pixelimagesinFig1A),theperceptuallymeaningfulstructurehasmanyfewerindependentdegreesoffreedom

• Theimagesin1Alieonanintrinsically3-dimmanifold,orconstraintsurface(twoposevariables&analightingangle)

• Givenunorderedhigh-diminputs,discoverlow-dimrepresentations

• PCAfindsalinearsubspace;Fig.3Aillustratesthechallengeofnon-linearity;pointsfarapartontheunderlyingmanifold,asmeasuredbytheirgeodesic,orshortestpath,distances,mayappearcloseinhigh-diminputspace,asmeasuredbytheirstraightlineEuclideandistance.

Page 62: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 63: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 64: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 65: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 66: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 67: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 68: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 69: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 70: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 71: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 72: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

LowDimensionalRepresentationsAndMultidimensionalScaling(MDS)(Sec10.14)

• Given n points (objects) x1, …, xn . No class labels• Suppose only the similarities between the n objects are

provided• Goal is to represent these n objects in some low dimensional

space in such a way that the distances between points in that space corresponds to the dissimilarities in the original space

• If an accurate representation can be found in 2 or 3 dimensions than we can visualize the structure of the data

• Find a configuration of points y1, …, yn for which the n(n-1)distances dij are as close as possible to the original similarities; this is called Multidimensional scaling

• Two cases• Meaningful to talk about the distances between given n

points

Page 73: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

DistancesBetweenGivenPointsisMeaningful

Page 74: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

CriterionFunctions• Sum of squared error functions• Since they only involve distances between points, they

are invariant to rigid body motions of the configuration• Criterion functions have been normalized so their

minimum values are invariant to dilations of the sample points

Page 75: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

FindingtheOptimumConfiguration

• Use gradient-descent procedure to find an optimal configuration y1, …, yn

Page 76: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Example

20 iterations with Jef

Page 77: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

NonmetricMultidimensionalScaling

• Numerical values of dissimilarities are not as important as their rank order

• Monotonicity constraint: rank order of dij = rank order of dij

• The degree to which dij satisfy the monotonicy constraint is measured by

• Normalize to prevent it from being collapsedˆmonJ

Page 78: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Overfitting

Page 79: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Problem of Insufficient Data• How to train a classifier (e.g., estimate the covariance

matrix) when the training set size is small (compared to the number of features)

• Reduce the dimensionality– Select a subset of features– Combine available features to get a smaller number of more

“salient” features.• Bayesian techniques

– Assume a reasonable prior on the parameters to compensate for small amount of training data

• Model Simplification– Assume statistical independence

• Heuristics– Threshold the estimated covariance matrix such that only

correlations above a threshold are retained.

Page 80: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Practical Observations• Most heuristics and model simplifications are

almost surely incorrect• In practice, however, the performance of the

classifiers base don model simplification is better than with full parameter estimation

• Paradox: How can a suboptimal/simplified model perform better than the MLE of full parameter set, on test dataset?– The answer involves the problem of

insufficient data

Page 81: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Insufficient Data in Curve Fitting

Page 82: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Curve Fitting Example (contd)

• The example shows that a 10th-degree polynomial fits the training data with zero error– However, the test or the generalization error is

much higher for this fitted curve• When the data size is small, one cannot be sure

about how complex the model should be • A small change in the data will change the

parameters of the 10th-degree polynomial significantly, which is not a desirable quality; stability

Page 83: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Handling insufficient data• Heuristics and model simplifications• Shrinkage is an intermediate approach, which combines

“common covariance” with individual covariance matrices– Individual covariance matrices shrink towards a common

covariance matrix.– Also called regularized discriminant analysis

• Shrinkage Estimator for a covariance matrix, given shrinkage factor 0 < a < 1,

• Further, the common covariance can be shrunk towards the Identity matrix,

nnnn

i

iii aa

aaa+-

S+S-=S

)1()1()(

Ibbb +S-=S )1()(

Page 84: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principle of Parsimony• By allowing the covariance matrices of Gaussian

conditional densities to be arbitrary, the no. of parameters in the resulting quadratic discriminant analysis to be estimated for large d or C can be rather large

• In such situations, LDF is often preferred with the principle of parsimony as the main underlying thought

• Dempster (1972) suggested that parameters should be introduced sparingly and only when data indicate they are required.

A.P. Dempster (1972), Covariance selection, Biometrics 28, 157-175

Page 85: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Problems of Dimensionality

Page 86: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Introduction

• Real world applications usually come with a large number of features– Text in documents is represented using frequencies of tens

of thousands of words– Images are often represented by extracting local features

from a large number of regions within an image

• Naive intuition: more the number of features, the better the classification performance? – Not always!

• There are two issues that must be confronted with high dimensional feature spaces– How does the classification accuracy depend on the

dimensionality and the number of training samples?– What is the computational complexity of the classifier?

Page 87: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Statistically Independent Features

• If features are statistically independent, it is possible to get excellent performance as dimensionality increases

• For a two class problem with multivariate normal classes , and equal prior probabilities, the probability of error is

where the Mahalanobis distance is defined as

dueePr

u

ò¥

-=2

22

21)(p

),(~)|( Sjj NxP µw

)()( 211

212 µµµµ -S-= -Tr

Page 88: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Statistically Independent Features

• When features are independent, the covariance matrix is diagonal, and we have

• Since r2 increases monotonically with an increase in the number of features, P(e) decreases

• As long as the means of features in the differ, the error decreases

2

1

212 å=

÷÷ø

öççè

æ -=

d

i i

iirsµµ

Page 89: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Increasing Dimensionality

• If a given set of features does not result in good classification performance, it is natural to add more features

• High dimensionality results in increased cost and complexity for both feature extraction and classification

• If the probabilistic structure of the problem is completely known, adding new features will not possibly increase the Bayes risk

Page 90: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal
Page 91: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Curse of Dimensionality

• In practice, increasing dimensionality beyond a certain point in the presence of finite number of training samples often leads to lower performance, rather than better performance

• The main reasons for this paradox are as follows:– the Gaussian assumption, that is typically made, is

almost surely incorrect– Training sample size is always finite, so the estimation of

the class conditional density is not very accurate

• Analysis of this “curse of dimensionality” problem is difficult

Page 92: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

A Simple Example• Trunk (PAMI, 1979) provided a simple example

illustrating this phenomenon.

rmean vecto theof component 1 thi i

i=÷

øö

çèæ=µ

( ) ( )21

21 == ww pp

( ) ( )Iµ1,~| 1 GXp w

( ) ( )Iµ2,~| 2 GXp w

µµµ,µ 21 -==

( )21

21

11 21|

÷ø

öçè

æ --

=P= i

xN

i

i

epp

wx

( )21

21

12 21|

÷ø

öçè

æ --

=P= i

xN

i

i

epp

wxN: Number of features

÷ø

öçè

æ= ,...41,

31,

21,11µ ÷

ø

öçè

æ ----= ,...41,

31,

21,12µ

Page 93: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Case 1: Mean Values Known• Bayes decision rule:

0 or 1

>å =

N

ii

ix

0... if Decide 22111 >+++= NNt xxx µµµw µx

dzePz

e

2

21

2/ 21 -

¥

ò=g p å = ÷

øö

çèæ=-=

N

i i1

22 1421 µµg

( ) dzeNPz

i

eN

i

2

1

21

1 21 -

¥

÷øö

çèæò

å

=

=

p

series divergent a is 1å ÷øö

çèæi

( ) ¥®®\ NNPe as 0

Page 94: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Case 2: Mean Values Unknown• m labeled training samples are available

0ˆ...ˆˆˆ if Decide 22111 >+++= NNt xxx µµµw µx

( ) ( ) ( ) ( ) ( )1122 |0ˆ.|0ˆ., wwww γ+γ= xµxxµx tte PPPPmNP

{ 21

1ˆ , is replaced by - if m i i i iim

w=

= Îåµ x x x x

å ===

N

i iit µx

1ˆLet µxz

POOLED ESTIMATEPlug-in decision rule

( ) { symmetry todue |0ˆ 2wγ= xµxtP

z ofon distributi hecomputer t todifficult isIt

Page 95: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Case 2: Mean Values Unknown( ) å =

+÷øö

çèæ

÷øö

çèæ +=

N

i mN

imVAR

1

111z

( ) ( ) ( )( )

( )( ) ÷

÷ø

öççè

æ -³

-=γ=

zVARzE

zVARzEzPzPmNPe 2|0, wx

( ) dzeNmPz

e

2

21

2/ 21,

ò=g p

( )( )

( ) { Normal Standard,1,0~lim GVARE

N zzz -

¥®

( ) å = ÷øö

çèæ=

N

i iE

1

1z

( )( )

å

å

=

=

+÷øö

çèæ

÷øö

çèæ +

÷øö

çèæ

=-=N

i

N

i

N

mN

im

czVAR

zE

1

1

111

1

g

0lim =¥® NNP

( )21,lim =\

¥®NmPeN

Page 96: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Case 2: Mean Values Unknown

Page 97: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Pattern Classification, Chapter 1 96

• Component Analysis and Discriminants–Combine features to increase discriminability & reduce dimensionality

–Project d-dim. data to m dimensions, m<<d

–Linear combinations are simple & tractable

–Two approaches for linear transformation•PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense”; also called K-L 8

Page 98: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Diagonalization of Covariance Matrix

• Find a basis for which the components of a random vector X are uncorrelated

• It can be shown that the eignevectors of the covariance matrix for X form such a basis

• Covariance matrices (d x d) are positive semidefinite, so there exist d linearly independent eignevectors that form a basis for X

• If K is the covariance matrix, an eignevector e and an eignevalue a satisfy

Ke = ae(K-aI)e = 0

Characteristic Equation: det |K-aI| = 0

Page 99: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

x

y

Eigenvectors:

Eigenvalues:

0.5863 0.81010.8101 0.5863

-é ùê ú- -ë û

0.8344 00 6.9753

é ùê úë û

μ = [2, 1]

5 33 3é ùê úë û

Σ =

Page 100: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis

This can be easily verified by writing

Since the second sum is independent of x0, this expression is minimized by the choice x0=m.

20

1

2 2

1 1 1

2 2

1 1 1

2 2

1 1

( ) ( ) ( )

( ) 2 ( ) ( ) ( )

( ) 2( ) ( ) ( )

( ) ( ) .

n

kn n n

t

k k kn n n

t

k k kn n

k k

J=

= = =

= = =

= =

= - - -

= - - - - + -

= - - - - + -

= - + -

å

å å å

å å å

å å

0 0 k

0 0 k k

0 0 k k

0 k

x x m x m

x m x m x m x m

x m x m x m x m

x m x m

P P

P P P P

P P P P

P P P P

Independent of x0

Page 101: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis

• The scatter matrix is merely (n-1) times the sample covariance matrix. It arises here when we substitute ak found in Eq. (83) into Eq. (82) to obtain

2 2 21

1 1 1

2 2

1 1

2

1 1

2

1

( ) 2

[ ( )]

( )( )

.

n n n

k kk k k

n nt

k kn n

t t

k kn

t

k

J a a= = =

= =

= =

=

= - + -

= - +

= - +

= - +

å å å

å å

å å

å

k

k k

k k k

k

e x m

e x -m x -m

e x -m x -m e x -m

e Se x -m

P P

P P

P P

P P

Page 102: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis

• The vector e that minimizes J1 also maximizes etSe. We use the method of Lagrange multipliers (Section A.3 of the Appendix) to maximize etSe subject to the constraint that . Letting λ be the undetermined multiplier, we differentiate

with respect to e to obtain

• Setting this gradient vector equal to zero, e is the eigenvector of the scatter matrix:

( 1)t tu l= - -e Se e e

2 2 .u l¶= -

¶Se e

e

.l=Se e

1=e

Page 103: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis

• Since etSe = λ ete = λ, it follows that to maximize etSe, we want to select the eigenvector corresponding to the largest eigenvalue of the scatter matrix.

• In other words, to find the best one-dimensional projection of the d-dimensional data (in the least-sum-of-squared-error sense), project the data onto a line through the sample mean in the direction of the eigenvector of the scatter matrix with the largest eigenvalue.

Page 104: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Principal Component Analysis

• This result can be readily extended from a one-dimensional projection to a d’-dimensional projection (d’<d). In place of Eq. (81), we write

where d’≤d.

• It is not difficult to show that the criterion function

is minimized when the vectors e1,…ed’ are the d’ eigenvectors of S with the largest eigenvalues.

'

1,

d

iia

=

= +å ix m e

2'

'1 1( )

n d

d kik i

J a= =

= + -å å i km e x

Page 105: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear Discriminant

• How to find the best direction w that will enable accurate classification?

• A measure of the separation between the projected points is the difference of the sample means. If mi is the d-dimensional sample mean

then the sample mean for the projected points is

1 ,iDin Î

= åix

m x

± 1

1i

i

iy Yi

t t

x Di

m yn

n

Î

Î

=

= =

å

å iw x w m

Page 106: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear Discriminant

• The distance between the projected means is

and we can make this difference as large as we wish merely by scaling w.

• To obtain good separation of the projected data we really want the difference between the means to be large relative to some measure of the standard deviations for each class.

• Define the scatter for projected samples labeled ωi by

Page 107: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

Fisher Linear Discriminant

• Thus, is an estimate of the variance of the pooled data, and is called the total within-class scatter of the projected samples. The Fisher linear discriminant employs that linear function wtx for which the criterion function

is maximum (and independent of ||w||).

• The vector w maximizing J(·) leads to the best separation between the two projected sets.

• How to solve for the optimal w?

Page 108: MODEL INFORMATION COMPLETE INCOMPLETE Supervisedcse802/S17/slides/Lec_06_07_08_Feb06.pdf · 2017. 2. 9. · Pattern Classification, Chapter 3 3 Bayesian framework To design an optimal

x

y

Eigenvectors:

Eigenvalues:

0.1137 0.99350.9935 0.1137- -é ùê ú-ë û

3.1757 00 5.3882

é ùê úë û

μ= [2, 1]

Σ=5 00 3é ùê úë û