PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...

Post on 24-Sep-2020

0 views 0 download

Transcript of PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...

PCA and LDA

Nuno Vasconcelos ECE Department, UCSDp ,

Principal component analysisbasic idea:• if the data lives in a subspace, it is going to look very flat when

viewed from the full space, e.g.

1D subspace in 2D 2D subspace in 3D

• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids

2

Principal component analysisIf y is Gaussian with covariance Σ, the equiprobability contours are the ellipses whose y2

φ• principal components φi are the eigenvectors of Σ

• principal lengths λi are the λ1λ2

y

φ1φ2

p p g ieigenvalues of Σ

by computing the eigenvalues we know if the data is flat

y1

by computing the eigenvalues we know if the data is flatλ1 >> λ2: flat λ1=λ2: not flat

y2 y2

λ1λ2y1 λ1

λ2

y1

3

y1 λ1 y1

Principal component analysis (learning)

4

Principal component analysis

5

Principal component analysisthere is an alternative manner to compute the principal components, based on singular value decompositionp , g pSVD:• any real n x m matrix (n>m) can be decomposed as

TA ΜΠΝ=

• where M is a n x m column orthonormal matrix of left singular vectors (columns of M)

• Π a m x m diagonal matrix of singular values• NT a m x m row orthonormal matrix of right singular vectors

(columns of N)

II TT ΝΝΜΜ

6

II TT =ΝΝ=ΜΜ

PCA by SVDto relate this to PCA, we consider the data matrix

⎤⎡ ||

⎥⎥⎥⎤

⎢⎢⎢⎡

=||

1 nxxX K

the sample mean is

⎥⎦⎢⎣ ||p

111||

11 X⎥⎤

⎢⎡⎥⎤

⎢⎡

∑ M 11||

1 Xn

xxn

xn n

ii =

⎥⎥⎥

⎦⎢⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

== ∑ MKµ

7

PCA by SVDand we can center the data by subtracting the mean to each column of Xthis is the centered data matrix

⎤⎡⎤⎡ ||||

⎥⎥⎥⎤

⎢⎢⎢⎡

−⎥⎥⎥⎤

⎢⎢⎢⎡

= nc xxX||

||

||

||

1 µµ KK

⎞⎜⎛===

⎥⎦⎢⎣⎥⎦⎢⎣

TTT IXXXX 1111111

||||

µ⎠

⎜⎝

−=−=−=n

IXXn

XX 11111µ

8

PCA by SVDthe sample covariance is

11 ( )( ) ( )∑∑ =−−=Σi

Tci

ci

i

Tii xx

nxx

n11 µµ

where xic is the ith column of Xc

this can be written as

T

c

cc XXx

1||

1 1⎥⎤

⎢⎡ −−⎥⎤

⎢⎡

Σ M Tcc

cn

cn

c XXn

xxx

n||

1 =⎥⎥⎥

⎦⎢⎢⎢

⎣ −−⎥⎥⎥

⎦⎢⎢⎢

=Σ MK

9

PCA by SVDthe matrix

⎥⎤

⎢⎡ −− cx 1

⎥⎥⎥

⎦⎢⎢⎢

⎣ −−=

cn

Tc

xX M

is real n x d. Assuming n > d it has SVD decomposition

⎦⎣ n

and

TTcX ΜΠΝ= II TT =ΝΝ=ΜΜ

and

TTTTXX ΝΝΠ=ΜΠΝΝΠΜ==Σ 2111

10

cc nnXX

nΝΝΠΜΠΝΝΠΜΣ

PCA by SVD

T

nΝ⎟⎠⎞

⎜⎝⎛ ΠΝ=Σ 21

noting that N is d x d and orthonormal, and Π2 diagonal, shows that this is just the eigenvalue decomposition of Σ

n ⎠⎝

j g pit follows that • the eigenvectors of Σ are the columns of N• the eigenvalues of Σ are

21i πλ =

this gives an alternative algorithm for PCA

ini πλ

11

PCA by SVDcomputation of PCA by SVDgiven X with one example per columngiven X with one example per column• 1) create the centered data-matrix

⎞⎛ 1

2) t it SVD

TTTc X

nIX ⎟

⎠⎞

⎜⎝⎛ −= 111

• 2) compute its SVD

TTcX ΜΠΝ=

• 3) principal components are columns of N, eigenvalues are

21

12

21ini πλ =

Limitations of PCAPCA is not optimal for classification• note that there is no mention of the class label in the definition ofnote that there is no mention of the class label in the definition of

PCA• keeping the dimensions of largest energy (variance) is a good

idea but not always enoughidea, but not always enough• certainly improves the density estimation, since space has

smaller dimension• but could be unwise from a classification point of view• the discriminant dimensions could be thrown out

it is not hard to construct examples where PCA is theit is not hard to construct examples where PCA is the worst possible thing we could do

13

Exampleconsider a problem with• two n-D Gaussian classes with covariance Σ=σ2I σ2 = 10• two n-D Gaussian classes with covariance Σ=σ I, σ = 10

• we add an extra variable which is the class label itself

)10,(~ INX iµ• we add an extra variable which is the class label itself

• assuming that P (0)=P (1)=0 5

] ,[' iXX =• assuming that PY(0)=PY(1)=0.5

5.015.005.0][ =×+×=YE)501(50)500(50]var[ 22 ×+×Y

• dimension n+1 has the smallest variance and is the first to be 10125.0

)5.01(5.0)5.00(5.0]var[<=

−×+−×=Y

14

discarded!

Examplethis is• a very contrived examplea very contrived example• but shows that PCA can throw away all the discriminant info

does this mean you should never use PCA?y• no, typically it is a good method to find a suitable subset of

variables, as long as you are not too greedy• e g if you start with n = 100 and know that there are only 5• e.g. if you start with n = 100, and know that there are only 5

variables of interest• picking the top 20 PCA components is likely to keep the desired 5• your classifier will be much better than for n = 100, probably

not much worse than the one with the best 5 features

is there a rule of thumb for finding the number of PCA

15

is there a rule of thumb for finding the number of PCA components?

Principal component analysisa natural measure is to pick the eigenvectors that explain p % of the data variabilityp y• can be done by plotting the ratio rk as a function of k

∑=

k

ii

r 1

∑=

== n

ii

ikr

1

2

1

λ

• e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset

16

dataset

Fischer’s linear discriminantwhat if we really need to find the best features?• harder question, usually impossible with simple methods• there are better methods at finding discriminant directions

one good example is linear discriminant analysis (LDA)• the idea is to find the line that best separates the two classes

bad projectionbad projection

17

good projection

Linear discriminant analysiswe have two classes such that

[ ]iYXE |[ ] iYX iYXE µ==||

( )( )[ ] iT

iiYX iYXXE Σ==−− || µµ

and want to find the line

( )( )[ ] iiiYX || µµ

xwz T=that best separates themone possibility would be to maximize

xwz =

one possibility would be to maximize

[ ] [ ]( )2|| 0|1| ==−= YZYZ YZEYZE

18

[ ] [ ]( ) [ ]( )2012

|| 0|1| µµ −==−= TTYX

TYX wYxwEYxwE

Linear discriminant analysishowever, this [ ]( )201 µµ −Tw

can be made arbitrarily large by simply scaling wwe are only interested in the direction, not the magnitudeneed some type of normalizationFischer suggested

max =scatterclasswithinscatterclassbetween

[ ] [ ]( )[ ] [ ]0|var1|var

0|1|max

2

||

=+==−=

YZYZYZEYZE

scatterclasswithin

YZYZ

w

19

[ ] [ ]0|var1|var =+= YZYZw

Linear discriminant analysiswe have already seen that

( ) ( )2[ ] [ ]( ) [ ]( )[ ][ ] ww

wYZEYZETT

TYZYZ

0101

201

2||

0|1|

µµµµ

µµ

−−=

−==−=

also

[ ][ ]0101

( ){ }[ ] [ ]( ){ }[ ]( ){ }iYxwE

iYiYZEzEiYZ

iT

YZ

YZYZ

=−=

==−==

|

|||var2

|

2||

µ[ ]( ){ }[ ][ ]{ }iYwxxwE

iYxwE

T

Tii

TYZ

iYZ

Σ

=−−= |

|

|

|

µµ

µ

20

ww iTΣ=

Linear discriminant analysisand

[ ] [ ]( )[ ] [ ]YZYZ

YZEYZEwJ YZYZ

2||

0|var1|var0|1|

)(=+==−=

=

( )( )( )ww

wwT

TT

01

0101 Σ+Σ−−

=µµµµ

which can be written as

( )01

T ( )( )Tbetween class scatter

wSwwSwwJ

WT

BT

=)( ( )( )( )01

0101

Σ+Σ=−−=

W

TB

SS µµµµ

21

within class scatter

Linear discriminant analysismaximizing the ratio

wSwT

is equivalent to maximizing the numerator while keeping the

wSwwSwwJ

WT

B=)(

• is equivalent to maximizing the numerator while keeping the denominator constant, i.e.

KwSwwSw WT

BT =subject tomax

• and can be accomplished using Lagrange multipliersd fi th L i

KwSwwSw WBw subject to max

• define the Lagrangian

( )KwSww - SwL WT

BT −= λ

22

• and maximize with respect to both w and λ

Linear discriminant analysissetting the gradient of

( )

with respect to w to zero we get

( ) Kw S- SwL WBT λλ +=

with respect to w to zero we get

or( ) 02 ==∇ w S- SL WBw λ

or

wSwS WB λ=

this is a generalized eigenvalue problemthe solution is easy when exists( ) 1

011 −− Σ+Σ=wS

23

Linear discriminant analysisin this case

λ1

and, using the definition of SB

wwSS BW λ=−1

noting that (µ1-µ0)Tw = α is a scalar this can be written as

( )( ) wwS TW λµµµµ =−−−

01011

noting that (µ1 µ0) w α is a scalar this can be written as

( ) wSW αλµµ =−−

011

and since we don’t care about the magnitude of w

( ) ( ) ( )011

01011* µµµµ −Σ+Σ=−= −−

WSw

24

( ) ( ) ( )010101 µµµµW

Linear discriminant analysisnote that we have seen this before• for a classification problem with Gaussian classes of equalfor a classification problem with Gaussian classes of equal

covariance Σi = Σ, the BDR boundary is the plane of normal

( )jiw µµ −Σ= −1

• if Σ1 = Σ0, this is also the LDA solution

w

µi

µjx0

Gaussian classes

25

Gaussian classes, equal covariance Σ

Linear discriminant analysisthis gives two different interpretations of LDA• it is optimal if and only if the classes are Gaussian and haveit is optimal if and only if the classes are Gaussian and have

equal covariance• better than PCA, but not necessarily good enough• a classifier on the LDA feature, is equivalent to

• the BDR after the approximation of the data by two Gaussians with equal covariance

26

27