PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...

PCA and LDA

Nuno Vasconcelos ECE Department, UCSDp ,

Principal component analysisbasic idea:• if the data lives in a subspace, it is going to look very flat when

viewed from the full space, e.g.

1D subspace in 2D 2D subspace in 3D

• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids

Principal component analysisIf y is Gaussian with covariance Σ, the equiprobability contours are the ellipses whose y2

φ• principal components φi are the eigenvectors of Σ

• principal lengths λi are the λ1λ2

φ1φ2

p p g ieigenvalues of Σ

by computing the eigenvalues we know if the data is flat

by computing the eigenvalues we know if the data is flatλ1 >> λ2: flat λ1=λ2: not flat

λ1λ2y1 λ1

y1 λ1 y1

Principal component analysis (learning)

Principal component analysis

Principal component analysisthere is an alternative manner to compute the principal components, based on singular value decompositionp , g pSVD:• any real n x m matrix (n>m) can be decomposed as

TA ΜΠΝ=

• where M is a n x m column orthonormal matrix of left singular vectors (columns of M)

• Π a m x m diagonal matrix of singular values• NT a m x m row orthonormal matrix of right singular vectors

(columns of N)

II TT ΝΝΜΜ

II TT =ΝΝ=ΜΜ

PCA by SVDto relate this to PCA, we consider the data matrix

⎤⎡ ||

⎥⎥⎥⎤

⎢⎢⎢⎡

1 nxxX K

the sample mean is

⎥⎦⎢⎣ ||p

11 X⎥⎤

⎢⎡⎥⎤

⎢⎡

∑ M 11||

⎥⎥⎥

⎦⎢⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

== ∑ MKµ

PCA by SVDand we can center the data by subtracting the mean to each column of Xthis is the centered data matrix

⎤⎡⎤⎡ ||||

⎥⎥⎥⎤

⎢⎢⎢⎡

−⎥⎥⎥⎤

⎢⎢⎢⎡

= nc xxX||

1 µµ KK

⎞⎜⎛===

⎥⎦⎢⎣⎥⎦⎢⎣

TTT IXXXX 1111111

⎜⎝

−=−=−=n

XX 11111µ

PCA by SVDthe sample covariance is

11 ( )( ) ( )∑∑ =−−=Σi

Tii xx

n11 µµ

where xic is the ith column of Xc

this can be written as

cc XXx

1 1⎥⎤

⎢⎡ −−⎥⎤

⎢⎡

Σ M Tcc

1 =⎥⎥⎥

⎦⎢⎢⎢

⎣ −−⎥⎥⎥

⎦⎢⎢⎢

=Σ MK

PCA by SVDthe matrix

⎥⎤

⎢⎡ −− cx 1

⎥⎥⎥

⎦⎢⎢⎢

⎣ −−=

is real n x d. Assuming n > d it has SVD decomposition

⎦⎣ n

TTcX ΜΠΝ= II TT =ΝΝ=ΜΜ

TTTTXX ΝΝΠ=ΜΠΝΝΠΜ==Σ 2111

cc nnXX

nΝΝΠΜΠΝΝΠΜΣ

PCA by SVD

nΝ⎟⎠⎞

⎜⎝⎛ ΠΝ=Σ 21

noting that N is d x d and orthonormal, and Π2 diagonal, shows that this is just the eigenvalue decomposition of Σ

n ⎠⎝

j g pit follows that • the eigenvectors of Σ are the columns of N• the eigenvalues of Σ are

21i πλ =

this gives an alternative algorithm for PCA

ini πλ

PCA by SVDcomputation of PCA by SVDgiven X with one example per columngiven X with one example per column• 1) create the centered data-matrix

⎞⎛ 1

2) t it SVD

TTTc X

nIX ⎟

⎠⎞

⎜⎝⎛ −= 111

• 2) compute its SVD

TTcX ΜΠΝ=

• 3) principal components are columns of N, eigenvalues are

21ini πλ =

Limitations of PCAPCA is not optimal for classification• note that there is no mention of the class label in the definition ofnote that there is no mention of the class label in the definition of

PCA• keeping the dimensions of largest energy (variance) is a good

idea but not always enoughidea, but not always enough• certainly improves the density estimation, since space has

smaller dimension• but could be unwise from a classification point of view• the discriminant dimensions could be thrown out

it is not hard to construct examples where PCA is theit is not hard to construct examples where PCA is the worst possible thing we could do

Exampleconsider a problem with• two n-D Gaussian classes with covariance Σ=σ2I σ2 = 10• two n-D Gaussian classes with covariance Σ=σ I, σ = 10

• we add an extra variable which is the class label itself

)10,(~ INX iµ• we add an extra variable which is the class label itself

• assuming that P (0)=P (1)=0 5

] ,[' iXX =• assuming that PY(0)=PY(1)=0.5

5.015.005.0][ =×+×=YE)501(50)500(50]var[ 22 ×+×Y

• dimension n+1 has the smallest variance and is the first to be 10125.0

)5.01(5.0)5.00(5.0]var[<=

−×+−×=Y

discarded!

Examplethis is• a very contrived examplea very contrived example• but shows that PCA can throw away all the discriminant info

does this mean you should never use PCA?y• no, typically it is a good method to find a suitable subset of

variables, as long as you are not too greedy• e g if you start with n = 100 and know that there are only 5• e.g. if you start with n = 100, and know that there are only 5

variables of interest• picking the top 20 PCA components is likely to keep the desired 5• your classifier will be much better than for n = 100, probably

not much worse than the one with the best 5 features

is there a rule of thumb for finding the number of PCA

is there a rule of thumb for finding the number of PCA components?

Principal component analysisa natural measure is to pick the eigenvectors that explain p % of the data variabilityp y• can be done by plotting the ratio rk as a function of k

• e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset

dataset

Fischer’s linear discriminantwhat if we really need to find the best features?• harder question, usually impossible with simple methods• there are better methods at finding discriminant directions

one good example is linear discriminant analysis (LDA)• the idea is to find the line that best separates the two classes

bad projectionbad projection

good projection

Linear discriminant analysiswe have two classes such that

[ ]iYXE |[ ] iYX iYXE µ==||

( )( )[ ] iT

iiYX iYXXE Σ==−− || µµ

and want to find the line

( )( )[ ] iiiYX || µµ

xwz T=that best separates themone possibility would be to maximize

one possibility would be to maximize

[ ] [ ]( )2|| 0|1| ==−= YZYZ YZEYZE

[ ] [ ]( ) [ ]( )2012

|| 0|1| µµ −==−= TTYX

TYX wYxwEYxwE

Linear discriminant analysishowever, this [ ]( )201 µµ −Tw

can be made arbitrarily large by simply scaling wwe are only interested in the direction, not the magnitudeneed some type of normalizationFischer suggested

max =scatterclasswithinscatterclassbetween

[ ] [ ]( )[ ] [ ]0|var1|var

0|1|max

=+==−=

YZYZYZEYZE

scatterclasswithin

[ ] [ ]0|var1|var =+= YZYZw

Linear discriminant analysiswe have already seen that

( ) ( )2[ ] [ ]( ) [ ]( )[ ][ ] ww

wYZEYZETT

µµµµ

−−=

−==−=

[ ][ ]0101

( ){ }[ ] [ ]( ){ }[ ]( ){ }iYxwE

iYiYZEzEiYZ

==−==

|||var2

µ[ ]( ){ }[ ][ ]{ }iYwxxwE

=−−= |

ww iTΣ=

Linear discriminant analysisand

[ ] [ ]( )[ ] [ ]YZYZ

YZEYZEwJ YZYZ

0|var1|var0|1|

)(=+==−=

( )( )( )ww

0101 Σ+Σ−−

=µµµµ

which can be written as

T ( )( )Tbetween class scatter

wSwwSwwJ

=)( ( )( )( )01

Σ+Σ=−−=

SS µµµµ

within class scatter

Linear discriminant analysismaximizing the ratio

is equivalent to maximizing the numerator while keeping the

wSwwSwwJ

• is equivalent to maximizing the numerator while keeping the denominator constant, i.e.

KwSwwSw WT

BT =subject tomax

• and can be accomplished using Lagrange multipliersd fi th L i

KwSwwSw WBw subject to max

• define the Lagrangian

( )KwSww - SwL WT

BT −= λ

• and maximize with respect to both w and λ

Linear discriminant analysissetting the gradient of

with respect to w to zero we get

( ) Kw S- SwL WBT λλ +=

with respect to w to zero we get

or( ) 02 ==∇ w S- SL WBw λ

wSwS WB λ=

this is a generalized eigenvalue problemthe solution is easy when exists( ) 1

011 −− Σ+Σ=wS

Linear discriminant analysisin this case

and, using the definition of SB

wwSS BW λ=−1

noting that (µ1-µ0)Tw = α is a scalar this can be written as

( )( ) wwS TW λµµµµ =−−−

noting that (µ1 µ0) w α is a scalar this can be written as

( ) wSW αλµµ =−−

and since we don’t care about the magnitude of w

( ) ( ) ( )011

01011* µµµµ −Σ+Σ=−= −−

( ) ( ) ( )010101 µµµµW

Linear discriminant analysisnote that we have seen this before• for a classification problem with Gaussian classes of equalfor a classification problem with Gaussian classes of equal

covariance Σi = Σ, the BDR boundary is the plane of normal

( )jiw µµ −Σ= −1

• if Σ1 = Σ0, this is also the LDA solution

Gaussian classes

Gaussian classes, equal covariance Σ

Linear discriminant analysisthis gives two different interpretations of LDA• it is optimal if and only if the classes are Gaussian and haveit is optimal if and only if the classes are Gaussian and have

equal covariance• better than PCA, but not necessarily good enough• a classifier on the LDA feature, is equivalent to

• the BDR after the approximation of the data by two Gaussians with equal covariance

PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...

Documents

Transcript of PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...

Face Recognition Using LDA Based Algorithms · 2002. 9. 26. · Face Recognition, Linear Discriminant Analysis (LDA), direct LDA, fractional-step LDA, principle component analysis

Linear Discriminant Analysis (LDA)

Problem: SVM training is expensive – Mining for hard negatives, bootstrapping Solution: LDA (Linear Discriminant Analysis). – Extremely fast training,

7 Gaussian Discriminant Analysis, including QDA and LDAjrs/189/lec/07.pdfGaussian Discriminant Analysis, including QDA and LDA 37 Linear Discriminant Analysis (LDA) [LDA is a variant

Face Recognition Using a Kernel Fractional-Step ... · Face Recognition Using a Kernel Fractional-Step Discriminant Analysis ... enhanced LDA models for face recognition ... conducted

Sparse Discriminant Analysis - Stanford Universityweb.stanford.edu/~hastie/Papers/sda_resubm_daniela-final.pdf · full-rank LDA on the n qmatrix X 1 X q yields the rank-qclassi cation

Dimensionality Reduction: Linear Discriminant Analysis and ... · Linear Discriminant Analysis (LDA, LDiscA) and Principal Component Analysis (PCA) Summarize D-dimensional input data

Automatic Landmark Detection and Face Recognition for Side ...algorithms, Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), and also using Local Binary Pattern

Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

UvA-DARE (Digital Academic Repository) Chemical … discriminant analysis (LDA) ... Fertilizer prills or granules consist, ... plastic explosive no. 4 (PE4) [14], ...

skyzone-serverhome$wchowDocumentsManchester, NH - 9.18 · F21 F22 F22 F23 F23 F21 F21 E23A E23A E23A E23A 30 SERVICE 331 PARTY ROOM 333 PARTY ROOM 334 C10 C10 F09 F09 F09 F09 F09

Linear Discriminant Analysis (LDA)shireen/pdfs/tutorials/Elhabian_LDA09.pdf · LDA Objective •The objective of LDA is to perform dimensionalityreduction… –Sowhat,PCAdoesthisL…

Myriad F09

2DKLT-based Color Image Watermarkingpforczmanski.zut.edu.pl/homepage/wp-content/uploads/document.pdfform (DCT), Discrete Fourier Transform (DFT), Linear Discriminant Analysis (LDA),

A New Face Recognition Method using PCA, LDA and Neural ......PCA (principal Component Analysis), LDA (Linear Discriminant Analysis) and neural networks is proposed. This method consists

requirement - arXiv · inequalities, minimum sample size requirement, linear discriminant analysis (LDA), NP umbrella algorithm, unbounded feature support, adaptive splitting 1. Introduction

Regularized Discriminant Analysis and Reduced-Rank LDApersonal.psu.edu/jol2/course/stat597e/notes2/lda2.pdf · Regularized Discriminant Analysis and Reduced-Rank LDA Regularized Discriminant

Using LDA - Rutgers University LDA and R.pdfUsing LDA Randy Julian Lilly Research Laboratories Linear Discriminant Analysis Used in Supervised Learning! Must know some class information

The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

PERFORMANCE COMPARISONS OF FACIAL EXPRESSION …rps/Artigos/Performance... · (Independent Component Analysis)4 are used. For expression classiﬁcation, LDA (Linear Discriminant