PRINCIPAL COMPONENT ANALYSIS - KINX CDN

Machine Intelligence lab - School of Electronics Engineering - Kyungpook National University

ELEC801 Pattern Recognition

PRINCIPAL COMPONENT ANALYSIS

ELEC801 Pattern Recognition, Fall 2017, KNU

Instructor: Gil-Jin Jang

Slide credits:

Srinivasa Narasimhan, CS, CMU; James J. Cochran, Louisiana Tech University;

Barnabás Póczos, University of Alberta

Slide credit: Narasimhan, Cochran, Póczos 10/23/2017 1



� Example:

– Input data type: 53 blood and urine measurements (wet

chemistry)

– Samples: 65 people (33 alcoholics, 32 non-alcoholics).

� Data given in a matrix format

Problem: Hi-Dimensional Features

10/23/2017Slide credit: Narasimhan, Cochran, Póczos 2

Slide credit: S. Narasimhan

H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHCH-MCHC

A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000

A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000

A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000

A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000

A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000

A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000

A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000

A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000

A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000



010 20 30 40 50 60 70

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Person

H-Bands

0 50 150 250 350 45050

100

150

200

250

300

350

400

450

500

550

C-Triglycerides

C-LDH

0100

200300

400500

0

200

400

6000

1

2

3

4

C-TriglyceridesC-LDH

M-EPI

Univariate Bivariate

Trivariate

Data Representation

Slide credit: S. Narasimhan


Multi-variate

data

distributions can

be represented

by visualizing

distributions of

pairs or triplets.



� Issues

– Any other better presentation?

– Do we need all 53 dimensions?

– How to find the BEST lower dimensional space that conveys

maximum useful information?

� One answer: Find PRINCIPAL COMPONENTS

– Goal: We wish to explain/summarize the underlying

variance-covariance structure of a large set of variables

through a few linear combinations of these variables.

Data Representation




Principle Component

AnalysisGiven N points in a p-dimensional

space, for large N,– how does one project on to a 1-

dimensional space while preserving broad trends in the data and allowing it to be visualized?

Choose a line that fits the data so the points are spread out well along the line, such that– maximizes variance of projected data

(purple line)

– minimizes mean squared distance between

» data point and

» projections (sum of blue lines)




� Minimize sum of squares of distances to the line.

– Minimizing sum of squares of distances to the line is the

same as maximizing the sum of squares of the projections

on that line, thanks to Pythagoras.

Algebraic Interpretation – 1D




PCA: Mathematical Formulation

∑=

==

N

i

i

N 1

0

1xmx

∑=

−=

N

i

iJ

1

2

000)( xxx

Let us say we have xi, i=1…N data points in p dimensions (p is large)

If we want to represent the data set by a single point x0, then

Can we justify this choice mathematically?

Source: Chapter 3 of [DHS]

It turns out that if you minimize J0, you get the above solution,

viz., sample mean

Sample mean




PCA: An Intuitive Approach

emx a+=

Niaii

...1, =+= emx

Representing the data set xi, i=1…N by its mean is quite uninformative

So let’s try to represent the data by a straight line of the form:

This is equation of a straight line that says that it passes through m

e is a unit vector along the straight line

And the signed distance of a point x from m is a

The individual points projected on this straight line would be




PCA: An Intuitive Approach


( ) ( )

∑∑∑

∑∑∑

∑∑

===

===

==

−+−−=

−+−−=

−+=+≡

N

i

ii

N

i

T

i

N

i

i

N

i

ii

N

i

T

i

N

i

i

N

i

ii

N

i

iiN

aa

aa

aaDaaJ

1

2

11

2

1

2

11

22

1

2

1

11

||||)(2

||||)(2||||

,),,...,(

mxmxe

mxmxee

xemxeme

)( mxe −=i

T

ia

∑∑∑===

−+−=−+−−−=

N

i

i

T

N

i

i

N

i

T

ii

TJ

1

2

1

2

1

1||||||||))(()( mxSeemxemxmxee

∑=

−−=

N

i

T

ii

1

))(( mxmxS

Let’s now determine ai’s

Partially differentiating with respect to aiwe get:

Plugging in this expression for aiin J

1we get:

where is called the scatter matrix

(in this case, covariance matrix)



So minimizing J1

is equivalent to maximizing:

PCA: An Intuitive Approach…


SeeT

1=eeT

)1( −− eeSeeTT

λ

eSe0eSe λλ ==− or22

Subject to the constraint that e is a unit vector:

Use Lagrange multiplier method to form the objective function:

Differentiate to obtain the equation:

Solution is that e is the eigenvector of S corresponding to the largest eigenvalue



� Sx points in some other direction in general

e is an eigenvector and λ an eigenvalue if

Eigensystem


x

Sx

eSe = λe



PCA: Extension to Multi-dimensions

12

ddaa eemx +++= ...

11

∑ ∑= =

−+=

N

i

i

d

k

kikdaJ

1

2

1

||)(|| xem

∑=

−−=

N

i

T

ii

1

))(( mxmxS

The preceding analysis can be extended in the following way.

Instead of projecting the data points on to a straight line, we may

now want to project them on a d-dimensional plane of the form:

d is much smaller than the original dimension p

In this case one can form the objective function:

It can also be shown that the vectors e1, e

2, …, e

dare d eigenvectors

corresponding to d largest eigenvalues of the scatter matrix

10/23/2017Slide credit: Narasimhan, Cochran, Póczos



� Given a square matrix �, for some scalar

� (eigenvalue), a non-zero vector � is an eigenvector if

it satisfies (note: notation � ≡ �)

�� = �� ↔ (� − ��)� = �

� Characteristics

– There are � eigenvectors for non-singular � × � matrix �

– solution to the characteristic equation is obtained by finding

det(� − ��) = 0 (MATLAB function ‘eig’)

– example: 2x2 case

Eigenvalues and Eigenvectors


( ) 3,1012021

12det,

21

122

=⇒=−−⇒=

−

−

= λλ

λ

λS



Eigenvectors on Covariance Matrix

• if � is a covariance matrix (� = �[��]), and � is a unit vector

(|�| = 1), then �[(��)�] = �,

since �� = ��

� � = [��; ��; … ; ��] decorrelates �, i.e., zeros off-

diagonal terms of covariance matrix:

�[��] = �[��

��] =��

�[��]� =

=�� =

�� 0

0 ��⋯

0

0⋮ ⋱ ⋮

0 0 ⋯ ��




Dimension Reduction

� Suppose each data point is a vector of dimension d.

� �[��] = �[��

��

��] =��

��[��]�� =��

��

– The eigenvectors of S define a new coordinate system

» eigenvector with largest eigenvalue captures the most variation among

training vectors x

» eigenvector with smallest eigenvalue has least variation

– We can compress the data (represent with little error) by only using the top few eigenvectors

» corresponds to choosing a “linear subspace”

� represent points on a line, plane, or “hyper-plane”

» these eigenvectors are known as the principal components




Code Example: pca.m

function [W, eigvector, eigvalue] = pca(R)

% PCA Principal component analysis

% [W, EIGVECTOR, EIGVALUE] = PCA(covX)

% covX: covariance matrix

% EIGVECTOR: each column is a

% eigenvector of covX

% EIGVALUE: eigenvalues of covX

% W: E[WxxW'] = I

[v, d] = eig(R);

eigvector = v;

eigvalue = diag(d);

% Sort in a descending order

[-, index] = sort(-eigvalue);

% or use sort(eigvalue,'descend‘);

eigvalue = eigvalue(index);

eigvector = eigvector(:, index);

N = length(eigvalue);

W = zeros(N,N);

for m=1:N,

W(m,:) = ...

eigvector(:,m)'/sqrt(eigvalue(m));

end




PCA Applications:

Face recognition

Facial expression recognition

Barnabás Póczos

University of Alberta




Challenge: Face Recognition

Task: Identify specific

person, based on

facial image

regardless of wearing

glasses, lighting,…

Can we use all the

given 256 x 256

pixels?

� 2^16 = 64KB when

256 gray-level

images

Slide credit: Barnabás Póczos




� An image is a point in a high dimensional space

– An N x M image is a point in RNM

– We can define vectors in this space as we did in the 2D case

The Space of Faces


+=



Key Idea

}ˆ{ P

RLx=χ• Images in the possible set are highly correlated.

• So, compress them to a low-dimensional subspace that

captures key appearance characteristics

• EIGENFACES: [Turk and Pentland]

USE PCA!




Eigenfaces


Eigenfaces look somewhat like generic faces.



� Eigenfaces are

the eigenvectors of

the covariance matrix of

the vector space of

human faces

� Eigenfaces are the ‘standardized face ingredients’ derived from the statistical analysis of many pictures of human faces

� A human face may be considered to be a combination of these standardized faces

Eigenfaces – verbal summary




1. Large set of images of human faces is taken.

2. The images are normalized to line up the eyes, mouths and other features.

3. The eigenvectors of the covariance matrix of the face image vectors are then extracted.

4. These eigenvectors are called eigenfaces.

Generating Eigenfaces




� When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face.

� Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces.

� Hence eigenfaces provide a means of applying data compression to faces for identification purposes.

Eigenfaces for Face Recognition




Dimensionality Reduction

The set of faces is a “subspace” of the set of images– Suppose it is K dimensional

– We can find the best subspace using PCA

– This is like fitting a “hyper-plane” to the set of faces

» spanned by vectors v1, v

2, ..., v

K

Any face:




Projecting onto the Eigenfaces

� The eigenfaces v1, ..., v

Kspan the space of faces

– A face is converted to eigenface coordinates by




Choosing the Dimension K

K NMi =

eigenvalues

� How many eigenfaces to use? – find the “knee”

� Look at the decay of the eigenvalues

– the eigenvalue tells you the amount of variance “in the direction” of that eigenface

– ignore eigenfaces with low variance




Applying PCA: Eigenfaces

� Example data set: Images of faces

– Eigenface approach[Turk & Pentland], [Sirovich & Kirby]

� Each face x is …

– 256 × 256 values (luminance at location)

» x in ℜ256×256 (view as 64K dim vector)

� Form X = [ x1

, …, xm

] centered data matrix

� Compute Σ = XXT

� Problem: Σ is 64K × 64K … HUGE!!!

256 x

256

real v

alu

es

m faces

X =

x1, …, x

m

Method A: Build a PCA subspace for each person and check which

subspace can reconstruct the test image the best

Method B: Build one PCA database for the whole dataset and then

classify based on the weights.





Computational Complexity

� Suppose m instances, each of size N

– Eigenfaces: m=500 faces, each of size N=64K

� Given N×N covariance matrix Σ, can compute

– all N eigenvectors/eigenvalues in O(N3)

– first k eigenvectors/eigenvalues in O(kN2)

� But if N=64K, it often becomes computationally

intractable





A Clever Workaround

� Note that m<<64K

� Use L=XTX instead of Σ=XXT

� If v is eigenvector of L

then Xv is eigenvector of Σ

Proof: L v = γ v

XTX v = γ v

X (XTX v) = X(γ v) = γ Xv

(XXT)X v = γ (Xv)

Σ (Xv) = γ (Xv)

256 x

256

real v

alu

es

m faces

X =

x1, …, x

m





Happiness subspace (method A)

Method A: Build a PCA subspace for each person and check which subspace

can reconstruct the test image the best





Disgust subspace (method A)

Method A: Build a PCA subspace for each person and check which subspace

can reconstruct the test image the best





Principle Components (Method B)


Method B: Build one PCA database for the whole dataset and then classify

based on the weights.




Classification with Eigenfaces (Method B)

1. Process the image database (set of images with labels)

– Run PCA—compute eigenfaces

– Calculate the K coefficients for each image

2. Given a new image (to be recognized) x, calculate Kcoefficients

3. Detect if x is a face

4. If it is a face, who is it?

• Find closest labeled face in database

• nearest-neighbor in K-dimensional space




Reconstructing… (Method B)

� … faster if train with…

– only people w/out glasses

– same lighting conditions





� Advantages

– PCA is completely knowledge free

– Reduction in computation and memory requirements

� Shortcomings

– PCA Requires carefully controlled data:

» All faces centered in frame, same sizes, and sensitive to angles

– Alternative:

» “Learn” one set of PCA vectors for each angle

» Use the one with lowest error

Summary





LINEAR DISCRIMINANT ANALYSIS

PCA vs LDA

LDA for two classes and multi-classes

Example problem

Applications of LDA

Slide credit: Narasimhan, Cochran, Póczos 10/23/2017 37



Principal Component Analysis (PCA)


Find a transformation w, such that the wTx is dispersed the most

(maximum distribution)

XwYT

=



Linear Discriminant Analysis (LDA)


Find a transformation w, such that the wTX1

and wTX2

are maximally separated &

each class is minimally dispersed (maximum separation)

11XwY

T=

22XwY

T=




• Perform dimensionality reduction “while

preserving as much of the class discriminatory

information as possible”.

• Seeks to find directions along which the classes

are best separated.

• Takes into consideration the scatter within-

classes but also the scatter between-classes.

• More capable of distinguishing image variation

due to identity from variation due to other sources

such as illumination and expression.




� Separation by mean, variation, …



1

2

3





• Two classes: ω1, ω

2

• Introduce

• SB

: between-class scatter matrix

• SW

: within-class scatter matrix

( )( )∑∈

−−=

cx

T

cccxxS

ω

µµ

21SSS

w+=

∑∈

=

cxc

cx

Nω

µ1

( )( )TB

S2121

µµµµ −−=

class mean

scatter of the means

class covariance

sum of the covariances


ELEC801 Pattern Recognition 43

LDA

( )( ) ( )( ) wSwwwSB

TTTT

B=−−=−−=

21212121

~~~~~

µµµµµµµµ

wSw

wSwwJ

W

T

B

T

=)(Objective function J(w)

maximizes SB

of y

minimizes SW

of y

XwYT

=

c

T

yc

c wyN

c

µµ

ω

== ∑∈

1~

( )( ) ( )( )

( )( ) wSwwxxw

wxwwxwyyS

c

T

x

T

cc

T

x

T

c

TT

c

TT

y

T

ccc

c

cc

=−−=

−−=−−=

∑

∑∑

∈

∈∈

ω

ωω

µµ

µµµµ~~

~

wSwSSW

T=+

21

~~




LDA by Eigenvector


wSw

wSwwJ

W

T

B

T

=)( How to maximize this

1such that 2

1 minimize =wSwwSw-

W

T

B

T

)1(2

1

2

1 )(w, −+−=Λ wSwwSw

W

T

B

Tλλ

0 =+−=∂

Λ∂wSwS

wWBT

λ

wSwSWB

λ =

wwSSBW

λ 1=

−

Eigen value problem!!



LDA Example


• Example

• X1={(4,1),(2,4),(2,3),(3,6),(4,4)}

• X2={(9,10),(6,8),(9,5),(8,7),(10,8)}

−

−=

−

−=

==

64.204.

04.84.1,

6.24.

4.8.

]6.74.8[],6.30.3[

21

21

SS

µµ

• Class statistics

• Within and between class scatter

−

−=

=

28.544.

44.64.2,

00.1660.21

60.2116.29WBSS

• Solve the Eigen value problem

wwSSBW

λ 1=

−



LDA on more than 2 classes


wSw

wSw

S

S

wJ

W

T

B

T

W

B

== ~

~

)(



LDA on C > 2 classes

Slide credit: Narasimhan, Cochran, Póczos 47

1[ ]

kW V V= L

1 1 2 2

1 2 1 2[ ]

c

c

nU X X X X X= L

1 1

( )( )cnc

i i T

w j i j i

i j

S X Xµ µ

= =

= − −∑∑

1

( )( )c

T

b i i

i

S µ µ µ µ

=

= − −∑

B

W

SV V

Sλ=

• C-classes

wSwSWB

λ =



Comparison of PCA and LDA

� PCA – finds axes of maximal variances

– Computed by eigenvalue decomposition

– Eigenfaces when applied to face recognition

� LDA – finds axes of maximal separation

– Often referred to Fisher’s linear discriminant

– Fisherfaces when applied to face recognition




Which is a set of Eigenfaces and Fisherfaces?

Images from Wikipedia.org




Eigenfaces vs. Fisherfaces

� Independent Comparative Study of PCA and LDA on the FERET

Data Set, by Kresimir Delac, Mislav Grgic, Sonja Grgic

– PCA: blurred, like average faces

– LDA finds more discriminant features from face images




References

� Peter N. Belhumeur, João P. Hespanha, and David J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997

� Kresimir Delac, Mislav Grgic, Sonja Grgic, Independent Comparative Study of PCA, ICA, and LDA on the FERET Data Set

� M. Turk, A. Pentland, Eigenfaces for Recognition, Journal of Cognitive Neuroscience, 3(1), pp. 71-86, 1991.

� Gregory Shakhnarovich, Baback Moghaddam, Face Recognition in Subspaces, Mitsubishi TR2004-041 May 2004

� Naotoshi Seo, Project: Eigenfaces and Fisherfaces, http://note.sonots.com/SciSoftware/FaceRecognition.html

� http://www.face-rec.org/algorithms/




PCA Applications:

Image Compression and Denoising

Barnabás Póczos

University of Alberta




Original Image

• Divide the original 372x492 image into patches:• Each patch is an instance that contains 12x12 pixels on a grid

• View each as a 144-D vectorSlide credit: Barnabás Póczos




L2

error and PCA dim





Looks like the discrete cosine bases of JPG!...

60 most important eigenvectors




2D Discrete Cosine Basis

http://en.wikipedia.org/wiki/Discrete_cosine_transform




PCA compression: 144D to 60D






2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12





2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12






2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12






Noise Filtering by Auto-Encoder

x x’

U x




Noisy image





Denoised: 15 PCA components




Kernel PCA




z

v

-3 -2 -1 0 1 2 3

-4-2

02

4Motivation

z

u

-3 -2 -1 0 1 2 3

02

46

8

???

????




Motivation

Linear projections

will not detect the

pattern.




Nonlinear PCA

Three popular methods are available:

1) Neural-network based PCA (E. Oja, 1982)

2)Method of Principal Curves (T.J. Hastie and W. Stuetzle,

1989)

3) Kernel based PCA (B. Schölkopf, A. Smola, and K.

Müller, 1998)




PCA

NPCA




KPCA: Basic Idea




� Let C be the scatter matrix of the centered mapping φ(x):

� Let w be an eigenvector of C, then w can be written as a linear

combination:

� Also, we have:

� Combining, we get:

Kernel PCA Formulation…

∑=

=

N

i

T

iiC

1

)()( xx φφ

∑=

=

N

k

kk

1

)(xw φα

ww λ=C

∑∑∑===

=

N

k

kk

N

k

kk

N

i

T

ii

111

)())()()()(( xxxx φαλφαφφ





).()(where,

,,2,1,)()()()()()(

)()()()(

)())()()()((

2

11 1

11 1

111

j

T

iij

N

k

k

T

lk

N

i

N

k

kk

T

ii

T

l

N

k

kk

N

i

N

k

kk

T

ii

N

k

kk

N

k

kk

N

i

T

ii

KK

KK

Nl

xxαα

αα

xxxxxx

xxxx

xxxx

φφλ

λ

φφαλαφφφφ

φαλαφφφ

φαλφαφφ

==

⇒=

⇒==

⇒=

⇒=

∑∑∑

∑∑∑

∑∑∑

== =

== =

===

Κ

Kernel or Gram matrix





αα λ=KFrom the eigen equation

And the fact that the eigenvector w is normalized to 1, we obtain:

λ

φαφα

1

1))(())((||||11

2

=

⇒=== ∑∑==

αα

ααxxw

T

T

N

i

ii

T

N

i

iiK




KPCA Algorithm

Step 1: Compute the Gram matrix: NjikKjiij

,,1,),,( Κ== xx

Step 2: Compute (eigenvalue, eigenvector) pairs of K: Mll

l ,,1),,( Κ=λα

Step 3: Normalize the eigenvectors:

l

l

l

λ

α

α ←

Thus, an eigenvector wl of C is now represented as: ∑=

=

N

k

k

l

k

l

1

)(xw φα

To project a test feature φ(x) onto wl we need to compute:

∑∑==

==

N

k

k

l

k

N

k

k

l

k

TlTk

11

),())(()()( xxxxwx αφαφφ

So, we never need φ explicitly




Feature Map Centering

So far we assumed that the feature map φ(x) is centered for thedata points x1,… xN

Actually, this centering can be done on the Gram matrix without ever

explicitly computing the feature map φ(x).

)/11()/11(~

NIKNIKTT

−−=

Scholkopf, Smola, Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Technical report #44,

Max Plank Institute, 1996.

is the kernel matrix for centered features, i.e., 0)(1

=∑=

N

i

ixφ

A similar expression exist for projecting test features on the feature eigenspace




KPCA: USPS Digit Recognition

Scholkopf, Smola, Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Technical report #44,

Max Plank Institute, 1996.

dTyxk )(),( yx=Kernel function:

(d)

Classier: Linear SVM with features as kernel principal components

N = 3000, p = 16-by-16 image

Linear PCA




Input points before kernel PCA

http://en.wikipedia.org/wiki/Kernel_principal_component_analysis




Output after kernel PCA

The three groups are distinguishable using the first component only




� PCA

– finds orthonormal basis for data

– Sorts dimensions in order of “importance”

– Discard low significance dimensions

� Uses:

– Get compact description

– Ignore noise

– Improve classification (hopefully)

� Not magic:

– Doesn’t know class labels

– Can only capture linear variations

� One of many tricks to reduce dimensionality!

PCA Conclusions




*Matrix and Vector DerivativesMatrix and vector derivatives are obtained first by element-wise derivatives

and then reforming them into matrices and vectors.

Slide credit: Tae-Kyun Kim




*Matrix and Vector Derivatives

Slide credit: Tae-Kyun Kim


PRINCIPAL COMPONENT ANALYSIS - KINX CDN

Documents

Transcript of PRINCIPAL COMPONENT ANALYSIS - KINX CDN