1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
Principal component analysis (PCA) is a technique that is useful for the compression and classification of data.
The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set, that nonetheless retains most of the sample's information.
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
By information we mean the variation present in the sample, given by the correlations between the original variables.
The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
Principal component
• direction of maximum variance in the input space
• principal eigenvector of the covariance matrix
Goal: Relate these two definitions
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
• Variance
A random variablefluctuating about its mean value
Average of the square of the fluctuations
x x x
x 2 x2 x 2
x x x
x 2 x2 x 2
xx x
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
• Covariance
Pair of random variables, each fluctuating about its mean value.
Average of product of fluctuations
x1 x1 x1x2 x2 x2
x1x2 x1x2 x1 x2
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
• Covariance matrix
N random variables
NxN symmetric matrix
Diagonal elements are variances
Cij x ix j x i x j
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
• eigenvectors with k largest eigenvalues
Now you can calculate them, but what do they mean?
Principal components
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
• Covariance to variance
From the covariance, the variance of any projection can be calculated.
w : unit vector
wT x 2 wT x2wTCw
wiCijw jij
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
• Maximizing parallel variance
Principal eigenvector of C (the one with the largest eigenvalue)
w* argmaxw:w1
wTCw
max C maxw:w1
wTCw
w*TCw*
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
Geometric picture of principal components
• the 1st PC z1 is a minimum distance fit to a line in X space
• the 2nd PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
Then…
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
Algebraic definition of PCs
Given a sample of n observations on a vector of p variables x = (x1,x2,….,xp) define the first principal component of the sample by the linear transformation
λ
where the vector
is chosen such that is maximum
6.Principal Component Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
Likewise, define the kth PC of the sample by the linear transformation
where the vector
is chosen such that is maximum subject to
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
To find first note that
is the covariance matrix
for the variables
6.Principal Component
Analysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
To find maximize subject to
Let λ be a Lagrange multiplier
by differentiating…
then maximize
is an eigenvector of
corresponding to eigenvalue
therefore
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
We have maximized
Algebraic derivation of
So is the largest eigenvalue of
The first PC retains the greatest amount of variation in the sample.
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
To find vector maximize
then let λ and φ be Lagrange multipliers, and maximize
subject to
First note that
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
We find that is also an eigenvector of
whose eigenvalue is the second largest. In general
The kth largest eigenvalue of is the variance of the kth PC.
The kth PC retains the kth greatest fraction of the variation in the sample.
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
Algebraic formulation of PCA
Given a sample of n observations
define a vector of p PCs
according to
on a vector of p variables
whose kth column is the kth eigenvector of
where is an orthogonal p x p matrix
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
Then
is the covariance matrix of the PCs,
being diagonal with elements
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
In general it is useful to define standardized variables by
If the are each measured about their sample mean
then the covariance matrix of will be equal to the
correlation matrix of
and the PCs will be dimensionless
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
Practical computation of PCs
Given a sample of n observations on a vector of p variables (each measured about its sample mean)
compute the covariance matrix
where is the n x p matrix whose ith row is the ith observation
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
Then, compute the n x p matrix
whose ith row is the PC score for the ith observation:
Write to decompose each observation into PCs:
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
Usage of PCA: Data compression
Because the kth PC retains the kth greatest fraction of the variation, we can approximate each observation
by truncating the sum at the first m < p PCs:
6.Principal ComponentAnalysis
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
This reduces the dimensionality of the data from p to m < p by approximating
where
is the n x m portion of
and
is the p x m portion of
6.Principal ComponentAnalysis
Top Related