Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to...
Transcript of Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to...
![Page 1: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/1.jpg)
Principal Component Analysis
![Page 2: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/2.jpg)
Philosophy of PCA
Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data in terms of a set of uncorrelated variables
We typically have a data matrix of n observations on p correlated variables x1,x2,…xp
PCA looks for a transformation of the xi into p new variables yi that are uncorrelated
![Page 3: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/3.jpg)
The data matrix
case ht (x1) wt(x2) age(x3) sbp(x4) heart rate (x5)
1 175 1225 25 117 56
2 156 1050 31 122 63
n 202 1350 58 154 67
![Page 4: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/4.jpg)
Reduce dimension
The simplet way is to keep one variable and discard all others: not reasonable!
Wheigt all variable equally: not reasonable (unless they have same variance)
Wheigted average based on some citerion.
Which criterion?
![Page 5: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/5.jpg)
Let us write it first
Looking for a transformation of the data matrix X (nxp) such that
Y= T X=1 X1+ 2 X2+..+ p Xp
Where =(1 , 2 ,.., p)T is a column vector of wheights with
1²+ 2²+..+ p² =1
![Page 6: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/6.jpg)
One good criterion
Maximize the variance of the projection of the observations on the Y variables
Find so that
Var(T X)= T Var(X) is maximal
The matrix C=Var(X) is the covariance matrix of the Xi variables
![Page 7: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/7.jpg)
Let us see it on a figure
Good Better
![Page 8: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/8.jpg)
Covariance matrix
)(..........
........)(
........)(
21
2221
1211
ppp
p
p
xv),xc(x),xc(x
),xc(xxv),xc(x
),xc(x),xc(xxv
C=
![Page 9: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/9.jpg)
![Page 10: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/10.jpg)
And so.. We find that
The direction of is given by the eigenvector 1 correponding to the largest eigenvalue of matrix C
The second vector that is orthogonal (uncorrelated) to the first is the one that has the second highest variance which comes to be the eignevector corresponding to the second eigenvalue
And so on …
![Page 11: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/11.jpg)
So PCA gives
New variables Yi that are linear combination of the original variables (xi):
Yi= ai1x1+ai2x2+…aipxp ; i=1..p
The new variables Yi are derived in decreasing order of importance;
they are called ‘principal components’
![Page 12: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/12.jpg)
Calculating eignevalues and eigenvectors
The eigenvalues i are found by solving the equation
det(C-I)=0Eigenvectors are columns of the
matrix A such that C=A D AT
Where D=
p
............0
0
0.......0
0........01
2
![Page 13: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/13.jpg)
An example
Let us take two variables with covariance c>0
C= C-I=
det(C-I)=(1- )²-c²
Solving this we find 1 =1+c
2 =1-c < 1
1
1
c
c
1
1
c
c
![Page 14: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/14.jpg)
and eigenvectors
Any eigenvector A satisfies the condition CA=A
Solving we find
21
21
aca
caa
2
1
a
a
1
1
c
cA= CA= =
2
1
a
a
2
1
a
a=
A1 A2
![Page 15: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/15.jpg)
PCA is sensitive to scale
If you multiply one variable by a scalar you get different results
(can you show it?)This is because it uses covariance
matrix (and not correlation)PCA should be applied on data that
have approximately the same scale in each variable
![Page 16: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/16.jpg)
Interpretation of PCA
The new variables (PCs) have a variance equal to their corresponding eigenvalue
Var(Yi)= i for all i=1…pSmall i small variance data
change little in the direction of component Yi
The relative variance explained by each PC is given by i / i
![Page 17: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/17.jpg)
How many components to keep?
Enough PCs to have a cumulative variance explained by the PCs that is >50-70%
Kaiser criterion: keep PCs with eigenvalues >1
Scree plot: represents the ability of PCs to explain de variation in data
![Page 18: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/18.jpg)
![Page 19: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/19.jpg)
Do it graphically
![Page 20: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/20.jpg)
Interpretation of components
See the wheights of variables in each component
If Y1= 0.89 X1 +0.15X2-0.77X3+0.51X4
Then X1 and X3 have the highest wheights and so are the mots important variable in the first PC
See the correlation between variables Xi and PCs: circle of correlation
![Page 21: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/21.jpg)
Circle of correlation
![Page 22: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/22.jpg)
Normalized (standardized) PCA
If variables have very heterogenous variances we standardize them
The standardized variables Xi*
Xi*= (Xi-mean)/variance
The new variables all have the same variance (1), so each variable have the same wheight.
![Page 23: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/23.jpg)
Application of PCA in Genomics
PCA is useful for finding new, more informative, uncorrelated features; it reduces dimensionality by rejecting low variance features
Analysis of expression dataAnalysis of metabolomics data
(Ward et al., 2003)
![Page 24: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/24.jpg)
However
PCA is only powerful if the biological question is related to the highest variance in the dataset
If not other techniques are more useful : Independent Component Analysis
Introduced by Jutten in 1987
![Page 25: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/25.jpg)
What is ICA?
![Page 26: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/26.jpg)
That looks like that
![Page 27: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/27.jpg)
The idea behind ICA
![Page 28: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/28.jpg)
How it works?
![Page 29: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/29.jpg)
Rationale of ICA
Find the components Si that are as independent as possible in the sens of maximizing some function F(s1,s2,.,sk) that measures indepedence
All ICs (except 1) should be non-NormalThe variance of all ICs is 1There is no hierarchy between ICs
![Page 30: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/30.jpg)
How to find ICs ?
Many choices of objective function FMutual information
We use the kurtosis of the variables to approximate the distribution function
The number of ICs is chosen by the user
)()...()(
),...,,(),...,,(
2211
2121
kk
kk sfsfsf
sssfLogsssfMI
![Page 31: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/31.jpg)
Difference with PCA
It is not a dimensionality reduction technique
There is no single (exact) solution for components; uses different algorithms (in R: FastICA, PearsonICA, MLICA)
ICs are of course uncorrelated but also as independent as possible
Uninteresting for Normally distributed variables
![Page 32: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/32.jpg)
Example: Lee and Batzoglou (2003)
Microarray expression data on 7070 genes in 59 Normal human tissue samples (19 types)
We are not interested in reducing dimension but rather in looking for genes that show tissue specific expression profile (what make tissue types differents)
![Page 33: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/33.jpg)
PCA vs ICA
Hsiao et al (2002) applied PCA and by visual inspection observed three gene cluster of 425 genes: liver-specific, brain-specific and muscle-specific
ICA identified more tissue-specific genes than PCA
![Page 34: Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649dab5503460f94a997cf/html5/thumbnails/34.jpg)