2009 7 PCA + Factor Analyses
-
Upload
muhamad-burhanudin -
Category
Documents
-
view
219 -
download
0
Transcript of 2009 7 PCA + Factor Analyses
-
7/28/2019 2009 7 PCA + Factor Analyses
1/68
Data Reduction I : PCA and Factor Analysis
Carlos Pinto1
1Neuropsychiatric Genetics GroupUniversity of Dublin
Data Analysis Seminars 11 November 2009
-
7/28/2019 2009 7 PCA + Factor Analyses
2/68
Overview
1. Distortions in Data
2. Variance and Covariance
3. PCA: A Toy Example
4. Running PCA in R
5. Factor Analysis
6. Running Factor Analysis in R
7. Conclusions
-
7/28/2019 2009 7 PCA + Factor Analyses
3/68
Sources of Confusion in Data
1. Noise
2. Rotation
3. Redundancy
in linearsystems . . .
-
7/28/2019 2009 7 PCA + Factor Analyses
4/68
Sources of Confusion in Data
1. Noise
2. Rotation
3. Redundancy
in linearsystems . . .
-
7/28/2019 2009 7 PCA + Factor Analyses
5/68
Sources of Confusion in Data
1. Noise
2. Rotation
3. Redundancy
in linearsystems . . .
-
7/28/2019 2009 7 PCA + Factor Analyses
6/68
Sources of Confusion in Data
1. Noise
2. Rotation
3. Redundancy
in linearsystems . . .
-
7/28/2019 2009 7 PCA + Factor Analyses
7/68
Noise
1. Noise arises from inaccuracies in our measurements
2. Noise is measured relative to the signal
SNR =2signal
2noise
The SNR must be large if we are to extract useful information
from our experiment.
-
7/28/2019 2009 7 PCA + Factor Analyses
8/68
Noise
1. Noise arises from inaccuracies in our measurements
2. Noise is measured relative to the signal
SNR =2signal
2noise
The SNR must be large if we are to extract useful information
from our experiment.
-
7/28/2019 2009 7 PCA + Factor Analyses
9/68
Noise
1. Noise arises from inaccuracies in our measurements
2. Noise is measured relative to the signal
SNR =2signal
2noise
The SNR must be large if we are to extract useful information
from our experiment.
-
7/28/2019 2009 7 PCA + Factor Analyses
10/68
Noise
1. Noise arises from inaccuracies in our measurements
2. Noise is measured relative to the signal
SNR =2signal
2noise
The SNR must be large if we are to extract useful information
from our experiment.
-
7/28/2019 2009 7 PCA + Factor Analyses
11/68
Variance in Signal and Noise
-
7/28/2019 2009 7 PCA + Factor Analyses
12/68
Rotation
1. We dont always know the true underlying variables in the
system we are studying
2. We may actually be measuring some (hopefully linear!)
combination of the true variables.
-
7/28/2019 2009 7 PCA + Factor Analyses
13/68
Rotation
1. We dont always know the true underlying variables in the
system we are studying
2. We may actually be measuring some (hopefully linear!)
combination of the true variables.
-
7/28/2019 2009 7 PCA + Factor Analyses
14/68
Redundancy
1. We dont always know whether the variables we are
measuring are independent.
2. We may actually be measuring one or more (hopefully
linear!) combinations of the same set of variables.
-
7/28/2019 2009 7 PCA + Factor Analyses
15/68
Redundancy
1. We dont always know whether the variables we are
measuring are independent.
2. We may actually be measuring one or more (hopefully
linear!) combinations of the same set of variables.
-
7/28/2019 2009 7 PCA + Factor Analyses
16/68
Degrees of Redundancy
-
7/28/2019 2009 7 PCA + Factor Analyses
17/68
-
7/28/2019 2009 7 PCA + Factor Analyses
18/68
Goals of PCA
1. To isolate noise
2. To separate out the redundant degrees of freedom
3. To eliminate the effects of rotation
-
7/28/2019 2009 7 PCA + Factor Analyses
19/68
Goals of PCA
1. To isolate noise
2. To separate out the redundant degrees of freedom
3. To eliminate the effects of rotation
-
7/28/2019 2009 7 PCA + Factor Analyses
20/68
Goals of PCA
1. To isolate noise
2. To separate out the redundant degrees of freedom
3. To eliminate the effects of rotation
-
7/28/2019 2009 7 PCA + Factor Analyses
21/68
Example: Measuring Two Correlated Variables
-
7/28/2019 2009 7 PCA + Factor Analyses
22/68
Reduced Data
-
7/28/2019 2009 7 PCA + Factor Analyses
23/68
Variance
Variable : X Measurements: {x1 . . .xn}
Variable : Y Measurements: {y1 . . .yn}
Variance of X: 2XX =P
i(xix)(xix)n1
Variance of Y: 2YY =P
i(yiy)(yiy)
n1
-
7/28/2019 2009 7 PCA + Factor Analyses
24/68
Covariance
Covariance of Xand Y:
2XY =
i(xi x)(yi y)
n 1= 2YX
The covariance measures the degree of the (linear!)
relationship between Xand Y
Small values of the covariance indicate that the variables areindependent (uncorrelated)
-
7/28/2019 2009 7 PCA + Factor Analyses
25/68
Covariance Matrix
We can write the four variances associated with our two
variables in the form of a matrix:
CXX CXYCYX CYY
The diagonal elements are the variances and the offdiagonal
elements are the covariances
The covariance matrix is symmetric, since CXY = CYX
-
7/28/2019 2009 7 PCA + Factor Analyses
26/68
Three Variables
For 3 variables, we get a 3 3 matrix:
C11 C12 C13
C21 C22 C23C31 C32 C33
As before, the diagonal elements are the variances and the
offdiagonal elements are the covariances, and the matrix issymmetric
-
7/28/2019 2009 7 PCA + Factor Analyses
27/68
Interpretation of the Covariance Matrix
1. The diagonal terms are the variances of our different
variables.
2. Large diagonal values correspond to strong signals
3. The off-diagonal terms are the covariances between
different variables. They reflect distortions in the data
(noise, rotation, redundancy).
4. Large off-diagonal values correspond to distortions in ourdata
-
7/28/2019 2009 7 PCA + Factor Analyses
28/68
Interpretation of the Covariance Matrix
1. The diagonal terms are the variances of our different
variables.
2. Large diagonal values correspond to strong signals
3. The off-diagonal terms are the covariances between
different variables. They reflect distortions in the data
(noise, rotation, redundancy).
4. Large off-diagonal values correspond to distortions in ourdata
-
7/28/2019 2009 7 PCA + Factor Analyses
29/68
Minimizing Distortion
1. If we redefine our variables (as linear combinations of each
other) the covariance matrix will change.
2. We want to change the covariance matrix so that the
offdiagonal elements are close to zero.
3. That is, we want to diagonalize the covariance matrix.
-
7/28/2019 2009 7 PCA + Factor Analyses
30/68
Minimizing Distortion
1. If we redefine our variables (as linear combinations of each
other) the covariance matrix will change.
2. We want to change the covariance matrix so that the
offdiagonal elements are close to zero.
3. That is, we want to diagonalize the covariance matrix.
-
7/28/2019 2009 7 PCA + Factor Analyses
31/68
Minimizing Distortion
1. If we redefine our variables (as linear combinations of each
other) the covariance matrix will change.
2. We want to change the covariance matrix so that the
offdiagonal elements are close to zero.
3. That is, we want to diagonalize the covariance matrix.
-
7/28/2019 2009 7 PCA + Factor Analyses
32/68
Example 1: Toy Example
X = (1, 2) X = 3/2
Y = (3, 6) Y = 9/2
Subtract the means, since we are only interested in thevariances
X
=
1
2 ,1
2
Y =
3
2,
3
2
-
7/28/2019 2009 7 PCA + Factor Analyses
33/68
Example 1: Toy Example
X = (1, 2) X = 3/2
Y = (3, 6) Y = 9/2
Subtract the means, since we are only interested in thevariances
X
=
1
2 ,1
2
Y =
3
2,
3
2
V i
-
7/28/2019 2009 7 PCA + Factor Analyses
34/68
Variances
C11 =
12
12
+
1
2 1
2
=
1
2
C
12 =
1
2 3
2
+1
2 3
2
=
3
2
C21 =
32
12
+
3
2 1
2
=
3
2
C22 =
32
32
+
32
32
= 9
2
C i M t i
-
7/28/2019 2009 7 PCA + Factor Analyses
35/68
Covariance Matrix
The covariance matrix is:
C
=
1
2
3
2
3
2
9
2
This is not diagonal the nonzero off-diagonal terms indicate
the presence of distortion, in this case because the two
variables are correlated
R t ti f A I
-
7/28/2019 2009 7 PCA + Factor Analyses
36/68
Rotation of Axes I
We need to redefine our variables to diagonalize the covariance
matrix.
Choose new variables which are a linear combination of the old
ones:
X = a1X + a2Y
Y = b1X + b2Y
We have to determine the four constants a1, a2, b1, b2 such that
the covariance matrix is diagonal.
R t ti f A I
-
7/28/2019 2009 7 PCA + Factor Analyses
37/68
Rotation of Axes I
We need to redefine our variables to diagonalize the covariance
matrix.
Choose new variables which are a linear combination of the old
ones:
X = a1X + a2Y
Y = b1X + b2Y
We have to determine the four constants a1, a2, b1, b2 such that
the covariance matrix is diagonal.
Rotation of Axes I
-
7/28/2019 2009 7 PCA + Factor Analyses
38/68
Rotation of Axes I
We need to redefine our variables to diagonalize the covariance
matrix.
Choose new variables which are a linear combination of the old
ones:
X = a1X + a2Y
Y = b1X + b2Y
We have to determine the four constants a1, a2, b1, b2 such that
the covariance matrix is diagonal.
Rotation of Axes II
-
7/28/2019 2009 7 PCA + Factor Analyses
39/68
Rotation of Axes II
In this case, the constants are easy to find:
X
= X
+ 3Y
Y = 3X + Y
Recalculating the Data Points
-
7/28/2019 2009 7 PCA + Factor Analyses
40/68
Recalculating the Data Points
Our data points relative to the new axes are now:
X1 = 1
2+
3 32
= 5
X
2 =1
2 +
3 3
2
= 5
Y1 = 3 1
2+ 3
2= 0
Y2 =
3 12
+ 3
2= 0
Recalculating the Variances
-
7/28/2019 2009 7 PCA + Factor Analyses
41/68
Recalculating the Variances
The new variances are now:
C11 = (5 5) + (5 5) = 25
C
12 = (5 0) + (5 0) = 0
C21 = (0 5) + (0 5) = 0
C
22 = (0 0) + (0 0) = 0
Diagonalised Covariance Matrix
-
7/28/2019 2009 7 PCA + Factor Analyses
42/68
Diagonalised Covariance Matrix
The covariance matrix is now:
C =
25 0
0 0
1. All offdiagonal entries are now zero, so data distortions
have been removed
2. The variance of Y
(second entry on the diagonal) is nowzero. This variable has now been eliminated, so the data
has been dimensionally reduced (from 2 dimensions to 1)
Eigenvalues
-
7/28/2019 2009 7 PCA + Factor Analyses
43/68
Eigenvalues
1. The numbers on the diagonal of C are called the
eigenvalues of the covariance matrix.
2. Large eigenvalues correspond to large variances theseare the important variables to consider.
3. Small eigenvalues correspond to small variances these
variables can sometimes be neglected.
Eigenvectors
-
7/28/2019 2009 7 PCA + Factor Analyses
44/68
Eigenvectors
The directions of the new rotated axes are called the
eigenvectors of the covariance matrix
Etymological Note:
vector: from Latin, carrier
eigen: from German, proper to, belonging to
Example 2: A More Realistic Example
-
7/28/2019 2009 7 PCA + Factor Analyses
45/68
Example 2: A More Realistic Example
3 variables, 5 observations
Finance, X Marketing, Y Policy, Z
3 6 5
7 3 3
10 9 8
3 9 7
10 6 5
Running PCA in R
-
7/28/2019 2009 7 PCA + Factor Analyses
46/68
Running PCA in R
PCA Output from R
-
7/28/2019 2009 7 PCA + Factor Analyses
47/68
PCA Output from R
Assumptions and Limitations of PCA
-
7/28/2019 2009 7 PCA + Factor Analyses
48/68
Assumptions and Limitations of PCA
1. Large variances represent signal. Small signals represent
noise.
2. The new axes are linear combinations of the original axes
3. The principal components are orthogonal (perpendicular)
to each other
4. The distributions of our measurements are fully described
by their mean and variance
Advantages of PCA
-
7/28/2019 2009 7 PCA + Factor Analyses
49/68
d a tages o C
1. The procedure is hypothesis free.
2. The procedure is well defined
3. The solution is (essentially) unique
Factor Analysis
-
7/28/2019 2009 7 PCA + Factor Analyses
50/68
y
There are a multiplicity of procedures which go under the name
of factor analysis.
Factor analysis is hypothesis drivenand tries to find a prioristructure in the data.
Well use the data from Example 2 in the next few slides...
Factor Model
-
7/28/2019 2009 7 PCA + Factor Analyses
51/68
Hypothesis: A significant part of the variance of our 3
variables can be expressed as linear combinations of twounderlying factors, F1 and F2:
X = a1F1 + a2F2 + eX
Y = b1F1 + b2F2 + eY
Z = c1F1 + c2F2 + eZ
eX, eY and eZ allow for errors in the measuements of our 3
variables.
Factor Model
-
7/28/2019 2009 7 PCA + Factor Analyses
52/68
Hypothesis: A significant part of the variance of our 3
variables can be expressed as linear combinations of twounderlying factors, F1 and F2:
X = a1F1 + a2F2 + eX
Y = b1F1 + b2F2 + eY
Z = c1F1 + c2F2 + eZ
eX, eY and eZ allow for errors in the measuements of our 3
variables.
Factor Model
-
7/28/2019 2009 7 PCA + Factor Analyses
53/68
Hypothesis: A significant part of the variance of our 3
variables can be expressed as linear combinations of twounderlying factors, F1 and F2:
X = a1F1 + a2F2 + eX
Y = b1F1 + b2F2 + eY
Z = c1F1 + c2F2 + eZ
eX, eY and eZ allow for errors in the measuements of our 3
variables.
Assumptions
-
7/28/2019 2009 7 PCA + Factor Analyses
54/68
1. The factors Fi are independent with mean 0 and variance 1
2. The error terms are independent with mean 0 and variance2i
With these assumptions, we can calculate the variances in
terms of our hypothesised loadings.
Assumptions
-
7/28/2019 2009 7 PCA + Factor Analyses
55/68
1. The factors Fi are independent with mean 0 and variance 1
2. The error terms are independent with mean 0 and variance2i
With these assumptions, we can calculate the variances in
terms of our hypothesised loadings.
Factor Model Variances
-
7/28/2019 2009 7 PCA + Factor Analyses
56/68
CXX = a2
1 + a2
2 +
2
eX
CYY = b2
1 + b2
2 + 2
eY
CZZ = c2
1 + c2
2 + 2
eZ
CXY = a1b1 + a2b2
CXZ = a1c1 + a2c2
CYZ = b1c1 + b2c2
Observed Variances for Example 2
-
7/28/2019 2009 7 PCA + Factor Analyses
57/68
CXX = 9.84
CXX = 5.04
CXX = 3.04
CXY = 0.36
CXZ = 0.44
CYZ = 3.84
Note: Observations are not mean-adjusted
Fitting the Model
-
7/28/2019 2009 7 PCA + Factor Analyses
58/68
To fit the model we would like to choose the loadings ai, bi and
ci so that the predicted variances match the observed
variances.
Problem: The solution is not unique
That is, there are different choices for the loadings that yield the
same values for the variances; in fact, an infinite number.....
Fitting the Model
-
7/28/2019 2009 7 PCA + Factor Analyses
59/68
To fit the model we would like to choose the loadings ai, bi and
ci so that the predicted variances match the observed
variances.
Problem: The solution is not unique
That is, there are different choices for the loadings that yield the
same values for the variances; in fact, an infinite number.....
So ... what does the computer do ?
-
7/28/2019 2009 7 PCA + Factor Analyses
60/68
In essence, it finds an initial set of loadings and then rotates
these to find the best solution according to a user specified
criterion
Varimax: Tends to maximise the difference in loading
between factors.
Quartimax: Tends to maximise the difference in loading
between variables.
So ... what does the computer do ?
-
7/28/2019 2009 7 PCA + Factor Analyses
61/68
In essence, it finds an initial set of loadings and then rotates
these to find the best solution according to a user specified
criterion
Varimax: Tends to maximise the difference in loading
between factors.
Quartimax: Tends to maximise the difference in loading
between variables.
Example 3
-
7/28/2019 2009 7 PCA + Factor Analyses
62/68
Physics History English Maths Politics
8 4 3 8 5
7 3 3 9 3
10 4 3 9 4
3 8 9 4 7
4 8 6 4 7
5 9 8 5 9
Running Factor Analysis in R
-
7/28/2019 2009 7 PCA + Factor Analyses
63/68
PCA Analysis of Example 3
-
7/28/2019 2009 7 PCA + Factor Analyses
64/68
Factor Analysis of Example 3
-
7/28/2019 2009 7 PCA + Factor Analyses
65/68
Some Opinions on Factor Analysis...
-
7/28/2019 2009 7 PCA + Factor Analyses
66/68
Take Home
-
7/28/2019 2009 7 PCA + Factor Analyses
67/68
1. Measurements are subject to many distortions including
noise, rotation and redundancy
2. PCA is a robust, well defined method for reducing data and
investigating underlying structure.
3. There are many approaches to factor analysis; all of them
are subject to a certain degree of arbitrariness and should
be used with caution
Further Reading
-
7/28/2019 2009 7 PCA + Factor Analyses
68/68
A Tutorial on Principal Components Analysis
Jonathon Shlens
http://www.snl.salk.edu/ shlens/notes.html
Factor Analysis
Peter Tryfos
www.yorku.ca/ptryfos/f1400.pdf