Xuhua Xia Slide 1 Principal Components Analysis Objectives: –Understand the principles of...
-
Upload
kristin-jackson -
Category
Documents
-
view
226 -
download
0
Transcript of Xuhua Xia Slide 1 Principal Components Analysis Objectives: –Understand the principles of...
Xuhua Xia Slide 1
Principal Components Analysis
• Objectives:– Understand the principles of principal
components analysis (PCA)
– Recognize conditions under which PCA may be useful
– Use SAS procedure PRINCOMP to• perform a principal components analysis• interpret PRINCOMP output.
Xuhua Xia Slide 2
Typical Form of Data
A data set in a 8x3 matrix. The rows could be species and columns sampling sites.
100 97 9996 90 9080 75 6075 85 9562 40 2877 80 7892 91 8075 85 100
X =
A matrix is often referred to as a nxp matrix (n for number of rows and p for number of columns). Our matrix has 8 rows and 3 columns, and is an 8x3 matrix. A variance-covariance matrix has n = p, and is called n-dimensional square matrix.
Xuhua Xia Slide 3
What are Principal Components?
• Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet three criteria
• What are the three criteria?
Y = b1X1 + b2 X2 + … bn Xn
Xuhua Xia Slide 4
What are Principal Components?
• The three criteria:– There are exactly p principal components
(PCs), each being a linear combination of the observed variables;
– The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated);
– The components are extracted in order of decreasing variance.
Xuhua Xia Slide 5
A Simple Data Set
-2-1.5
-1-0.5
0
0.51
1.52
-1.5 -1 -0.5 0 0.5 1 1.5
X
Y
X Y1 -1.264911064 -1.788852 -0.632455532 -0.894433 0 04 0.632455532 0.8944275 1.264911064 1.788854
Mean 0.0000 0.0000Var 1 2
21
))((),( 1
n
yyxxyxCov
n
iii
X Y
X 1 1
Y 1 1
X Y
X 1 1.414
Y 1.414 2
184
6569.5
)()(
))((22,
YYXX
YYXXr YX
Correlation matrix
Covariance matrix
Xuhua Xia Slide 6
General Patterns
• The total variance is 3 (= 1 + 2)• The two variables, X and Y, are perfectly correlated,
with all points fall on the regression line.• The spatial relationship among the 5 points can
therefore be represented by a single dimension.• PCA is a dimension-reduction technique. What
would happen if we apply PCA to the data?
Xuhua Xia Slide 7
Graphic PCA
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1.5 -1 -0.5 0 0.5 1 1.5
X
Y
Xuhua Xia Slide 8
SAS Program
data pca; input x y;cards;-1.264911064 -1.788854382-0.632455532 -0.8944271910 00.632455532 0.8944271911.264911064 1.788854382;proc princomp cov out=pcscore;proc print; var prin1 prin2;proc princomp data=pca out=pcscore;proc print; var prin1 prin2;run;
Requesting the PCA to be carried out on the covariance matrix rather than the correlation matrix.
Without specifying the covariance option, PCA will be carried out on the correlation matrix.
Xuhua Xia Slide 9
A positive definite matrix• When you run the SAS program, the log file will warn that
“The Correlation Matrix is not positive definite.”. What does that mean?
• A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if z’Mz > 0 for all non-zero vectors z with real entries, where z’ is the transpose of z.
• Given our correlation matrix with all entries being 1, it is easy to find z that lead to z’Mz = 0. So the matrix is not positive definite:
11 2
2
1 2
1 10
1 1
:
zz z
z
Solution z z
Replace the correlation matrix with the covariance matrix and solve for z.
Xuhua Xia Slide 10
SAS Output Eigenvalues of the Covariance Matrix
Eigenvalue Difference Proportion CumulativePRIN1 3.00000 3.00000 1.00000 1.00000PRIN2 0.00000 . 0.00000 1.00000
Eigenvectors PRIN1 PRIN2
X 0.577350 0.816497 Y 0.816497 -.577350
OBS PRIN1 PRIN2 1 -2.19089 0 2 -1.09545 0 3 0.00000 0 4 1.09545 0 5 2.19089 0
Variance accounted for by each principal components
Principal component scores
What’s the variance in PC1? How are the values computed?
PC1 = 0.57735*X1+0.816497*X2
Xuhua Xia Slide 11
SAS Output
OBS PRIN1 PRIN2 1 -2.19089 0 2 -1.09545 0 3 0.00000 0 4 1.09545 0 5 2.19089 0
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
PC1
PC
2
Xuhua Xia Slide 12
Eigenvalues of the Correlation Matrix
Eigenvalue Difference Proportion CumulativePRIN1 2.00000 2.00000 1.00000 1.00000PRIN2 0.00000 . 0.00000 1.00000
Eigenvectors
PRIN1 PRIN2 X 0.707107 0.70710 Y 0.707107 -0.70711
OBS PRIN1 PRIN2 1 -1.78885 0 2 -0.89443 0 3 0.00000 0 4 0.89443 0 5 1.78885 0
SAS Output
Variance accounted for by each principal components
Principal component scores
What’s the variance in PC1?
Xuhua Xia Slide 13
Steps in a PCA
• Have at least two variables• Generate a correlation or variance-covariance matrix • Obtain eigenvalues and eigenvectors (This is called
an eigenvalue problem, and will be illustrated with a simple numerical example)
• Generate principal component (PC) scores• Plot the PC scores in the space with reduced
dimensions• All these can be automated by using SAS.
Xuhua Xia Slide 14
Covariance or Correlation Matrix?
0
10
20
30
40
Abu
ndan
ce
Sp1Sp2
Xuhua Xia Slide 15
Covariance or Correlation Matrix?
0
5
10
15
20
25
30
35
Ab
und
ance
Sp2
Sp3
Xuhua Xia Slide 16
Covariance or Correlation Matrix?
0
5
10
15
20
25
30
35
Sp1
Sp2
Sp3
Xuhua Xia Slide 17
The Eigenvalue Problem
3,0
0322
21
22
21
21
2
A
The covariance matrix.
The Eigenvalue is the set of values that satisfy this condition.
The resulting eigenvalues (There are n eigenvalues for n variables). The sum of eigenvalues is equal to the sum of variances in the covariance matrix.
Finding the eigenvalues and eigenvectors is called an eigenvalue problem (or a characteristic value problem).
Xuhua Xia Slide 18
Get the Eigenvectors
3,0
0322
21
22
21
21
2
A
• An eigenvector is a vector (x) that satisfies the following condition:A x = x
• In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x1 and x2.
2
022
,02
0
0
22
21
,0
12
21
21
2
1
xx
xx
xx
toequivalentiswhich
x
xAx
For
12
221
121
2
1
2
1
2
322
,32
322
21
,3
xx
xxx
xxx
toequivalentiswhich
x
x
x
xAx
For
Xuhua Xia Slide 19
Get the Eigenvectors
• We want to find an eigenvector of unit length, i.e., x1
2 + x22 = 1
• We therefore have
5774.0,8165.02
1
2,0
21
1212
12
xx
xxx
xxFor
8165.02,5774.01
21
,3
1212
xx
xxx
For
From Previous Slide
The first eigenvector is one associated
with the largest eigenvalue.
Solve x1
Xuhua Xia Slide 20
Get the PC Scores
0 2.19089
0 1.09545
0 0.00000
0 1.09545-
0 2.19089-
.577350- 0.816497
0.816497 0.577350
21.78885438 41.26491106
10.89442719 20.63245553
0 0
10.89442719- 20.63245553-
21.78885438- 41.26491106-
First PC score
Second PC score
Original data (x and y) Eigenvectors
The original data in a two dimensional space is reduced to one dimension..
Xuhua Xia Slide 21
What Are Principal Components?
• Principal components are a new set of variables, which are linear combinations of the observed ones, with these properties:– Because of the decreasing variance property, much of the
variance (information in the original set of p variables) tends to be concentrated in the first few PCs. This implies that we can drop the last few PCs without losing much information. PCA is therefore considered as a dimension-reduction technique.
– Because PCs are orthogonal, they can be used instead of the original variables in situations where having orthogonal variables is desirable (e.g., regression).
Xuhua Xia Slide 22
Index of hidden variables
School Math English Physics Chemistry Chinese1 60 55 65 64 672 70 65 69 71 773 80 75 72 85 824 90 85 85 88 885 100 95 95 95 936 …. … … … …
• The ranking of Asian universities by the Asian Week– HKU is ranked second in financial resources, but seventh
in academic research
– How did HKU get ranked third?
– Is there a more objective way of ranking?
• An illustrative example:
Xuhua Xia Slide 23
School Math English1 60 552 70 653 80 754 90 855 100 95
Mean 80.0 75.0Var 250 250
50
75
100
50 75 100
Math
En
glis
h
A Simple Data Set
• School 5 is clearly the best school• School 1 is clearly the worst school
Xuhua Xia Slide 24
Graphic PCA
-1.7889
-0.8944
0
0.8944
1.7889
Xuhua Xia Slide 25
Crime Data in 50 States
STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTOALABAMA 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7ALASKA 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3ARIZONA 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5ARKANSAS 8.8 27.6 83.2 203.4 972.6 1862.1 183.4CALIFORNIA 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5COLORADO 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1CONNECTICUT 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2DELAWARE 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0FLORIDA 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4GEORGIA 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9HAWAII 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4IDAHO 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6ILLINOIS 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6. . . . . . . .. . . . . . . .
PROC PRINCOMP OUT=CRIMCOMP;
DATA CRIME; TITLE 'CRIME RATES PER 100,000 POP BY STATE'; INPUT STATENAME $1-15 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;CARDS;Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8
North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2Virginia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0;PROC PRINCOMP out=crimcomp;run;PROC PRINT; ID STATENAME; VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;run;PROC GPLOT; PLOT PRIN2*PRIN1=STATENAME; TITLE2 'PLOT OF THE FIRST TWO PRINCIPAL COMPONENTS';run;PROC PRINCOMP data=CRIME COV OUT=crimcomp;run;PROC PRINT; ID STATENAME; VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;run;
/* Add to have a map view*/proc sort data=crimcomp out=crimcomp; by STATENAME;run;proc sort data=maps.us2 out=mymap; by STATENAME;run;data both; merge mymap crimcomp; by STATENAME;run;proc gmap data=both; id _map_geometry_; choro PRIN1 PRIN2/levels=15; /* choro PRIN1/discrete; */run;
Xuhua Xia Slide 28
MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO
MURDER 1.0000 0.6012 0.4837 0.6486 0.3858 0.1019 0.0688RAPE 0.6012 1.0000 0.5919 0.7403 0.7121 0.6140 0.3489ROBBERY 0.4837 0.5919 1.0000 0.5571 0.6372 0.4467 0.5907ASSAULT 0.6486 0.7403 0.5571 1.0000 0.6229 0.4044 0.2758BURGLARY 0.3858 0.7121 0.6372 0.6229 1.0000 0.7921 0.5580LARCENY 0.1019 0.6140 0.4467 0.4044 0.7921 1.0000 0.4442AUTO 0.0688 0.3489 0.5907 0.2758 0.5580 0.4442 1.0000
Correlation Matrix
If variables are not correlated, there would be no point in doing PCA.
The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix.
Xuhua Xia Slide 29
Eigenvalue Difference Proportion Cumulative
PRIN1 4.11496 2.87624 0.587851 0.58785PRIN2 1.23872 0.51291 0.176960 0.76481PRIN3 0.72582 0.40938 0.103688 0.86850PRIN4 0.31643 0.05846 0.045205 0.91370PRIN5 0.25797 0.03593 0.036853 0.95056PRIN6 0.22204 0.09798 0.031720 0.98228PRIN7 0.12406 . 0.017722 1.00000
Eigenvalues
Xuhua Xia Slide 30
Eigenvectors
PRIN1 PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7
MURDER 0.3002 -.6291 0.1782 -.2321 0.5381 0.2591 0.2675RAPE 0.4317 -.1694 -.2441 0.0622 0.1884 -.7732 -.2964ROBBERY 0.3968 0.0422 0.4958 -.5579 -.5199 -.1143 -.0039ASSAULT 0.3966 -.3435 -.0695 0.6298 -.5066 0.1723 0.1917BURGLARY 0.4401 0.2033 -.2098 -.0575 0.1010 0.5359 -.6481LARCENY 0.3573 0.4023 -.5392 -.2348 0.0300 0.0394 0.6016AUTO 0.2951 0.5024 0.5683 0.4192 0.3697 -.0572 0.1470
• Do these eigenvectors mean anything?– All crimes are positively correlated with the first eigenvector, which is
therefore interpreted as a measure of overall crime rate.
– The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…...
Xuhua Xia Slide 31
PC Plot: Crime Data
NO
SO
WE
IOWINE
NE
VE
MA
KE
PE
MO
MI
MI
IDWY
AR
UT
VI
NO
KA
CO
IN
OK
RH
TE
AL
NE
OH
GE
IL
MI
HA
WA
DE
MA
LO
NETE
OR
SO
MAMIAL
COAR
FL
NECA
NE
-3
-2
-1
0
1
2
3
-5 -3 -1 1 3 5 7
PC 1
PC
2
North and South Dakota
Nevada, New York, California
Mississippi, Alabama, Louisiana, South Carolina
Maryland
Prin1 -3.9640776 - -3.1477220 -2.5815619 - -2.4656229 -2.1507074 - -1.7269086 -1.7200694 - -1.5543424-1.5073580 - -1.4246347 -1.0544104 - -0.6992517 -0.6340669 - -0.4998955 -0.3213630 - -0.1365951-0.0498802 - 0.4904076 0.5129025 - 0.8231313 0.9305796 - 0.9784390 1.1202026 - 1.44900211.6033606 - 2.2733344 2.4215150 - 3.0141383 3.1117540 - 5.2669853
Plot of PC1
Prin2 -2.54671E+00 - -2.09610E+00 -2.08327E+00 - -1.38079E+00 -1.34544E+00 - -9.50756E-01-8.14251E-01 - -6.81314E-01 -6.24288E-01 - -5.58511E-01 -2.54464E-01 - -1.94742E-01-2.80416E-02 - 2.60334E-05 6.26829E-02 - 9.42305E-02 1.43187E-01 - 2.25739E-012.70992E-01 - 4.32893E-01 5.78785E-01 - 7.37764E-01 7.80831E-01 - 8.44945E-019.16596E-01 - 9.44967E-01 9.64209E-01 - 1.29674E+00 1.50123E+00 - 2.63105E+00
Plot of PC2