Multivariate Analysis -Overviefuchs/Introduction And Summary.pdf · •z4 =Diastolic blood pressure...
Transcript of Multivariate Analysis -Overviefuchs/Introduction And Summary.pdf · •z4 =Diastolic blood pressure...
Multivariate Analysis - Overview
•In general: - Analysis of Multivariate data, i.e. each observation has two or more variables as predictor variables,
•Analyses
–Analysis of Treatment Means (Single (multivariate) sample, two-samples,etc.)
– Study interrelationships – correlations and predictions (regression)
– Other specific methods (discriminant analysis, principal components, clustering)
• Limitations:
–Many parameters estimated – large sample sizes
– Whenever testing hypotheses, assumption of normality (almost overall)
•We’ll focus on the multivariate methods and applications with somewhat limited mathematical emphasis (without proofs)
•Textbook: Applied Multivariate Analysis - Fifth Edition, Richard Johnson and Dean Wichern
Sweat data – One sample testing means Table 5.1
•Perspiration from 20 healthy women analyzed
•Three components:
•X1 =sweat rate
•X2 =sodium content
•X3 =potasium content
•Null hypothesis: simultaneousely
�µ1 = 4
�µ2 = 50
�µ3 = 10
or
[ 4 ]
µ =[ 50]
[10]
or
�µ’ = [4, 50,10]
Sweat data – One sample testing means Table 5.1Sweat rate Sodium Potasium
3.7 48.5 9.3
5.7 65.1 8
3.8 47.2 10.9
3.2 53.2 12
3.1 55.5 9.7
4.6 36.1 7.9
2.4 24.8 14
7.2 33.1 7.6
6.7 47.4 8.5
5.4 54.1 11.3
3.9 36.9 12.7
4.5 58.8 12.3
3.5 27.8 9.8
4.5 40.2 8.4
1.5 13.5 10.1
8.5 56.4 7.1
4.5 71.6 8.2
6.5 52.8 10.9
4.1 44.1 11.2
5.5 40.9 9.4
Sweat data – One sample testing means Table 5.1
µ0’ = [4, 50,10]One-Sample T: Sweatrate, Sodium, Potasium
Variable N Mean StDev SE Mean 90% CI
Sweatrate 20 4.64000 1.69687 0.37943 (3.98391, 5.29609)
Sodium 20 45.4000 14.1347 3.1606 ( 39.9349, 50.8651)
Potasium 20 9.96500 1.90464 0.42589 (9.22858, 10.70142)
One-Sample T: Sweatrate, Sodium, Potasium
Variable N Mean StDev SE Mean 95% CI
Sweatrate 20 4.64000 1.69687 0.37943 (3.84584, 5.43416)
Sodium 20 45.4000 14.1347 3.1606 (38.7848, 52.0152)
Potasium 20 9.96500 1.90464 0.42589 (9.07360, 10.85640)
Sweat data – One sample testing means Table 5.1
µ0’ = [4, 50,10]
0.401847 -0.001580.257969Potasium
=20*.487=9.74(Xbar-Miu0)’S(-1)n*(Xbar-Miu0)’
F(3,17,.9)=3.20F(3,17,.9)=2.44=2.905F=(n-p)/[(n-1)*p]*T2
-0.001580.006067-0.02209Sodium
0.257969-0.022090.586155Sweat rate
PotasiumSodiumSweat rateS-matrix(-1)
3.627658-5.64-1.80905Potasium
-5.64199.788410.01Sodium
-1.8090510.012.879368Sweat rate
PotasiumSodiumSweat rateS-matrix
Xbar-Miu0-0.035-4.60.64
Miu010504
Means9.96545.44.64
Turtle Carapaces- Two Samples – Testing means –Table 6.7
•Jolicoeur and Mosimann studied relationship of size and shape for painted turtled. Measures on the caprapaces of 24 male and 24 female turtles
•Three components:
•X1 =Length
•X2 =Width
•X3 =Height
•Null hypothesis: µ1 = µ2 (vectors of size 3)
Turtle Carapaces- Two Samples – Testing means –Table 6.7
Length Width Height Gender Length Width Height Gender
98 81 38 female 93 74 37 male
103 84 38 female 94 78 35 male
103 86 42 female 96 80 35 male
105 86 42 female 101 84 39 male
109 88 44 female 102 85 38 male
123 92 50 female 103 81 37 male
123 95 46 female 104 83 39 male
133 99 51 female 106 83 39 male
133 102 51 female 107 82 38 male
133 102 51 female 112 89 40 male
134 100 48 female 113 88 40 male
136 102 49 female 114 86 40 male
138 98 51 female 116 90 43 male
138 99 51 female 117 90 41 male
141 105 53 female 117 91 41 male
147 108 57 female 119 93 41 male
149 107 55 female 120 89 40 male
153 107 56 female 120 93 44 male
155 115 63 female 121 95 42 male
155 117 60 female 125 93 45 male
158 115 62 female 127 96 45 male
159 118 63 female 128 95 45 male
Turtle Carapaces- Two Samples – Testing means –Table 6.7
ANOVA: Length, Width, Height versus Gender
Gender fixed 2 female, male
Analysis of Variance for Length
Source DF SS MS F P
Gender 1 5474.8 5474.8 17.19 0.000
Error 47 14965.9 318.4
Total 48 20440.7
S = 17.8444 R-Sq = 26.78% R-Sq(adj) = 25.23%
Analysis of Variance for Width
Source DF SS MS F P
Gender 1 2208.0 2208.0 18.71 0.000
Error 47 5548.0 118.0
Total 48 7756.0
S = 10.8647 R-Sq = 28.47% R-Sq(adj) = 26.95%
Analysis of Variance for Height
Source DF SS MS F P
Gender 1 1420.8 1420.8 34.47 0.000
Error 47 1937.2 41.2
Total 48 3358.0
S = 6.42005 R-Sq = 42.31% R-Sq(adj) = 41.08%
Turtle Carapaces- Two Samples – Testing means –Table 6.7
Means
Gender N Length Width Height
female 25 134.52 101.72 51.480
male 25 113.38 88.29 40.708
SSCP Matrix for Gender (B) MANOVA for Gender
Length Width Height s = 1 m = 0.5 n = 21.5
Length 5475 3477 2789 Test DF
Width 3477 2208 1771 Criterion Statistic F Num Denom P
Height 2789 1771 1421 Wilks' 0.410 21.570 3 44 0.000
Determinant =-426 Lawley-Hotelling 1.43798 21.570 3 44 0.000
SSCP Matrix for Error (W) Pillai's 0.58983 21.570 3 44 0.000
Length Width Height
Length 14966 8841 5189
Width 8841 5548 3131
Height 5189 3131 1937
Determinant = 6.067E+08
SSCP Matrix for Error (B+W)
Length Width Height
Length 20441 12318 7978
Width 12318 7756 4902
Height 7978 4902 3358
Determinant = 1.481E+09 Determinant (W)/Determinant (B+W)=.6067/1.481=.410
Notations for MANOVA-
Exact F-distributions for Wilk’s Lambda
~ )( )( : 2 )
~ )( )( : 2 )
~ )( )( : 1)
~ )( )( : 1)
)1,min(way}-One {),min(
way)-Onein ( E of
way)-Onein 1-( (H) Hypothesis of
MANOVAin Denote
)1(2,2
11
)1(2,211
1,
11
,1
pvpp
pv
vqqv
pvpp
pv
vqqv
FFqd
FFpc
FF qb
FF pa
gpInqps
N-gdfv
gdfq
−+
−+
Λ
Λ−
−
−
Λ
Λ−
−+
−+
Λ
Λ−
Λ
Λ−
==
==
==
==
−==
=
=
Turtle Carapaces- Two Samples – Testing means –Table 6.7
Hotelling T2Means
Gender N Length Width Height
female 25 134.52 101.72 51.480
male 25 113.38 88.29 40.708
Diff. 22.67 14.29 11.33
SSCP Matrix for Error (W)
Length Width Height
Length 14966 8841 5189
Width 8841 5548 3131
Height 5189 3131 1937
Spooled = SSCP Matrix for Error (W)/(n1+n2-2)
Hotelling T2 =(xbar1-xbar2)’ [Spooled (1/n1+1/n2)]-1(xbar1-xbar2)=65.66
F= Hotelling T2*(n1+n2-p-1)/[(n1+n2-2)p]= 65.66*[44/(3*46)]=65.66*0.319=20.94
df=p,n1+n2-p-1=3,44
Wilks' 0.410 21.570 3 44 0.000
11.333 14.292 22.667
Amitriptyline Data –Multivariate Regression Analysis - Table7.6
•Amitriptyline – drug for depression. Several side effects: irregular heartbeat, abnormal BP, etc. Data on 17 patients admited after amitriptyline overdose Two dependent variables and 5 predictor variables:
•Y1 =Total TCAD plasma level (TOT)
•Y2 =Amount of amitriptyline in TCAD plasma level (AMI)
•z1 =Gender 1=female, 0=male
•z2 =Amount of amitriptyline taken at time of overdose(AMT)
•z3 =PR wave measurements (PR)
•z4 =Diastolic blood pressure (DIAP)
•z5 =QRS wave measurements (QRS)
•Analysis: Model to predict Y1 and Y2 from the predictor variables
•Multivariate Linear Regression Models
Amitriptyline Data –Multivariate Regression Analysis - Table7.6
Y1-Tot -
TCAD Y2-Ami
z1-
Gender
z2-Amt
Anti-
depress z3-PR
z4-Diap
(Diastolic
BP) z5-QRS
3389 3149 1 7500 220 0 140
1101 653 1 1975 200 0 100
1131 810 0 3600 205 60 111
596 448 1 675 160 60 120
896 844 1 750 185 70 83
1767 1450 1 2500 180 60 80
807 493 1 350 154 80 98
1111 941 0 1500 200 70 93
645 547 1 375 137 60 105
628 392 1 1050 167 60 74
1360 1283 1 3000 180 60 80
652 458 1 450 160 64 60
860 722 1 1750 135 90 79
500 384 0 2000 160 60 80
781 501 0 4500 180 0 100
1070 405 0 1500 170 90 120
1754 1520 1 3000 180 0 129
Amitriptyline Data –Multivariate Regression Analysis - Table7.6
•Multivariate Linear Regression ModelsRegression Analysis: Y1 versus Z1, Z2, Z3, Z4, Z5
The regression equation is
Y1 = - 2879 + 676 Z1 + 0.285 Z2 + 10.3 Z3 + 7.25 Z4 + 7.60 Z5
Predictor Coef SE Coef T P
Constant -2879.5 893.3 -3.22 0.008
Z1 675.7 162.1 4.17 0.002
Z2 0.28485 0.06091 4.68 0.001
Z3 10.272 4.255 2.41 0.034
Z4 7.251 3.225 2.25 0.046
Z5 7.598 3.849 1.97 0.074
S = 281.232 R-Sq = 88.7% R-Sq(adj) = 83.6%
Analysis of Variance
Source DF SS MS F P
Regression 5 6835932 1367186 17.29 0.000
Residual Error 11 870008 79092
Total 16 7705940
Amitriptyline Data –Multivariate Regression Analysis - Table7.6
•Multivariate Linear Regression ModelsRegression Analysis: Y2 versus Z1, Z2, Z3, Z4, Z5
The regression equation is
Y2 = - 2729 + 763 Z1 + 0.306 Z2 + 8.90 Z3 + 7.21 Z4 + 4.99 Z5
Predictor Coef SE Coef T P
Constant -2728.7 928.8 -2.94 0.014
Z1 763.0 168.5 4.53 0.001
Z2 0.30637 0.06334 4.84 0.001
Z3 8.896 4.424 2.01 0.070
Z4 7.206 3.354 2.15 0.055
Z5 4.987 4.002 1.25 0.239
Analysis of Variance
Source DF SS MS F P
Regression 5 6669669 1333934 15.60 0.000
Residual Error 11 940709 85519
Total 16 7610378
Radiotheraphy-Principal Components Table1.5
•Data of the 98 average ratings over the course of the treatment for 98 patients undergoing radiotherapy. Six components:
•X1 =number of symptoms (as nausea, sore throat)
•X2 =amount of activity (on a 1-5 scale)
•X3 =amount of sleep (on a 1-5 scale)
•X4 =amount of food consumed (on a 1-3 scale)
•X5 =appetite (on a 1-5 scale)
•X6 =skin reaction (on a 0-3 scale)
•Analysis: Finding a single (or several) measures – linear combinations of the 6 components, to represent patients’response to therapy.
•Principal components
Radiotheraphy-Principal Components Table1.5
Symptoms Activity Sleep Eat Appetite
Skin
Reaction
0.889 1.389 1.555 2.222 1.945 1
2.813 1.437 0.999 2.312 2.312 2
1.454 1.091 2.364 2.455 2.909 3
0.294 0.941 1.059 2 1 1
2.727 2.545 2.819 2.727 4.091 0
3.937 1.25 1.937 2.937 3.749 1
2.786 1.714 2.357 2.071 2 2
5.231 2.692 1.077 1.846 2.539 1
1.15 1.1 0.95 2 1 1
6.5 2.562 1.749 2.562 2.499 1
0.8 1 2.2 2.267 2.466 2
4.6 2 3 2.5 3.4 1
3.5 1.286 2.714 1.286 1.252 3
3.444 2.556 2.388 2.389 3 1
4.071 1 1 2.357 1.572 1
3.692 1 2.538 2.154 2.615 1
5.167 3 1 2.667 3.666 0
0.5 1 1 2 1 0
2.385 1.923 2.539 2.154 2.461 1
2.1 1.3 1.3 1.8 2.6 1
5 3.25 3.125 2.375 3.375 0
4.571 1.214 3.286 2.571 3.572 1
2.733 1.133 2.6 1.933 1.667 1
4.235 2.294 2.706 2.176 1.883 1
0 1 1.941 2 2 0
Radiotheraphy-Principal Components Table1.5
•X1 =number of symptoms (as nausea, sore throat) X2 =amount of activity (on a 1-5 scale)
•X3 =amount of sleep (on a 1-5 scale) X4 =amount of food consumed (on a 1-3 scale)
•X5 =appetite (on a 1-5 scale) X6 =skin reaction (on a 0-3 scale)
•Principal components and Factor Analysis•Principal Component Analysis: Symptoms, Activity, Sleep, Eat, Appetite, SkinRea
Eigenanalysis of the Correlation Matrix
Eigenvalue 2.8643 1.0764 0.7776 0.6503 0.3880 0.2433
Proportion 0.477 0.179 0.130 0.108 0.065 0.041
Cumulative 0.477 0.657 0.786 0.895 0.959 1.000
Variable PC1 PC2 PC3 PC4 PC5 PC6
Symptoms 0.445 -0.027 0.339 0.551 0.601 0.146
Activity 0.429 -0.292 0.499 0.061 -0.687 0.076
Sleep 0.359 0.380 -0.628 0.421 -0.332 0.212
Eat 0.463 -0.021 -0.125 -0.666 0.207 0.533
Appetite 0.521 -0.074 -0.203 -0.201 0.103 -0.794
SkinReaction 0.056 0.874 0.430 -0.179 -0.053 -0.116
Radiotheraphy- Factor Analysis Table1.5•Principal Component Factor Analysis of the Correlation Matrix
Unrotated Factor Loadings and Communalities
Variable Factor1 Factor2 Factor3 Factor4 Communality
Symptoms 0.753 -0.028 0.299 0.444 0.855
Activity 0.727 -0.303 0.440 0.049 0.815
Sleep 0.607 0.394 -0.554 0.340 0.946
Eat 0.783 -0.022 -0.110 -0.537 0.914
Appetite 0.882 -0.076 -0.179 -0.162 0.842
SkinReaction 0.095 0.907 0.379 -0.144 0.996
Variance 2.864 1.076 0.777 0.650 5.368
% Var 0.477 0.179 0.130 0.108 0.895
Rotated Factor Loadings and Communalities - Varimax Rotation
Variable Factor1 Factor2 Factor3 Factor4 Communality
Symptoms 0.125 -0.858 0.314 -0.065 0.855
Activity 0.392 -0.806 -0.091 0.052 0.815
Sleep 0.220 -0.121 0.936 -0.086 0.946
Eat 0.927 -0.188 0.121 -0.066 0.914
Appetite 0.739 -0.393 0.369 0.072 0.842
SkinReaction 0.015 -0.008 0.071 -0.995 0.996
Variance 1.625 1.590 1.138 1.014 5.368
% Var 0.271 0.265 0.190 0.169 0.895
Hemophilia Data- Discriminant Analysis- Table 11.8
•To construct a procedure for detecting potential hemophilia A carriers, blood samples assayed and measurements made on two variables:
•X1 =log10(AHF activity) where AHF=antihemophilic factor
•X2 = log10(AHF-like antigen)
•Measurements taken on two groups of women:– A group of n1=24 women who do not carry the hemophilic gene –Normal group– A group n2=22 women from known hemophilia A carriers (daughters of hemophiliacs, mothers with more than one hemophiliac son, mothers with one hemophiliac son and other hemophilic relatives) –Obligatory carriers
–New cases to be classified–Classification and Discrimination – Discriminant Analysis
Hemophilia Data- Discriminant Analysis- Table 11.8
Noncarriers Obligatory Carriers New cases Requiring Classification
Group
log(AHF
activity)
log(AHF
antigen) Group
log(AHF
activity)
log(AHF
antigen) Group
log(AHF
activity)
log(AHF
antigen)
1 -0.0056 -0.1657 2 -0.3478 0.1151 3 -0.112 -0.279
1 -0.1698 -0.1585 2 -0.3618 -0.2008 3 -0.059 -0.068
1 -0.3469 -0.1879 2 -0.4986 -0.086 3 0.064 0.012
1 -0.0894 0.0064 2 -0.5015 -0.2984 3 -0.043 -0.052
1 -0.1679 0.0713 2 -0.1326 0.0097 3 -0.05 -0.098
1 -0.0836 0.0106 2 -0.6911 -0.339 3 -0.094 -0.113
1 -0.1979 -0.0005 2 -0.3608 0.1237 3 -0.123 -0.143
1 -0.0762 0.0392 2 -0.4535 -0.1682 3 -0.011 -0.037
1 -0.1913 -0.2123 2 -0.3479 -0.1721 3 -0.21 -0.09
1 -0.1092 -0.119 2 -0.3539 0.0722 3 -0.126 -0.019
1 -0.5268 -0.4773 2 -0.4719 -0.1079
1 -0.0842 0.0248 2 -0.361 -0.0399
1 -0.0225 -0.058 2 -0.3226 0.167
1 0.0084 0.0782 2 -0.4319 -0.0687
1 -0.1827 -0.1138 2 -0.2734 -0.002
1 0.1237 0.214 2 -0.5573 0.0548
1 -0.4702 -0.3099 2 -0.3755 -0.1865
1 -0.1519 -0.0686 2 -0.495 -0.0153
1 0.0006 -0.1153 2 -0.5107 -0.2483
1 -0.2015 -0.0498 2 -0.1652 0.2132
1 -0.1932 -0.2293 2 -0.2447 -0.0407
1 0.1507 0.0933 2 -0.4232 -0.0998
Hemophilia Data- Discriminant Analysis- Table 11.8Discriminant Analysis: GroupDisc versus logActivity, logAntigen
Linear Method for Response: GroupDisc
Predictors: logActivity, logAntigen
Group 1 2
Count 30 45
Summary of classification
True Group
Put into Group 1 2
1 27 8
2 3 37
Total N 30 45
N correct 27 37
Proportion 0.900 0.822
N = 75 N Correct = 64 Proportion Correct = 0.853
Squared Distance Between Groups
1 2
1 0.00000 4.57431
2 4.57431 0.00000
Hemophilia Data- Discriminant Analysis- Table 11.8Discriminant Analysis: GroupDisc versus logActivity, logAntigen
Linear Discriminant Function for Groups
1 2
Constant -0.411 -3.970
logActivity -6.824 -26.143
logAntigen 1.270 18.394
Summary of Misclassified Observations
True Pred Squared
Observation Group Group Group Distance Probability
5** 1 2 1 2.7065 0.288
2 0.8962 0.712
32** 2 1 1 2.366 0.502
2 2.383 0.498
Prediction for Test Observations
Squared
Observation Pred Group From Group Distance Probability
1 1 1 4.260 0.998
2 16.607 0.002
Hemophilia Data- Discriminant Analysis- Table 11.8Discriminant Analysis: -Minitab
Distance and discriminant functions
Squared distance: The squared distance (also called the Mahalanobisdistance) of observation x to the center (mean) of group i is given by the general form:
di2 (x) = (x - mi)' Sp
-1 (x - mi) where: x = p-column vector with the values of this observation
mi = column vector of length p containing the means of the predictors calculated from the data in group i;
Sp = pooled covariance matrix, used in linear discriminant analysis;
The linear discriminant function =mi' Sp-1x - 0.5mi'Sp
-1mi + ln p where:
x = column vector of length p containing the values of the predictors for this observation (note, this column vector is stored as one row)
mi = column vector of length p containing the means of the predictors calculated from the data in group i
Sp = pooled covariance matrix
ln p = natural log of the prior probability
For a given x, the group with the smallest squared distance has the largest linear discriminant function.
Hemophilia Data- Discriminant Analysis- Table 11.8Discriminant Analysis: -Minitab
Distance and discriminant functions
Posterior probability- The posterior probability for group i given the data and is calculated by:
pi fi(x)/Σpi fi(x)
where:
pi = prior probability of group i
fi (x) = the joint density for the data in group i (with the population parameters replaced by sample estimates)
The largest posterior probability is equivalent to the largest value of ln [pi fi(x)], where (under normality):
ln [pi fi(x)] = -0.5 [di2(x) - 2 lnpi] - constant value
where:
di2 (x) = -2 [mi' Sp
-1x - 0.5mi'Sp-1mi + lnpi] + x' Sp
-1x
Bone and Skull - White leghorn fowls –
Canonical Correlation- Ex10.4
•To assess correlation between two sets of variables
•Head Measurements (X(1) )–X(1)1 =Skull length
–X(1)2 =Skull breadth
•Leg Measurements (X(1) )–X(2)1 =Femur length
–X(2)2 =Tibia length
–Create new variables which represent the most of the correlations between the two sets of variables –
–Cannonical Correlation
Bone and Skull - White leghorn fowls –
Canonical Correlation- Ex10.4
Skull
Lenghts
Skull
Breadth
Skull
Breadth
Tibia
Length
Skull
Lenghts 1 0.505 0.569 0.602Skull
Breadth 0.505 1 0.422 0.467
Skull
Breadth 0.569 0.422 1 0.926Tibia
Length 0.602 0.467 0.926 1
Universities – Cluster Analysis-Table 12-9
•Data on certain universities for certain variables used to compare or rank major universities. The variables:
•X1 =Average SAT for new freshmen
•X2 = Percent new freshmen in top 10% of high school class
•X3 =Percent of applicants accepted
•X4 =Student faculty ratio
•X5 =Estimated annual expenses
•X6 =Graduation rate (%)
•Analysis: Clustering observations (universities) based on linear combinations of variables (or specific variables). Definition of distances
•Cluster Analysis, Distance Methods and Ordination
Universities – Cluster Analysis-Table 12-9# SAT Top10 Accept SFRatio Expenses
1 Harvard 14.00 91 14 11 39.525
2 Princeton 13.75 91 14 8 30.220
3 Yale 13.75 95 19 11 43.514
4 Stanford 13.60 90 20 12 36.450
5 MIT 13.80 94 30 10 34.870
6 Duke 13.15 90 30 12 31.585
7 CalTech 14.15 100 25 6 63.575
8 Dartmouth 13.40 89 23 10 32.162
9 Brown 13.10 89 22 13 22.704
10 JohnsHopkins 13.05 75 44 7 58.691
11 Uchicago 12.90 75 50 13 38.380
12 UPenn 12.85 80 36 11 27.553
13 Cornell 12.80 83 33 13 21.860
14 Northwestern 12.60 85 39 11 28.052
15 Columbia 13.10 76 24 12 31.510
16 NotreDame 12.55 81 42 13 15.122
17 UVir 12.25 77 44 14 13.349
18 Georgetown 12.55 74 24 12 20.126
19 CarnegieMellon 12.60 62 59 9 25.026
20 Umichigan 11.80 65 68 16 15.470
21 UCBerkeley 12.40 95 40 17 15.140
22 Uwisconsin 10.85 40 69 15 11.857
23 PennState 10.81 38 54 18 10.180
24 Purdue 10.05 28 90 19 9.066
25 TexasA&M 10.75 49 67 25 8.704
Universities – Cluster Analysis-Table 12-9Cluster Analysis of Observations: SAT, Top10, Accept, SFRatio, Expenses, Grad
Euclidean Distance, Single Linkage
Amalgamation Steps
Number
Number of obs.
of Similarity Distance Clusters New in new
Step clusters level level joined cluster cluster
1 24 95.2869 5.3135 16 17 16 2 Notre Dame UVirginia
2 23 94.7291 5.9423 12 14 12 2 UPenn Northwestern
3 22 94.6465 6.0355 4 8 4 2 Stanford Dartmouth
4 21 93.9052 6.8712 5 6 5 2 MIT Duke
5 20 93.4580 7.3753 4 5 4 4 (Stanford, Dartmouth) (MIT, Duke)
6 19 93.4570 7.3765 12 13 12 3
7 18 93.2462 7.6141 1 3 1 2
8 17 92.9253 7.9759 1 4 1 6
9 16 91.4509 9.6381 1 2 1 7
10 15 91.1059 10.0272 1 9 1 8
11 14 89.2653 12.1022 12 16 12 5
12 13 89.1401 12.2433 15 18 15 2
13 12 88.9290 12.4814 1 12 1 13
14 11 88.4325 13.0411 1 15 1 15
15 10 87.1170 14.5242 22 25 22 2
16 9 84.0878 17.9392 22 23 22 3
17 8 83.7468 18.3237 1 11 1 16
18 7 82.2972 19.9579 1 21 1 17
19 6 82.2608 19.9989 19 20 19 2
20 5 80.4746 22.0127 1 10 1 18
21 4 78.0311 24.7675 22 24 22 4
22 3 77.0504 25.8731 1 19 1 20
23 2 76.3836 26.6248 1 22 1 24
24 1 76.3051 26.7134 1 7 1 25
Universities – Cluster Analysis-Table 12-9
Observations
Similarity
CalTech
Purdue
PennState
TexasA&M
Uwisconsin
Umichigan
CarnegieMellon
JohnsHopkins
UCBerkeley
Uchicago
Georgetown
Columbia
Uvir
NotreDame
Cornell
Northwestern
Upenn
Brown
Princeton
Duke
MIT
Dartmouth
Stanford
Yale
Harvard
76.31
84.20
92.10
100.00
Dendrogram with Single Linkage and Euclidean Distance
Universities – Cluster Analysis-Table 12-9
Cluster Analysis of Variables: SAT, Top10, Accept, SFRatio, Expenses, Grad
Correlation Coefficient Distance, Single Linkage- Amalgamation Steps
Number
Number of obs.
of Similarity Distance Clusters New in new
Step clusters level level joined cluster cluster
1 5 96.1261 0.07748 1 2 1 2
2 4 88.9491 0.22102 1 5 1 3
3 3 87.3856 0.25229 1 6 1 4
4 2 81.5832 0.36834 3 4 3 2
5 1 22.0783 1.55843 1 3 1 6
Correlations: SAT, Top10, Accept, SFRatio, Expenses, Grad
SAT Top10 Accept SFRatio Expenses
Top10 0.923
Accept -0.886 -0.859
SFRatio -0.813 -0.643 0.632
Expenses 0.779 0.611 -0.558 -0.782
Grad 0.748 0.746 -0.820 -0.561 0.394
V a r ia b le s
Similarity
S F R a tioA c c e p tG r a dE xp e n s e sT o p 1 0S A T
2 2 .0 8
4 8 .0 5
7 4 .0 3
1 0 0 .0 0
D e nd r o g r am w i th S i n g l e L in k a g e a n d C o r r e l a t i o n C o e f f i c i e n t D i s ta n c e
Universities – Cluster Analysis-Table 12-9Cluster Analysis of Observations: SAT, Top10, Accept, SFRatio, Expenses, Grad
Euclidean Distance, Single Linkage
Amalgamation Steps
K-means Cluster Analysis: SAT, Top10, Accept, SFRatio, Expenses, Grad
Number of clusters: 2 - Cluster Centroids
Variable Cluster1 Cluster2 centroid
SAT 13.1447 11.1433 12.6644
Top10 85.7895 47.0000 76.4800
Accept 30.1579 67.8333 39.2000
SFRatio 11.3684 17.0000 12.7200
Expenses 31.8099 13.3838 27.3876
Grad 90.7368 74.0000 86.7200
Number of clusters: 3 - Cluster Centroids
Grand
Variable Cluster1 Cluster2 Cluster3 centroid
SAT 13.1031 11.1433 13.3667 12.6644
Top10 86.2500 47.0000 83.3333 76.4800
Accept 28.3750 67.8333 39.6667 39.2000
SFRatio 11.8750 17.0000 8.6667 12.7200
Expenses 27.7339 13.3838 53.5487 27.3876
Grad 91.8125 74.0000 85.0000 86.7200