Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.
-
date post
22-Dec-2015 -
Category
Documents
-
view
216 -
download
3
Transcript of Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.
Copyright (c) Bani K. mallick 1
STAT 651
Lecture #14
Copyright (c) Bani K. mallick 2
Topics in Lecture #14 The Kruskal Wallis Test
Review of ANOVA
Theory for the ANOVA Table
Copyright (c) Bani K. mallick 3
Book Sections Covered in Lecture #14
Chapter 8.6
Copyright (c) Bani K. mallick 4
Relevant SPSS Tutorials Kruskal-Wallis
Copyright (c) Bani K. mallick 5
What you need for the ANOVA Table
Error sum of squares (SSE)
Corrected Total sum of squares (CTSS)
Between sum of squares: SSB = CTSS – SSE
Number of populations t: df1 = t – 1
Total number of observations nT: df2 = nT –t
Critical value from Table 8: F for and df1
and df2
Copyright (c) Bani K. mallick 6
The ANOVA Table: General Linear Model
Sum of Squares
Degrees of freedom
Mean Square
F for equal means
Variable Name
SSB df1 = t – 1 SSB/(t-1) SSB/(t-1)
------------
SSE/(nT –t)
Error SSE df2 = nT –t
SSE/(nT –t)
Corrected Total
CTSS nT – 1
Copyright (c) Bani K. mallick 7
What you need for the ANOVA Table
df1 = t – 1
df2 = nT –t
Critical value from Table 8: F for and df1 and df2
You compute the F statistic, and reject the hypothesis that the population means are equal at Type I error probability if the F statistic exceeds this tabulated value
Copyright (c) Bani K. mallick 8
Illustration: What you need for the ANOVA Table
df1 = t – 1 = 4
df2 = nT –t = 25
= 0.05
Critical value from Table 8: F for and df1 and df2 = 2.76.
You reject the hypothesis that the population means are equal at Type I error probability if the F statistic exceeds 2.76
Copyright (c) Bani K. mallick 9
Nonparametric Methods
As we found for 1-sample comparisons and 2-sample comparisons, there are also nonparametric methods for ANOVA
These are often called Kruskal-Wallis methods
The idea is the same as in the 2-sample problem
Remember it?
Copyright (c) Bani K. mallick 10
Nonparametric Methods
Replace each observation by its rank in the pooled data
Do the usual ANOVA F-test
Copyright (c) Bani K. mallick 11
Female Concho Water Snakes, Ages 2-4, Tail Length
Normal Q-Q Plot of Residual for TAILL
Observed Value
3020100-10-20-30-40
Exp
ect
ed
No
rma
l Va
lue
30
20
10
0
-10
-20
-30
We need a method that allows for non-normal data!
Copyright (c) Bani K. mallick 12
Concho Water Snake Data for Females
Ranks
11 8.82
17 19.29
9 30.89
37
Age2.00
3.00
4.00
Total
Tail LengthN Mean Rank
Kruskal-Wallis Test
Copyright (c) Bani K. mallick 13
Concho Water Snake Data for Females
Test Statisticsa,b
20.637
2
.000
Chi-Square
df
Asymp. Sig.
Tail Length
Kruskal Wallis Testa.
Grouping Variable: Ageb.
Copyright (c) Bani K. mallick 14
Nonparametric Methods
Once you have decided that the populations are different in their means, there is no version of a LSD
You simply have to do each comparison in turn
This is a bit of a pain in SPSS, because you physically must do each 2-population comparison, defining the groups as you go
Copyright (c) Bani K. mallick 15
Nonparametric Methods
Illustrate Kruskal Wallis in SPSS and then remind ourselves about how to do the 2-population comparisons
Copyright (c) Bani K. mallick 16
Lipids Research Study
Four populations:
Healthy, non-smokers
Healthy, smokers
CHD, non-smokers
CHD, Smokers
Compared on basis of cholesterol levels
Copyright (c) Bani K. mallick 17
Lipid Research Study
Between-Subjects Factors
Healthy,NonSmoker
143
Healthy,Smoker
44
HeartDisease,NonSmoker
113
HeartDisease,Smoker
37
.00
1.00
2.00
3.00
CHD/SmokingCategory
Value Label N
Copyright (c) Bani K. mallick 18
Lipid Research Study
3711344143N =
Lipids Research Study
CHD/Smoking Category
CHD, Smoke
CHD, NonSmoker
Healthy, Smoker
Healthy, NonSmoker
Ch
ole
ste
rol
400
300
200
100
0
255305
85
173
Copyright (c) Bani K. mallick 19
Lecture 14 Review: Residuals
Testing for Normality in ANOVA
I use the General Linear Model to define these residuals
Form the residuals, which are simply the differences of the data with their group mean
Then do a q-q plot
Useful if you have many groups with a small number of observations per group
Copyright (c) Bani K. mallick 20
Lipid Research StudyNormal Q-Q Plot of Cholesterol Residuals
Observed Value
400300200100
Exp
ect
ed
No
rma
l Va
lue
400
300
200
100
Copyright (c) Bani K. mallick 21
Lipid Research Study: Note Table Entries
Tests of Between-Subjects Effects
Dependent Variable: Cholesterol
13450.180a 3 4483.393 2.738 .043
12637872.4 1 12637872.42 7716.573 .000
13450.180 3 4483.393 2.738 .043
545373.126 333 1637.757
17709567.0 337
558823.306 336
SourceCorrected Model
Intercept
NCHDSMOK
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .024 (Adjusted R Squared = .015)a.
Copyright (c) Bani K. mallick 22
Lipid Research Study
Kruskal-Wallis Test
Test Statisticsa,b
7.360
3
.061
Chi-Square
df
Asymp. Sig.
Cholesterol
Kruskal Wallis Testa.
Grouping Variable: CHD/Smoking Categoryb.
Copyright (c) Bani K. mallick 23
Lipids Research Study
The p-values are 0.043 for ANOVA, 0.061 for Kruskal Wallis.
Weakish evidence of population differences
The Q-Q plot was pretty normal, so I would probably go with the smaller p-value and publish, but with some warnings.
BTW, what hypothesis were we testing??
Copyright (c) Bani K. mallick 24
Lipids Research Study
The p-values are 0.043 for ANOVA, 0.061 for Kruskal Wallis.
Weakish evidence of population differences
BTW, what hypothesis were we testing??
That the population means for the 4 populations were all simultaneously equal.
Copyright (c) Bani K. mallick 25
Lipids Research Study: Fisher LSD
Multiple Comparisons
Dependent Variable: Cholesterol
LSD
-.9983 6.9767 .886 -14.7222 12.7257
-9.0113 5.0937 .078 -19.0313 1.0087
-19.1168* 7.4644 .011 -33.8000 -4.4336
.9983 6.9767 .886 -12.7257 14.7222
-8.0131 7.1913 .266 -22.1592 6.1331
-18.1186* 9.0269 .046 -35.8755 -.3616
9.0113 5.0937 .078 -1.0087 19.0313
8.0131 7.1913 .266 -6.1331 22.1592
-10.1055 7.6653 .188 -25.1840 4.9731
19.1168* 7.4644 .011 4.4336 33.8000
18.1186* 9.0269 .046 .3616 35.8755
10.1055 7.6653 .188 -4.9731 25.1840
(J) CHD/SmokingCategory
Healthy, Smoker
Heart Disease,NonSmoker
Heart Disease, Smoker
Healthy, NonSmoker
Heart Disease,NonSmoker
Heart Disease, Smoker
Healthy, NonSmoker
Healthy, Smoker
Heart Disease, Smoker
Healthy, NonSmoker
Healthy, Smoker
Heart Disease,NonSmoker
(I) CHD/SmokingCategoryHealthy, NonSmoker
Healthy, Smoker
Heart Disease,NonSmoker
Heart Disease, Smoker
MeanDifference
(I-J) Std. Error Sig. Lower Bound Upper Bound
95% Confidence Interval
Based on observed means.
The mean difference is significant at the .05 level.*.
Healthy Nonsmokers differ from Heart Disease Smokers
Healthy and CHD patients differ among smokers
Copyright (c) Bani K. mallick 26
Lipids Research Study
Fisher’s LSD suggested that Healthy, Non-smokers has significantly less Healthy, Smokers (p = 0.011)
P-value for Wilcoxon rank sum test is 0.024
Ranks
143 86.04 12304.00
37 107.73 3986.00
180
CHD/Smoking CategoryHealthy, NonSmoker
Heart Disease, Smoker
Total
CholesterolN Mean Rank Sum of Ranks
Test Statisticsa
2008.000
12304.000
-2.257
.024
Mann-Whitney U
Wilcoxon W
Z
Asymp. Sig. (2-tailed)
Cholesterol
Grouping Variable: CHD/Smoking Categorya.
Copyright (c) Bani K. mallick 27
Variances
The ANOVA method is remarkably robust
Although it assumes that the populations have equal population variances, as long as the sample sizes are reasonably close, it is not much affected by unequal variances
Of course, the sample variances will be different: why?
Copyright (c) Bani K. mallick 28
Variances
It is still possible to compare variances
Realistically, if you are intrinsically interested in whether populations have the same variances or not, you should consult a statistician
However, there is a version of the Levene test that can be computed from SPSS. It uses the same algorithm as in the 2-population case
Copyright (c) Bani K. mallick 29
Lipid Research Study: Variances?
3711344143N =
Lipids Research Study
CHD/Smoking Category
CHD, Smoke
CHD, NonSmoker
Healthy, Smoker
Healthy, NonSmoker
Ch
ole
ste
rol
400
300
200
100
0
255305
85
173
Copyright (c) Bani K. mallick 30
Variances
The IQR are 57, 44, 56, 74
I don’t see a massive inequality of variability
The medians are 220, 225, 225, 232
Copyright (c) Bani K. mallick 31
Lipid Research Study: Variances?
Levene's Test of Equality of Error Variancesa
Dependent Variable: Cholesterol
1.657 3 333 .176F df1 df2 Sig.
Tests the null hypothesis that the error variance ofthe dependent variable is equal across groups.
Design: Intercept+NCHDSMOKa.
You get this under “Options” in the ANOVA Run
Copyright (c) Bani K. mallick 32
Concho Water Snakes
91711N =
Female Concho Water Snakes
Age
4-year olds3-year olds2-year olds
Ta
il L
en
gth
220
200
180
160
140
120
35
27
Copyright (c) Bani K. mallick 33
Concho Water Snakes
Test Statisticsa,b
20.637
2
.000
Chi-Square
df
Asymp. Sig.
Tail Length
Kruskal Wallis Testa.
Grouping Variable: Ageb.
Copyright (c) Bani K. mallick 34
Concho Water Snakes
Levene's Test of Equality of Error Variancesa
Dependent Variable: Tail Length
2.903 2 34 .069F df1 df2 Sig.
Tests the null hypothesis that the error variance ofthe dependent variable is equal across groups.
Design: Intercept+AGEa.