Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

34
Copyright (c) Bani K. mal lick 1 STAT 651 Lecture #14
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    3

Transcript of Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Page 1: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 1

STAT 651

Lecture #14

Page 2: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 2

Topics in Lecture #14 The Kruskal Wallis Test

Review of ANOVA

Theory for the ANOVA Table

Page 3: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 3

Book Sections Covered in Lecture #14

Chapter 8.6

Page 4: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 4

Relevant SPSS Tutorials Kruskal-Wallis

Page 5: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 5

What you need for the ANOVA Table

Error sum of squares (SSE)

Corrected Total sum of squares (CTSS)

Between sum of squares: SSB = CTSS – SSE

Number of populations t: df1 = t – 1

Total number of observations nT: df2 = nT –t

Critical value from Table 8: F for and df1

and df2

Page 6: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 6

The ANOVA Table: General Linear Model

Sum of Squares

Degrees of freedom

Mean Square

F for equal means

Variable Name

SSB df1 = t – 1 SSB/(t-1) SSB/(t-1)

------------

SSE/(nT –t)

Error SSE df2 = nT –t

SSE/(nT –t)

Corrected Total

CTSS nT – 1

Page 7: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 7

What you need for the ANOVA Table

df1 = t – 1

df2 = nT –t

Critical value from Table 8: F for and df1 and df2

You compute the F statistic, and reject the hypothesis that the population means are equal at Type I error probability if the F statistic exceeds this tabulated value

Page 8: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 8

Illustration: What you need for the ANOVA Table

df1 = t – 1 = 4

df2 = nT –t = 25

= 0.05

Critical value from Table 8: F for and df1 and df2 = 2.76.

You reject the hypothesis that the population means are equal at Type I error probability if the F statistic exceeds 2.76

Page 9: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 9

Nonparametric Methods

As we found for 1-sample comparisons and 2-sample comparisons, there are also nonparametric methods for ANOVA

These are often called Kruskal-Wallis methods

The idea is the same as in the 2-sample problem

Remember it?

Page 10: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 10

Nonparametric Methods

Replace each observation by its rank in the pooled data

Do the usual ANOVA F-test

Page 11: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 11

Female Concho Water Snakes, Ages 2-4, Tail Length

Normal Q-Q Plot of Residual for TAILL

Observed Value

3020100-10-20-30-40

Exp

ect

ed

No

rma

l Va

lue

30

20

10

0

-10

-20

-30

We need a method that allows for non-normal data!

Page 12: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 12

Concho Water Snake Data for Females

Ranks

11 8.82

17 19.29

9 30.89

37

Age2.00

3.00

4.00

Total

Tail LengthN Mean Rank

Kruskal-Wallis Test

Page 13: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 13

Concho Water Snake Data for Females

Test Statisticsa,b

20.637

2

.000

Chi-Square

df

Asymp. Sig.

Tail Length

Kruskal Wallis Testa.

Grouping Variable: Ageb.

Page 14: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 14

Nonparametric Methods

Once you have decided that the populations are different in their means, there is no version of a LSD

You simply have to do each comparison in turn

This is a bit of a pain in SPSS, because you physically must do each 2-population comparison, defining the groups as you go

Page 15: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 15

Nonparametric Methods

Illustrate Kruskal Wallis in SPSS and then remind ourselves about how to do the 2-population comparisons

Page 16: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 16

Lipids Research Study

Four populations:

Healthy, non-smokers

Healthy, smokers

CHD, non-smokers

CHD, Smokers

Compared on basis of cholesterol levels

Page 17: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 17

Lipid Research Study

Between-Subjects Factors

Healthy,NonSmoker

143

Healthy,Smoker

44

HeartDisease,NonSmoker

113

HeartDisease,Smoker

37

.00

1.00

2.00

3.00

CHD/SmokingCategory

Value Label N

Page 18: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 18

Lipid Research Study

3711344143N =

Lipids Research Study

CHD/Smoking Category

CHD, Smoke

CHD, NonSmoker

Healthy, Smoker

Healthy, NonSmoker

Ch

ole

ste

rol

400

300

200

100

0

255305

85

173

Page 19: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 19

Lecture 14 Review: Residuals

Testing for Normality in ANOVA

I use the General Linear Model to define these residuals

Form the residuals, which are simply the differences of the data with their group mean

Then do a q-q plot

Useful if you have many groups with a small number of observations per group

Page 20: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 20

Lipid Research StudyNormal Q-Q Plot of Cholesterol Residuals

Observed Value

400300200100

Exp

ect

ed

No

rma

l Va

lue

400

300

200

100

Page 21: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 21

Lipid Research Study: Note Table Entries

Tests of Between-Subjects Effects

Dependent Variable: Cholesterol

13450.180a 3 4483.393 2.738 .043

12637872.4 1 12637872.42 7716.573 .000

13450.180 3 4483.393 2.738 .043

545373.126 333 1637.757

17709567.0 337

558823.306 336

SourceCorrected Model

Intercept

NCHDSMOK

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .024 (Adjusted R Squared = .015)a.

Page 22: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 22

Lipid Research Study

Kruskal-Wallis Test

Test Statisticsa,b

7.360

3

.061

Chi-Square

df

Asymp. Sig.

Cholesterol

Kruskal Wallis Testa.

Grouping Variable: CHD/Smoking Categoryb.

Page 23: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 23

Lipids Research Study

The p-values are 0.043 for ANOVA, 0.061 for Kruskal Wallis.

Weakish evidence of population differences

The Q-Q plot was pretty normal, so I would probably go with the smaller p-value and publish, but with some warnings.

BTW, what hypothesis were we testing??

Page 24: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 24

Lipids Research Study

The p-values are 0.043 for ANOVA, 0.061 for Kruskal Wallis.

Weakish evidence of population differences

BTW, what hypothesis were we testing??

That the population means for the 4 populations were all simultaneously equal.

Page 25: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 25

Lipids Research Study: Fisher LSD

Multiple Comparisons

Dependent Variable: Cholesterol

LSD

-.9983 6.9767 .886 -14.7222 12.7257

-9.0113 5.0937 .078 -19.0313 1.0087

-19.1168* 7.4644 .011 -33.8000 -4.4336

.9983 6.9767 .886 -12.7257 14.7222

-8.0131 7.1913 .266 -22.1592 6.1331

-18.1186* 9.0269 .046 -35.8755 -.3616

9.0113 5.0937 .078 -1.0087 19.0313

8.0131 7.1913 .266 -6.1331 22.1592

-10.1055 7.6653 .188 -25.1840 4.9731

19.1168* 7.4644 .011 4.4336 33.8000

18.1186* 9.0269 .046 .3616 35.8755

10.1055 7.6653 .188 -4.9731 25.1840

(J) CHD/SmokingCategory

Healthy, Smoker

Heart Disease,NonSmoker

Heart Disease, Smoker

Healthy, NonSmoker

Heart Disease,NonSmoker

Heart Disease, Smoker

Healthy, NonSmoker

Healthy, Smoker

Heart Disease, Smoker

Healthy, NonSmoker

Healthy, Smoker

Heart Disease,NonSmoker

(I) CHD/SmokingCategoryHealthy, NonSmoker

Healthy, Smoker

Heart Disease,NonSmoker

Heart Disease, Smoker

MeanDifference

(I-J) Std. Error Sig. Lower Bound Upper Bound

95% Confidence Interval

Based on observed means.

The mean difference is significant at the .05 level.*.

Healthy Nonsmokers differ from Heart Disease Smokers

Healthy and CHD patients differ among smokers

Page 26: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 26

Lipids Research Study

Fisher’s LSD suggested that Healthy, Non-smokers has significantly less Healthy, Smokers (p = 0.011)

P-value for Wilcoxon rank sum test is 0.024

Ranks

143 86.04 12304.00

37 107.73 3986.00

180

CHD/Smoking CategoryHealthy, NonSmoker

Heart Disease, Smoker

Total

CholesterolN Mean Rank Sum of Ranks

Test Statisticsa

2008.000

12304.000

-2.257

.024

Mann-Whitney U

Wilcoxon W

Z

Asymp. Sig. (2-tailed)

Cholesterol

Grouping Variable: CHD/Smoking Categorya.

Page 27: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 27

Variances

The ANOVA method is remarkably robust

Although it assumes that the populations have equal population variances, as long as the sample sizes are reasonably close, it is not much affected by unequal variances

Of course, the sample variances will be different: why?

Page 28: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 28

Variances

It is still possible to compare variances

Realistically, if you are intrinsically interested in whether populations have the same variances or not, you should consult a statistician

However, there is a version of the Levene test that can be computed from SPSS. It uses the same algorithm as in the 2-population case

Page 29: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 29

Lipid Research Study: Variances?

3711344143N =

Lipids Research Study

CHD/Smoking Category

CHD, Smoke

CHD, NonSmoker

Healthy, Smoker

Healthy, NonSmoker

Ch

ole

ste

rol

400

300

200

100

0

255305

85

173

Page 30: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 30

Variances

The IQR are 57, 44, 56, 74

I don’t see a massive inequality of variability

The medians are 220, 225, 225, 232

Page 31: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 31

Lipid Research Study: Variances?

Levene's Test of Equality of Error Variancesa

Dependent Variable: Cholesterol

1.657 3 333 .176F df1 df2 Sig.

Tests the null hypothesis that the error variance ofthe dependent variable is equal across groups.

Design: Intercept+NCHDSMOKa.

You get this under “Options” in the ANOVA Run

Page 32: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 32

Concho Water Snakes

91711N =

Female Concho Water Snakes

Age

4-year olds3-year olds2-year olds

Ta

il L

en

gth

220

200

180

160

140

120

35

27

Page 33: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 33

Concho Water Snakes

Test Statisticsa,b

20.637

2

.000

Chi-Square

df

Asymp. Sig.

Tail Length

Kruskal Wallis Testa.

Grouping Variable: Ageb.

Page 34: Copyright (c) Bani K. mallick1 STAT 651 Lecture #14.

Copyright (c) Bani K. mallick 34

Concho Water Snakes

Levene's Test of Equality of Error Variancesa

Dependent Variable: Tail Length

2.903 2 34 .069F df1 df2 Sig.

Tests the null hypothesis that the error variance ofthe dependent variable is equal across groups.

Design: Intercept+AGEa.