Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

45
Copyright (c) Bani K. Mal lick 1 STAT 651 Lecture # 12
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    3

Transcript of Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Page 1: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 1

STAT 651

Lecture # 12

Page 2: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 2

Topics in Lecture #12 The ANOVA F-test, and the basics of

the F-table

Page 3: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 3

Book Sections Covered in Lecture #12

Chapter 8.1-8.2

Page 4: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 4

Relevant SPSS Tutorials ANOVA-GLM

Post hoc tests

Page 5: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 5

ANalysis Of VAriance

We now turn to making inferences when there are 3 or more populations

This is classically called ANOVA

It is somewhat more formula dense than what we have been used to.

Tests for normality are also somewhat more complex

Page 6: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 6

ANOVA

Suppose we form three populations on the basis of body mass index (BMI):

BMI < 22, 22 <= BMI < 28, BMI > 28

This forms 3 populations

We want to know whether the three populations have the same mean caloric intake, or if their food composition differs.

Page 7: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 7

ANOVA

If you do lots of 95% confidence intervals, you’d expect by chance that about 5% will be wrong

Thus, if you do 20 confidence intervals, you expect 1 = 20 x 5% will not include the true population parameter

This is a fact of life

Page 8: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 8

ANOVA

One procedure that is often followed is to do a preliminary test to see whether there are any differences among the populations

Then, once you conclude that some differences exist, you allow somewhat more informality in deciding where those differences manifest themselves

The first step is the ANOVA F-test

Page 9: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 9

ANOVA

Consider the ACS data, with 3 BMI groups and measuring the % calories from fat (first FFQ)

What is your preliminary conclusion about differences in means/medians?

About differences in variability?

About massive outliers?

388462N =

ACS Data: % Calories from Fat

BMI Group

High BMIMedium BMILow BMI

Ba

selin

e F

FQ

70

60

50

40

30

20

10

0

8

148

Page 10: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 10

ANOVA

Consider the ACS data, with 3 BMI groups and measuring the % calories from fat (first FFQ)

Between-Subjects Factors

Low BMI 62

MediumBMI

84

High BMI 38

1.00

2.00

3.00

BMIGroup

Value Label N

Page 11: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 11

ANOVA

The F-test is easy to compute, and provided in all statistical packages

The populations are 1, 2, … t

The sample sizes are

The population means are

The hypothesis to test is

1 tn , ,n

1 tμ ,...,μ

0 1 2 tH :μ =μ =μ

Page 12: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 12

ANOVA

The data from population i are

The sample mean from population i is

The sample mean of all the data is

The total sample size is the total number of observations, called

ii1 i2 inY ,Y , ,YiY

Y

Tn

Page 13: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 13

ANOVA

The ANOVA Table (demo in SPSS in class of ACS data) “Analyze” “Compare Means” “1-way ANOVA”: I’ll now show you what each item is

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Page 14: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 14

ANOVA

The sample mean from population i is

The sample mean of all the data is

The idea of the F-test is based on distances

The distance of the data to the overall mean is

TSS = Total Sum of Squares

iY

Y

2ij

ij

TSS = (Y Y )

Page 15: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 15

ANOVA

The distance of the data to the overall mean is

TSS = Total Sum of Squares

This has degrees of freedom

2ij

ij

TSS = (Y Y )

Tn 1

Page 16: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 16

ANOVA

Next comes the “Between Groups” row

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Page 17: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 17

ANOVA

The sum of squares between groups is

It has t-1 degrees of freedom, so the number of populations is the degrees of freedom between groups + 1.

2ii

i

n (Y Y )

Page 18: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 18

ANOVA

Next comes the “Within Groups” row

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Page 19: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 19

ANOVA

The distance of the observations to their sample means is

This is the Sum of Squares Within

It has degrees of freedom

2iij

ij

SSW = (Y Y )

Tn t

Page 20: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 20

ANOVA

Next comes the “Mean Squares”

These are the different sums of squares divided by their degrees of freedom

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Page 21: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 21

ANOVA

Next comes the F-statistic

It is the ratio of the mean square between groups to the mean square within groups

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Page 22: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 22

ANOVA

The F-statistic is

T

SSB/(t-1) F =

SSW/(n -t)2

iii

2iij T

ij

n (Y Y ) /(t 1) =

(Y Y ) /(n -t)

Page 23: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 23

ANOVA

The F-statistic is

What values (large or small) indicate differences?

Clearly large, since if the population means are equal, the sample means will be close, and the top will be near 0

2ii

i2

i Tijij

n (Y Y ) /(t 1) =

(Y Y ) /(n -t)

Page 24: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 24

Why do they call it ANOVA?

ANOVA = ANalysis Of VAriance

I want to show you the concept in graphs, because these become important in STAT 652

I will illustrate the idea with samples from two populations

The first population will be in red

The second in green

When pooled I will use blue

Page 25: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 25

Why do they call it ANOVA?

The data from the first population

The total distance of the observations to their sample mean is

Y Y Y Y Y Y

1Y

1

21j

j

(Y Y )

Page 26: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 26

Why do they call it ANOVA?

I will use this funny symbol to denote total distance

The total distance of the observations to their sample mean is

Y Y Y Y Y Y

1

21j

j

(Y Y )

Page 27: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 27

Why do they call it ANOVA?

Consider two similar populations

Summing the two symbols is the SSW = Sum of Squared distances Within the two samples

Y Y Y Y Y Y

Y Y Y Y Y Y

Page 28: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 28

Why do they call it ANOVA?

Now pool the two similar populations

Y Y Y Y Y Y

Y Y Y Y Y Y

YYYY YY YY Y Y Y Y

The Blue symbol represents the sum of squared distances of the total sample to the total sample mean

The is the Total Sum of Squares, or TSS

Page 29: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 29

Why do they call it ANOVA?

Now pool the two similar populations

Y Y Y Y Y Y

Y Y Y Y Y Y

YYYY YY YY Y Y Y Y

The Blue symbol represents the sum of squared distances of the total sample to the total sample mean

The is the Total Sum of Squares, or TSS

Note how the SSW and the TSS are about the same: when this happens, it indicates equal means for the populations

Page 30: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 30

Why do they call it ANOVA?

Now note what happens if the population means are different

Y Y Y Y Y Y

Y Y Y Y Y Y

Y Y Y Y YYYY Y Y Y Y Note how the TSS has greatly increased

Note how the SSW are the same as before (remember, we add the squared distances separately for the two populations

It is this phenomenon that the F-test measures

Page 31: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 31

ANOVA

The F-statistic is compared to the F-distribution with t-1 and degrees of freedom.

See Table 8 ,which lists the cutoff points in terms of . If the F-statistic exceeds the cutoff, you reject the hypothesis of equality of all the means.

SPSS gives you the p-value (significance level) for this test

Tn t

Page 32: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 32

ANOVA I have used the language of the book up

to this point, which coincides with the SPSS 1-way ANOVA output.

For our purposes, the general linear model is more useful.

It allows us to compare all the populations simultaneously

It also allows us to check assumptions about normality

Unfortunately, it uses slightly different terminology

Page 33: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 33

ANOVA

For our purposes, the general linear model is more useful.

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

ANOVA Procedure

General Linear ModelProcedure: note how F-statistics and p-valuesare identical

Page 34: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 34

ANOVA

For our purposes, the general linear model is more useful.

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between GroupsWithin Groups

Total

Sum ofSquares df Mean Square F Sig.

Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

Source

Corrected ModelIntercept

BMIGROUPError

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Between Groups

Corrected model or variable name (BMIGROUP)

Page 35: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 35

ANOVA

For our purposes, the general linear model is more useful.

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between Groups

Within GroupsTotal

Sum ofSquares df Mean Square F Sig.

Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

ErrorTotal

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Within Groups

Error

Page 36: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 36

ANOVA

For our purposes, the general linear model is more useful.

ANOVA

Baseline FFQ

960.287 2 480.143 5.689 .004

15275.639 181 84.396

16235.925 183

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Total

Corrected Total

Page 37: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 37

ANOVA

If the populations have a common variance 2, the Mean squared error estimates it

For 2-populations, the mean squared error equals the square of the pooled sd:

Hence, estimated common s.d.

2Ps

Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

84.396 9.19

Page 38: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 38

ANOVA

There appears to be some difference

95% CI for difference in population means between high and low BMI groups is 2.26 to 10.20

95% CI for difference in population means between medium and low BMI groups is from -1.59 to 4.31

High & Medium: 1.08 to 8.65

Conclusions?

Page 39: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 39

ANOVA

The ANOVA Table (demo in SPSS in class of ACS data)

Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Page 40: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 40

ANOVA

The ANOVA Table

Number of populations is t=3, so degrees of freedom for the model (BMIGROUP) is

t-1 = 2 Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Page 41: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 41

ANOVA

The ANOVA Table

Total sample size is 184, so the degrees of freedom for the corrected total is = 183

Tn 1Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Page 42: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 42

ANOVA

The mean square between groups is 480.143

The mean square within groups is 84.396

The F-statistic is the ratio: 5.689 Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Page 43: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 43

ANOVA

The p-value is 0.004: what does this mean?

What was the null hypothesis?

Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Page 44: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 44

ANOVA

The p-value is 0.004: what does this mean?

What was the null hypothesis? That the populations means are all equal

Tests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.

Page 45: Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.

Copyright (c) Bani K. Mallick 45

ANOVA

The p-value is 0.004: what does this mean? We have more than 99% confidence that the null hypothesis is false

What was the null hypothesis? That the populations means are all equalTests of Between-Subjects Effects

Dependent Variable: Baseline FFQ

960.287a 2 480.143 5.689 .004

196009.919 1 196009.919 2322.508 .000

960.287 2 480.143 5.689 .004

15275.639 181 84.396

226223.216 184

16235.925 183

SourceCorrected Model

Intercept

BMIGROUP

Error

Total

Corrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .059 (Adjusted R Squared = .049)a.