Copyright (c) Bani K. Mallick 1
STAT 651
Lecture # 12
Copyright (c) Bani K. Mallick 2
Topics in Lecture #12 The ANOVA F-test, and the basics of
the F-table
Copyright (c) Bani K. Mallick 3
Book Sections Covered in Lecture #12
Chapter 8.1-8.2
Copyright (c) Bani K. Mallick 4
Relevant SPSS Tutorials ANOVA-GLM
Post hoc tests
Copyright (c) Bani K. Mallick 5
ANalysis Of VAriance
We now turn to making inferences when there are 3 or more populations
This is classically called ANOVA
It is somewhat more formula dense than what we have been used to.
Tests for normality are also somewhat more complex
Copyright (c) Bani K. Mallick 6
ANOVA
Suppose we form three populations on the basis of body mass index (BMI):
BMI < 22, 22 <= BMI < 28, BMI > 28
This forms 3 populations
We want to know whether the three populations have the same mean caloric intake, or if their food composition differs.
Copyright (c) Bani K. Mallick 7
ANOVA
If you do lots of 95% confidence intervals, you’d expect by chance that about 5% will be wrong
Thus, if you do 20 confidence intervals, you expect 1 = 20 x 5% will not include the true population parameter
This is a fact of life
Copyright (c) Bani K. Mallick 8
ANOVA
One procedure that is often followed is to do a preliminary test to see whether there are any differences among the populations
Then, once you conclude that some differences exist, you allow somewhat more informality in deciding where those differences manifest themselves
The first step is the ANOVA F-test
Copyright (c) Bani K. Mallick 9
ANOVA
Consider the ACS data, with 3 BMI groups and measuring the % calories from fat (first FFQ)
What is your preliminary conclusion about differences in means/medians?
About differences in variability?
About massive outliers?
388462N =
ACS Data: % Calories from Fat
BMI Group
High BMIMedium BMILow BMI
Ba
selin
e F
FQ
70
60
50
40
30
20
10
0
8
148
Copyright (c) Bani K. Mallick 10
ANOVA
Consider the ACS data, with 3 BMI groups and measuring the % calories from fat (first FFQ)
Between-Subjects Factors
Low BMI 62
MediumBMI
84
High BMI 38
1.00
2.00
3.00
BMIGroup
Value Label N
Copyright (c) Bani K. Mallick 11
ANOVA
The F-test is easy to compute, and provided in all statistical packages
The populations are 1, 2, … t
The sample sizes are
The population means are
The hypothesis to test is
1 tn , ,n
1 tμ ,...,μ
0 1 2 tH :μ =μ =μ
Copyright (c) Bani K. Mallick 12
ANOVA
The data from population i are
The sample mean from population i is
The sample mean of all the data is
The total sample size is the total number of observations, called
ii1 i2 inY ,Y , ,YiY
Y
Tn
Copyright (c) Bani K. Mallick 13
ANOVA
The ANOVA Table (demo in SPSS in class of ACS data) “Analyze” “Compare Means” “1-way ANOVA”: I’ll now show you what each item is
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between Groups
Within Groups
Total
Sum ofSquares df Mean Square F Sig.
Copyright (c) Bani K. Mallick 14
ANOVA
The sample mean from population i is
The sample mean of all the data is
The idea of the F-test is based on distances
The distance of the data to the overall mean is
TSS = Total Sum of Squares
iY
Y
2ij
ij
TSS = (Y Y )
Copyright (c) Bani K. Mallick 15
ANOVA
The distance of the data to the overall mean is
TSS = Total Sum of Squares
This has degrees of freedom
2ij
ij
TSS = (Y Y )
Tn 1
Copyright (c) Bani K. Mallick 16
ANOVA
Next comes the “Between Groups” row
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between Groups
Within Groups
Total
Sum ofSquares df Mean Square F Sig.
Copyright (c) Bani K. Mallick 17
ANOVA
The sum of squares between groups is
It has t-1 degrees of freedom, so the number of populations is the degrees of freedom between groups + 1.
2ii
i
n (Y Y )
Copyright (c) Bani K. Mallick 18
ANOVA
Next comes the “Within Groups” row
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between Groups
Within Groups
Total
Sum ofSquares df Mean Square F Sig.
Copyright (c) Bani K. Mallick 19
ANOVA
The distance of the observations to their sample means is
This is the Sum of Squares Within
It has degrees of freedom
2iij
ij
SSW = (Y Y )
Tn t
Copyright (c) Bani K. Mallick 20
ANOVA
Next comes the “Mean Squares”
These are the different sums of squares divided by their degrees of freedom
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between Groups
Within Groups
Total
Sum ofSquares df Mean Square F Sig.
Copyright (c) Bani K. Mallick 21
ANOVA
Next comes the F-statistic
It is the ratio of the mean square between groups to the mean square within groups
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between Groups
Within Groups
Total
Sum ofSquares df Mean Square F Sig.
Copyright (c) Bani K. Mallick 22
ANOVA
The F-statistic is
T
SSB/(t-1) F =
SSW/(n -t)2
iii
2iij T
ij
n (Y Y ) /(t 1) =
(Y Y ) /(n -t)
Copyright (c) Bani K. Mallick 23
ANOVA
The F-statistic is
What values (large or small) indicate differences?
Clearly large, since if the population means are equal, the sample means will be close, and the top will be near 0
2ii
i2
i Tijij
n (Y Y ) /(t 1) =
(Y Y ) /(n -t)
Copyright (c) Bani K. Mallick 24
Why do they call it ANOVA?
ANOVA = ANalysis Of VAriance
I want to show you the concept in graphs, because these become important in STAT 652
I will illustrate the idea with samples from two populations
The first population will be in red
The second in green
When pooled I will use blue
Copyright (c) Bani K. Mallick 25
Why do they call it ANOVA?
The data from the first population
The total distance of the observations to their sample mean is
Y Y Y Y Y Y
1Y
1
21j
j
(Y Y )
Copyright (c) Bani K. Mallick 26
Why do they call it ANOVA?
I will use this funny symbol to denote total distance
The total distance of the observations to their sample mean is
Y Y Y Y Y Y
1
21j
j
(Y Y )
Copyright (c) Bani K. Mallick 27
Why do they call it ANOVA?
Consider two similar populations
Summing the two symbols is the SSW = Sum of Squared distances Within the two samples
Y Y Y Y Y Y
Y Y Y Y Y Y
Copyright (c) Bani K. Mallick 28
Why do they call it ANOVA?
Now pool the two similar populations
Y Y Y Y Y Y
Y Y Y Y Y Y
YYYY YY YY Y Y Y Y
The Blue symbol represents the sum of squared distances of the total sample to the total sample mean
The is the Total Sum of Squares, or TSS
Copyright (c) Bani K. Mallick 29
Why do they call it ANOVA?
Now pool the two similar populations
Y Y Y Y Y Y
Y Y Y Y Y Y
YYYY YY YY Y Y Y Y
The Blue symbol represents the sum of squared distances of the total sample to the total sample mean
The is the Total Sum of Squares, or TSS
Note how the SSW and the TSS are about the same: when this happens, it indicates equal means for the populations
Copyright (c) Bani K. Mallick 30
Why do they call it ANOVA?
Now note what happens if the population means are different
Y Y Y Y Y Y
Y Y Y Y Y Y
Y Y Y Y YYYY Y Y Y Y Note how the TSS has greatly increased
Note how the SSW are the same as before (remember, we add the squared distances separately for the two populations
It is this phenomenon that the F-test measures
Copyright (c) Bani K. Mallick 31
ANOVA
The F-statistic is compared to the F-distribution with t-1 and degrees of freedom.
See Table 8 ,which lists the cutoff points in terms of . If the F-statistic exceeds the cutoff, you reject the hypothesis of equality of all the means.
SPSS gives you the p-value (significance level) for this test
Tn t
Copyright (c) Bani K. Mallick 32
ANOVA I have used the language of the book up
to this point, which coincides with the SPSS 1-way ANOVA output.
For our purposes, the general linear model is more useful.
It allows us to compare all the populations simultaneously
It also allows us to check assumptions about normality
Unfortunately, it uses slightly different terminology
Copyright (c) Bani K. Mallick 33
ANOVA
For our purposes, the general linear model is more useful.
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between Groups
Within Groups
Total
Sum ofSquares df Mean Square F Sig.
Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
ANOVA Procedure
General Linear ModelProcedure: note how F-statistics and p-valuesare identical
Copyright (c) Bani K. Mallick 34
ANOVA
For our purposes, the general linear model is more useful.
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between GroupsWithin Groups
Total
Sum ofSquares df Mean Square F Sig.
Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
Source
Corrected ModelIntercept
BMIGROUPError
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Between Groups
Corrected model or variable name (BMIGROUP)
Copyright (c) Bani K. Mallick 35
ANOVA
For our purposes, the general linear model is more useful.
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between Groups
Within GroupsTotal
Sum ofSquares df Mean Square F Sig.
Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
ErrorTotal
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Within Groups
Error
Copyright (c) Bani K. Mallick 36
ANOVA
For our purposes, the general linear model is more useful.
ANOVA
Baseline FFQ
960.287 2 480.143 5.689 .004
15275.639 181 84.396
16235.925 183
Between Groups
Within Groups
Total
Sum ofSquares df Mean Square F Sig.
Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Total
Corrected Total
Copyright (c) Bani K. Mallick 37
ANOVA
If the populations have a common variance 2, the Mean squared error estimates it
For 2-populations, the mean squared error equals the square of the pooled sd:
Hence, estimated common s.d.
2Ps
Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
84.396 9.19
Copyright (c) Bani K. Mallick 38
ANOVA
There appears to be some difference
95% CI for difference in population means between high and low BMI groups is 2.26 to 10.20
95% CI for difference in population means between medium and low BMI groups is from -1.59 to 4.31
High & Medium: 1.08 to 8.65
Conclusions?
Copyright (c) Bani K. Mallick 39
ANOVA
The ANOVA Table (demo in SPSS in class of ACS data)
Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Copyright (c) Bani K. Mallick 40
ANOVA
The ANOVA Table
Number of populations is t=3, so degrees of freedom for the model (BMIGROUP) is
t-1 = 2 Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Copyright (c) Bani K. Mallick 41
ANOVA
The ANOVA Table
Total sample size is 184, so the degrees of freedom for the corrected total is = 183
Tn 1Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Copyright (c) Bani K. Mallick 42
ANOVA
The mean square between groups is 480.143
The mean square within groups is 84.396
The F-statistic is the ratio: 5.689 Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Copyright (c) Bani K. Mallick 43
ANOVA
The p-value is 0.004: what does this mean?
What was the null hypothesis?
Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Copyright (c) Bani K. Mallick 44
ANOVA
The p-value is 0.004: what does this mean?
What was the null hypothesis? That the populations means are all equal
Tests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Copyright (c) Bani K. Mallick 45
ANOVA
The p-value is 0.004: what does this mean? We have more than 99% confidence that the null hypothesis is false
What was the null hypothesis? That the populations means are all equalTests of Between-Subjects Effects
Dependent Variable: Baseline FFQ
960.287a 2 480.143 5.689 .004
196009.919 1 196009.919 2322.508 .000
960.287 2 480.143 5.689 .004
15275.639 181 84.396
226223.216 184
16235.925 183
SourceCorrected Model
Intercept
BMIGROUP
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
R Squared = .059 (Adjusted R Squared = .049)a.
Top Related