Download - ANOVA: Analysis of Variance Xuhua Xia [email protected] .

ANOVA: Analysis of Variance

Xuhua Xia

[email protected]

http://dambe.bio.uottawa.ca

Xuhua Xia

Head of the statistics Division at the Rothamsted ExperimentalStation in Hertfordshire. One of the three founders of theoretical population genetics. Developer of statistical methods, especially the likelihood methods. Published The Genetical Theory of Natural Selection in 1930, in which he proposed the fundamental theory of natural selection:

“To call in a statistician after the experiment is done may be no more than asking him to perform a postmortem examination; he may be able to say what the experiment died of.”

Ronald A. Fisher (1890-1962)

Xuhua Xia

Analysis of Variance (ANOVA)

• ANOVA was mainly developed by Ronald A. Fisher• The F statistic was named after him.• The essence of ANOVA is to partition the total variation

into its components. • Assumptions

– Normality– Equal variance among treatment groups

• Alternative methods

Xuhua Xia

xij = + i + ij vs. xij = + ij

One-way ANOVA Model

Is this effect zero?This is the same model for t-test, except that the subscript i is 1 and 2 in t-test, but 1, 2, ..., n in one-way ANOVA

Xuhua Xia

t-test and ANOVAMale Female193 175188 173185 168183 165180 163178170

n 7 5Mean 182.4286 168.8SS 329.7143 104.8Pooled Var 43.45143PooledSE 3.859745t 3.530951df 10P 0.0054Equal Var.? P= 0.4939

Groups Count Sum Average VarianceMale 7 1277 182.4286 54.95238Female 5 844 168.8 26.2

ANOVASource SS df MS F P-value

Between Groups541.7357 1 541.7357 12.46762 0.005438Within Groups434.5143 10 43.45143

Total 976.25 11

Xuhua Xia

Variance and Sum of Squares

1

)(1

2

2

1

N

xxs

N

xx

N

ii

N

ii

Sum of Squared Deviations

Degree of Freedom

Xuhua Xia

2X1X 3X

X

Within-groupdeviation

Between-groupdeviation

Partition of Variance

Grand Mean

Xuhua Xia

Treatment Low-fat food Medium-fat food High-fat food Weight Gain 0

2 4 6

8 10

Mean 1 5 9 SSB 2(1-5)2=32 2(5-5)2=0 2(9-5)2=32 SSW (0-1)2+(2-1)2 = 2 2 2 Grand Mean = (0 + 2 + 4 +…+ 10) / 6 = 5 SST = (0-5)2 + (2-5)2 +…+ (10-5)2 = 70, with df = 5 SSB = 32 + 0 + 32 = 64, with df = 2 SSW = 2 + 2 + 2 = 6, with df = 3 MSB = 64/2 = 32 MSW = 6/3 = 2 F = MSB/MSW = 16, DFnum = 2, DFdenom = 3, p = 0.0251

Numerical Illustration of One-Way ANOVA

1 5 91 5 9

Now repeat the ANOVA computation with the addition of the numbers in red. Email me SSB, SSW, DFnum, and DFdenom.

Xuhua Xia

Dependent variable: Weight Gain

Source DF SS MS F p

Model 2 64.0 32.0 16.0 0.0251

Error 3 6.0 2.0

Total 5 70.0

ANOVA Table

Xuhua Xia

Mean1 s12 Mean2 s2

2 s12/s2

2

3 3 1

3 2 1.5

F-distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.5 1 1.5 2 2.5 3 3.5

F

f

1.4

1.6

0.6

...

2.4

3.0

2.6

2.9

Empirical F distribution

Xuhua Xia

Low-fat food Medium-fat food High-fat food

Weight 0 4 8gain 2 6 10

The null hypothesis H0: X1 = X2 = X3 is rejected. The three kinds of fooddiffer significantly in their effect on weight gain of rabbits. In particular, Medium-fat and High-fat foods are significantly better than Low-fat food.However, Medium-fat and High-fat foods do not differ in their effect onrabbit weight gain.

One-way experimental design

Xuhua Xia

75 8276 8080 8577 8580 7877 8773 8277 82

n 8 8Mean 76.875 82.625Var 5.554 8.554GrandMean 79.750 79.750SST 231.000 subtotalSSB 66.125 66.125 132.250SSW 38.875 59.875 98.750dfT 15dfB 1dfW 14MSB 132.250MSW 7.054F = 18.749 P = 0.0007

Assumptions

75 8276 8080 8577 8580 7877 8773 8277 82

200n 8 9Mean 76.875 95.667Var 5.554 1538.250GrandMean 86.824 86.824SST 13840.471 subtotalSSB 791.786 703.810 1495.596SSW 38.875 12306.000 12344.875dfT 16dfB 1dfW 15MSB 1495.596MSW 822.992F = 1.817 P = 0.1976

Xuhua Xia

1

2

1

2

How should we allocate the two crop varieties to the plots? What comparison would be fair?

Block 1

Block 2

Block 3

Block 4

Using blocks to reduce confounding environmental factors (Everything else being equal except for the treatment effect) in evaluating the protein content of two wheat variaties.

Paired-sample t-test: 3

2

1

2

1

1

1

2

2

1

1

2

2

Xuhua Xia

13 2

24 3

31 4

34

1

The three crop varieties are randomly allocated to the plots within each block.

Block 1

Block 2

Block 3

Block 4

Using blocks to reduce confounding environmental factors (Everything else being equal except for the treatment effect).

Randomized Complete Blocks: Plots

4

1

2

2

11 1

22 2

33 3

44

4

1

2

3

4

Xuhua Xia

Which of the six strains of clover has the highest protein content? The experimenter divided his field into 5 relatively homogenous blocks each with 6 plots, and randomly assigned his 6 strains to the 6 plots within each block. After harvesting, he determined the nitrogen content for each strain in each plot.

Randomized complete blocks

3dok1

3dok1

3dok4

3dok1

3dok1

3dok4

3dok4

3dok1

3dok4 3dok4

3dok5 3dok13 3dok13 3dok7 compo

3dok5

3dok13

3dok5

3dok5

3dok13

3dok5

3dok7

3dok7

3dok7

3dok13

3dok13

compo

compo

compo

compo

3dok13 3dok4Block 1

Block 2

Block 3

Block 4

Block 5

3dok13 3dok4

3dok13 3dok4

3dok13 3dok4

3dok13 3dok4

If only two strains:

Xuhua Xia

Bartlett’s Test

Feed 1 Feed 2 Feed 3 Feed 460.8 68.7 102.6 87.9

57 67.7 102.1 84.265 74 100.2 83.1

58.6 66.3 96.5 85.761.7 69.8 90.3

k 4 <==Number of groups SUMn 5 5 4 5 19SS 37.568 34.26 22.97 33.552 128.35v 4 4 3 4 15Inversev 0.25 0.25 0.333333 0.25 1.083333Var 9.392 8.565 7.656667 8.388lnVar 2.239858 2.147684 2.035577 2.126802v*lnVar 8.959433 8.590737 6.10673 8.507208 32.16411PooledVar 8.556667lnPooledVar 2.146711B 0.036552 <==More accurate than that in Zar (1996)C 1.112963Bc 0.032842P 0.998433

The null hypothesis for the F-test (or variance ratio test):

H0: v1 = v2.

The null hypothesis for Bartlett’s or Levene test:

H0: v1 = v2 = ... = vn.

The formulae in this sheet use defined variables in EXCEL:

Insert|name|define

Xuhua Xia

Class Levels Values

STRAIN 6 3dok1 3dok13 3dok4 3dok5 3dok7 compos

Number of observations in data set = 30

Analysis of Variance Procedure

Dependent Variable: NITROGEN Sum of MeanSource DF Squares Square F Value Pr > F

Model 5 847.046667 169.409333 14.37 0.0001Error 24 282.928000 11.788667Corrected Total 29 1129.974667

R-Square C.V. Root MSE NITROGEN Mean 0.749616 17.26515 3.43346 19.8867

Do Six Strains of Clover Differ?

Xuhua Xia

Duncan's Multiple Range Test for variable: NITROGEN

NOTE: This test controls the type I comparisonwise error rate, not the experimentwise error rate

Alpha= 0.05 df= 24 MSE= 11.78867

Difference spanning Number of Means 2 3 4 5 6 Critical Range 4.482 4.707 4.852 4.954 5.031

Means with the same letter are not significantly different.

Duncan Grouping Mean N STRAIN A 28.820 5 3dok1 B 23.980 5 3dok5 C B 19.920 5 3dok7 C D 18.700 5 compos E D 14.640 5 3dok4 E 13.260 5 3dok13

Multiple Comparison

Means are arranged in descending order.

Xuhua Xia

Comparisonwise & Experimentwise Errors

• Type I comparisonwise error rate is the probability of a Type I error for an individual test of hypothesis, symbolized by c.

• Type I experimentwise error rate is the probability of making at least one Type I error for a set of hypothesis tests, symbolized by e.

• If c = 0.05, and N hypotheses are tested, then e 1 – (1 - c)N.

• For 5 treatments in our case, there are a total of 10 pairwise comparisons between means. Thus, c = 0.05 would imply e 0.40. That is, if all means are in fact equal, there is roughly a probability of 0.4 that at least one hypothesis will be incorrectly rejected.

• If we are to control the experimentwise error rate below 0.05, we can set e = 0.05:

• e 1 – (1 - c)N = 1 – (1 - c)10 = 0.05

• and solve the equation, which yield c = 0.005. This of course would increase the difficulty to reject a null hypothesis, even if the null hypothesis is false.

Xuhua Xia

SAS output: I

Dependent Variable: nitrogen

Sum of Source DF Squares Mean Square F Value Pr > F

Model 9 1045.201333 116.133481 27.40 <.0001

Error 20 84.773333 4.238667

Corrected Total 29 1129.974667

R-Square Coeff Var Root MSE nitrogen Mean

0.924978 10.35268 2.058802 19.88667

Source DF Anova SS Mean Square F Value Pr > F

strain 5 847.0466667 169.4093333 39.97 <.0001

Block 4 198.1546667 49.5386667 11.69 <.0001

Xuhua Xia

Duncan's Multiple Range Test for nitrogen

NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate.

Alpha = 0.05, DFE = 20 MSE = 4.238667

Number of Means 2 3 4 5 6Critical Range 2.716 2.851 2.937 2.997 3.041 Means with the same letter are not significantly different. Duncan Grouping Mean N strain A 28.820 5 3dok1 B 23.980 5 3dok5 C 19.920 5 3dok7 C 18.700 5 compos D 14.640 5 3dok4 D 13.260 5 3dok13

Multiple Comparison

Xuhua Xia

Subjects Drug 1 Drug 2 Drug 31 164 152 1782 202 181 2223 143 136 1324 210 194 2165 228 219 2456 173 159 1827 161 157 165

Ex. ANOVA with repeated measures

What is the treatment effect? What is the block?

Analyze the data with SAS. Write a concise 1-page report. Submit at the beginning of the next class in hardcopy.

Xuhua Xia

Fresh food Rancid food

Male 695.67 535.33

Female642.67 517.33

Food 709, 679, 699 592, 538, 476 Consumed

657, 594, 677 508, 505, 539

Testing the effect of food and sex on rabbit food consumption

Two-way experimental design

Xuhua Xia

Dependent Variable: CONSUMED Sum of MeanSource DF Squares Square F Value Pr > F

Model 3 65903.5833 21967.8611 15.06 0.0012

Error 8 11666.6667 1458.3333

Corrected Total 11 77570.2500

R-Square C.V. Root MSE CONSUMED Mean

0.849599 6.388646 38.1881 597.750

Source DF Anova SS Mean Square F Value Pr > FFOOD 1 61204.0833 61204.0833 41.97 0.0002SEX 1 3780.7500 3780.7500 2.59 0.1460FOOD*SEX 1 918.7500 918.7500 0.63 0.4503

What is the interaction effect?

Xuhua Xia

What is Interaction?

When the effect of FOOD is independent of SEX, e.g., when fresh food is preferred by both males and females to the same extent, then there is no interaction term. When the effect of FOOD depends on SEX, e.g., when males eat more fresh food than rancid food but females eat less rancid food than fresh food, then there is an interaction effect.

0200400600800

1000120014001600

Male Female

Sex

Consumption

500

550

600

650

700

Male Female

Sex

ConsumptionFresh

Rancid

FreshRan

cid

Xuhua Xia

Fresh food Rancid food

Male 568.67 695.67

Female642.67 517.33

Food 592, 538, 576 709, 679, 699Consumed

657, 594, 677 508, 505, 539

Interaction Effect: Example

Xuhua Xia

Significant Interaction

Dependent Variable: CONSUMED Sum of MeanSource DF Squares Square F Value Pr > FModel 3 55920.2500 18640.0833 23.06 0.0003Error 8 6466.6667 808.3333Total 11 62386.9167

R-Square C.V. Root MSE CONSUMED Mean0.896346 4.690973 28.4312 606.083


FOOD 1 47754.0833 47754.0833 59.08 0.0001SEX 1 2.0833 2.0833 0.00 0.9608FOOD*SEX 1 8164.0833 8164.0833 10.10 0.0130

Can we conclude that SEX has no effect on food consumption?

Xuhua Xia

proc format; value sexLevel 1='male' 2='female'; value foodLevel 1='fresh' 2='rancid';data assign63;do food=1 to 2; do sex=1 to 2; do n=1 to 3; input Consumed @@; output; end; end;end;format sex sexLevel. food foodLevel.;cards;709 679 699 657 594 677 592 538 476 508 505 539;proc anova; class food sex; model Consumed=food|sex; means food / duncan;run;

SAS Program for two-way ANOVA

Ex.

1. Rewrite the “data” block of the SAS program by using:

data assign63;input food sex consumed;cards;

......

;

2. Run the resulting program to check if the rewriting is correct.

Xuhua Xia

Race Sex Fresh Rancid

Short-ear Male

647.5 515.5

Female611 500.5

Long-ear Male

706 594.5

Female

652.5 548

Short-ear Male 650, 645 511, 520Female 610, 612 500, 501

Long-ear Male 700, 712 601, 588Female 650, 655 550, 546

Three-way ANOVA

Xuhua Xia

SAS Program

proc format; value sex 1='male' 2='female'; value food 1='fresh' 2='rancid'; value race 1='short-ear' 2='long-ear';

format sex sex. food food. race race.;

data assign71;input race sex food Consumed;cards;1 1 1 650 1 1 1 645 1 1 2 511 1 1 2 520 1 2 1 610 1 2 1 612 1 2 2 500 1 2 2 5012 1 1 700 2 1 1 712 2 1 2 601 2 1 2 588 2 2 1 650 2 2 1 655 2 2 2 550 2 2 2 546;proc anova; class food sex race; model Consumed=food|sex|race;

Optional, but will increase clarity in the output

Need to be in a new line, i.e., not

2 2 2 546;

Xuhua Xia

Dependent Variable: CONSUMED Sum of MeanSource DF Squares Square F Value Pr > F

Model 7 72138.4375 10305.4911 354.60 0.0001Error 8 232.5000 29.0625Corrected Total 15 72370.9375

R-Square C.V. Root MSE CONSUMED Mean 0.996787 0.903104 5.39096 596.938


FOOD 1 52555.5625 52555.5625 1808.36 0.0001SEX 1 5738.0625 5738.0625 197.44 0.0001FOOD*SEX 1 203.0625 203.0625 6.99 0.0296RACE 1 12825.5625 12825.5625 441.31 0.0001FOOD*RACE 1 175.5625 175.5625 6.04 0.0395SEX*RACE 1 588.0625 588.0625 20.23 0.0020FOOD*SEX*RACE 1 52.5625 52.5625 1.81 0.2156

ANOVA Table

Xuhua Xia

data assign71;do race=1 to 2; do sex=1 to 2; do food=1 to 2; do n=1 to 2; input Consumed @@; output; end; end; end;end;cards;650 645 511 520 610 612 500 501700 712 601 588 650 655 550 546;proc anova; class food sex race; model Consumed=food|sex|race;run;

data assign71;input race sex food Consumed;cards;1 1 1 650 1 1 1 645 1 1 2 511 1 1 2 520 1 2 1 610 1 2 1 612 1 2 2 500 1 2 2 5012 1 1 700 2 1 1 712 2 1 2 601 2 1 2 588 2 2 1 650 2 2 1 655 2 2 2 550 2 2 2 546;

SAS program listing

Xuhua Xia

Class N MeanMembers of Royal family 97 64.04Clergy 945 69.49Lawyers 294 68.14Medical Profession 244 67.31English aristocracy 1179 67.31Gentry 1632 70.22Trade and commerce 513 68.74Officers in the Royal Navy 366 68.40English literature and science 395 67.55Officers of the Army 569 67.07Fine arts 239 65.96

The Efficacy of Prayer

Other data collected by Galton:1. Rate of successful delivery between church-going parents and others2. Life span of believers and non-believers from insurance companies

Galton’s data could be analyzed by an one-way ANOVA. One criterion for a good ANOVA design is that everything else being equal except for the treatment effect. Does the data set above satisfy this criterion?

(1822-1911)

Xuhua Xia

Replicate

1 2 1 32 6 7 5

Metabolic rate in rabbit liver cells, taken for two samples of liver tissue

1. Model I ANOVA tests the differential effects of the fixed treatment.xij = + i + ij

where i stands for fixed treatment effects (e.g., between male and femle).

2. Model II ANOVA tests the differential effects of a random variable and estimates itscontribution to total variance relative to that from measurement errors (for facilitatingexperimental design).xij = + Ai + ij

where Ai stands for random treatment effects (e.g., between randomly sampled rabbits).

Model I and Model II ANOVA

How can we optimize the experiment? More rabbits or more replicates?

Xuhua Xia

3.28 3.52 2.88

3.09 3.48 2.80

2.46 1.87 2.19

2.44 1.92 2.19

2.77 3.74 2.55

2.66 3.44 2.55

3.78 4.07 3.31

3.87 4.12 3.32

Determining Calcium Content in Leaves

Xuhua Xia

SAS Programdata turnip;Input plant leaf calcium @@;cards;1 1 3.28 1 1 3.09 1 2 3.52 1 2 3.481 3 2.88 1 3 2.80 2 1 2.46 2 1 2.442 2 1.87 2 2 1.92 2 3 2.19 2 3 2.193 1 2.77 3 1 2.66 3 2 3.74 3 2 3.443 3 2.55 3 3 2.55 4 1 3.78 4 1 3.874 2 4.07 4 2 4.12 4 3 3.31 4 3 3.31;proc nested; class plant leaf; var calcium;run;proc glm; class plant leaf; model calcium=plant leaf(plant);run;

Xuhua Xia

SAS Output: NESTED

Nested Random Effects Analysis of Variance for Variable CALCIUM

Variance DF Sum of ErrorSource Squares F Value Pr > F TermTOTAL 23 10.270396PLANT 3 7.560346 7.665 0.0097 LEAFLEAF 8 2.630200 49.409 0.0000 ERRORERROR 12 0.079850

Variance Variance PercentSource Mean Square Component of Total

TOTAL 0.446539 0.532938 100.0000PLANT 2.520115 0.365223 68.5302LEAF 0.328775 0.161060 30.2212ERROR 0.006654 0.006654 1.2486

Mean 3.01208333 Standard error of mean 0.32404445

Xuhua Xia

SAS Output: GLM

Dependent Variable: calcium Sum of Source DF Squares Mean Square F Value Pr > F Model 11 10.19054583 0.92641326 139.22 <.0001 Error 12 0.07985000 0.00665417 Corrected Total 23 10.27039583

R-Square Coeff Var Root MSE calcium Mean 0.992225 2.708195 0.081573 3.012083

Source DF Type I SS Mean Square F Value Pr > F plant 3 7.56034583 2.52011528 378.73 <.0001 leaf(plant) 8 2.63020000 0.32877500 49.41 <.0001

Source DF Type III SS Mean Square F Value Pr > F plant 3 7.56034583 2.52011528 378.73 <.0001 leaf(plant) 8 2.63020000 0.32877500 49.41 <.0001