Topic 20: Single Factor Analysis of Variancebacraig/notes512/Topic_20.pdf · One-Way ANOVA • The...
Transcript of Topic 20: Single Factor Analysis of Variancebacraig/notes512/Topic_20.pdf · One-Way ANOVA • The...
Topic 20: Single Factor
Analysis of Variance
Outline
• Single factor Analysis of Variance
–One set of treatments
•Cell means model
•Factor effects model
–Link to linear regression using indicator explanatory variables
One-Way ANOVA• The response variable Y is continuous
• The explanatory variable is categorical
–We call it a factor
–The possible values are called levels
• This approach is a generalization of the
independent two-sample pooled t-test
• In other words, it can be used when there
are more than two treatments
Data for One-Way ANOVA
• Y is the response variable
• X is the factor (it is qualitative/discrete)
– r is the number of levels
–often refer to these levels as groups
or treatments
Notation• For Yij we use
– i to denote the level of the factor
– j to denote the jth observation at
factor level i
• i = 1, . . . , r levels of factor X
• j = 1, . . . , ni observations for level i
of factor X
– Note that ni does not need to be the
same for each level
KNNL Example (p 685)
• Y is the number of cases of cereal sold
• X is the design of the cereal package
– there are four levels for X because
there are four different designs
• i =1 to 4 levels
• j =1 to ni stores with design i (ni=5,5,4,5)
• Will use n if ni the same across levels
Data for one-way ANOVA
data a1;
infile 'c:../data/ch16ta01.txt';
input cases design store;
proc print data=a1;
run;
The data
Obs cases design store
1 11 1 1
2 17 1 2
3 16 1 3
4 14 1 4
5 15 1 5
6 12 2 1
7 10 2 2
8 15 2 3
Plot the data
symbol1 v=circle i=none;
proc gplot data=a1;
plot cases*design;
run;
The scatterplot
Plot the means
proc means data=a1;
var cases; by design;
output out=a2 mean=avcases;
proc print data=a2;
symbol1 v=circle i=join;
proc gplot data=a2;
plot avcases*design;
run;
Proc Print: New Data Set
Obs design _TYPE_ _FREQ_ avcases
1 1 0 5 14.6
2 2 0 5 13.4
3 3 0 4 19.5
4 4 0 5 27.2
Proc Gplot: Means Plot
The Model
• We assume that the response variable is
–Normally distributed with a
1. mean that may depend on the level of the factor
2. constant variance
• All observations assumed independent
• NOTE: Same assumptions as linear regression except there is no assumed linear relationship between X and E(Y|X)
The scatterplot
Based on scatterplot and design:
Independence?
Constant variance?
Normally distributed?
Cell Means Model
• A “cell” refers to a level of the factor
• Yij = μi + εij
–where μi is the theoretical mean or
expected value of all observations
at level (or in cell) i
– the εij are iid N(0, σ2) which means
–Yij ~N(μi, σ2) and independent
Parameters• The parameters of the model are
– μ1, μ2, … , μr
–σ2
• Question (Version 1) – Does our
explanatory variable help explain Y?
• Question (Version 2) – Do the μi vary?
H0: μ1= μ2= … = μr = μ (a constant)
Ha: not all μ’s are the same
Estimates
• Estimate μi by the mean of the
observations at level i, (sample mean)
•
• For each level i, also get an estimate of the variance
• (sample variance)
• We combine these to get an overall estimate of σ2
• Same approach as pooled t-test
iY
i i ij iˆ Y n Y
2
2
i ij i iY n 1s Y2
is
Pooled estimate of σ2
• If the ni were all the same we would
average the
– Do not average the si
• In general we pool the , giving
weights proportional to the df, ni -1
• The pooled estimate is
2
is
2
is
2 2
i i i
2
i i T
n 1 n 1
n 1 n
s s
s r
Running proc glm
proc glm data=a1;
class design;
model cases=design;
means design;
lsmeans design / stderr;
run;
Difference 1: Need
to specify factor
variables
Difference 2: Ask
for mean estimates
Output
Class Level
Information
Class Levels Values
design 4 1 2 3 4
Number of Observations Read 19
Number of Observations Used 19
Important to check
these summaries!!!
SAS 9.4 default output
for MEANS statement
MEANS statement output
Level of
design N
cases
Mean Std Dev
1 5 14.6000000 2.30217289
2 5 13.4000000 3.64691651
3 4 19.5000000 2.64575131
4 5 27.2000000 3.96232255
Table of sample means and
sample standard deviations
SAS 9.4 default output
for LSMEANS statement
LSMEANS statement
output
design cases LSMEAN
Standard
Error Pr > |t|
1 14.6000000 1.4523544 <.0001
2 13.4000000 1.4523544 <.0001
3 19.5000000 1.6237816 <.0001
4 27.2000000 1.4523544 <.0001
Provides estimates based on
model (i.e., constant variance)
Notation for ANOVA
i ij ij
ij Ti j
T ii
Y n (trt sample mean)
Y n (overall sample mean)
n n (total number of observations)
Y
Y
i iiWhen n n for all i, r YY
ANOVA Table
Source df SS MS
Model r-1 SSR/dfR
Error nT-r SSE/dfE
Total nT-1 SSTO/dfT
2
ii jY Y
2
ij ii jY Y
2
iji jY Y
ANOVA SAS Output
Source DF
Sum of
Squares
Mean
Square F Value Pr > F
Model 3 588.2210526 196.0736842 18.59 <.0001
Error 15 158.2000000 10.5466667
Cor Total 18 746.4210526
R-Square Coeff Var Root MSE cases Mean
0.788055 17.43042 3.247563 18.63158
Expected Mean Squares
• E(MSR) > E(MSE) when the group means are different
• See KNNL p 694 – 698 for more details
• In more complicated models, the EMS tell us how to construct the F test
2
22
i ii
i i Ti
E MSE
E MSR n 1
where n n
r
F test
• F* = MSR/MSE
• H0: μ1 = μ2 = … = μr
• Ha: not all of the μi are equal
• Under H0, F* ~ F(r-1, nT-r)
• Reject H0 when F* is large
• Typically report the P-value
Maximum Likelihood
Approach1. proc mixed data=a1;
class design;
model cases=design;
lsmeans design;
2. proc glimmix data=a1;
class design;
model cases=design / dist=normal;
lsmeans design;
run;
GLIMMIX Output
Model Information
Data Set WORK.A1
Response Variable cases
Response Distribution Gaussian
Link Function Identity
Variance Function Default
Variance Matrix Diagonal
Estimation Technique Restricted Maximum
Likelihood
Degrees of Freedom Method Residual
GLIMMIX Output
Fit Statistics
-2 Res Log Likelihood 84.12
AIC (smaller is better) 94.12
AICC (smaller is better) 100.79
BIC (smaller is better) 97.66
CAIC (smaller is better) 102.66
HQIC (smaller is better) 94.08
Pearson Chi-Square 158.20
Pearson Chi-Square / DF 10.55
GLIMMIX OutputType III Tests of Fixed Effects
Effect
Num
DF
Den
DF F Value Pr > F
design 3 15 18.59 <.0001
design Least Squares Means
design Estimate
Standard
Error DF t Value Pr > |t|
1 14.6000 1.4524 15 10.05 <.0001
2 13.4000 1.4524 15 9.23 <.0001
3 19.5000 1.6238 15 12.01 <.0001
4 27.2000 1.4524 15 18.73 <.0001
Factor Effects Model
• A reparameterization of the cell means
model
• Useful way at looking at more
complicated models
• Null hypotheses are easier to state
• Yij = μ + i + εij
– the εij are iid N(0, σ2)
Parameters
• The parameters of the model are
– μ, 1, 2, … , r
– σ2
• The cell means model had r + 1 parameters
– r μ’s and σ2
• The factor effects model has r + 2 parameters
– μ, the r ’s, and σ2
– Cannot uniquely estimate all parameters
An example
• Suppose r=3; μ1 = 10, μ2 = 20, μ3 = 30
• What is an equivalent set of parameters
for the factor effects model?
• We need to have μ + i = μi so…
1. μ = 0, 1 = 10, 2 = 20, 3 = 30
2. μ = 20, 1 = -10, 2 = 0, 3 = 10
3. μ = 5000, 1 = -4990, 2 = -4980, 3 = -4970
all provide the same means
Problem with factor effects?
• These parameters are not estimable or not well defined (i.e., not unique)
– There are many solutions to the least squares problem
– There is an X΄X matrix for this parameterization that does not have an inverse (perfect multicollinearity)
• We addressed similar situation in multiple regression. Parameter estimators provided by SAS are biased
Factor effects solution
• Put a constraint on the i
• Common to assume Σi i = 0
• This effectively reduces the number
of parameters by one
• Numerous other constraints possible
–Σi i = 100
– r = 0
Consequences
• Regardless of constraint, we always
have μi = μ + i
• The constraint Σi i = 0 implies
–μ = (Σi μi)/r (unweighted overall mean)
– i = μi – μ (group effect)
• The “unweighted” complicates
things when the ni are not all equal;
see KNNL p 702-708
Hypotheses
• H0: μ1 = μ2 = … = μr
• H1: not all of the μi are equal
are translated into
• H0: 1 = 2 = … = r = 0
• H1: at least one i is not 0
Estimates of parameters
• With the constraint Σi i = 0
ii
i
i i
ˆ
(if n n)
ˆ ˆ
r
Y
Y
Y
Solution used by SAS
• Recall, X΄X does not have an inverse
• We can use a generalized inverse in
its place
• (X΄X)- is the standard notation for
generalized inverse
• There are many generalized inverses,
each corresponding to a different
constraint
Solution used by SAS
• (X΄X)- used in proc glm corresponds to the constraint r = 0
• Recall that μ and the i are not estimable
• But the linear combinations μ + i are estimable
• These are estimated by the cell means (i.e., sample means)
Cereal package example
• Y is the number of cases of cereal sold
• X is the design of the cereal package
• i =1 to 4 levels
• j =1 to ni stores with design i
SAS coding for X•Class statement generates r explanatory
variables
•The ith explanatory variable is equal to 1 if the observation is from the ith group
•In other words, the rows of X are
1 1 0 0 0 for design=1
1 0 1 0 0 for design=2
1 0 0 1 0 for design=3
1 0 0 0 1 for design=4
Some SAS options
proc glm data=a1;
class design;
model cases=design
/xpx inverse solution;
run;
xpx OutputThe X'X Matrix
Int d1 d2 d3 d4 casesInt 19 5 5 4 5 354d1 5 5 0 0 0 73d2 5 0 5 0 0 67d3 4 0 0 4 0 78d4 5 0 0 0 5 136cases 354 73 67 78 136 7342
Contains
X’Y
Also Y’Y
inverse Output
X'X Generalized Inverse (g2)
Int d1 d2 d3 d4 casesInt 0.2 -0.2 -0.2 -0.2 0 27.2d1 -0.2 0.4 0.2 0.2 0 -12.6d2 -0.2 0.2 0.4 0.2 0 -13.8d3 -0.2 0.2 0.2 0.45 0 -7.7d4 0 0 0 0 0 0cases 27.2 -12.6 -13.8 -7.7 0 158.2
Inverse Matrix
•Actually, this matrix is
(X΄X)- (X΄X)- X΄Y
Y΄X(X΄X)- Y΄Y-Y΄X(X΄X)- X΄Y
•Parameter estimates are in upper
right corner, SSE is lower right
corner (last column on previous page)
solution Output:
Parameter estimates
StPar Est Err t PInt 27.2 B 1.45 18.73 <.0001d1 -12.6 B 2.05 -6.13 <.0001d2 -13.8 B 2.05 -6.72 <.0001d3 -7.7 B 2.17 -3.53 0.0030d4 0.0 B . . .
Caution Message
NOTE: The X'X matrix has beenfound to be singular, and ageneralized inverse was usedto solve the normal equations.Terms whose estimates arefollowed by the letter 'B' arenot uniquely estimable.
Interpretation
• If r = 0 (in our case, 4 = 0), then the
corresponding estimate should be zero
• The intercept μ is then estimated by the
sample mean of Group 4
• Since μ + i is the mean of group i, the i
are estimated as the differences between
the sample mean of Group i and the
sample mean of Group 4
Recall the means output
Level ofdesign N Mean Std Dev
1 5 14.6 2.32 5 13.4 3.63 4 19.5 2.64 5 27.2 3.9
Parameter estimates
based on means
Level ofdesign Mean
= 27.2 = 27.2
1 14.6 = 14.6-27.2 = -12.6
2 13.4 = 13.4-27.2 = -13.8
3 19.5 = 19.5-27.2 = -7.7
4 27.2 = 27.2-27.2 = 0
1
2
3
4
Last slide
• Read KNNL Chapter 16 up to 16.10
• We used programs topic20.sas to generate the output for today
• Will focus more on the relationship between regression and one-way ANOVA in next topic