12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors In the models...
-
Upload
jesse-kennedy -
Category
Documents
-
view
217 -
download
3
Transcript of 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors In the models...
04/21/23 330 lecture 17 1
STATS 330: Lecture 17
04/21/23 330 lecture 17 2
Factors In the models discussed so far, all explanatory
variables have been numeric Now we want to incorporate categorical
variables into our models In R, categorical variables are called factors
04/21/23 330 lecture 17 3
Example Consider an experiment to measure the rate
of metal removal in a machining process on a lathe.
The rate depends on the speed setting of the lathe (fast, medium or slow, a categorical measurement) and the hardness of the material being machined (a continuous measurement)
04/21/23 330 lecture 17 4
Data hardness setting rate1 120 slow 682 140 slow 903 150 slow 984 125 slow 775 136 slow 886 165 medium 1227 140 medium 1048 120 medium 759 125 medium 8410 133 medium 9511 175 fast 13812 132 fast 10213 124 fast 9314 141 fast 11215 130 fast 100
04/21/23 330 lecture 17 5
120 130 140 150 160 170
70
80
90
10
01
10
12
01
30
14
0
Plot of rate versus hardness for different lathe speeds
hardness of metal
rate
of m
eta
l re
mo
val
s
s
s
s
s
m
m
m
m
m
f
f
f
f
f
smf
slowmediumfast
04/21/23 330 lecture 17 6
ModelA model consisting of 3 parallel lines seems
appropriate:
hardnessrate
hardnessrate
hardnessrate
S
M
F
Note same slope ie parallel lines
Different intercepts
04/21/23 330 lecture 17 7
Baseline versionWe can regard the fast setting as a baseline and
express the other settings as “baseline plus offsets”:
SS
MM
F
Baseline
Offset for medium line
04/21/23 330 lecture 17 8
Baseline version (2)We can then write the model as
ehardnessrate
ehardnessrate
ehardnessrate
S
M
:setting slow theFor
:setting medium theFor
:settingfast theFor
04/21/23 330 lecture 17 9
“Deviation from mean” version
Now let be the mean of F, M and S. Define
SS
MM
FF
“fast” line intercept
Mean of intercepts
04/21/23 330 lecture 17 10
“Deviation from mean” version (2)
Then
SS
MM
FF
Thus, is now the “average intercept, and there are 3 offsets, one for each line. The 3 offsets add to zero. This is the form used in the Stage 2 course.
04/21/23 330 lecture 17 11
Dummy variablesBack to baseline form: We can combine the 3 “baseline” equations into one by using “dummy variables”. Define
med = 1 if setting =“medium” and 0 otherwise
slow = 1 if setting =“slow” and 0 otherwise
Then we can write the model as
hardnessslowmedrate SM
04/21/23 330 lecture 17 12
FittingThe model can be fitted as usual using lm:
> med <-ifelse(metal.df$setting=="medium", 1,0)> slow<-ifelse(metal.df$setting=="slow", 1,0)> summary(lm(rate~med + slow + hardness, data=metal.df))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * med -9.44980 1.87275 -5.046 0.000374 ***slow -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***
04/21/23 330 lecture 17 13
Fitting (2)
Thus, the baseline has intercept -22.17042
The “medium” line has intercept -22.17042 -9.44980 = -31.62022
The “slow” line has intercept -22.17042 -19.00757 = -41.17799
04/21/23 330 lecture 17 14
120 130 140 150 160 170
70
80
90
10
01
10
12
01
30
14
0
Plot of rate versus hardness for different lathe speeds
hardness of metal
rate
of m
eta
l re
mo
val
s
s
s
s
s
m
m
m
m
m
f
f
f
f
f
slowmediumfast
baseline
Offset m
Offset s
04/21/23 330 lecture 17 15
Fitting (3)Making dummy variables is a pain. Fortunately R allows us to write
> summary(lm(rate ~ setting + hardness))
Estimate Std.Error t-value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * settingmedium -9.44980 1.87275 -5.046 0.000374 ***settingslow -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***
and get the same result, provided the variable setting is a factor.
04/21/23 330 lecture 17 16
Factors Since the data for setting in the input data was
character data, the variable setting was automatically recognized as a factor
In fact the 3 settings were 1000, 1200, 1400 rpm. What would happen if the input data had used these (numerical) values?
Answer: the lm function would have assumed that setting was a continuous variable and fitted a plane, not 3 parallel lines.
04/21/23 330 lecture 17 17
Factors (2)> rpm = rep(c(1000,1200,1400), c(5,5,5))> summary(lm(rate~ rpm + hardness, data=metal.df))
Call:lm(formula = rate ~ rpm + hardness, data = metal.df)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -88.674624 7.837602 -11.31 9.29e-08 ***rpm 0.047519 0.004521 10.51 2.09e-07 ***hardness 0.934226 0.047944 19.49 1.89e-10 ***
When rpm = 1000, the relationship is
-88.674624 + 0.047519 * 1000 + 0.934226 * hardness
i.e. -41.15562 + 0.934226 * hardness
04/21/23 330 lecture 17 18
Factors (3)Intercept Slope
factor non-factor factor non-factor
Fast -22.17042 -22.14802 0.93426 0.93423
Medium -31.62022 -31.65182 0.93426 0.93423
Slow -41.17799 -41.15562 0.93426 0.93423
The non-factor model constrains the 3 intercepts to be equally spaced. OK for this data set, but not in general.
04/21/23 330 lecture 17 19
Factors (4) To avoid this, we could
• recode the variable as character, or (easier)• Use the factor function to coerce the numerical
data into a factor
rpm.as.factor = factor(rpm)
04/21/23 330 lecture 17 20
Factors (5)We can fit the “factor” model using the R code
> rpm.as.factor = factor(rpm)> summary(lm(rate~rpm.as.factor + hardness))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -41.17799 6.84927 -6.012 8.77e-05 ***rpm.as.factor1200 9.55777 1.86692 5.120 0.000334 ***rpm.as.factor1400 19.00757 1.88875 10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***
These estimates are different!! What’s going on??
04/21/23 330 lecture 17 21
Levels The different values of a factor are called “levels” The levels of the factor setting are fast, medium,
slow> levels(setting)
[1] "fast" "medium" "slow"
The levels of the factor rpm.as.factor are 1000,1200,1400> levels(rpm.as.factor)
[1] "1000" "1200" "1400"
04/21/23 330 lecture 17 22
Levels (2) By default, the levels are listed in alphabetical
order The first level is selected as the baseline Thus,
using setting, the baseline is “fast”
Using rpm.as.factor, the baseline is “1000”
04/21/23 330 lecture 17 23
Levels (3)
> rpm.newbaseline<-factor(rpm,levels=c("1400", "1200", "1000"))> summary(lm(rate~rpm.newbaseline + hardness, data=metal.df))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * rpm.newbaseline1200 -9.44980 1.87275 -5.046 0.000374 ***rpm.newbaseline1000 -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***
Can change the order using the factor function
04/21/23 330 lecture 17 24
Non-parallel lines What if the lines aren’t parallel? Then the betas
are different: the model becomes
hardnessrate
hardnessrate
hardnessrate
SS
MM
FF
04/21/23 330 lecture 17 25
Baseline version for the betas
As before, we can regard the fast setting as a baseline and express the other settings as “baseline plus offsets”:
SS
MM
F
Baseline
Offset for medium line
slope
04/21/23 330 lecture 17 26
Baseline version for both parameters
We can then write the model as
ehardnesshardnessrate
ehardnesshardnessrate
ehardnessrate
SS
MM
:setting slow theFor
:setting medium theFor
:settingfast theFor
04/21/23 330 lecture 17 27
Dummy variables for both parameters
As before, we can combine these 3 equations into one by using “dummy variables”. Define med and slow as before, and
h.med = hardness x med
h.slow = hardness x slow
Then we can write the model as
slowhmedh
hardnessslowmedrate
SM
SM
..
04/21/23 330 lecture 17 28
Fitting in RThe model formula for this non-parallel model is
rate ~ setting + hardness + setting:hardness
or, even more compactly, as rate ~ setting * hardness
> summary(lm(rate ~ setting*hardness))Estimate Std. Error t value Pr(>|t|) (Intercept) -12.18162 10.32795 -1.179 0.2684 settingmedium -30.15725 15.49375 -1.946 0.0834 . settingslow -33.60120 19.58902 -1.715 0.1204 hardness 0.86312 0.07295 11.831 8.69e-07 ***settingmedium:hardness 0.14961 0.11125 1.345 0.2116 settingslow:hardness 0.10546 0.14356 0.735 0.4813
04/21/23 330 lecture 17 29
Is the non-parallel model necessary?
This amounts to testing if M and S are zero, or, equivalently, if the parallel model
rate ~ setting + hardness
is an an adequate submodel of the non-parallel model
rate ~ setting * hardness
As in Lecture 6, we use the anova function to compare the two models:
04/21/23 330 lecture 17 30
> model1<-lm(rate ~ setting + hardness)> model2<-lm(rate ~ setting * hardness)> anova(model1, model2)Analysis of Variance Table
Model 1: rate ~ setting + hardnessModel 2: rate ~ setting * hardness Res.Df RSS Df Sum of Sq F Pr(>F)1 11 95.451 2 9 78.807 2 16.644 0.9504 0.4222
Conclusion: since the F-value is small and the p-value 0.4222 is large, we conclude that the submodel (ie the parallel lines model) is adequate.