Avoiding bias in RCTs David Torgerson Director, York Trials Unit [email protected] .
Lecture 4 Linear Models III Olivier MISSA, [email protected]@york.ac.uk Advanced Research...
Transcript of Lecture 4 Linear Models III Olivier MISSA, [email protected]@york.ac.uk Advanced Research...
2
Outline
"Refresher" on different types of model:
Multiple regression
Polynomial regression
Model building
Finding the "best" model.
3
When more than one continuous or discrete variables are used to predict a response variable.
Example: The 38 car drivers dataset
Multiple Regression
> data (seatpos) ## from faraway package
> attach(seatpos)
> names(seatpos)[1] "Age" "Weight" "HtShoes" "Ht" "Seated" [6] "Arm" "Thigh" "Leg" "hipcenter"
> summary(lm(hipcenter ~ Age))Coefficients:
Estimate Std. Error t value Pr(>|t|)(Intercept) -192.9645 24.3015 -7.940 2.00e-09 ***Age 0.7963 0.6331 1.258 0.217...Multiple R-squared: 0.0421, Adjusted R-squared: 0.01549...
4
Multiple Regression> summary(lm(hipcenter ~ Age))
Estimate Std. Error t value Pr(>|t|)Age 0.7963 0.6331 1.258 0.217 Multiple R-squared: 0.0421> summary(lm(hipcenter ~ Weight))
Estimate Std. Error t value Pr(>|t|)Weight -1.0674 0.2134 -5.002 1.49e-05 *** Multiple R-squared: 0.41> summary(lm(hipcenter ~ HtShoes))
Estimate Std. Error t value Pr(>|t|)HtShoes -4.2621 0.5391 -7.907 2.21e-09 *** Multiple R-squared: 0.6346> summary(lm(hipcenter ~ Ht))
Estimate Std. Error t value Pr(>|t|)Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383> summary(lm(hipcenter ~ Seated))
Estimate Std. Error t value Pr(>|t|)Seated -8.844 1.375 -6.432 1.84e-07 *** Multiple R-squared: 0.5347
5
Multiple Regression> summary(lm(hipcenter ~ Arm))
Estimate Std. Error t value Pr(>|t|)Arm -10.351 2.391 -4.329 0.000114 *** Multiple R-squared: 0.3423> summary(lm(hipcenter ~ Thigh))
Estimate Std. Error t value Pr(>|t|)Thigh -9.100 2.069 -4.398 9.29e-05 *** Multiple R-squared: 0.3495> summary(lm(hipcenter ~ Leg))
Estimate Std. Error t value Pr(>|t|)Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196
Age Ht Leg
6
Multiple Regression> summary(mod <- lm(hipcenter ~ Ht + Leg))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 491.244 99.543 4.935 1.95e-05 ***Ht -2.565 1.268 -2.022 0.0509 Leg -6.136 4.164 -1.473 0.1496 ---Residual standard error: 35.79 on 35 degrees of freedomMultiple R-squared: 0.6594, Adjusted R-squared: 0.6399 F-statistic: 33.88 on 2 and 35 DF, p-value: 6.517e-09
> summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|)
Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383
> summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|)
Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196
slope parameters change
Still Significant
No Longer Significant
7
> drop1(mod, test="F")Single term deletions
Model:hipcenter ~ Ht + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 44835 275 Ht 1 5236 50071 277 4.0877 0.05089 .Leg 1 2781 47616 275 2.1711 0.14957
> cor(Ht,Leg)[1] 0.9097524
> plot(Ht ~ Leg)
Multiple Regression Beware of strong collinearity
8
> add1( lm(hipcenter~Ht), ~. +Age +Weight +HtShoes +Seated +Arm +Thigh +Leg, test="F")
Single term additions
Model:hipcenter ~ Ht Df Sum of Sq RSS AIC F value Pr(F)<none> 47616 275 Age 1 2354 45262 275 1.8199 0.1860Weight 1 196 47420 277 0.1446 0.7061HtShoes 1 26 47590 277 0.0189 0.8913Seated 1 102 47514 277 0.0748 0.7861Arm 1 76 47540 277 0.0558 0.8147Thigh 1 5 47611 277 0.0034 0.9538Leg 1 2781 44835 275 2.1711 0.1496
Multiple Regression Beware of strong collinearity
No Added Variable
Significantly Improves the
Model
9
Comparing models How can we compare models on an equal footing ?
(regardless of the number of parameters).
The multiple-R2 can only increase as more variables enter a model
(because the RSS can only decrease).
The adjusted R2 corrects for the different number of parameters to some extent.
1
12
n
TSSpn
RSSradj
TSS
RSSrmult 12 no good to
compare models
10
Akaike Information Criterion
Invented by Hirotugu Akaike in 1971
after a few weeks of sleepless nights
stressing over a conference presentation.
The AIC originally called 'An Information Criterion'
penalizes the likelihood of a model according
to the number of parameters being estimated.
LkAIC ln22 Maximized value of
the Likelihood functionNumber of Parameters
The lower the AIC value
the better the model is
11
Akaike Information CriterionInvented by Hirotugu Akaike in 1971
after a few weeks of sleepless nights
stressing over a conference presentation.
When residuals are normally & independently distributed:
LkAIC ln22
Exact expression
Simplified expression(not equal)
12
Akaike Information CriterionMay need to be corrected for small sample sizes
A variant, the BIC, 'Bayesian Information Criterion' (Schwartz Criterion)
penalizes free parameters more strongly than AIC
When residuals are normally & independently distributed:
1
12
kn
kkAICAICc
LnkBIC ln2ln
LkAIC ln22
Exact expression
Simplified expression(not equal)
13
> mod <- lm(hipcenter ~. , data=seatpos) ## starts with everything> s.mod <- step(mod) ## by default prunes variables out ('backward')
Start: AIC=283.62hipcenter ~ Age +Weight +HtShoes +Ht +Seated +Arm +Thigh +Leg
Df Sum of Sq RSS AIC- Ht 1 5 41267 282- Weight 1 9 41271 282- Seated 1 29 41290 282- HtShoes 1 108 41370 282- Arm 1 165 41427 282- Thigh 1 263 41525 282<none> 41262 284- Age 1 2632 43894 284- Leg 1 2655 43917 284
Multiple Regression Searching for the "best solution"
Deletions ranked in increasing order of AIC
14
Step: AIC=281.63 ## after removing Hthipcenter ~ Age +Weight +HtShoes +Seated +Arm +Thigh +Leg
Df Sum of Sq RSS AIC- Weight 1 11 41278 280- Seated 1 31 41297 280- Arm 1 161 41427 280- Thigh 1 269 41536 280- HtShoes 1 972 42239 281<none> 41267 282- Leg 1 2665 43931 282- Age 1 2809 44075 282
Multiple Regression Searching for the "best solution"
15
Step: AIC=279.64 ## after removing Ht & Weighthipcenter ~ Age +HtShoes +Seated +Arm +Thigh +Leg
Df Sum of Sq RSS AIC- Seated 1 35 41313 278- Arm 1 156 41434 278- Thigh 1 285 41563 278- HtShoes 1 975 42253 279<none> 41278 280- Leg 1 2661 43939 280- Age 1 3012 44290 280
Multiple Regression Searching for the "best solution"
16
Step: AIC=277.67 ## after removing Ht, Weight & Seatedhipcenter ~ Age +HtShoes +Arm +Thigh +Leg
Df Sum of Sq RSS AIC- Arm 1 172 41485 276- Thigh 1 345 41658 276- HtShoes 1 1853 43166 277<none> 41313 278- Leg 1 2871 44184 278- Age 1 2977 44290 278
Step: AIC=275.83 ## after removing Arm as wellhipcenter ~ Age + HtShoes + Thigh + Leg
Df Sum of Sq RSS AIC- Thigh 1 473 41958 274<none> 41485 276- HtShoes 1 2341 43826 276- Age 1 3501 44986 277- Leg 1 3592 45077 277
Multiple Regression Searching for the "best solution"
17
Step: AIC=274.26 ## after removing Thigh toohipcenter ~ Age + HtShoes + Leg
Df Sum of Sq RSS AIC<none> 41958 274- Age 1 3109 45067 275- Leg 1 3476 45434 275- HtShoes 1 4219 46176 276
> summary(s.mod)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 456.2137 102.8078 4.438 9.1e-05 ***Age 0.5998 0.3779 1.587 0.1217 HtShoes -2.3023 1.2452 -1.849 0.0732 . Leg -6.8297 4.0693 -1.678 0.1024 ---Residual standard error: 35.13 on 34 degrees of freedomMultiple R-squared: 0.6813, Adjusted R-squared: 0.6531 F-statistic: 24.22 on 3 and 34 DF, p-value: 1.437e-08
Multiple Regression Searching for the "best solution"
18
> anova(s.mod)Response: hipcenter Df Sum Sq Mean Sq F value Pr(>F) Age 1 5541 5541 4.4904 0.04147 * HtShoes 1 80664 80664 65.3647 1.996e-09 ***Leg 1 3476 3476 2.8169 0.10244 Residuals 34 41958 1234
> drop1(s.mod, test="F")hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 41958 274 Age 1 3109 45067 275 2.5192 0.12173 HtShoes 1 4219 46176 276 3.4185 0.07318 .Leg 1 3476 45434 275 2.8169 0.10244
Multiple Regression Searching for the "best solution"
19
> plot(hipcenter ~ s.mod$fit)
> abline(0,1, col="red")
> null <- lm(hipcenter ~1, data=seatpos) ## starting from a null model
> s.mod <- step(null, ~. +Age+Weight+Ht+HtShoes+Seated+Arm+Thigh+Leg, direction="forward") ## intermediate steps removedStep: AIC=274.24hipcenter ~ Ht + Leg + Age
Df Sum of Sq RSS AIC<none> 41938 274+ Thigh 1 373 41565 276+ Arm 1 257 41681 276+ Seated 1 121 41817 276+ Weight 1 47 41891 276+ HtShoes 1 13 41925 276
Multiple Regression
No guarantee that the forward & backward searches
will find the same solution
20
> ozone.pollution <- read.table("ozone.data.txt", header=T)
> dim(ozone.pollution)[1] 111 4
> names(ozone.pollution)[1] "rad" "temp" "wind" "ozone"
> attach(ozone.pollution)
> pairs(ozone.pollution, panel=panel.smooth, pch=16, lwd=2)
> model <- lm(ozone ~ ., data=ozone.pollution)
Multiple RegressionAnother example: How is ozone concentration in the atmosphere
related to solar radiation, ambient temperature & wind speed ?
21
> summary(model)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -64.23208 23.04204 -2.788 0.00628 ** rad 0.05980 0.02318 2.580 0.01124 * temp 1.65121 0.25341 6.516 2.43e-09 ***wind -3.33760 0.65384 -5.105 1.45e-06 ***---Residual standard error: 21.17 on 107 degrees of freedomMultiple R-squared: 0.6062, Adjusted R-squared: 0.5952 F-statistic: 54.91 on 3 and 107 DF, p-value: < 2.2e-16
> drop1(model, test="F")ozone ~ rad + temp + wind Df Sum of Sq RSS AIC F value Pr(F) <none> 47964 682 rad 1 2984 50948 686 6.6565 0.01124 * temp 1 19032 66996 717 42.4567 2.429e-09 ***wind 1 11680 59644 704 26.0567 1.450e-06 ***
Multiple Regression
22
> plot(ozone ~ model$fitted, pch=16, xlab="Model Predictions", ylab="Ozone Concentration")
> abline(0,1, col="red", lwd=2)
> shapiro.test(model$res)
Shapiro-Wilk normality test
data: model$res W = 0.9173, p-value = 3.704e-06
> plot(model, which=1)
Multiple Regression
23
> library(car)
> cr.plot(model, rad, pch=16, main="")
> cr.plot(model, temp, pch=16, main="")
> cr.plot(model, wind, pch=16, main="")
Multiple RegressionWhich predictor variable is
non-linearly related to Ozone ?
rad temp wind
24
> model2 <- lm(ozone ~ poly(rad,2)+poly(temp,2)+poly(wind,2))
> summary(model2)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.099 1.735 24.269 < 2e-16 ***poly(rad, 2)1 65.085 19.379 3.359 0.00110 ** poly(rad, 2)2 -16.259 20.174 -0.806 0.42213 poly(temp, 2)1 142.708 23.502 6.072 2.09e-08 ***poly(temp, 2)2 56.043 19.138 2.928 0.00419 ** poly(wind, 2)1 -125.800 21.391 -5.881 4.99e-08 ***poly(wind, 2)2 88.636 19.199 4.617 1.12e-05 ***---Residual standard error: 18.28 on 104 degrees of freedomMultiple R-squared: 0.7148, Adjusted R-squared: 0.6984 F-statistic: 43.44 on 6 and 104 DF, p-value: < 2.2e-16
Polynomial RegressionWhen the trend between a predictor variable and
the response variable is not linear, the curvature
can be "captured" using polynomials of various degrees.
25
> extractAIC(model2) ## Simplified Version[1] 7.0000 651.8105
> model3 <- lm(ozone ~ rad +poly(temp,2) +poly(wind,2))> extractAIC(model3)[1] 6.0000 650.5016
> model4 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,2) )> extractAIC(model4)[1] 7.0000 648.0394
> model5 <- lm(ozone ~ rad+poly(temp,2)+poly(wind,3) )> extractAIC(model5)[1] 7.0000 651.6489
> model6 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,3) )> extractAIC(model6)[1] 8.0000 649.0149
> extractAIC(model) ## original, strictly linear model[1] 4.0000 681.6233
Polynomial RegressionBut how many degrees should we choose ?
Best Model
26
> plot(model4, which=1)
> shapiro.test(model4$res)
Shapiro-Wilk normality test
data: model4$res W = 0.9309, p-value = 2.267e-05
> plot(model4, which=2)
Polynomial Regression
27
> library(MASS)
> boxcox(model4, plotit=T)> boxcox(model4, plotit=T, lambda=seq(0,1,by=.1) )
> new.ozone <- ozone^(1/3)
> mod4 <- lm(new.ozone ~ rad +poly(temp,3) +poly(wind,2) )
> extractAIC(mod4)[1] 7.0000 -162.6569
> shapiro.test(mod4$res)
Shapiro-Wilk normality test
data: mod4$res W = 0.9899, p-value = 0.5855
Polynomial RegressionFinding the best transformation
of our response variable
28
> par(mfrow=c(2,2))
> plot(mod4)
> anova(mod4)Analysis of Variance Table
Response: new.ozone Df Sum Sq Mean Sq F value Pr(>F) rad 1 15.531 15.531 71.467 1.859e-13 ***poly(temp, 3) 3 40.947 13.649 62.804 < 2.2e-16 ***poly(wind, 2) 2 8.129 4.065 18.703 1.152e-07 ***Residuals 104 22.602 0.217
> summary(mod4)Residual standard error: 0.4662 on 104 degrees of freedomMultiple R-squared: 0.7408, Adjusted R-squared: 0.7259 F-statistic: 49.55 on 6 and 104 DF, p-value: < 2.2e-16
Polynomial Regression
29
> par(mfrow=c(1,1))
> plot(new.ozone ~ mod4$fitted)> abline(0,1, col="red", lwd=2)
> cr.plot(mod4, rad, pch=16, main="")> cr.plot(mod4, poly(temp,3), pch=16, main="")> cr.plot(mod4, poly(wind,2), pch=16, main="")
Polynomial Regression
rad temp wind