Lecture 4 Linear Models III Olivier MISSA, [email protected]@york.ac.uk Advanced Research...

29
Lecture 4 Linear Models III Olivier MISSA, [email protected] Advanced Research Skills

Transcript of Lecture 4 Linear Models III Olivier MISSA, [email protected]@york.ac.uk Advanced Research...

Page 1: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

Lecture 4

Linear Models III

Olivier MISSA, [email protected]

Advanced Research Skills

Page 2: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

2

Outline

"Refresher" on different types of model:

Multiple regression

Polynomial regression

Model building

Finding the "best" model.

Page 3: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

3

When more than one continuous or discrete variables are used to predict a response variable.

Example: The 38 car drivers dataset

Multiple Regression

> data (seatpos) ## from faraway package

> attach(seatpos)

> names(seatpos)[1] "Age" "Weight" "HtShoes" "Ht" "Seated" [6] "Arm" "Thigh" "Leg" "hipcenter"

> summary(lm(hipcenter ~ Age))Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) -192.9645 24.3015 -7.940 2.00e-09 ***Age 0.7963 0.6331 1.258 0.217...Multiple R-squared: 0.0421, Adjusted R-squared: 0.01549...

Page 4: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

4

Multiple Regression> summary(lm(hipcenter ~ Age))

Estimate Std. Error t value Pr(>|t|)Age 0.7963 0.6331 1.258 0.217 Multiple R-squared: 0.0421> summary(lm(hipcenter ~ Weight))

Estimate Std. Error t value Pr(>|t|)Weight -1.0674 0.2134 -5.002 1.49e-05 *** Multiple R-squared: 0.41> summary(lm(hipcenter ~ HtShoes))

Estimate Std. Error t value Pr(>|t|)HtShoes -4.2621 0.5391 -7.907 2.21e-09 *** Multiple R-squared: 0.6346> summary(lm(hipcenter ~ Ht))

Estimate Std. Error t value Pr(>|t|)Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383> summary(lm(hipcenter ~ Seated))

Estimate Std. Error t value Pr(>|t|)Seated -8.844 1.375 -6.432 1.84e-07 *** Multiple R-squared: 0.5347

Page 5: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

5

Multiple Regression> summary(lm(hipcenter ~ Arm))

Estimate Std. Error t value Pr(>|t|)Arm -10.351 2.391 -4.329 0.000114 *** Multiple R-squared: 0.3423> summary(lm(hipcenter ~ Thigh))

Estimate Std. Error t value Pr(>|t|)Thigh -9.100 2.069 -4.398 9.29e-05 *** Multiple R-squared: 0.3495> summary(lm(hipcenter ~ Leg))

Estimate Std. Error t value Pr(>|t|)Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196

Age Ht Leg

Page 6: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

6

Multiple Regression> summary(mod <- lm(hipcenter ~ Ht + Leg))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 491.244 99.543 4.935 1.95e-05 ***Ht -2.565 1.268 -2.022 0.0509 Leg -6.136 4.164 -1.473 0.1496 ---Residual standard error: 35.79 on 35 degrees of freedomMultiple R-squared: 0.6594, Adjusted R-squared: 0.6399 F-statistic: 33.88 on 2 and 35 DF, p-value: 6.517e-09

> summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|)

Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383

> summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|)

Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196

slope parameters change

Still Significant

No Longer Significant

Page 7: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

7

> drop1(mod, test="F")Single term deletions

Model:hipcenter ~ Ht + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 44835 275 Ht 1 5236 50071 277 4.0877 0.05089 .Leg 1 2781 47616 275 2.1711 0.14957

> cor(Ht,Leg)[1] 0.9097524

> plot(Ht ~ Leg)

Multiple Regression Beware of strong collinearity

Page 8: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

8

> add1( lm(hipcenter~Ht), ~. +Age +Weight +HtShoes +Seated +Arm +Thigh +Leg, test="F")

Single term additions

Model:hipcenter ~ Ht Df Sum of Sq RSS AIC F value Pr(F)<none> 47616 275 Age 1 2354 45262 275 1.8199 0.1860Weight 1 196 47420 277 0.1446 0.7061HtShoes 1 26 47590 277 0.0189 0.8913Seated 1 102 47514 277 0.0748 0.7861Arm 1 76 47540 277 0.0558 0.8147Thigh 1 5 47611 277 0.0034 0.9538Leg 1 2781 44835 275 2.1711 0.1496

Multiple Regression Beware of strong collinearity

No Added Variable

Significantly Improves the

Model

Page 9: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

9

Comparing models How can we compare models on an equal footing ?

(regardless of the number of parameters).

The multiple-R2 can only increase as more variables enter a model

(because the RSS can only decrease).

The adjusted R2 corrects for the different number of parameters to some extent.

1

12

n

TSSpn

RSSradj

TSS

RSSrmult 12 no good to

compare models

Page 10: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

10

Akaike Information Criterion

Invented by Hirotugu Akaike in 1971

after a few weeks of sleepless nights

stressing over a conference presentation.

The AIC originally called 'An Information Criterion'

penalizes the likelihood of a model according

to the number of parameters being estimated.

LkAIC ln22 Maximized value of

the Likelihood functionNumber of Parameters

The lower the AIC value

the better the model is

Page 11: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

11

Akaike Information CriterionInvented by Hirotugu Akaike in 1971

after a few weeks of sleepless nights

stressing over a conference presentation.

When residuals are normally & independently distributed:

LkAIC ln22

Exact expression

Simplified expression(not equal)

Page 12: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

12

Akaike Information CriterionMay need to be corrected for small sample sizes

A variant, the BIC, 'Bayesian Information Criterion' (Schwartz Criterion)

penalizes free parameters more strongly than AIC

When residuals are normally & independently distributed:

1

12

kn

kkAICAICc

LnkBIC ln2ln

LkAIC ln22

Exact expression

Simplified expression(not equal)

Page 13: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

13

> mod <- lm(hipcenter ~. , data=seatpos) ## starts with everything> s.mod <- step(mod) ## by default prunes variables out ('backward')

Start: AIC=283.62hipcenter ~ Age +Weight +HtShoes +Ht +Seated +Arm +Thigh +Leg

Df Sum of Sq RSS AIC- Ht 1 5 41267 282- Weight 1 9 41271 282- Seated 1 29 41290 282- HtShoes 1 108 41370 282- Arm 1 165 41427 282- Thigh 1 263 41525 282<none> 41262 284- Age 1 2632 43894 284- Leg 1 2655 43917 284

Multiple Regression Searching for the "best solution"

Deletions ranked in increasing order of AIC

Page 14: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

14

Step: AIC=281.63 ## after removing Hthipcenter ~ Age +Weight +HtShoes +Seated +Arm +Thigh +Leg

Df Sum of Sq RSS AIC- Weight 1 11 41278 280- Seated 1 31 41297 280- Arm 1 161 41427 280- Thigh 1 269 41536 280- HtShoes 1 972 42239 281<none> 41267 282- Leg 1 2665 43931 282- Age 1 2809 44075 282

Multiple Regression Searching for the "best solution"

Page 15: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

15

Step: AIC=279.64 ## after removing Ht & Weighthipcenter ~ Age +HtShoes +Seated +Arm +Thigh +Leg

Df Sum of Sq RSS AIC- Seated 1 35 41313 278- Arm 1 156 41434 278- Thigh 1 285 41563 278- HtShoes 1 975 42253 279<none> 41278 280- Leg 1 2661 43939 280- Age 1 3012 44290 280

Multiple Regression Searching for the "best solution"

Page 16: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

16

Step: AIC=277.67 ## after removing Ht, Weight & Seatedhipcenter ~ Age +HtShoes +Arm +Thigh +Leg

Df Sum of Sq RSS AIC- Arm 1 172 41485 276- Thigh 1 345 41658 276- HtShoes 1 1853 43166 277<none> 41313 278- Leg 1 2871 44184 278- Age 1 2977 44290 278

Step: AIC=275.83 ## after removing Arm as wellhipcenter ~ Age + HtShoes + Thigh + Leg

Df Sum of Sq RSS AIC- Thigh 1 473 41958 274<none> 41485 276- HtShoes 1 2341 43826 276- Age 1 3501 44986 277- Leg 1 3592 45077 277

Multiple Regression Searching for the "best solution"

Page 17: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

17

Step: AIC=274.26 ## after removing Thigh toohipcenter ~ Age + HtShoes + Leg

Df Sum of Sq RSS AIC<none> 41958 274- Age 1 3109 45067 275- Leg 1 3476 45434 275- HtShoes 1 4219 46176 276

> summary(s.mod)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 456.2137 102.8078 4.438 9.1e-05 ***Age 0.5998 0.3779 1.587 0.1217 HtShoes -2.3023 1.2452 -1.849 0.0732 . Leg -6.8297 4.0693 -1.678 0.1024 ---Residual standard error: 35.13 on 34 degrees of freedomMultiple R-squared: 0.6813, Adjusted R-squared: 0.6531 F-statistic: 24.22 on 3 and 34 DF, p-value: 1.437e-08

Multiple Regression Searching for the "best solution"

Page 18: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

18

> anova(s.mod)Response: hipcenter Df Sum Sq Mean Sq F value Pr(>F) Age 1 5541 5541 4.4904 0.04147 * HtShoes 1 80664 80664 65.3647 1.996e-09 ***Leg 1 3476 3476 2.8169 0.10244 Residuals 34 41958 1234

> drop1(s.mod, test="F")hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 41958 274 Age 1 3109 45067 275 2.5192 0.12173 HtShoes 1 4219 46176 276 3.4185 0.07318 .Leg 1 3476 45434 275 2.8169 0.10244

Multiple Regression Searching for the "best solution"

Page 19: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

19

> plot(hipcenter ~ s.mod$fit)

> abline(0,1, col="red")

> null <- lm(hipcenter ~1, data=seatpos) ## starting from a null model

> s.mod <- step(null, ~. +Age+Weight+Ht+HtShoes+Seated+Arm+Thigh+Leg, direction="forward") ## intermediate steps removedStep: AIC=274.24hipcenter ~ Ht + Leg + Age

Df Sum of Sq RSS AIC<none> 41938 274+ Thigh 1 373 41565 276+ Arm 1 257 41681 276+ Seated 1 121 41817 276+ Weight 1 47 41891 276+ HtShoes 1 13 41925 276

Multiple Regression

No guarantee that the forward & backward searches

will find the same solution

Page 20: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

20

> ozone.pollution <- read.table("ozone.data.txt", header=T)

> dim(ozone.pollution)[1] 111 4

> names(ozone.pollution)[1] "rad" "temp" "wind" "ozone"

> attach(ozone.pollution)

> pairs(ozone.pollution, panel=panel.smooth, pch=16, lwd=2)

> model <- lm(ozone ~ ., data=ozone.pollution)

Multiple RegressionAnother example: How is ozone concentration in the atmosphere

related to solar radiation, ambient temperature & wind speed ?

Page 21: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

21

> summary(model)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -64.23208 23.04204 -2.788 0.00628 ** rad 0.05980 0.02318 2.580 0.01124 * temp 1.65121 0.25341 6.516 2.43e-09 ***wind -3.33760 0.65384 -5.105 1.45e-06 ***---Residual standard error: 21.17 on 107 degrees of freedomMultiple R-squared: 0.6062, Adjusted R-squared: 0.5952 F-statistic: 54.91 on 3 and 107 DF, p-value: < 2.2e-16

> drop1(model, test="F")ozone ~ rad + temp + wind Df Sum of Sq RSS AIC F value Pr(F) <none> 47964 682 rad 1 2984 50948 686 6.6565 0.01124 * temp 1 19032 66996 717 42.4567 2.429e-09 ***wind 1 11680 59644 704 26.0567 1.450e-06 ***

Multiple Regression

Page 22: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

22

> plot(ozone ~ model$fitted, pch=16, xlab="Model Predictions", ylab="Ozone Concentration")

> abline(0,1, col="red", lwd=2)

> shapiro.test(model$res)

Shapiro-Wilk normality test

data: model$res W = 0.9173, p-value = 3.704e-06

> plot(model, which=1)

Multiple Regression

Page 23: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

23

> library(car)

> cr.plot(model, rad, pch=16, main="")

> cr.plot(model, temp, pch=16, main="")

> cr.plot(model, wind, pch=16, main="")

Multiple RegressionWhich predictor variable is

non-linearly related to Ozone ?

rad temp wind

Page 24: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

24

> model2 <- lm(ozone ~ poly(rad,2)+poly(temp,2)+poly(wind,2))

> summary(model2)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.099 1.735 24.269 < 2e-16 ***poly(rad, 2)1 65.085 19.379 3.359 0.00110 ** poly(rad, 2)2 -16.259 20.174 -0.806 0.42213 poly(temp, 2)1 142.708 23.502 6.072 2.09e-08 ***poly(temp, 2)2 56.043 19.138 2.928 0.00419 ** poly(wind, 2)1 -125.800 21.391 -5.881 4.99e-08 ***poly(wind, 2)2 88.636 19.199 4.617 1.12e-05 ***---Residual standard error: 18.28 on 104 degrees of freedomMultiple R-squared: 0.7148, Adjusted R-squared: 0.6984 F-statistic: 43.44 on 6 and 104 DF, p-value: < 2.2e-16

Polynomial RegressionWhen the trend between a predictor variable and

the response variable is not linear, the curvature

can be "captured" using polynomials of various degrees.

Page 25: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

25

> extractAIC(model2) ## Simplified Version[1] 7.0000 651.8105

> model3 <- lm(ozone ~ rad +poly(temp,2) +poly(wind,2))> extractAIC(model3)[1] 6.0000 650.5016

> model4 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,2) )> extractAIC(model4)[1] 7.0000 648.0394

> model5 <- lm(ozone ~ rad+poly(temp,2)+poly(wind,3) )> extractAIC(model5)[1] 7.0000 651.6489

> model6 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,3) )> extractAIC(model6)[1] 8.0000 649.0149

> extractAIC(model) ## original, strictly linear model[1] 4.0000 681.6233

Polynomial RegressionBut how many degrees should we choose ?

Best Model

Page 26: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

26

> plot(model4, which=1)

> shapiro.test(model4$res)

Shapiro-Wilk normality test

data: model4$res W = 0.9309, p-value = 2.267e-05

> plot(model4, which=2)

Polynomial Regression

Page 27: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

27

> library(MASS)

> boxcox(model4, plotit=T)> boxcox(model4, plotit=T, lambda=seq(0,1,by=.1) )

> new.ozone <- ozone^(1/3)

> mod4 <- lm(new.ozone ~ rad +poly(temp,3) +poly(wind,2) )

> extractAIC(mod4)[1] 7.0000 -162.6569

> shapiro.test(mod4$res)

Shapiro-Wilk normality test

data: mod4$res W = 0.9899, p-value = 0.5855

Polynomial RegressionFinding the best transformation

of our response variable

Page 28: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

28

> par(mfrow=c(2,2))

> plot(mod4)

> anova(mod4)Analysis of Variance Table

Response: new.ozone Df Sum Sq Mean Sq F value Pr(>F) rad 1 15.531 15.531 71.467 1.859e-13 ***poly(temp, 3) 3 40.947 13.649 62.804 < 2.2e-16 ***poly(wind, 2) 2 8.129 4.065 18.703 1.152e-07 ***Residuals 104 22.602 0.217

> summary(mod4)Residual standard error: 0.4662 on 104 degrees of freedomMultiple R-squared: 0.7408, Adjusted R-squared: 0.7259 F-statistic: 49.55 on 6 and 104 DF, p-value: < 2.2e-16

Polynomial Regression

Page 29: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

29

> par(mfrow=c(1,1))

> plot(new.ozone ~ mod4$fitted)> abline(0,1, col="red", lwd=2)

> cr.plot(mod4, rad, pch=16, main="")> cr.plot(mod4, poly(temp,3), pch=16, main="")> cr.plot(mod4, poly(wind,2), pch=16, main="")

Polynomial Regression

rad temp wind