STATS 330: Lecture 16

04/22/23 330 lecture 16 1

STATS 330: Lecture 16

04/22/23 330 lecture 16 2

Case studyAim of today’s lecture

To illustrate the modelling process using the evaporation data.

04/22/23 330 lecture 16 3

The Evaporation data

Data in data frame evap.df Aims of the analysis:

• Understand relationships between explanatory variables and the response

• Be able to predict evaporation loss given the other variables

04/22/23 330 lecture 16 4

Case Study: Evaporation data

Recall from Lecture 15: variables are

evap: the amount of moisture evaporating from the soil in the 24 hour period (response)

maxst: maximum soil temperature over the 24 hour periodminst: minimum soil temperature over the 24 hour periodavst: average soil temperature over the 24 hour periodmaxat: maximum air temperature over the 24 hour periodminat: minimum air temperature over the 24 hour periodavat: average air temperature over the 24 hour periodmaxh: maximum humidity over the 24 hour periodminh: minimum humidity over the 24 hour periodavh: average humidity over the 24 hour periodwind: average wind speed over the 24 hour period.

04/22/23 330 lecture 16 5

Modelling cycle

Examine residuals

Fit model

Transform

Choose Model

Bad fit

Good fit

Use model

Plots, theory

04/22/23 330 lecture 16 6

Modelling cycle (2)Our plan of attack:

1. Graphical check

1. Suitability for regression

2. Gross outliers

2. Preliminary fit

3. Model selection (for prediction)

4. Transforming if required

5. Outlier check

6. Use model for prediction

04/22/23 330 lecture 16 7

Step 1: Plots Preliminary plots Want to get an initial idea of suitability of data

for regression modelling Check for linear relationships, outliers

• Pairs plots, coplots • Data looks OK to proceed, but evap/maxh plot

looks curved

04/22/23 330 lecture 16 8

Points to note Avh has very few values Strong relationships between response and

some variables (particularly maxh, avst) Not much relationship between response and

minst, minat, wind strong relationships between min, av and max No obvious outliers

04/22/23 330 lecture 16 9

Step 2: preliminary fitCoefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -54.074877 130.720826 -0.414 0.68164

avst 2.231782 1.003882 2.223 0.03276 *

minst 0.204854 1.104523 0.185 0.85393

maxst -0.742580 0.349609 -2.124 0.04081 *

avat 0.501055 0.568964 0.881 0.38452

minat 0.304126 0.788877 0.386 0.70219

maxat 0.092187 0.218054 0.423 0.67505

avh 1.109858 1.133126 0.979 0.33407

minh 0.751405 0.487749 1.541 0.13242

maxh -0.556292 0.161602 -3.442 0.00151 **

wind 0.008918 0.009167 0.973 0.33733

Residual standard error: 6.508 on 35 degrees of freedom

Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023

F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11

04/22/23 330 lecture 16 10

04/22/23 330 lecture 16 11

0 10 20 30 40 50

-15

-10

-50

510

15

Fitted values

Res

idua

ls

Residuals vs Fitted

33412

-2 -1 0 1 2

-2-1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

41 33 2

0 10 20 30 40 50

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location41 33

2

0.0 0.2 0.4 0.6

-3-2

-10

12

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

1

0.5

0.5

1

Residuals vs Leverage

31

241

Findings Plots OK, normality dubious Gam plots indicated no transformations Point 31 has quite high Cooks distance but

removing it doesn’t change regression much Model is OK. Could interpret coefficients, but variables

highly correlated.

04/22/23 330 lecture 16 12

04/22/23 330 lecture 16 13

Step 3: Model selection

Use APR Model selected was

evap ~ maxat + maxh + wind

However, this model does not fit all that well (outliers, non-normality)

Try “best AIC” model

evap ~ avst + maxst + maxat + minh+maxh Now proceed to step 4

04/22/23 330 lecture 16 14

Step 4: Diagnostic checks

For a quick check, plot the regression object produced by lm

model1.lm<-lm(evap ~ avst + maxst + maxat + minh+maxh, data=evap.df)

plot(model1.lm)

04/22/23 330 lecture 16 15

Outliers ?Non-normal?

0 10 20 30 40 50

-20

-15

-10

-50

510

15

Fitted values

Res

idua

ls

Residuals vs Fitted

41 33

2

-2 -1 0 1 2

-3-2

-10

12

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

41

332

0 10 20 30 40 50

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location41

33 2

0.0 0.1 0.2 0.3 0.4

-3-2

-10

12

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance1

0.5

0.5

Residuals vs Leverage

2

41

38

04/22/23 330 lecture 16 16

Conclusions?

No real evidence of non-linearity, but will check further with gams

Normal plot looks curved Some largish outliers Points 2, 41 have largish Cooks D

04/22/23 330 lecture 16 17

Checking linearity Check for linearity with gams

> library(mgcv)>plot(gam(evap ~ s(avst) + s(maxst) + s(maxat) + s(maxh) + s(wind), data=evap.df))

04/22/23 330 lecture 16 18

75 80 85 90 95

-60

-40

-20

020

40

avst

s(av

st,4

.49)

130 140 150 160 170 180 190 200

-60

-40

-20

020

40

maxst

s(m

axst

,1)

150 160 170 180 190 200 210

-60

-40

-20

020

40

maxat

s(m

axat

,1)

340 360 380 400 420 440 460 480

-60

-40

-20

020

40

maxh

s(m

axh,

3.69

)

30 40 50 60 70

-60

-40

-20

020

40

minh

s(m

inh,

2.04

)

Transform avst, maxh ?

04/22/23 330 lecture 16 19

Remedy Gam plots for avst and maxh are curved Try cubics in these variables Plots look better Cubic terms are significant

04/22/23 330 lecture 16 20

04/22/23 330 lecture 16 21

> model2.lm<-lm(evap ~ poly(avst,3) + maxst + maxat + minh+poly(maxh,3), data=evap.df)> summary(model2.lm)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.6521 25.4308 2.935 0.00577 ** poly(avst, 3)1 83.0106 27.3221 3.038 0.00441 ** poly(avst, 3)2 21.4666 8.3097 2.583 0.01399 * poly(avst, 3)3 14.1680 7.2199 1.962 0.05749 . maxst -0.8167 0.1697 -4.814 2.65e-05 ***maxat 0.4175 0.1177 3.546 0.00111 ** minh 0.4580 0.3253 1.408 0.16766 poly(maxh, 3)1 -89.0809 20.0297 -4.447 8.02e-05 ***poly(maxh, 3)2 -10.7374 7.9265 -1.355 0.18398 poly(maxh, 3)3 15.1172 6.3209 2.392 0.02212 * ---

Residual standard error: 5.276 on 36 degrees of freedomMultiple R-squared: 0.8961, Adjusted R-squared: 0.8701 F-statistic: 34.49 on 9 and 36 DF, p-value: 4.459e-15

04/22/23 330 lecture 16 22

New modelLets now adopt model

lm(evap~poly(avst,3)+maxst+maxat+poly(maxh,3) + wind

Outliers are not too bad but lets check

> influenceplots(model2.lm)

04/22/23 330 lecture 16 23

04/22/23 330 lecture 16 24

04/22/23 330 lecture 16 25

Deletion of points Points 2, 6, 7, 41 are affecting the fitted values,

some coefficients. Removing these one at a time and refitting indicates that the cubics are not very robust, so we revert to the non-polynomial model

The coefficients of the non-polynomial model are fairly stable when we delete these points one at a time, so we decide to retain them.

04/22/23 330 lecture 16 26

Normality?However, the normal plot for the non-

polynomial model is not very straight – WB test has p-value 0.

Normality of polynomial model is better

Try predictions with both

04/22/23 330 lecture 16 27

predict.df = data.frame(avst = mean(evap.df$avst),maxst = mean(evap.df$maxst),maxat = mean(evap.df$maxat),maxh = mean(evap.df$maxh),minh = mean(evap.df$minh))

rbind(predict(model1.lm, predict.df,interval="p" ),predict(model2.lm, predict.df,interval="p" )) fit lwr upr1 34.67391 21.75680 47.591031 32.38471 21.39857 43.37084CV fit: fit lwr upr1 34.67391 21.02628 48.32154

STATS 330: Lecture 16

Documents

Transcript of STATS 330: Lecture 16