Extending the Linear Models 1: Smoothing and Generalised … · 2019. 10. 17. · Estimate Std....

Extending the Linear Models 1: Smoothing andGeneralised Additive Models Lab

Dr. Matteo TanadiniAngewandte statistische Regression I, HS19 (ETHZ)

Contents1 Polynomials to model non-linear effects 2

1.1 Polynomials as predictors: from x to f(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 More complex non-linear relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Are polynomials the ultimate solution for modelling non-linear relationships? 62.1 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Extrapolation not granted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Regression splines 7

4 Degree of complexity, how much is enough? 94.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Smoothing splines 10

6 Generalised Additive Models (GAMs) 116.1 Several smooth terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7 Appendix 157.1 Interactions with smooth terms (**) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157.2 2-dimensional smooth terms (**) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8 Session Information 17

1

1 Polynomials to model non-linear effects

On week three of this course we introduced the concept of quadratic effects. Let’s briefly recapitulate theconcept of polynomials and orthogonal polynomials.

We are going to use the d.trees data set again. This data set contains the growth rates of 557 trees.## (results are hidden)d.trees <- read.csv2("../../Data_sets/TreesChamagne2017_Lab.csv")##str(d.trees)head(d.trees)

Let’s visualise the effect of density.site with a smoother.library(ggplot2)gg.density.smooth <- ggplot(data = d.trees,

mapping = aes(y = growth.rate,x = density.site)) +

geom_point() +geom_smooth()

##gg.density.smooth

0.5

1.0

1.5

30 40 50 60

density.site

grow

th.r

ate

Let’s now fit a linear model to this data, where density.site is modelled as a quadratic effect. Note that forthe sake of simplicity, all other predictors are omitted from the model.lm.trees.quad <- lm(growth.rate ~ density.site + I(density.site^2),

data = d.trees)##summary(lm.trees.quad)

Call:lm(formula = growth.rate ~ density.site + I(density.site^2),

data = d.trees)

2

Residuals:Min 1Q Median 3Q Max

-0.970 -0.173 0.043 0.195 0.572

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.814776 0.205797 8.82 < 2e-16 ***density.site -0.035889 0.009893 -3.63 0.00031 ***I(density.site^2) 0.000433 0.000116 3.74 0.00020 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.28 on 554 degrees of freedomMultiple R-squared: 0.0254, Adjusted R-squared: 0.0219F-statistic: 7.22 on 2 and 554 DF, p-value: 8e-04

The summary table indicates that the quadratic term is needed. Indeed, its p-value is < 0.001 (note thatwhen looking at the p-values, we are assuming the model assumptions to be fulfilled). The fact that thequadratic term is significant does not come as a surprise, as it was very clear from the graph above thatdensity.site has a non-linear effect.

As previously mentioned, one problem of using polynomials is the risk of introducing collinearity among theterms (i.e. between the linear and the quadratic term in the example above). Let’s estimate the varianceinflation factor for the model that contains both density.site and its square.library(car)vif(lm.trees.quad)

density.site I(density.site^2)59 59

Indeed, the vif is clearly larger than 10. Therefore, some action must be undertaken. The solution here is touse orthogonal polynomials, which are implemented in the poly() function1.

Let’s refit the model with orthogonal polynomials. Note that the correlation between the linear and quadraticterm computed via the poly() function is zero. Indeed, they are orthogonal and therefore the issues ofcollinearity is solved.lm.trees.poly <- lm(growth.rate ~ poly(density.site, degree = 2),

data = d.trees)

1.1 Polynomials as predictors: from x to f(x)

It is important to note that the use of polynomials implies that we modify a predictor such that severalpredictors are created. For example, by adding a quadratic term we implicitly create and add a new predictorto the model. The model clearly becomes more complex as more parameters are estimated. Formally:

y = β0 + βx · x+ ε

y = β0 + βx · x+ βx2 · x2 + ε

So it is important to note the step:

1Another possible solution against collinear terms in polynomials is to center the linear term before squaring it (i.e. firstremove the mean and then square).

3

x→ f(x) = x+ x2

1.2 More complex non-linear relationships

Let’s turn our attention to a simulated example.

0

5

10

0 5 10 15 20

x1

yy

True relationship

0

5

10

0 5 10 15 20

x1

yy

Observations

If we were to assume a linear relationship, we would be locally under- and overestimating the true expectedvalue. In other words, our model would not be a good simplified representation of reality.

0

5

10

15

0 5 10 15 20

x1

yy

Linear effect of x1

Let’s try to model this relationship with a quadratic term.

4

0

5

10

0 5 10 15 20

x1

yy

Quadratic polynomial

The quadratic term is not enough to model the relationship between x1 and yy. More flexibility is required.

Let’s try higher order polynomials, for example a cubic polynomial.

0

5

10

0 5 10 15 20

x1

yy

Cubic polynomial

The regression line fits the data very well. As a matter of fact, the true relationship was simulated from acubic polynomial.

So in this case, we could model this data with the following linear model.lm.cubic <- lm(yy ~ poly(x1, degree = 3), data = sim.data)

5

2 Are polynomials the ultimate solution for modelling non-linearrelationships?

By looking at the examples above, it may appear that polynomials can solve any problem of non-linearrelationships. The only thing at stake is the choice of a “sufficiently complex” polynomial. Unfortunately,this is not the case. Although very useful in many settings, polynomials suffer from several shortcomings.These drawbacks are explained in the following subsections.

2.1 Collinearity

We have already seen that there is a simple solution to this issue: orthogonal polynomials.

2.2 Extrapolation not granted

Sometimes the models we fit will then be used to make predictions. In some cases, it may be required tomake predictions outside of the fitting range. This is known as extrapolation.

Let’s try to make predictions from the cubic model we just fitted (code is hidden).

0

5

10

0 10 20

x1

sim

.dat

a$yy

The predictions on the right-hand side seem reasonable. On the other hand, the predictions on the left-handside are not supported by data evidence. In other words, there is no reason why the predictions shouldincrease dramatically.

In general, the higher the degree of the fitted polynomial, the less reliable the extrapolation becomes.

Let’s now turn our attention to the confidence intervals for the estimated regression line.

6

0

5

10

0 10 20

x1

sim

.dat

a$yy

Considering the fact that we have no idea how the data behave outside the fitting region, the confidenceintervals are overly confident.

2.3 Summary

Polynomials are simple and intuitive functions that allow us to model non-linear relationships. Unfortunately,they suffer from some shortcomings:

• collinearity, which is solved by using orthogonal polynomials

• they are not reliable outside the fitting region (in terms of fit and confidence intervals)

• there is no simple rule that tells us what degree to choose. If a too high degree is used, we may riskoverfitting the data.

3 Regression splines

Regression splines have better properties than polynomials in terms of reliability outside the fitting regionand do not suffer from collinearity issues.

Regression splines are based on basis functions. There are many different types of basis functions that can beused to build a regression spline.

As for polynomials, the dimension (resp. the degree) of the regression spline must be set by the user.

Let’s look at one example of regression splines. Here we use natural splines that are applied to the data setsimulated from the previously used cubic polynomial.library(splines)lm.regressionSplines <- lm(yy ~ ns(x1, df = 3), data = sim.data)summary(lm.regressionSplines)

Call:lm(formula = yy ~ ns(x1, df = 3), data = sim.data)

7

Residuals:Min 1Q Median 3Q Max

-1.5033 -0.3850 0.0393 0.4093 1.2266

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.354 0.182 -1.94 0.055 .ns(x1, df = 3)1 13.730 0.224 61.16 <2e-16 ***ns(x1, df = 3)2 15.345 0.457 33.58 <2e-16 ***ns(x1, df = 3)3 11.324 0.185 61.12 <2e-16 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.53 on 96 degrees of freedomMultiple R-squared: 0.989, Adjusted R-squared: 0.989F-statistic: 2.85e+03 on 3 and 96 DF, p-value: <2e-16

Note that the summary output does not look very much different from the one obtained with a cubicpolynomial. In particular, from a single continuous predictor we created three continuous predictors.

x→ f(x) = x̃1 + x̃2 + x̃3

Let’s look at the fit of this model.

0

5

10

0 5 10 15 20

x1

yy

natural spline of degree 3

As expected, the fit is very similar to the one obtained with the cubic polynomial.

Let’s now look at the predictions and corresponding confidence intervals for the natural splines outside thefitting region.

8

0

5

10

0 10 20

x1

sim

.dat

a$yy

The behaviour of the natural splines outside the fitting region is more appropriate and reliable than for thecubic polynomial model. In addition, also the confidence intervals are wider, which correctly reflects the lackof data.

4 Degree of complexity, how much is enough?

As for polynomials, the user must define the complexity of the regression splines. This is done through thedefinition of the dimension of the basis function. In the example above, the polynomial had order three andthe natural spline had dimension three.

We can actually convince ourself by printing the first few observations for both functions.library(dplyr) ## for pipingpoly(sim.data$x1, degree = 3) %>% head()

1 2 3[1,] -0.17 0.22 -0.25[2,] -0.17 0.20 -0.22[3,] -0.16 0.19 -0.19[4,] -0.16 0.18 -0.16[5,] -0.16 0.17 -0.14[6,] -0.15 0.15 -0.11

ns(sim.data$x1, df = 3) %>% head()

1 2 3[1,] 0.0000 0.000 0.000[2,] -0.0077 0.023 -0.015[3,] -0.0153 0.046 -0.031[4,] -0.0229 0.069 -0.046[5,] -0.0303 0.092 -0.061[6,] -0.0376 0.114 -0.076

9

Unfortunately, even with regression splines, there is no simple rule on how to choose the “right” degree ofcomplexity. The following graph shows four different choices of complexity for natural splines.

0

5

10

0 5 10 15 20

x1

yy

natural spline, df = 3

0

5

10

0 5 10 15 20

x1

yy


0

5

10

0 5 10 15 20

x1

yy


0

5

10

0 5 10 15 20

x1

yy


Setting the complexity for this natural spline to 25 or 50 clearly leads to overfit. However, It can be difficultto argue whether the natural spline with complexity 10 is more appropriate than the one with complexitythree.

4.1 Summary

Regression splines enable us to model non-linear relationships. They have the following properties:

• they are not affected by collinearity issues

• they are reliable outside the fitting region (in terms of fit and confidence intervals)

• unfortunately, the degree of complexity must be set by the user with the risk of overfitting the data.

5 Smoothing splines

A valid solution to the problem of choosing the correct degree of complexity to model non-linear relationshipsis to use smoothing splines.

In a few words, this boils down to choosing a large enough degree of complexity (e.g. 10) and then avoidoverfit by penalizing the wiggliness of the estimated function (see blackboard).

Note that this is an extreme simplification of what smoothing splines are. See the references mentioned inclass for further reading.

10

The smoother “loess” that is used by default by geom_smooth() works somehow a little bit differently and isbased on “local regression” (see blackboard and further reading). There is a wide variety of smoothers. Note,however, that in most practical situations they all lead to very similar results.

6 Generalised Additive Models (GAMs)

Up to this point we have only dealt with one single predictor. In real settings, we often have several predictorsthat we want to consider simultaneously.

Generalised Additive Models, GAMs for short, allow us to estimate the effect of several predictors as asmooth function. In other words, a GAM allows us to fit a different smoothing spline to several predictorssimultaneously.

The currently most powerful implementation of GAMs is to be found in the {mgcv} package2.

Let’s fit a GAM to the data simulated from a cubic polynomial.library(mgcv)gam.1 <- gam(yy ~ s(x1), data = sim.data)

The s() function allows the user to specify smooth terms. In this case, the effect of x1, the unique predictorpresent in the model, is modelled with a smooth term. Let’s visualise the estimated effect of x1.plot(gam.1, residuals = TRUE, cex = 2, shade = TRUE)

0 5 10 15 20

−8

−6

−4

−2

02

46

x1

s(x1

,7.1

1)

This is indeed a reasonable fit. Note that we allowed the smooth function to have a degree of complexityequal to 10 (i.e default value). Nevertheless, the function does not seem to overfit the data as penalisation isused.

To highlight the fact that in GAMs overfit is properly dealt with, we allow the smooth function to be overlycomplex. We do this by setting the dimension of the basis function to 40.

2The {gam} package present in the base R installation also implements GAMs (see the gam() function). However, GAMs inthe {gam} package are implemented based on the backfitting algorithm, which has been superseded. In addition, the {mgcv}package implements a much wider choice of methods and tools.

11

gam.40 <- gam(yy ~ s(x1, k = 40), ## "k" sets the dimension of the basis functiondata = sim.data)

plot(gam.40, residuals = TRUE, cex = 2, shade = TRUE)

0 5 10 15 20

−8

−6

−4

−2

02

46

x1

s(x1

,8.5

6)

Again, there is no overfit. Penalisation worked well. For the sake of completeness, we show the fit obtainedwith a complexity degree of 40, but no penalisation. No penalisation is obtained by setting the fx argumentto TRUE.gam.40.fx <- gam(yy ~ s(x1, k = 40, fx = TRUE),

data = sim.data)plot(gam.40.fx, residuals = TRUE, cex = 2, shade = TRUE)

0 5 10 15 20

−5

05

x1

s(x1

,39)

This graph clearly shows overfit.

12

6.1 Several smooth terms

We are now turning our attention back to the d.trees data set. Let’s fit a model that allows the effect ofdensity.site and diversity.site to be a smooth function. Note that we are also including species in the model.For the sake of brevity, no other predictors are considered.gam.trees.1 <- gam(growth.rate ~ species +

s(density.site) + s(diversity.site),data = d.trees)

Let’s now look at the output of the summary function.summary(gam.trees.1)

Family: gaussianLink function: identity

Formula:growth.rate ~ species + s(density.site) + s(diversity.site)

Parametric coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.2522 0.0213 58.80 < 2e-16 ***speciesLarix -0.3015 0.0306 -9.84 < 2e-16 ***speciesPicea -0.0980 0.0311 -3.16 0.0017 **speciesQuercus -0.1863 0.0302 -6.18 1.3e-09 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Approximate significance of smooth terms:edf Ref.df F p-value

s(density.site) 2.67 3.36 5.0 0.00141 **s(diversity.site) 1.00 1.00 14.9 0.00012 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

R-sq.(adj) = 0.196 Deviance explained = 20.5%GCV = 0.064137 Scale est. = 0.063253 n = 557

Note the two sections in the output: “Parametric coefficients” and “Approximate significance of smoothterms”. The column “edf”, which stands for estimated degrees of freedom, indicates the complexity of thesmooth terms. For example, the effect of diversity.site is estimated to be linear (edf 1.00), whereas the effectof density.site seems to be more complex (edf 2.67). Let’s visualise these effects.plot(gam.trees.1, pages = 1, residuals = TRUE, shade = TRUE)

13

30 40 50 60

−1.

0−

0.5

0.0

0.5

density.site

s(de

nsity

.site

,2.6

7)

1.0 2.0 3.0 4.0

−1.

0−

0.5

0.0

0.5

diversity.site

s(di

vers

ity.s

ite,1

)Note that we can refit the same model where the effect of diversity.site is taken to be linear.gam.trees.2 <- gam(growth.rate ~ species +

s(density.site) + diversity.site,data = d.trees)

summary(gam.trees.2)


Formula:growth.rate ~ species + s(density.site) + diversity.site


(Intercept) 1.1041 0.0435 25.37 < 2e-16 ***speciesLarix -0.3015 0.0306 -9.84 < 2e-16 ***speciesPicea -0.0980 0.0311 -3.16 0.00169 **speciesQuercus -0.1863 0.0302 -6.18 1.3e-09 ***diversity.site 0.0581 0.0150 3.87 0.00012 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1


s(density.site) 2.67 3.36 5 0.0014 **---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1


14

7 Appendix

7.1 Interactions with smooth terms (**)

In a previous Lab, we have seen that some variables appeared to interact with species. We may want toinspect whether a different smooth term for size is needed for each species.gam.trees.3 <- gam(growth.rate ~ species +

s(size, by = species),data = d.trees)

summary(gam.trees.3)


Formula:growth.rate ~ species + s(size, by = species)


(Intercept) 1.2685 0.0182 69.86 < 2e-16 ***speciesLarix -0.3575 0.0261 -13.68 < 2e-16 ***speciesPicea -0.1802 0.0267 -6.75 3.9e-11 ***speciesQuercus -0.1075 0.0280 -3.84 0.00014 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1


s(size):speciesFagus 3.41 4.23 14.3 1.8e-11 ***s(size):speciesLarix 4.30 5.29 10.4 8.0e-10 ***s(size):speciesPicea 2.31 2.91 33.4 < 2e-16 ***s(size):speciesQuercus 2.38 2.99 32.5 < 2e-16 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1


Let’s visualise the estimated smooth terms for the four species.plot(gam.trees.3, residuals = TRUE, cex = 3, pages = 1)

15

50000 150000 250000

−1.

00.

5

size

s(si

ze,3

.41)

:spe

cies

Fagu

s

50000 150000 250000

−1.

00.

5

size

s(si

ze,4

.3):

spec

iesL

arix

50000 150000 250000

−1.

00.

5

size

s(si

ze,2

.31)

:spe

cies

Pic

ea

50000 150000 250000

−1.

00.

5

size

s(si

ze,2

.38)

:spe

cies

Que

rcus

## argument "residuals" does not seem to work properly

7.2 2-dimensional smooth terms (**)

Smooth terms can be fitted on two-dimensions. A classical example would be to model the spatial effect onthe response variable with a two-dimensional smooth term. This may represent, for example, soil fertility.gam.trees.4 <- gam(growth.rate ~ species +

s(size) + s(X, Y, k = 40),data = d.trees)

16

8 Session Information

sessionInfo()

R version 3.5.3 (2019-03-11)Platform: x86_64-redhat-linux-gnu (64-bit)Running under: Fedora 30 (Workstation Edition)

Matrix products: defaultBLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:[1] splines stats graphics grDevices utils datasets methods[8] base

other attached packages:[1] mgcv_1.8-28 nlme_3.1-137 dplyr_0.7.6 gridExtra_2.3 car_3.0-2[6] carData_3.0-2 ggplot2_3.0.0 knitr_1.20

loaded via a namespace (and not attached):[1] zip_2.0.1 Rcpp_1.0.2 cellranger_1.1.0[4] pillar_1.3.1 compiler_3.5.3 plyr_1.8.4[7] bindr_0.1.1 forcats_0.3.0 tools_3.5.3

[10] digest_0.6.16 lattice_0.20-35 evaluate_0.10.1[13] tibble_2.1.1 gtable_0.2.0 pkgconfig_2.0.2[16] rlang_0.4.0 Matrix_1.2-15 openxlsx_4.1.0[19] curl_3.2 yaml_2.2.0 haven_2.0.0[22] rio_0.5.16 bindrcpp_0.2.2 withr_2.1.2[25] stringr_1.3.1 hms_0.4.2 rprojroot_1.3-2[28] grid_3.5.3 tidyselect_0.2.5 data.table_1.11.8[31] glue_1.3.1 R6_2.4.0 readxl_1.1.0[34] foreign_0.8-71 rmarkdown_1.10 purrr_0.3.2[37] magrittr_1.5 backports_1.1.2 scales_1.0.0[40] htmltools_0.3.6 abind_1.4-5 assertthat_0.2.0[43] colorspace_1.3-2 labeling_0.3 stringi_1.2.4[46] lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4

17

Extending the Linear Models 1: Smoothing and Generalised … · 2019. 10. 17. · Estimate Std....

Documents

Transcript of Extending the Linear Models 1: Smoothing and Generalised … · 2019. 10. 17. · Estimate Std....