Linear Models 2: Tree growth Lab · 2019. 9. 20. · drop1(lm.trees.2,test ="F") Single term...

17
Linear Models 2: Tree growth Lab Dr. Matteo Tanadini Angewandte statistische Regression I, HS19 (ETHZ) Contents 1 Getting data 2 2 Testing the effect of a categorical variable 2 3 Post-hoc contrasts 5 3.1 Quercus vs. Picea ........................................... 5 3.2 All pairwise comparisons ....................................... 6 3.3 Broadleaved versus conifers ..................................... 7 4 Testing several variables 8 4.1 Testing categorical variables ..................................... 8 4.2 Testing continuous and categorical variables ............................ 10 4.3 Principle of marginality ....................................... 10 4.4 Testing all predictors in a model .................................. 12 4.5 Sequential sums of squares ...................................... 13 5 Appendix 13 5.1 Testing several post-hoc hypotheses (**) .............................. 13 5.2 Conifers vs broadleaved, why not a t-test? (*) ........................... 14 5.3 Why not adding a dummy variable to test conifers vs broadleaved (***) ............ 15 6 Session Information 17 1

Transcript of Linear Models 2: Tree growth Lab · 2019. 9. 20. · drop1(lm.trees.2,test ="F") Single term...

  • Linear Models 2: Tree growth LabDr. Matteo Tanadini

    Angewandte statistische Regression I, HS19 (ETHZ)

    Contents1 Getting data 2

    2 Testing the effect of a categorical variable 2

    3 Post-hoc contrasts 53.1 Quercus vs. Picea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 All pairwise comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3 Broadleaved versus conifers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    4 Testing several variables 84.1 Testing categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Testing continuous and categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3 Principle of marginality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.4 Testing all predictors in a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5 Sequential sums of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    5 Appendix 135.1 Testing several post-hoc hypotheses (**) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Conifers vs broadleaved, why not a t-test? (*) . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.3 Why not adding a dummy variable to test conifers vs broadleaved (***) . . . . . . . . . . . . 15

    6 Session Information 17

    1

  • 1 Getting data

    The data set used here is about the growth rates of 557 trees. The response variable growth.rate representsgrowth rates measured via tree cores. Trees belong to four species. A few other variables expected to affectgrowth rates are also present in the data set (e.g. “tree age”). Further information about the data set can befound here.

    Let’s load the data.## (results are hidden)d.trees

  • lm.trees.1

  • Coefficients:Estimate Std. Error t value Pr(>|t|)

    (Intercept) 1.2528 0.0216 58.05

  • Note that the two models differ in their complexity and thus number of parameters. The very simple modelhas an intercept only, while lm.trees.1 has four parameters (one for each species). Therefore, the secondmodel has three more parameters. This information is printed in the Df column of the output above.

    3 Post-hoc contrasts

    3.1 Quercus vs. Picea

    Let’s assume that we may want to further investigate the species effect by testing the hypothesis that Quercusdiffers from Picea in terms of growth rates. To do that we are going to use the glht() function from the{multcomp} package1.

    To use the glht() function, we must specify three things:

    • the model

    • which predictor is involved in the testing

    • what levels of the latter predictor are to be compared

    In the call below, we use the lm.trees.1 model to test whether Quercus and Picea differ in terms of growthrates.library(multcomp)ph.test.1 |t|)

    Quercus - Picea == 0 -0.1141 0.0308 -3.71 0.00023 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1(Adjusted p values reported -- single-step method)

    The output is quite clear again. There is strong evidence that the growth rates differ between these twospecies. Picea grows at a rate that is 0.11 larger than the rate of Quercus.

    Note of caution: post-hoc test can only be performed for factors that have a significant effect. If a factoris not significant, there is no evidence of differences among groups and no further analysis is required andallowed.

    1There are several methods implemented in R to test post-hoc hypotheses. The {multcomp} package is possibly the mostreliable and flexible currently available.

    5

  • 3.2 All pairwise comparisons

    Let’s assume that we may want to test all species in a pairwise manner. We have four species, we thus havesix possible couples. What if we had ten species? In that case, we would have had 45 couples.

    When performing a large number of tests, we may fall into the issue of multiple testing. Indeed, by runningmany tests, it may be that some of them are significant just due to chance2.

    Fortunately, there are ways to control for the number of tests one is performing. In the specific case of testingall possible pairs of species, we can use the “Tukey Honest Significant Difference Test” method.ph.test.THSD |t|)

    Larix - Fagus == 0 -0.2996 0.0309 -9.71

  • −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

    Quercus − Picea

    Quercus − Larix

    Picea − Larix

    Quercus − Fagus

    Picea − Fagus

    Larix − Fagus (

    (

    (

    (

    (

    (

    )

    )

    )

    )

    )

    )

    95% family−wise confidence level

    Linear Function

    The above graph shows all the estimated pairwise differences (i.e. the dots) and the associated confidenceintervals.

    3.3 Broadleaved versus conifers

    Assume we want to test the difference between “broadleaved species” and “conifers” (i.e. Fagus and Quercusversus Picea and Larix). Note that we are using a vector to specify this hypothesis.levels(d.trees$species)

    [1] "Fagus" "Larix" "Picea" "Quercus"

    #### vector of contrastsv.conifers_vs_broadleaved

  • Simultaneous Tests for General Linear Hypotheses

    Multiple Comparisons of Means: User-defined Contrasts

    Fit: lm(formula = growth.rate ~ species, data = d.trees)

    Linear Hypotheses:Estimate Std. Error t value Pr(>|t|)

    1 == 0 0.1854 0.0436 4.25 2.5e-05 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1(Adjusted p values reported -- single-step method)

    On average, broadleaved species have higher growth rates (0.18 faster growth rates). There is strong evidencethat this difference is significant.

    Here we are testing a single hypothesis. If we were to test several hypotheses, then we must feed the glht()function with all hypotheses in one go such that the correction for multiple testing is carried out properly(see the Appendix for an example).

    4 Testing several variables

    4.1 Testing categorical variables

    We just saw that we tested the effect of species by using the anova() function fed with two models. In general,we often want to test several variables present in a model.

    Let’s consider a model with two further predictors. Density.tree.Class is a predictor that measures the densityof trees around the focal tree for which the growth rate was measured. This latter categorical predictor hasthree classes (i.e. “low”, “medium” and “high”).

    SiteID is another categorical variable that accounts for the fact that the 557 trees were clustered into 45groups (i.e. sites).

    Let’s add these variables to the model. To do that we could refit the model via the lm() function or “update”the current model by using the update() function.lm.trees.2

  • To test these two categorical variables we could again perform two F-tests via the anova() function. However,this is not needed as the drop1() function preforms this automatically for each variable present in a givenmodel.drop1(lm.trees.2, test = "F")

    Single term deletions

    Model:growth.rate ~ species + Density.tree.Class + SiteID

    Df Sum of Sq RSS AIC F value Pr(>F) 36.4 -1505species 3 6.85 43.3 -1415 34.51 > than 0.1).

    Finally, the SiteID predictor also seems to not have a relevant effect. However, note that the Df for thispredictor are one instead of 44 as expected (i.e. number of levels of the categorical variable minus one).

    Question: why are Df equal to one?

    The reason is that the variable SiteID was not correctly defined as a categorical variable. Indeed, sites IDsare plain numbers and therefore this predictor was considered to be a continuous variable.

    To fix this problem we must declare this predictor as a factor.d.trees$SiteID.fac

  • 4.2 Testing continuous and categorical variables

    We have previously seen that in order to test continuous and dummy variables (i.e. categorical variable withtwo levels) we can use the summary() function (via the t-test). Note that drop1() can also be used withcontinuous predictors.

    Let’s add the age predictor.lm.trees.4 F) 29.3 -1538species 3 4.02 33.3 -1473 23.10 5.0e-14 ***Density.tree.Class 2 0.24 29.6 -1538 2.05 0.13SiteID.fac 44 7.13 36.4 -1505 2.80 3.2e-08 ***age 1 0.05 29.4 -1539 0.84 0.36---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

    The output tells us that the age predictor does not seem to play an important role. To know whether thispredictor has a positive or negative effect on the response variable, we can use the coef() function.

    4.3 Principle of marginality

    Note that the drop1() function respects the principal of marginality. To explain what this means, we addan interaction to our model. Let’s assume that we would expect different species to be affected in differentways by the age of a tree. To properly model this, we must introduce the interaction term between these twopredictors.lm.trees.5 F)

    28.0 -1558Density.tree.Class 2 0.22 28.2 -1557 2.01 0.14SiteID.fac 44 7.79 35.8 -1509 3.18 3.4e-10 ***species:age 3 1.31 29.3 -1538 7.85 3.9e-05 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

    There is strong evidence that the age interacts with species. In other words, the effect of age is differentamong species.

    10

  • Question: why was then the effect of age alone non-significant?

    The reason is that the effect of age may be absent or very weak when averaged over all species. However, itmay be that two species have a strong positive effect and two other species have a strong negative effect.Averaged, this gives no overall effect.

    Let’s visualise the effect of age for all species together first. Note that we are using the add-on package{ggplot2}. The use of the ggplot() function will be discussed later in this course.library(ggplot2)ggplot(data = d.trees,

    mapping = aes(y = growth.rate,x = age)) +

    geom_point() + ## adds observationsgeom_smooth(method = "lm") ## adds the regression line with CI

    0.5

    1.0

    1.5

    40 80 120

    age

    grow

    th.r

    ate

    When species are ignored, the effect of age seems to be weak. Let’s visualise each species in a different graph.ggplot(data = d.trees,

    mapping = aes(y = growth.rate,x = age)) +

    geom_point() +geom_smooth(method = "lm") +facet_grid(. ~ species)

    11

  • Fagus Larix Picea Quercus

    40 80 120 40 80 120 40 80 120 40 80 120

    0.5

    1.0

    1.5

    age

    grow

    th.r

    ate

    These four plots show clearly that the effect of age is not constant among species. For example, “Larix” treesseem decrease their growth rates over age, while “Quercus” show the opposite pattern.

    So the right conclusion is that: “there is strong evidence that age has an effect on growth rate and this effectdiffers among species.”

    This conclusion implies that both predictors involved in the interaction are relevant. In other words, as soonas an interaction term is significant, both predictors are to be considered relevant and no further testing isrequired to assess whether they play a role.

    This is why the drop1() function does not test the main effects that are involved in an interaction. Indeed, inthe output above neither age nor species are tested.

    This is the principle of marginality. Main effects must be tested only if they are not involved in interactionsor higher-degree terms. More in general, marginality implies the testing of higher-degree terms first.

    4.4 Testing all predictors in a model

    Sometimes, it may be of interest to test whether any of the predictors contained in the model has an influenceon the response variable. This can be done by comparing the full model with a simple model that onlycontains an intercept. This test, rarely used in practice, is referred to as “global F-test”.anova(lm.trees.0, lm.trees.5)

    Analysis of Variance Table

    Model 1: growth.rate ~ 1Model 2: growth.rate ~ species + Density.tree.Class + SiteID.fac + age +

    12

  • species:ageRes.Df RSS Df Sum of Sq F Pr(>F)

    1 556 43.72 503 28.0 53 15.7 5.33

  • Picea vs Quercus 0 0 1.0 -1.0Piceas vs Quercus & Picea 1 0 -0.5 -0.5F - L vs P - Q 1 -1 -1.0 1.0

    Let’s perform the testing.ph.test.mat.1 |t|)

    Fagus vs Larix == 0 0.2996 0.0309 9.71

  • seemed to depend on species. Therefore, we should rather perform the test “density low” vs “densityhigh” for each species separately. If we were to run t-tests disconnected from the model, we may endup testing hypotheses that should not be tested. Here, we may test “density low” vs “density high” byforgetting that this may not be a sensible choice as density effect varies from species to species.

    5. Unallowed testing: post-hoc tests can only be performed when the factor of interest is shown to have asignificant effect. This prevents us to test very many hypotheses and then “fish out” the significantones.

    5.3 Why not adding a dummy variable to test conifers vs broadleaved (***)

    Following on the argumentation above, we may say that to test the effect of conifers vs broadleaved, wesimply add a dummy variable that signals whether the species is a conifer or a broadleaved species.d.trees$conifers.YES

  • of the design matrix is a linear combination of the others.

    The concept of rank-deficiency is not straightforward, so to give you an intuition consider the followingargument. In the lm.trees.conifers model we tried to estimate five parameters from four groups of observations(one per species). In other words, we tried to estimate more parameters than available groups, which leads tothe problem.

    16

  • 6 Session Information

    sessionInfo()

    R version 3.5.3 (2019-03-11)Platform: x86_64-redhat-linux-gnu (64-bit)Running under: Fedora 30 (Workstation Edition)

    Matrix products: defaultBLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

    locale:[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C[9] LC_ADDRESS=C LC_TELEPHONE=C

    [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

    attached base packages:[1] stats graphics grDevices utils datasets methods base

    other attached packages:[1] ggplot2_3.0.0 multcomp_1.4-10 TH.data_1.0-10 MASS_7.3-51.1[5] survival_2.43-3 mvtnorm_1.0-10 knitr_1.20

    loaded via a namespace (and not attached):[1] Rcpp_0.12.18 bindr_0.1.1 compiler_3.5.3 pillar_1.3.1[5] plyr_1.8.4 tools_3.5.3 digest_0.6.16 evaluate_0.10.1[9] tibble_2.1.1 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.2

    [13] rlang_0.3.4 Matrix_1.2-15 yaml_2.1.19 bindrcpp_0.2.2[17] withr_2.1.2 stringr_1.3.1 dplyr_0.7.6 tidyselect_0.2.4[21] rprojroot_1.3-2 grid_3.5.3 glue_1.2.0 R6_2.2.2[25] rmarkdown_1.10 reshape2_1.4.3 purrr_0.2.5 magrittr_1.5[29] backports_1.1.2 scales_1.0.0 codetools_0.2-16 htmltools_0.3.6[33] splines_3.5.3 assertthat_0.2.0 colorspace_1.3-2 labeling_0.3[37] sandwich_2.5-1 stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0[41] crayon_1.3.4 zoo_1.8-3

    17

    Getting dataTesting the effect of a categorical variablePost-hoc contrastsQuercus vs. PiceaAll pairwise comparisonsBroadleaved versus conifers

    Testing several variablesTesting categorical variablesTesting continuous and categorical variablesPrinciple of marginalityTesting all predictors in a modelSequential sums of squares

    AppendixTesting several post-hoc hypotheses (**)Conifers vs broadleaved, why not a t-test? (*)Why not adding a dummy variable to test conifers vs broadleaved (***)

    Session Information