General introduction to mixed models

eNote 1 1

eNote 1

General introduction to mixed models

eNote 1 INDHOLD 2

Indhold

1 General introduction to mixed models 1

1.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Introductory example: NIR predictions of HPLC measurements . . . . . . 5

1.2.1 Simple analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Simple analysis by an ANOVA approach . . . . . . . . . . . . . . . 6

1.2.3 The problem of the ANOVA approach . . . . . . . . . . . . . . . . . 7

1.2.4 The mixed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.5 Comparison of fixed and mixed model . . . . . . . . . . . . . . . . 9

1.2.6 Analysis by mixed model . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Example with missing values . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.1 Analysis by fixed effects ANOVA . . . . . . . . . . . . . . . . . . . 12


1.4 Why use mixed models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 R-TUTORIAL: What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 R TUTORIAL: Importing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.7 R tutorial: Data handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.8 R TUTORIAL: Creating Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 18

eNote 1 1.1 PREFACE 3

1.9 R-TUTORIAL: Introductory example: NIR predictions of HPLC measu-rements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.9.1 Simple analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.9.2 Simple analysis by an ANOVA approach: Data re-structuring . . . 21

1.9.3 ANOVA approach: Using function lm . . . . . . . . . . . . . . . . . 24

1.9.4 R-TUTORIAL: ANOVA post hoc . . . . . . . . . . . . . . . . . . . . 29


1.10 R-TUTORIAL: Example with missing values . . . . . . . . . . . . . . . . . 38

1.10.1 Simple analysis by an ANOVA approach . . . . . . . . . . . . . . . 40


1.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1.1 Preface

Analysis of variance and regression analysis are at the heart of applied statistics in re-search and industry and has been so for many years. The basic methodology is taught inintroductory statistics courses within almost any field at any university around the wor-ld. Depending on the number and substance of these courses, they usually only providestatistical tools for a rather limited pool of setups, which often will be too simplistic forthe more complex features of real life situations. Moreover, the classical approaches arebased on rather strict assumptions about the data at hand: The structure must be descri-bed by a linear model, observations, or rather the residual or error terms, must follow anormal distribution, they must be independent and the variability should be homoge-neous.

This course is aimed at providing participants with knowlegde and tools to handle morecomplex setups without having to fit them into a limited set of predefined settings. Thetheory behind and the tool box given by the mixed linear models emcompass metho-dology to relax the assumptions of independence and variance homogeneity. The greatversatility of the mixed linear models has only relatively recently been generally acces-sable to users in commercial software packages, like SAS to be used in this course. Andstill today, many statistical packages will only offer a limited version of the possibilitieswith mixed linear models.

eNote 1 1.1 PREFACE 4

Knowlegde about and experience with the possibilities of the mixed linear models willalso form the basis of relaxing the assumptions about model linearity and normality. Inthe final Module 13 of the course it is indicated how the entire versatility of the linearmixed models (still using the normal distribution) can be embedded into non-linearand/or non-normal/non-quantitative data modelling and analysis. The availability ofsoftware to handle situations of this complexity is really still very limited and details ofthese issues are still a matter of active research within the science of statistics.

We begin in this module by introducing the concept of a random effect. Module 2 in-troduces the factor structure diagram tool as an aide to handle general complex expe-rimental structures together with the model notation to be used throughout. Module 3is a case study illustrating how a data analysis project is commonly approached withor without random effects. In Module 4 the basic statistical theory of mixed linear mo-dels is presented. Modules 5 and 6 treat two specific and commonly appearing practicalsettings: the hierarchical data setting and the split-plot setting. In Module 7 ideas andmethods for model diagnostics are presented together with a completion of the casestudy started in Module 3. Modules 8 and 9 cover two related specific settings: Mixedmodel versions of analysis of covariance (ANCOVA) and random coefficent (regression)models. In Module 10 the final theory is presented and in Modules 11 and 12 the im-portant topic of repeated measures/longitudinal data is covered by a module on simpleanalysis methods and a module on more advanced modelling approaches.

The readers are assumed to possess some basic knowlegde of statistics. A course inclu-ding topics on regression and/or analysis of variance following an introductory stati-stics course will ususally be sufficient. The focus is on applications and interpretationsof results obtained from software, but some insight in the underlying theory and inparticular some feeling for the modelling concepts will be emphasized. Throughout thematerial, an effort is made to describe and make available all SAS code used and neededto obtain the presented results.

Most examples are taken from the dayly work of the Statistics Group, Department ofMathematics and Physics, The Royal Veterinary and Agricultural University, Copen-hagen, Denmark. As such biological oriented examples in a broad sense constitute themajority of cases. The material in general uses contributions from several staff membersover the years: Henrik Stryhn, Ib Skovgaard, Bo Martin Bibby, Torben Martinussen.

eNote 1 1.2 INTRODUCTORY EXAMPLE: NIR PREDICTIONS OF HPLCMEASUREMENTS 5

1.2 Introductory example: NIR predictions of HPLCmeasurements

In a pharmaceutical company the use of NIR (Near Infrared Reflectance) spectroscopywas investigated as an alternative to the more cumbersome (and expensive) HPLC met-hod to determine the content of active substance in tablets. Below the measurements on10 tablets are shown together with the differences (in mg):

HPLC NIR DifferenceTablet 1 10.4 10.1 0.3Tablet 2 10.6 10.8 -0.2Tablet 3 10.2 10.2 0.0Tablet 4 10.1 9.9 0.2Tablet 5 10.3 11 -0.7Tablet 6 10.7 10.5 0.2Tablet 7 10.3 10.2 0.1Tablet 8 10.9 10.9 0.0Tablet 9 10.1 10.4 -0.3Tablet 10 9.8 9.9 -0.1

One of the main interests lies in the average difference between the two methods, alsocalled the method bias in this context.

1.2.1 Simple analysis

The most straightforward approach to an analysis of this data is by considering thesetup as a two-paired-samples setup and carry out the corresponding paired t-test ana-lysis. The paired t-test approach corresponds to a one-sample analysis of the differences,i.e. calculating the average and the standard deviation of the 10 differences given in thetable above:

d = −0.05, sd = 0.2953

The uncertainty of the estimated difference d = −0.05 is given by the standard error

SEd =sd√

n=

0.2953√10

= 0.0934


A t-test for the hypothesis of no difference (no method bias) is then given by

t =d

SEd=−0.050.0934

= −0.535

with a P-value of 0.61 and there is no significant method bias. A 95%-confidence bandfor the difference is given by

d± t0.975(9)SEd.

So we get in this case, since the critical t-value with 9 degrees of freedom is 2.262, thatthe 95%-confidence band for the method bias is

−0.05± 0.21

showing that even though we cannot claim a significant method difference, the diffe-rence could very well be as small as −0.26 or as large as 0.16.

The simple analysis just carried out is based on a statistical model. Formally, the stati-stical model is expressed by

di = µ + εi, ε ∼ N(0, σ2),

where µ is the true average difference between the two methods (the true method bias),and σ is the true standard deviation. ”True”refers here to the value in the ”population”,from which the ”random sample”of 10 tablets was taken (in this case tablets with a no-minal level of 10 mg active content). The differences di are defined by di = yi2 − yi1,where yij is measurement by method j (j = 1(NIR) and j = 2(HPLC)) for tablet i,i = 1, . . . , 10. Formally, the average and standard deviation of the differences are ”esti-mates”of the population values and a ”hat-notation is usually employed:

µ = d, σ = sd

1.2.2 Simple analysis by an ANOVA approach

The paired t-test setup can also be regarded as a ”Randomized Blocks”setup with 10”blocks”(the tablets) and two ”treatments”(the methods). The model for this situationbecomes:

yij = µ + αi + β j + εij, εij ∼ N(0, σ2), (1-1)

where µ now represents the overall mean of the measurements, αi is the effect of the ithtablet and β j is the effect of the jth method. An analysis of variance (ANOVA) will resultin the following ANOVA table:


Source of Degrees of Sums of Mean F Pvariation freedom squares squaresTablets 9 2.0005 0.2223 5.10 0.0118Methods 1 0.0125 0.0125 0.29 0.6054Residual 9 0.3925 0.0436

Note that the P-value for the method effect is the same as for the paired t-test above.This is a consequence of the fact that the F-statistic is exactly equal to the square of thepaired t-statistic:

FMethods = t2 = (−0.535)2 = 0.29

The estimate of the residual standard deviation σ is given by

σ =√

MSResidual =√

0.0436 = 0.209

Note that this equals the standard deviation of differences divided by√

2:

σ = sd/√

2

The uncertainty of the average difference is given by

SE(y2 − y1) =

√σ2(

110

+1

10

)= 0.0934

exactly as above, and hence also the 95% confidence band will be the same.

1.2.3 The problem of the ANOVA approach

The analysis just carried out is what most statistical textbooks would present as theanalysis of the randomized complete blocks design data. And when it comes to the test(and possible post hoc analysis) of treatment differences this is perfectly alright, and anystatistical software would do this analysis for you (In SAS: PROC GLM). The problemarises if you also ask the programme to give you an estimate of the uncertainty of thetreatment averages individually, that is, not a treatment difference. For instance, whatis the uncertainty of the average value of the NIR method? A standard use of the model(1-1) leads to:

SE(y1) =σ√10

= 0.066


and this is again what any ordinary ANOVA software procedure would tell you. This isNOT correct, however! Assume for a moment that we only observed the NIR method.Then we would use these 10 values as a random sample to obtain the standard deviation

s1 = 0.4012

and hence the uncertaintySE(y1) =

s1√10

= 0.127

Note how the model (1-1) dramatically under-estimates what appears to be the real un-certainty in the NIR values. This is so, because the variance σ2 in the model (1-1) mea-sures residual variability after possible tablet differences has been corrected for (theeffects of tablets in the model). In the subsequent, and better analysis in that respect,it is the variability between tablets that is used. The conceptual difference between thetwo approaches is whether the 10 tablets is considered as a random sample or not. Inthe ANOVA they are not; each tablet is individually modeled with an effect and the re-sults of the analysis is only valid for these 10 specific tablets. However, these 10 specifictablets are not of particular interest themselves - we are interested in tablets in general,and these 10 tablets should be considered as representing this population of tablets, i.e.a random sample. The drawback of analyzing the NIR (resp. HPLC) data in a separateanalysis is that the information available in the complete data material about variabili-ty within each tablet is not used, leading to in-efficient data analysis. The solution is tocombine the two in a model for the complete data, that considers the tablets as a randomsample, or in other words: where the tablet effect is considered as a ”random effect”.

1.2.4 The mixed model

The model with tablet as a random effect is expressed as

yij = µ + ai + β j + εij, εij ∼ N(0, σ2), (1-2)

where µ as before represents the overall mean of the measurements, β j is the effect ofthe jth method and ai is the random effect of the ith tablet assumed to be independentand normally distributed ai ∼ N(0, σ2

T). The random effects are also assumed to beindependent from the error terms εij Since the model consists of as well random effects(tablet) as fixed (non-random) effects (method) we refer to models of this kind as mixedmodels. Each random effect in the model gives rise to a variance component. The residualerror term εij can be seen as a random effect and hence the residual variance σ2 is avariance component.


1.2.5 Comparison of fixed and mixed model

To understand the conceptual differences between the models (1-1) and (1-2) we willstudy three theoretical features of the models:

1. The expected value of the ijth observation yij

2. The variance of the ijth observation yij

3. The relation between two different observations (covariance/correlation)

The results are summarized in the following table:

Fixed model (1-1) Mixed model (1-2)1. E(yij) µ + αi + β j µ + β j2. var(yij) σ2 σ2

T + σ2

3. cov(yij, yi′ j′ ) 0 σ2T (if i = i

′)

(j 6= j′) 0 (if i 6= i

′)

The table is obtained by applying basic rules of calculus for expected values, variancesand covariances to the model expressions in (1-1) and (1-2), for instance the variance ofyij in the mixed model :

var(yij) = var(µ + ai + β j + εij)

= var(ai + εij)

= var(ai) + var(εij)

= σ2T + σ2

The expected values and the variances show how the effects of the tablets in the mixedmodel enter the variance (random) part (as a variance component) rather than the expected(fixed/systematic) part of the model. It also emphasizes that expectations under themixed model does not depend on the individual tablet, but is an expectation of theaverage population of tablets.

To understand the covariance part of the table, recall that the covariance between tworandom variables, e.g. two different observations in the data set, express a relation be-tween those two variables. So independent observations have zero covariance, which is


the case for the ordinary fixed effects model, where all observations are assumed inde-pendent. The result for the mixed model is obtained as follows

cov(yij, yi′ j′ ) = cov(µ + ai + β j + εij, µ + ai′ + β j′ + εi′ j′ ), using (1-2)

= cov(ai + εij, ai′ + εi′ j′ ), only the random effects

= cov(ai, ai′ ) + cov(ai, εi′ j′ ) + cov(εij, ai′ ) + cov(εij, εi′ j′ )

(each possible pair)

If observations on two different tablets are considered, i 6= i′, the independence as-

sumptions of the mixed model gives that all these covariances (”relations”) are zero.However, if two different observations on the same tablet are considered i = i

′and

j 6= j′

then only the latter three terms in the expression is zero, and the covariance isgiven by the first term:

cov(yij, yi′ j′ ) = cov(ai, ai) = var(ai) = σ2T

Thus, observations on the same tablet are no longer assumed to be independent - so-me correlation is allowed between observations on the same tablet. This illustrates anessential feature of going from standard regression/ANOVA models to mixed models.

1.2.6 Analysis by mixed model

As mentioned above, the analysis of the data based on the mixed model is in this caseto a large extent an exact copy of the ordinary analysis: The same decomposition ofvariability as given by the ANOVA table is used, the F-tests for method effect (and/ortablet effect) and subsequent method comparisons are the same. The uncertainty of theaverage NIR-value in the mixed model is:

SE(y1) =

√σ2

T + σ2

√10

Thus, to calculate this we need to estimate the two variance components. One wayof doing this is by using the so-called expected mean squares. They give the theoreticalexpectation of the three mean squares of the ANOVA table: (we do not go through thesetheoretical calculations of expectations here)

eNote 1 1.3 EXAMPLE WITH MISSING VALUES 11

Source of Degrees of Sums of Mean E(MS)variation freedom squares squaresTablets 9 2.0005 0.2223 2σ2

T + σ2

Methods 1 0.0125 0.0125 σ2 + 10 ∑ β2j

Residual 9 0.3925 0.0436 σ2

The expectations show which features of the data enter each mean square. For instance,in the residual mean square only the residual error variance component enters, and thusthis is a natural estimate of this variance component, σ2 = 0.0436 (exactly as in the fixedmodel). And since the expectation for the tablet mean square shows that 0.2223 is anestimate of 2σ2

T + σ2, we can use the value for σ2 to obtain:

σ2T =

0.2223− 0.04362

= 0.0894

showing that the tablet-to-tablet variation seems to be around twice the size of the resi-dual variation. The uncertainty of the average NIR-value now becomes

SE(y1) =

√0.0894 + 0.0436√

10= 0.115,

now much closer to the figure found in the separate NIR data analysis above. And dueto the complete model specification we were now able to decompose the total variabilityinto its two variance components. This gives additional information AND has impact onthe degrees of freedom to be used, when the standard error are to be used for hypothesistesting and/or confidence interval calculations - something we will return to in moredetail later.

The example has shown how the mixed model comes up as the more proper way ofexpressing a statistical model that fits the situation at hand. The main message wasthat to avoid making mistakes in the direction of under-estimation of uncertainty weneeded the mixed model. As such, it came up as a necessary evil. The next examplewill illustrate how the mixed model in a direct way may give information about the keyissues in a data set, that a straightforward fixed ANOVA does not.

1.3 Example with missing values

Imagine that in addition to the the 10 tablets of the previous example another 10 tabletswere observed, but each of them only for one of the two methods, giving the followingdata table:


HPLC NIR DifferenceTablet 1 10.4 10.1 0.3Tablet 2 10.6 10.8 -0.2Tablet 3 10.2 10.2 0.0Tablet 4 10.1 9.9 0.2Tablet 5 10.3 11 -0.7Tablet 6 10.7 10.5 0.2Tablet 7 10.3 10.2 0.1Tablet 8 10.9 10.9 0.0Tablet 9 10.1 10.4 -0.3Tablet 10 9.8 9.9 -0.1Tablet 11 10.8Tablet 12 9.8Tablet 13 10.5Tablet 14 10.3Tablet 15 9.7Tablet 16 10.3Tablet 17 9.6Tablet 18 10.0Tablet 19 10.2Tablet 20 9.9

1.3.1 Analysis by fixed effects ANOVA

An ordinary fixed model analysis will result in the following ANOVA table:

Source of Degrees of Sums of Mean F Pvariation freedom squares squaresTablets 19 3.7230 0.1959 4.49 0.0129Methods 1 0.0125 0.0125 0.29 0.6054Residual 9 0.3925 0.0436

Note that only the Tablets row of the table has changed compared to the previous ana-lysis. Similarly, the estimate of the average method difference is given as before to be

β2 − β1 = −0.05


only using the 10 tablets for which both observations are present. And the uncertaintyis also just given by these 10 tablets, as before:

SE(β2 − β1) =

√σ2(

110

+110

)= 0.0934

So in summary, the fixed effect analysis only uses the information in the first 10 tablets.


Consider for a moment how an analysis of the 10 tablets for which only one of themethods were observed could be carried out. This data set can be regarded as two in-dependent samples - a sample of size 5 within each method, and a classical two-samplet-test setting is at hand:

y1 = 10.22, s1 = 0.4658

y2 = 10.00, s2 = 0.2739

The difference is estimated toy2 − y1 = −0.22,

and the (pooled) standard error to:

SE(y2 − y1) =

√2s√5

= 0.24, s2 =s2

1 + s22

2= 0.146

The results from the two separate analyzes can be summarized as:

Tablets 1-10 Tablets 11-20Difference -0.05 -0.22SE2 0.00872 0.0584

The fixed effects ANOVA only uses the first column of information; it would be prefe-rable to use all the information. Since the two estimates of the method difference has(very) different uncertainties, a weighted average of the two using the inverse squaredstandard errors as weights could be calculated:

β2 − β1 =−0.05 1

0.00872 − 0.22 10.0584

10.00872 +

10.0584

= (0.87)(−0.05) + (0.13)(−0.22) = −0.072

eNote 1 1.4 WHY USE MIXED MODELS? 14

Using basic rules of variance calculus gives the squared standard error of this weightedaverage:

SE2(β2 − β1) =1

10.00872 +

10.0584

= 0.00759

and henceSE(β2 − β1) = 0.0871

Note that apart from giving a slightly different value, this estimator is also more precise,than the one only based on tablets 1-10. This is the kind of analysis that the mixedmodel for this situation leads to. And by combining the data in one analysis (ratherthan two separate ones), the information about the two variance components is used inan optimal way. In this case, the variance components are not easily derived from theANOVA table. For now we will just state the results as they are given by PROC MIXEDin SAS:

σ2 = 0.0435, σ2T = 0.1019

β2 − β1 = −0.07211, SE(β2 − β1) = 0.0870

We see how the mixed model automatically incorporates all the information in the ana-lysis of the method difference, thus superior to a pure fixed effects ANOVA. This is anexample of how analysis by the mixed model automatically ”recovers the inter-blockinformation”in an incomplete blocks design, Cochran and Cox (1957).

1.4 Why use mixed models?

We just saw above how the use of mixed linear models saved us from ”making a mista-ke”when it came to the uncertainty of the expected NIR level. More to the point wouldbe to say, that the mixed linear model made it possible to broaden the statistical inferen-ce made about the average NIR level. Statistical inference is the process of using datato say something about the population(s)/real world from which the data came in thefirst place. Parameter estimates, uncertainties and confidence intervals and hypothesistesting are all examples of statistical inference. The inference induced by the fixed effectsmodel in the introductory example is only valid for the 10 specific tablets in the experi-ment: the low uncertainty is valid for the estimation of the average of the 10 unknowntrue NIR values. The inference induced by the mixed model is valid for the estimationof the tablet population average NIR value.

For the randomized complete block setting, the inference about treatment differenceswere not affected by the broadening of the inference space. In other situations, whenthe data has a hierarchical structure, the importance of doing the proper inference is an

eNote 1 1.5 R-TUTORIAL: WHAT IS R? 15

issue also for the tests of treatment differences. If 20 patients are allocated to two treat-ment groups, and then subsequently measured 10 times each, the essential variabilitywhen it comes to comparing the two treatments will most likely be in the patient-to-patient differences. And clearly it would not be valid just to analyse the data as if 100independent observations are available in each group. A mixed model with patients asa random effect would handle the situation in the proper way inducing the inferencemost likely to be the relevant one in this case.

We also saw above how a mixed model could recover information in the data not foundby a fixed effect model, when incomplete and/or unbalanced data is at hand. This is animportant benefit of the mixed models.

The mixed model approach offers a flexible way of modelling covariance/correlationin the data. This is particularly relevant for longitudinal data or other types of repeatedmeasures data, e.g. spatial data as in geostatistics. In this way, the proper inference aboutfixed effects is obtained and the covariance structure itself provides additional insightin the problem at hand. The handling of inhomogeneous variances in fixed and mixedmodels is also included in the tool box.

Hence, there are many advantages of the mixed models. And in many cases, a mixedmodel is really the only reasonable model for the given data. It is only fair to admit thatthere is also a potential disadvantage. More distributional assumptions are made andapproximations are used in the methodology leading to potential biased results. Alsothe high complexity of (some of) the models make the data handling and communica-tion of the results a challenge. However, after this course you should be ready to meetthis challenge!

1.5 R-TUTORIAL: What is R?

R is a free and very flexible statistical computer program. In this appendix we introducesome of the basic R-commands necessary when making a statistical data analysis. Thisinvolves reading data from a file, preparing the data set for a statistical analysis (thiscould for instance include choosing a subset of the data or transforming some of the va-riables in the data set), getting an overview of the variation in the data using plots, andmaking the actual statistical analysis of the data. A more comprehensive introductionto statistical analysis using R can be found in Dalgaard (2002). Or check the manuals atthe R-homepage or a one-page reference card

http://mirrors.sunsite.dk/cran/manuals.html

http://mirrors.sunsite.dk/cran/manuals.html

http://cran.r-project.org/doc/contrib/refcard.pdf

eNote 1 1.6 R TUTORIAL: IMPORTING DATA 16

1.6 R TUTORIAL: Importing Data

The first skill to learn is how to get data into R. Please note that you can cut and pastethe code from the browser directly into the script/prompt window in R. When enteringR you meet a prompt,

>

which indicates that R is ready to receive a command or a list of commands. Data aremost conveniently entered into R using the command (or more precisely the function)read.table. To perform the steps in this description you should download the datasethplcnir1.txt. Place it somewhere where you can find it again, assume that the workingdirectory of the R is where you have you file. Then writing

> hpnir1 <- read.table("hplcnir1.txt", header=TRUE, sep=",", dec=".")

creates a data set (called a data.frame in R) named hpnir1 using the assignment ope-rator <- and the function read.table which as arguments takes the file name and anindicator (header=TRUE) telling that there is a header in the file. Note that the Windowsbackslashes ”\”in the file path in R is written as slashes /”. Missing values are allowedand in R they are denoted NA. If you wonder exactly what R has created you only haveto type the variable name (and this is always the case in R),

> hpnir1

Note that, when the prompt �”is shown as part of the R output, then remember to omitthe prompt, if you cut and paste the commands into R. Also if you ever wonder justwhat a function does or what arguments you can use, type a ? in front of the functionname, for example ?read.table and the online help will provide you with a descriptionof the function.

1.7 R tutorial: Data handling

The two variables in the data frame hpnir1 (hplc and nir) can be accessed using the$-notation,

http://www2.compute.dtu.dk/courses/02429/Data/datafiles/hplcnir1.txt

eNote 1 1.7 R TUTORIAL: DATA HANDLING 17

> hpnir1$hplc

Formally hpnir1$hplc is a vector of length 10 (that is 10 numbers arranged one afteranother) and [1] just means that the first number in that line is the first element of thevector. This information is really only relevant if the vector is so long that it exceeds oneline to display it. Note that R is case sensitive, writing hpnir1$Hplc would produce NULLas output indicating that no such variable is present. The somewhat tedious $-notationcan be suppressed by telling R to look for the variable names in the data set hpnir1. Thisis done using the attach function,

> attach(hpnir1)

> hplc

Similarly R is told not to look in the data frame hpnir1 by writing

> detach(hpnir1)

In the following we assume that hpnir1 is attached. Of course if hpnir1 is modified (asit will be later on) it has to be detached once again. It is possible to modify the dataset using the functions subset and transform. Say, for instance, that we only want toconsider data with a hplc value above 10.4. This can obtained by writing

> hpnir12 <- subset(hpnir1, hplc > 10.4)

The result is a new data frame hpnir12 with the same variable names. Similarly, sup-pose that we want to transform the nir measurements using the natural logarithm. Anew variable containing the transformed values can be created using the transform-command,

> hpnir13 <- transform(hpnir1, lognir = log(nir))

In other words the new data frame hpnir13 contains the variables from the original dataframe hpnir1 along with a new variable lognir which is the natural logarithm of thenir values.

eNote 1 1.8 R TUTORIAL: CREATING GRAPHS 18

Apart from the natural logarithm function log there are a number of built-in mathe-matical functions (such as exp, the exponential function, and sqrt, the square root) aswell as statistical functions (such as mean, var, sd, median and quantile, calculating theempirical mean, variance, standard deviation, median, and quantiles respectively of avector of numbers).

1.8 R TUTORIAL: Creating Graphs

One of the strong sides of R is its graphical functions. It is generally very easy to produceplots that give insight into the structure of the data. The most important function is plot.One can simply supply the name of a data frame as the argument to the plot-function,and plots of all pairs of variables are produced. In the case of the data frame hpnir13

the result is given in Figure 1.1. The way to get the 3-by-3 plots is to use the graphicalcontrol parameter par. This is really a function and one of the possible arguments to itis mfrow which is read as multiframe rowwise. Specifying c(3,3) means that we want 3rows and 3 columns, that is two plots besides each other.

To produce a simple scatterplot of nir versus hplc simply write: (and see figure 1.2)

Note that the specification par(mfrow=c(1,1)) sets back the setting to the default con-sisting of a single plot pr. page. The function abline is used to add lines to the plot:vertical, horizontal or in this case: the fitted line. The function call lsfit(hplc,nir)makes the least squares regression analysis of nir as function of hplc and would actu-ally contain various information from that analysis. However, for now we only needthe estimated intercept and slope and the nice thing is that the abline function knowswhere to find this information from the result of lsfit. This feature is another strengthof R. Clearly there are many options for creating advanced graphics. In Module 3 wewill see some more of these.

1.9 R-TUTORIAL: Introductory example: NIR predictions ofHPLC measurements

In this section we go through the steps needed to do the calculations given in the Intro-ductory example 1.2.

eNote 1 1.9 R-TUTORIAL: INTRODUCTORY EXAMPLE: NIR PREDICTIONS OF HPLCMEASUREMENTS 19

par(mfrow=c(3,3))

plot(hpnir13)

hplc

10.0 10.2 10.4 10.6 10.8 11.0

●

●

●

●

●

●

●

●

●

● 9.8

10.0

10.2

10.4

10.6

10.8

●

●

●

●

●

●

●

●

●

●

10.0

10.2

10.4

10.6

10.8

11.0

●

●

●

●

●

●

●

●

●

●

nir

●

●

●

●

●

●

●

●

●

●

9.8 10.0 10.2 10.4 10.6 10.8

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

2.30 2.32 2.34 2.36 2.38 2.40

2.30

2.32

2.34

2.36

2.38

2.40

lognir

Figur 1.1: The result of running the R-code plot(hpnir13)

1.9.1 Simple analysis

This is most easily carried out by constructing the difference as a new variable in thedata frame and then using some of the basic statistical functions:


attach(hpnir1)

par(mfrow=c(1,1))

plot(hplc,nir,main="Plot of NIR vs HPLC, Example 1",

sub="The data was kindly provided by Lundbeck A/S")

abline(lsfit(hplc,nir))

●

●

●

●

●

●

●

●

●

●

9.8 10.0 10.2 10.4 10.6 10.8

10.0

10.6

Plot of NIR vs HPLC, Example 1

The data was kindly provided by Lundbeck A/Shplc

nir

detach(hpnir1)

Figur 1.2: Plot of NIR vs HPLC, Example 1

hpnir1 <- transform(hpnir1,d=hplc-nir)

attach(hpnir1)

mean(d)

[1] -0.05

var(d)

[1] 0.08722222

sd(d)


[1] 0.2953341

range(d)

[1] -0.7 0.3

quantile(d)

0% 25% 50% 75% 100%

-0.700 -0.175 0.000 0.175 0.300

t.test(d)

One Sample t-test

data: d

t = -0.53537, df = 9, p-value = 0.6054

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

-0.2612693 0.1612693

sample estimates:

mean of x

-0.05

detach(hpnir1)

1.9.2 Simple analysis by an ANOVA approach: Data re-structuring

As in SAS the data must be on a different form to do the proper ANOVA, since fromthis point of view there are 20 observations of concentrations yij and two factors: the”treatment”factor (method) and the ”blocking”factor (tablet). So a data set with 20 ob-servations (rows) and 3 variables (columns) is needed. The following R-lines makes the


proper transformation of the data using the tidyr package:

require(tidyr, quietly = TRUE)

temp <- transform(hpnir1[, c(1,2)], tablet = 1:10)

temp <- gather(temp, key = "method" , value="value", 1:2)

Note that method is a character valued variable, and even though tablet is a numericvariable, the information in tablet is really only a coding identifying each tablet. Both ofthese variables are ”class variables”(in e.g. SAS terminology) or what we call factors thatdivide the experimental units into treatment groups. R does not know this, however,before we tell it and the way to do this is by using the function factor.

temp$tablet <- factor(temp$tablet)

temp$method <- factor(temp$method)

Some boxplots with the ggplot2-package is provided by the following:

## boxplots with the ggplot2 package

library(ggplot2, quietly = TRUE)

ggplot(temp, aes(x = method, y = value,

colour = method)) + geom_boxplot()


9.75

10.00

10.25

10.50

10.75

11.00

hplc nirmethod

valu

e

method

hplc

nir

## boxplots with the ggplot2 package

ggplot(temp, aes(x = tablet, y = value)) + geom_boxplot()


9.75

10.00

10.25

10.50

10.75

11.00

1 2 3 4 5 6 7 8 9 10tablet

valu

e

1.9.3 ANOVA approach: Using function lm

In R a linear model is fitted to data using the function lm which is in fact an abbreviationof linear model. The following lines will do the job:


model1 <- lm(value ~ method + tablet, data = temp)

model1

Call:

lm(formula = value ~ method + tablet, data = temp)

Coefficients:

(Intercept) methodnir tablet2 tablet3 tablet4

1.022e+01 5.000e-02 4.500e-01 -5.000e-02 -2.500e-01

tablet5 tablet6 tablet7 tablet8 tablet9

4.000e-01 3.500e-01 1.130e-14 6.500e-01 1.130e-14

tablet10

-4.000e-01

The argument to the function lm is a model formula (y ∼ method+tablet) which is readas y described by method and tablet. The result of the call to lm, here model1, is a modelobject which in itself does not contain much information. It lists the function call andthe parameter estimates. However, it does contain a lot more information which can beextracted using different functions. An example of such a function is summary,

summary(model1)

Call:

lm(formula = value ~ method + tablet, data = temp)

Residuals:

Min 1Q Median 3Q Max

-0.3250 -0.0875 0.0000 0.0875 0.3250

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.022e+01 1.549e-01 66.021 2.12e-13 ***

methodnir 5.000e-02 9.339e-02 0.535 0.6054

tablet2 4.500e-01 2.088e-01 2.155 0.0596 .

tablet3 -5.000e-02 2.088e-01 -0.239 0.8161

tablet4 -2.500e-01 2.088e-01 -1.197 0.2618

tablet5 4.000e-01 2.088e-01 1.915 0.0877 .


tablet6 3.500e-01 2.088e-01 1.676 0.1281

tablet7 1.130e-14 2.088e-01 0.000 1.0000

tablet8 6.500e-01 2.088e-01 3.113 0.0125 *

tablet9 1.130e-14 2.088e-01 0.000 1.0000

tablet10 -4.000e-01 2.088e-01 -1.915 0.0877 .

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.2088 on 9 degrees of freedom

Multiple R-squared: 0.8368,Adjusted R-squared: 0.6555

F-statistic: 4.616 on 10 and 9 DF, p-value: 0.0154

The summary-function gives a few summary statistics for the residuals and then it printsthe parameter estimates with estimated standard errors and t-tests corresponding tothe hypotheses that the parameters are zero. Note here the (default) way R decides totell the story about the tablet and method effects, which in this case also would be thesimple averages:

## Overall mean of y:

mean(temp$value)

[1] 10.365

## Tablet means:

tapply(temp$value, temp$tablet, mean)

1 2 3 4 5 6 7 8 9 10

10.25 10.70 10.20 10.00 10.65 10.60 10.25 10.90 10.25 9.85

## Method means:

tapply(temp$value, temp$method, mean)

hplc nir

10.34 10.39

So as Intercept is given the estimated value for tablet 1 and method 1 (=HPLC, as


ordered by the programme):

µ = y1· + y·1 − y··= 10.25 + 10.34− 10.365= 10.225

And the effect-coefficients express the difference between level 2 and up to the last leveland the first level, e.g. for tablet 2:

α2 = y2· − y1·

= 10.70− 10.25= 0.45

And the NIR method:

ˆβNIR = y·2 − y·1= 10.39− 10.34= 0.05

In Module 4 some more details are given related to parameters. In this model thereseems to be 1 + 2 + 10 = 13 parameters (intercept µ, method effects β1, β2 and tableteffects α1, . . . , α10). But really there are only 11 “free” parameters in the model.

The Residual standard error is the estimate of σ given by√

SSe/DFe. The Multiple

R-Squared is the R2-statistic. Finally the F-statistic is the F-test statistic associatedwith the hypothesis that all the mean parameters are zero which is rarely relevant, sin-ce it generally mixes together different things, in this case the method differences andtablet differences. The analysis of variance table is given by the functions anova:

require(xtable, quietly = TRUE)

print(xtable(anova(model1)), comment=FALSE)

Df Sum Sq Mean Sq F value Pr(>F)method 1 0.01 0.01 0.29 0.6054tablet 9 2.00 0.22 5.10 0.0118Residuals 9 0.39 0.04

In general this would correspond to the so-called Type I ANOVA table (Successive ef-fects).


Remark 1.1

The Type I table provides a successive decomposition of the total variability in theorder that the effects were typed in the MODEL statement (so the results shown de-pend on this order!). This means that the first row in the Type I table expresses tabletdifferences completely ignoring the method information. The second row expressesmethod differences correcting for the tablet differences. The bottom row of the TypeI table will always equal the bottom row of the Type III table.The Type III table is in the help pages for SAS explained as: The sum of squaresfor each term represents the variation among the means for the different levels ofthe factors. The Type III Tests table presents the Type III sums of squares associatedwith the effects in the model. The Type III sum of squares for a particular effect is theamount of variation in the response due to that effect after correcting for all otherterms in the model. Type III sums of squares, therefore, do not depend on the orderin which the effects are specified in the model. Refer to the chapter on

To obtain so-called Type II or Type III tables, we may use the car-package:

Remark 1.2

The library call needs some clarification: There are a large number of additional R-packages available NOT automatically installed with the base-package. A packageis a collection of functions. One such package is car. To access the functions from apackage, the package should once and for all be installed on your local computer.This requires internet connection, since clicking the “Packages” menu bar in R willgive you the option “Install package(s) from CRAN” and it will list all the possiblepackages (that it identifies at an R website). Click the wanted package and it is in-stalled! This only needs to be carried out once. To actually use the functions from anadd-on package (and for the help information to be visible) the package must alsobe loaded - this is done by the function library as shown above. This needs to bedone every time R is re-started.

require(car)

print(xtable(Anova(model1)), comment=FALSE)


Sum Sq Df F value Pr(>F)method 0.01 1 0.29 0.6054tablet 2.00 9 5.10 0.0118Residuals 0.39 9

print(xtable(Anova(model1,type=c("III"))), comment=FALSE)

Sum Sq Df F value Pr(>F)(Intercept) 190.09 1 4358.80 0.0000method 0.01 1 0.29 0.6054tablet 2.00 9 5.10 0.0118Residuals 0.39 9

First, the car package is loaded and then the Anova function is applied twice (rememberthat R is case sensitive). In this case there isno difference between the three Types.

1.9.4 R-TUTORIAL: ANOVA post hoc

As is clear from the example the default parameter story as provided by the coefficientsoutput of the model summary in R is most often not the way you would like to havethe final interpretations of your results. We will get back to more details on this para-metrization and why we have to live with this. Here we presnt briefly how to do somepost hoc analysis in this case using the packages lsmeans and multcomp. The focus is he-re on so-called LSMEANS, that is, estimated expected values for an effect assuming allother effects/factors/covariates at the average level. The summary of LSMEANS andDIFFERENCES of LSMEANS can be a very good way of summarizing what is goingon:

require(lsmeans)

lsmeans::lsmeans(model1, pairwise ~ method)

$lsmeans

method lsmean SE df lower.CL upper.CL

hplc 10.34 0.06603871 9 10.19061 10.48939

nir 10.39 0.06603871 9 10.24061 10.53939


Results are averaged over the levels of: tablet

Confidence level used: 0.95

$contrasts

contrast estimate SE df t.ratio p.value

hplc - nir -0.05 0.09339284 9 -0.535 0.6054

Results are averaged over the levels of: tablet

lsmeans::lsmeans(model1, pairwise ~ tablet)

$lsmeans

tablet lsmean SE df lower.CL upper.CL

1 10.25 0.147667 9 9.915954 10.58405

2 10.70 0.147667 9 10.365954 11.03405

3 10.20 0.147667 9 9.865954 10.53405

4 10.00 0.147667 9 9.665954 10.33405

5 10.65 0.147667 9 10.315954 10.98405

6 10.60 0.147667 9 10.265954 10.93405

7 10.25 0.147667 9 9.915954 10.58405

8 10.90 0.147667 9 10.565954 11.23405

9 10.25 0.147667 9 9.915954 10.58405

10 9.85 0.147667 9 9.515954 10.18405

Results are averaged over the levels of: method

Confidence level used: 0.95

$contrasts

contrast estimate SE df t.ratio p.value

1 - 2 -4.500000e-01 0.2088327 9 -2.155 0.5367

1 - 3 5.000000e-02 0.2088327 9 0.239 1.0000

1 - 4 2.500000e-01 0.2088327 9 1.197 0.9551

1 - 5 -4.000000e-01 0.2088327 9 -1.915 0.6632

1 - 6 -3.500000e-01 0.2088327 9 -1.676 0.7856

1 - 7 -1.130050e-14 0.2088327 9 0.000 1.0000

1 - 8 -6.500000e-01 0.2088327 9 -3.113 0.1759

1 - 9 -1.129741e-14 0.2088327 9 0.000 1.0000

1 - 10 4.000000e-01 0.2088327 9 1.915 0.6632

2 - 3 5.000000e-01 0.2088327 9 2.394 0.4197


2 - 4 7.000000e-01 0.2088327 9 3.352 0.1285

2 - 5 5.000000e-02 0.2088327 9 0.239 1.0000

2 - 6 1.000000e-01 0.2088327 9 0.479 0.9999

2 - 7 4.500000e-01 0.2088327 9 2.155 0.5367

2 - 8 -2.000000e-01 0.2088327 9 -0.958 0.9883

2 - 9 4.500000e-01 0.2088327 9 2.155 0.5367

2 - 10 8.500000e-01 0.2088327 9 4.070 0.0492

3 - 4 2.000000e-01 0.2088327 9 0.958 0.9883

3 - 5 -4.500000e-01 0.2088327 9 -2.155 0.5367

3 - 6 -4.000000e-01 0.2088327 9 -1.915 0.6632

3 - 7 -5.000000e-02 0.2088327 9 -0.239 1.0000

3 - 8 -7.000000e-01 0.2088327 9 -3.352 0.1285

3 - 9 -5.000000e-02 0.2088327 9 -0.239 1.0000

3 - 10 3.500000e-01 0.2088327 9 1.676 0.7856

4 - 5 -6.500000e-01 0.2088327 9 -3.113 0.1759

4 - 6 -6.000000e-01 0.2088327 9 -2.873 0.2387

4 - 7 -2.500000e-01 0.2088327 9 -1.197 0.9551

4 - 8 -9.000000e-01 0.2088327 9 -4.310 0.0357

4 - 9 -2.500000e-01 0.2088327 9 -1.197 0.9551

4 - 10 1.500000e-01 0.2088327 9 0.718 0.9984

5 - 6 5.000000e-02 0.2088327 9 0.239 1.0000

5 - 7 4.000000e-01 0.2088327 9 1.915 0.6632

5 - 8 -2.500000e-01 0.2088327 9 -1.197 0.9551

5 - 9 4.000000e-01 0.2088327 9 1.915 0.6632

5 - 10 8.000000e-01 0.2088327 9 3.831 0.0678

6 - 7 3.500000e-01 0.2088327 9 1.676 0.7856

6 - 8 -3.000000e-01 0.2088327 9 -1.437 0.8873

6 - 9 3.500000e-01 0.2088327 9 1.676 0.7856

6 - 10 7.500000e-01 0.2088327 9 3.591 0.0934

7 - 8 -6.500000e-01 0.2088327 9 -3.113 0.1759

7 - 9 3.089444e-18 0.2088327 9 0.000 1.0000

7 - 10 4.000000e-01 0.2088327 9 1.915 0.6632

8 - 9 6.500000e-01 0.2088327 9 3.113 0.1759

8 - 10 1.050000e+00 0.2088327 9 5.028 0.0141

9 - 10 4.000000e-01 0.2088327 9 1.915 0.6632

Results are averaged over the levels of: method

P value adjustment: tukey method for comparing a family of 10 estimates

Or combining the two packages to get a plot of all pairwise differences with confidence


intervals and multiplicity correction:

require(multcomp)

lsm.tablet <- lsmeans::lsmeans(model1, pairwise ~ tablet, glhargs = list())

plot(lsm.tablet[[2]])

estimate

cont

rast

1 − 21 − 31 − 41 − 51 − 61 − 71 − 81 − 9

1 − 102 − 32 − 42 − 52 − 62 − 72 − 82 − 9

2 − 103 − 43 − 53 − 63 − 73 − 83 − 9

3 − 104 − 54 − 64 − 74 − 84 − 9

4 − 105 − 65 − 75 − 85 − 9

5 − 106 − 76 − 86 − 9

6 − 107 − 87 − 9

7 − 108 − 9

8 − 109 − 10

−1 0 1 2

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Read more about how to use these two packages in this VERY useful vignette from thelsmeans package:


vignette("using-lsmeans", package="lsmeans")

One may also more explicitly compute any kind of contrast, that is, (estimable) linearfunction of the parameters of the model. For illustration let us try to find the LSMEANSfor the hplc-level by this more ”manual” approach. Formally, the LSMEANS are in thiscase, for e.g. hplc method:

µ + β1 + ¯α = µ + β1 + 0β2 +1

10(α1 + · · ·+ α10)

In practice we then define the list of the 11 numbers matching the 11 coefficients of theR object as a row in a matrix with 11 columns, give this matrix a name (and give the rowa name) and use it in a call to the glht function of the multcomp package:

model1$coef

(Intercept) methodnir tablet2 tablet3 tablet4

1.022500e+01 5.000000e-02 4.500000e-01 -5.000000e-02 -2.500000e-01

tablet5 tablet6 tablet7 tablet8 tablet9

4.000000e-01 3.500000e-01 1.130050e-14 6.500000e-01 1.129741e-14

tablet10

-4.000000e-01

myestimates <- matrix(c(1,0,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,

1,1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,

0,1,0,0,0,0,0,0,0,0,0), nrow=3,byrow=T)

rownames(myestimates)=c("LS HPLC","LS NIR","LS DIF")

glht(model1,linfct=myestimates)

General Linear Hypotheses

Linear Hypotheses:

Estimate

LS HPLC == 0 10.34

LS NIR == 0 10.39

LS DIF == 0 0.05

The difference of LSMEANS is obtained by subtracting the two sets of coefficients fromeach other, as coded for in the last of the three rows.


And finally, the results could plotted as:

plot(glht(model1,linfct=myestimates))

0 2 4 6 8 10

LS DIF

LS NIR

LS HPLC (

(

(

)

)

)

●

●

●

95% family−wise confidence level

Linear Function

But just to emphasize: Most often we would able to make the relevant post hoc analysisdirectly by the multcomp and lsmeans packages. And if you have some contrasts definedexplicitly like this, you can still use the glht function to summarize the results in a plot,like we just did. An important feature of the glht function is the default use of a multipli-city correction method: Accounting for the additional risk of significance-by-chance, whenwe simultaneously carry out several (sometimes many) post hoc tests. All pairwise tab-let comparisons would amount to 10 · 9/2 = 45 tests. A wellknown approach to correctfor this the socalled bonferroni correction where one would simply multiply the p-valueswith the number of coparisons made, in this case 45 This is known to be a bit too much,so in the R-functions a version of the Holm method is used: Correct the smallest p-valueby n, the next-smallest by n− 1 etc., see also the help of the p.adjust function, whichthe glht is based upon:

?p.adjust


For linear mixed models in R we will use either the lmer function of the lme4 packageot the lme function which is a part of the nlme package. For many reasons we prefer


lmer, and for most of the material this is the function we will use. However, for some ofthe important longitudinal models to be introduced at the end of the material, we needfeatures of the lme-function to be able to run these models, and hence we will use it atthat point. It can be anticipated that at some point in time, these longitudinal modelswill also be implemented within the lme4 framework.

The practical approach is simply to omit the random effects from the (fixed) modelspecification and then specify these random effects separately. Simply write:

library(lme4)

model2 <- lmer(value ~ method + (1 | tablet), data=temp)

model2

Linear mixed model fit by REML [’merModLmerTest’]

Formula: value ~ method + (1 | tablet)

Data: temp

REML criterion at convergence: 13.9605

Random effects:

Groups Name Std.Dev.

tablet (Intercept) 0.2989

Residual 0.2088

Number of obs: 20, groups: tablet, 10

Fixed Effects:

(Intercept) methodnir

10.34 0.05

The fixed part of the model is written in the usual form (as for lm). The random effectsare specified in parentheses with the ”1” followed by some grouping variables after the|. Note that R gives the standard deviations, NOT the variance components themselves.The anova and summary functions also work for lmer objects:

summary(model2)

Linear mixed model fit by REML t-tests use Satterthwaite approximations

to degrees of freedom [lmerMod]

Formula: value ~ method + (1 | tablet)

Data: temp

REML criterion at convergence: 14


Scaled residuals:


-1.28851 -0.50128 -0.03985 0.52348 1.82403

Random effects:

Groups Name Variance Std.Dev.

tablet (Intercept) 0.08933 0.2989

Residual 0.04361 0.2088


Fixed effects:

Estimate Std. Error df t value Pr(>|t|)

(Intercept) 10.34000 0.11530 12.40100 89.678 <2e-16 ***

methodnir 0.05000 0.09339 9.00000 0.535 0.605

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Correlation of Fixed Effects:

(Intr)

methodnir -0.405


Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)method 0.01 0.01 1.00 9.00 0.29 0.6054

As can be seen, the anova table now focusses on the fixed effects, but as also seen it pro-duces no P-value for the hypothesis test of no method effect. To get such a test directy,we can use the Anova function from the car package:

## Type III ANOVA table

print(xtable(Anova(model2, type = 3, test.statistic = "F")), comment=FALSE)

F Df Df.res Pr(>F)(Intercept) 8042.13 1 12 0.0000method 0.29 1 9 0.6054

Or we may use the newly developed package lmerTest that over-writes and re-definesthe generic anova function: kage:


require(lmerTest, quietly = TRUE)



Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)method 0.01 0.01 1.00 9.00 0.29 0.6054

The post hoc analysis can be obtained as above for linear models. However, note thatwe should only consider parameters corresponding to the fixed part of the model:


lsmeansm2 <- lsmeans::lsmeans(model2, pairwise ~ method)

print(xtable(lsmeansm2$lsmeans), comment=FALSE)

method lsmean SE df lower.CL upper.CLhplc 10.3400 0.1153 12.40 10.0897 10.5903nir 10.3900 0.1153 12.40 10.1397 10.6403Confidence level used: 0.95

And note the double notation for the call to Sometimes it is more informative to omitthe intercept term from the model expression to produce a more directly interpretableparametrization:

model3 <- lmer(value ~ -1+ method + (1 | tablet), data=temp)

summary(model3)



Formula: value ~ -1 + method + (1 | tablet)

Data: temp

REML criterion at convergence: 14

Scaled residuals:


-1.28851 -0.50128 -0.03985 0.52348 1.82403

Random effects:

eNote 1 1.10 R-TUTORIAL: EXAMPLE WITH MISSING VALUES 38



Residual 0.04361 0.2088


Fixed effects:


methodhplc 10.3400 0.1153 12.4010 89.68 <2e-16 ***

methodnir 10.3900 0.1153 12.4010 90.11 <2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1


mthdhp

methodnir 0.672

Note that the model is exeactly the same, but now the lsmeans for the two methods aregiven directly.

1.10 R-TUTORIAL: Example with missing values

In this section we go through the steps needed to do the calculations given in the in-troductory example with missing values The R-lines needed can be copied from above(with suitable change of file-names):

hpnir2 <- read.table("hplcnir2.txt",header=TRUE,sep=",")

tmp1<-subset(hpnir2,select=hplc)

tmp1<-transform(tmp1,method="hplc",y=hplc,tablet=1:20)

tmp1<-subset(tmp1,select=c(tablet,method,y))

tmp2<-subset(hpnir2,select=nir)

tmp2<-transform(tmp2,method="nir",y=nir,tablet=1:20)

tmp2<-subset(tmp2,select=c(tablet,method,y))

temp2 <- rbind(tmp1,tmp2)

temp2

tablet method y


1 1 hplc 10.4

2 2 hplc 10.6

3 3 hplc 10.2

4 4 hplc 10.1

5 5 hplc 10.3

6 6 hplc 10.7

7 7 hplc 10.3

8 8 hplc 10.9

9 9 hplc 10.1

10 10 hplc 9.8

11 11 hplc NA

12 12 hplc NA

13 13 hplc NA

14 14 hplc NA

15 15 hplc NA

16 16 hplc 10.3

17 17 hplc 9.6

18 18 hplc 10.0

19 19 hplc 10.2

20 20 hplc 9.9

21 1 nir 10.1

22 2 nir 10.8

23 3 nir 10.2

24 4 nir 9.9

25 5 nir 11.0

26 6 nir 10.5

27 7 nir 10.2

28 8 nir 10.9

29 9 nir 10.4

30 10 nir 9.9

31 11 nir 10.8

32 12 nir 9.8

33 13 nir 10.5

34 14 nir 10.3

35 15 nir 9.7

36 16 nir NA

37 17 nir NA

38 18 nir NA

39 19 nir NA

40 20 nir NA


Note that R uses NA for missing values.

1.10.1 Simple analysis by an ANOVA approach

The copies of the R-lines needed are then:

temp2$tablet<-factor(temp2$tablet)

temp2$method<-factor(temp2$method)

model1<-lm(y ~ tablet + method, data=temp2)


Df Sum Sq Mean Sq F value Pr(>F)tablet 19 3.72 0.20 4.49 0.0129method 1 0.01 0.01 0.29 0.6054Residuals 9 0.39 0.04

print( xtable(summary(model1)), comment=FALSE)

Note that the order of the factors was changed as compared to above - this is to obtainthe same Type I ANOVA table, since this depends on the order, when the data is NOTbalanced. To get post hoc:




Now use the missing values data set:

model2 <- lmer(y ~ method + (1 | tablet), data=temp2)

summary(model2)


Estimate Std. Error t value Pr(>|t|)(Intercept) 10.2250 0.1549 66.02 0.0000

tablet2 0.4500 0.2088 2.15 0.0596tablet3 -0.0500 0.2088 -0.24 0.8161tablet4 -0.2500 0.2088 -1.20 0.2618tablet5 0.4000 0.2088 1.92 0.0877tablet6 0.3500 0.2088 1.68 0.1281tablet7 0.0000 0.2088 0.00 1.0000tablet8 0.6500 0.2088 3.11 0.0125tablet9 0.0000 0.2088 0.00 1.0000

tablet10 -0.4000 0.2088 -1.92 0.0877tablet11 0.5250 0.2600 2.02 0.0742tablet12 -0.4750 0.2600 -1.83 0.1010tablet13 0.2250 0.2600 0.87 0.4093tablet14 0.0250 0.2600 0.10 0.9255tablet15 -0.5750 0.2600 -2.21 0.0543tablet16 0.0750 0.2600 0.29 0.7795tablet17 -0.6250 0.2600 -2.40 0.0396tablet18 -0.2250 0.2600 -0.87 0.4093tablet19 -0.0250 0.2600 -0.10 0.9255tablet20 -0.3250 0.2600 -1.25 0.2428

methodnir 0.0500 0.0934 0.54 0.6054

method lsmean SE df lower.CL upper.CLhplc 10.2125 0.0618 9 10.0728 10.3522nir 10.2625 0.0618 9 10.1228 10.4022Results are averaged over the levels of: tabletConfidence level used: 0.95



Formula: y ~ method + (1 | tablet)

Data: temp2

REML criterion at convergence: 24.7

Scaled residuals:


-1.16659 -0.43803 0.00315 0.43635 1.84478

eNote 1 1.11 EXERCISES 42

Random effects:



Residual 0.04347 0.2085


Fixed effects:


(Intercept) 10.21175 0.09259 25.81000 110.288 <2e-16 ***

methodnir 0.07211 0.08697 11.69600 0.829 0.424

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1


(Intr)

methodnir -0.470

print(xtable(Anova(model2,test.statistic="F", type = 3)), comment=FALSE)

F Df Df.res Pr(>F)(Intercept) 11960.92 1 26 0.0000method 0.65 1 12 0.4371



method lsmean SE df lower.CL upper.CLhplc 10.2117 0.0934 25.81 10.0197 10.4037nir 10.2839 0.0934 25.81 10.0919 10.4759Confidence level used: 0.95

1.11 Exercises


Exercise 1 R start

1. Start R on your computer.

2. All data sets used in the course for examples and exercises are described and avai-lable (downloadable) from the course material. Go to the data eNote 13 by clickinghere!.

3. Click on the first data set: NIR prediction of HPLC measurements, complete data.

4. Click on [Get the data] (bottom of the section) and you see the raw data in thebrowser.

5. Save this on your computer by clicking [Files]→ [Save as] and chose the place tosave the file ”hplcnir1.txt”.

6. Import the data into R as described in this chapter

7. Try to reproduce the scatterplot of the NIR values versus the HPLC values givenin the main text by ”cutting and pasting”the R-lines from the R graphics sectioninto R.

Exercise 2 Sensory evaluation of cookies

Ten different chill or freezer storage treatments were tested on a type of cookies, andafter storage the cookies were evaluated bya sensory panel composed of 13 assessors.Each assessor tasted the cookies in randomized order, and tasted each type twice. Ateach test the assessor gave a score for each of the properties: colour, consistency, taste,quality (combined). The score was an integer between 0 and 10 with 10 as the best.The treatments are numbered 46, ..., 55, and the assessors are numbered 1, .., 13. Oneassessor did not give any score for quality. The data set is available here, and is partlylisted below:

assessor treatm colour cons taste quality

1 55 3 4 3 4

1 55 5 3 4 4

1 54 4 3 2 3

1 54 5 4 4 4

1 53 4 3 3 3

1 53 3 3 5 4

http://02429.compute.dtu.dk/enote/afsnit/NUID193/

http://02429.compute.dtu.dk/enote/afsnit/NUID193/

R.html#subsec:Rgraph

http://www2.compute.dtu.dk/courses/02429/Data/datafiles/cookies.txt


1 52 4 5 2 4

1 52 2 4 4 3

. . . . . . (There are 260 lines in the data set)

13 46 4 3 2 2

13 46 5 4 4 4

Consider the quality response.

a) Consider what kind of plots would be adequate to explore the information in thequality response and try to do some of these.

b) Carry out a statistical analysis of the quality response with the aim of investigatingassessor and treatment differences (use ordinary analysis of variance techniques)and try to answer the following questions:

1. Are there any assessor differences?

2. Are there any treatment differences?

3. Are possible treatment differences the same for all assessors?

(Did you make any plots giving any idea of the answer to these questions?)

The following R-lines may help as a start:

cookies<- read.table("cookies.txt", header = TRUE, sep = ",")

cookies$assessor <- factor(cookies$assessor)

cookies$treatm <- factor(cookies$treatm)

model1 <- lm(quality ~ assessor + treatm + assessor:treatm, data=cookies)

anova(model1)

summary(model1)

c) Answer the same question for the other three response variables: colour, cons andtaste.

General introduction to mixed models

Documents

Transcript of General introduction to mixed models