Longitudinal Data Analysis

103
Analysis of Longitudinal Data using R and Epicalc Author: Virasakdi Chongsuvivatwong

Transcript of Longitudinal Data Analysis

Analysis of Longitudinal Data using R and Epicalc

Author: Virasakdi Chongsuvivatwong

1

Analysis of Longitudinal Data using R and Epicalc

Author: Virasakdi Chongsuvivatwong [email protected]

Editor: Edward McNeil [email protected]

Epidemiology Unit Prince of Songkla University, THAILAND

2

Analysis of Longitudinal Data using R and Epicalc

First Paperback Edition

Author : Virasakdi Chongsuvivatwong

Editor : Edward McNeil

First Published : Chamnuang Press Soi 2, 11 Phetkasem Rd, Hat Yai, Songkhla 90110, Thailand

Book design by : Patcharin Potong Book Unit, Facuilty of Medicine Prince of Songkla University Hat Yai, Songkhla 90110, Thailand

Cover image :

ISBN :

Printed in Thailand

First Paperback Printing, 2011

R and Epicalc

3

Preface

From the theoretical side, this book was written to help new researchers in Health Science to be able to understand the nature of longitudinal studies, the structure of data and the approach in analysis. Equipped with R, an open-source software consisting of a suite of both standard and user contributed packages, readers are encouraged to follow the examples on data manipulation, graphing, exploration and modelling of longitudinal data.

Readers of this book should have some basic epidemiological knowledge, such as measures in epidemiology (incidence, prevalence etc), types of study designs, bias, confounding and interaction. Those who feel that they have an inadequate background should read fundamental text books of epidemiology listed at the end of this book. Basic data management concepts are also required as the first few chapters deal with data structure and data manipulation.

Experience in using data entry software, such as EpiInfo and EpiData, may also be beneficial. Although they can be used for entering longitudinal data, facilities for management of data with longitudinal nature are limited. So-called relational database software, such as Microsoft Access and others, are designed for data management and manipulation, but cannot do complicated statistical analyses. Data manipulation is more efficient if the same software is used for both. Documentation is also simplified since it can be integrated within the command file. R can be used exclusively for both purposes.

To get the full benefit from this book, readers should be acquainted with R software. Readers are recommended to follow each section in conjunction with typing the commands that follow since the theoretical parts are often followed by an example. The output on the computer screen can be observed and compared with that in the book and the explanations can then be read in order to integrate the learning of concepts and practice of data analysis simultaneously.

There are several text books and tutorials written for R available from the Internet. "Analysis of Epidemiological Data Using R and Epicalc", which can be freely downloaded from the CRAN website, http://cran.r-project.org/ and also from the WHO web site is strongly recommended as a preamble reading. http://apps.who.int/tdr/svc/publications/training-guideline-publications/analysis-epidemiological-data/

That book not only explains how the functions in R and Epicalc are used but it also provides the concept of variable management in R, especially how to avoid the confusion between a free vector and a variable of the same name inside a data frame. Epicalc also enables the data frame as well the variable to be labelled, which

4

subsequently leads to more understandable output in tables and graphs. The current version of Epicalc has been developed to respond to the needs of longitudinal data analysis in addition to the existing ordinary data exploration features.

Similar to the previous book, Epicalc functions are typed in Italics in contrast to functions from other R packages, which appear in normal font type. A function will be briefly explained when it is first used to let readers who have never used Epicalc before catch up quickly but would probably not be enough to substitute the needs to learn from the preceding book.

Students of Epidemiology may recall that a "longitudinal study" is first introduced to them as a cohort study or a follow-up study. The exposure variable is determined at the time of subject recruitment. Subjects are followed until the final outcome occurs or until the study ends. A cohort study design taught to students is often over-simplified, that is, repeated observations of exposure and outcome, as well as other explanatory variables, are often ignored. The simple 2-by-2 table for a cohort design is often limited to diseases with a short and homogeneous length of incubation period, such as food poisoning. For chronic diseases that have multiple risk factors and a multi-stage of disease manifestation, the incubation period can be long and may even vary. The denominator for calculation of risk is person-time, not just person. Statistical techniques employed are often survival analysis and Poisson regression and its variants. Data sets used for teaching students usually contain a person-time variable for grouped data or period of follow-up (time first seen and time last seen) of the subjects. This is quite different from real-life practices where individual subjects are followed-up at regular intervals until the event occurs or the study terminates. Unlike the epidemiological agreement that this is a longitudinal study, statisticians in general probably do not consider that analysis of time-to-event data is a longitudinal data analysis.

However, in addition to just following up and waiting for the failure event to occur, follow-up records allow analysis of transitions. The state of an outcome can be more than the classical dichotomy (diseased vs. non-diseased). It can be different states of the disease or health. Transition is the change of state between one point of time to the next. In biology, the measurements are mostly taken as continuous data. Transition in this case may mean the difference in outcome measure between two adjacent time points.

When the follow-up time is short and the number of variables is small, the data for a longitudinal study could easily be stored in the so-called 'wide' format. In wide form, a person appears in only in one record with measurements of the same sets of variables, usually measured at different times, stored in separate columns. The wide form has serious limitations when the number of repetitions of the visits is large and each visit has a relatively large number of measurements. It is also inefficient when

5

certain persons have only a few visits while others have a large number of visits. In this situation it is more efficient to store the data in the so-called 'long' format. In long form, a person can appear in more than one record corresponding to each of their visits. Measurements for each variable are stored in a single column, with an additional column denoting time included to distinguish separate visits. When the data are stored in long form, the number of visits does not need to be the same for every person.

As most well-designed follow-up studies have a fixed time interval, longitudinal studies may be considered as multiple cross-sectional studies over time. The relationship between the outcome and the exposure variables are measured in the same cross-sectional time. For each cross-sectional study, there is an equation explaining this relationship. For repeated measurement of the outcome and the exposure, the relationships are therefore modelled under generalized estimating equations (GEE). The estimated values are actually more-or-less the average values of the relationship. The model is also therefore called population average. The equations are averaged to only one with the correlations among the repetitions taken into account.

Instead of focusing on and time-averaging the relationship, the effects of the exposure variables may be separated into two components. The first component is fixed, and the exposure variables are called fixed effects since they are shared by individual subjects. The second component consists of variables which may vary from one individual to another in a random fashion, and are called random effects. Modelling data in this two component fashion is called a random effects modelling, or random coefficient model, or mixed effects model, because the components of the model contain a mixture of random and fixed effects.

Finally, as an individual changes his/her exposure and outcome over time, instead of looking at the status at each time point, one can consider his/her transition from one point of time to the next and the relationship between the transition of the outcome and the transition of exposure. While the outcome statuses of an individual over time are correlated and thus the relationship needs adjustment for this correlation, the transition from one time point to the next is usually not correlated, and analysis of the transition (transition modelling) is therefore simpler than the above two approaches. For a continuous outcome variable, the magnitude of change is modelled against the exposure variable in the concurrent transition or preceding state. When the outcome is a dichotomous variable, the transition probability that we are interested in is not confined to failure but so-called transition, which could be multi-directional. Modelling of transitional probabilities is called transition modelling or Markov modelling. Markov models predict the probability of the current outcome from the preceding status. It can also be called auto-regressive modelling as the outcome is regressed by its own previous value.

6

7

Table of Contents

Chapter 1: Data formats _____________________________________________ 8 

Chapter 2: Exploration and graphical display ___________________________ 14 

Chapter 3: Area under the curve (AUC) ________________________________ 26 

Chapter 4: Individual growth rate ____________________________________ 35 

Chapter 5: Within subject, across time comparison _______________________ 46 

Chapter 6: Analysis of missing records ________________________________ 55 

Chapter 7: Modelling longitudinal data ________________________________ 65 

Chapter 8: Mixed models ___________________________________________ 75 

Chapter 9: Transition models ________________________________________ 84 

Solutions to exercises ______________________________________________ 91 

8

Chapter 1: Data formats

Wide form and long form

Data entry phase

The decision to choose between wide form and long form should be made at the design level. As mentioned before, when the number of visits is few and constant for each subject, it may be better to enter the data in wide form. Since all variables are in the same record, quality control by comparison of variables is straightforward. For example, one can easily ensure that the date of first visit comes before that of the second visit.

Data entry software can issue a warning to the user if a certain value is too much different from a preceding value. These consistency checks are difficult to implement when the data are entered in long form because values of the same variable, but from different times, are not entered in the same record.

Normalization of data

Some variables, especially baseline characteristics of the subjects, are usually fixed, for example, date of birth, sex, place of birth. This data should be entered only once. In a database, a set of the data is considered as a table. The baseline table, which has one record per subject, can be linked with the follow-up data set through an identification (ID) field. This field is also called a key field. This ID field in the baseline table must be unique. In other words, there must not be any duplication in ID in the baseline table. In the follow-up table, ID certainly can be duplicated but the combination of ID and time of follow-up must be unique since each follow up of a subject should be recorded only once. A good database design would require such a unique ID in the baseline table and unique ID+time (a compound key) in the follow-up table. The database design must also ensure that all the IDs in the follow-up table must also be present in the baseline table.

The baseline table is sometimes considered as the mother table and correspondingly the follow-up table is considered as the child table. A follow-up record without corresponding ID in the baseline table is called an 'orphan' record. It indicates poor quality control in the data entry system. In order to ensure such integrity (absence of orphan records), a relational database software, such as Microsoft Access, is

9

required. EpiData can also be used to serve this purpose. Such data integrity can be ensured if and only if the records in the follow-up table can be entered only through an existing baseline record.

Hierarchical data

Data in which the relationship between the baseline data and the follow-up data has a hierarchy is called hierarchical data. It is also known as multi-level data because each follow-up record is considered as level 1, whereas each subject is considered as level 2. The latter is nested around the former; and is also therefore called nested data. Apart from longitudinal studies, hierarchical data can also be nested based on social or spatial relationships. For example, subjects can be nested with families and families can be nested within villages, and so forth.

When upper level variables affect the outcome at the individual level, the variables are sometimes called contextual determinants. For example, the nutritional status of a child is not only influenced by his/her immune status but also by the child rearing behaviours of the family and the hygiene conditions (waste disposal, water supply) of the community.

Several software packages can be used to analyze hierarchical data. Analysis of this multi-level data is well covered by a number of packages in R and will be discussed in subsequent sections.

Complex relational database

Relational data is often, but not necessarily, in hierarchical fashion. For example, several patients are often seen by one doctor. However, a patient can also been seen by more than one doctor. The same matched patient-doctor set could also occur on several visits. The quality of care that the patient receives may be determined by the characteristic of the patient, for example their age and sex, and of the doctor, for example, their qualification, experience, and the interaction between the two. So far, there are very few statistical software packages capable of analyzing this type of complex data. The package lme4 in R (by Bates and Maechler) is a pioneer in this area. We will not discuss this highly complicated relational data since this is outside of our current interest.

10

Examples of longitudinal data in long form

The datasets package in R contains a large number of data sets from longitudinal studies. All of these are in the long format. The list can be viewed by typing: > data(package="datasets")

Among these, is a data frame containing 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. > class(ChickWeight) [1] "nfnGroupedData" "nfGroupedData" "groupedData" "data.frame"

In addition to being a data frame, the object is also a special kind of data frame which was modified from an ordinary data frame in order to make it suitable for analysis using functions from the nlme package. To view the first 6 records you can type: > head(ChickWeight) Grouped Data: weight ~ Time | Chick weight Time Chick Diet 1 42 0 1 1 2 51 2 1 1 3 59 4 1 1 4 64 6 1 1 5 76 8 1 1 6 93 10 1 1

The dataset actually contains a formula, which models the chick’s weight using each chick's age (in days).

To view the variable names, classes and descriptions you can type: > library(epicalc) > des(ChickWeight) No. of observations = 578 Variable Class Description 1 weight numeric 2 Time numeric 3 Chick ordered 4 Diet factor

There are 4 variables, of which the third variable, 'Chick', is the identification variable. > use(ChickWeight) > Chick[1:30] [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 <

3 < 1 < 12 < 2 < 5 < 14 < 7 < 24 < 30 < 22 < 23 < 27 < 28 < 26 < 25 < ... < 48

11

The display looks fine but the ordering for the levels is unusual. Now let's do a cross-tabulation with the 'Time' variable. > table(Chick, Time) Time Chick 0 2 4 6 8 10 12 14 16 18 20 21 18 1 1 0 0 0 0 0 0 0 0 0 0 16 1 1 1 1 1 1 1 0 0 0 0 0 15 1 1 1 1 1 1 1 1 0 0 0 0 13 1 1 1 1 1 1 1 1 1 1 1 1 9 1 1 1 1 1 1 1 1 1 1 1 1 ========= remaining lines omitted =========

The sequence of 'Chick' in the rows is not in numeric order. This is because it is an ordered "factor" class object. It was classed this way in order to fit in with the structure of "groupedData" required by the nlme package. We will learn about how to convert this class of object into a normal integer in the next chapter. The other point to note is that Chick "18", which appears in the first row, has only 2 visits at times 0 and 2. The data is unbalanced. This is a common finding in longitudinal data. It will be discussed in detail in subsequent chapters.

Reshaping longitudinal data frame

Many other data sets, such as Indometh and Theoph, which contain pharmacokinetic data, have similar class and structure. This long form data can be reshaped to wide form if necessary. The first three commands to reshape the Indometh data to wide form actually come from the help page for the reshape function. > wide <- reshape(Indometh, v.names="conc", idvar="Subject",

timevar="time", direction="wide") > des(wide) No. of observations = 6 Variable Class Description 1 Subject ordered 2 conc.0.25 numeric 3 conc.0.5 numeric 4 conc.0.75 numeric 5 conc.1 numeric 6 conc.1.25 numeric 7 conc.2 numeric 8 conc.3 numeric 9 conc.4 numeric 10 conc.5 numeric 11 conc.6 numeric 12 conc.8 numeric

12

In the above example there is only one variable with repeated measures. In reality, a data set can contain many sets of repeated measurements. As a simple illustration, study the following commands carefully. > exposure1 <- c(1:9,NA) > exposure2 <- 11:20 > exposure3 <- 21:30 > outcome1 <- 101:110 > outcome2 <- 111:120 > outcome3 <- 121:130 > data.wide <- data.frame(ID=letters[1:10], exposure1, exposure2,

exposure3, outcome1, outcome2, outcome3) > data.wide ID exposure1 exposure2 exposure3 outcome1 outcome2 outcome3 1 a 1 11 21 101 111 121 2 b 2 12 22 102 112 122 3 c 3 13 23 103 113 123 4 d 4 14 24 104 114 124 5 e 5 15 25 105 115 125 6 f 6 16 26 106 116 126 7 g 7 17 27 107 117 127 8 h 8 18 28 108 118 128 9 i 9 19 29 109 119 129 10 j NA 20 30 110 120 130

Note the missing value for the 'exposure1' variable in the last row (ID = "j"). Now let's reshape this data frame to long format. > data1.long <- reshape(data.wide, idvar="ID", varying=list(2:4, 5:7),

v.names=c("exposure", "outcome"), direction="long") > data1.long ID time exposure outcome a.1 a 1 1 101 b.1 b 1 2 102 c.1 c 1 3 103 d.1 d 1 4 104 e.1 e 1 5 105 f.1 f 1 6 106 g.1 g 1 7 107 h.1 h 1 8 108 i.1 i 1 9 109 j.1 j 1 NA 110 a.2 a 2 11 111 b.2 b 2 12 112 c.2 c 2 13 113 d.2 d 2 14 114 e.2 e 2 15 115 f.2 f 2 16 116 g.2 g 2 17 117 h.2 h 2 18 118 i.2 i 2 19 119 j.2 j 2 20 120 ========= remaining lines omitted =========

13

Note the source of the new 'time' variable in the long format is generated from the suffixes of the 'exposure' and 'outcome' variables in the wide format. Also the new 'exposure' variable in the long format corresponds to the 2nd to 4th variables in the wide format and the 'outcome' variable in the long format corresponds to the 5th to 7th variables in the wide format. These sets of variables must be matched correctly in the 'varying' argument. Note also that the value of 'exposure1' for ID "j" is missing. Now suppose that the exposure and outcome variables are adjacent to each other in the wide data frame. > data2.wide <- data.frame(ID=letters[1:10], exposure1, outcome1,

exposure2, outcome2, exposure3, outcome3) > data2.wide ID exposure1 outcome1 exposure2 outcome2 exposure3 outcome3 1 a 1 101 11 111 21 121 2 b 2 102 12 112 22 122 3 c 3 103 13 113 23 123 4 d 4 104 14 114 24 124 5 e 5 105 15 115 25 125 6 f 6 106 16 116 26 126 7 g 7 107 17 117 27 127 8 h 8 108 18 118 28 128 9 i 9 109 19 119 29 129 10 j NA 110 20 120 30 130

The command to reshape to wide format is almost identical to the previous command. > data2.long <- reshape(data2.wide, idvar="ID", varying=list(c(2,4,6),

c(3,5,7)), v.names=c("exposure", "outcome"), direction="long") > data2.long

The positions of the variables in the 'varying' list need to be changed accordingly to match the order of the 'v.names' argument. Note that the row names of the resulting data frame are formed from the combination of the 'ID' and 'Time' variables. They are therefore unique.

Exercise In what format (wide or long) is the data frame Theoph provided by R? Reshape it to the other format. Explain how the variable 'Subject' is arranged in the new format.

14

Chapter 2: Exploration and graphical display

In this chapter, we will go into more details of data frames that have class "groupedData". > library(epicalc) > zap() > use(Indometh) > des()

No. of observations = 66 Variable Class Description 1 Subject ordered 2 time numeric 3 conc numeric

There are 66 records and 3 variables, the first of which has class "ordered", with the other 2 being "numeric". There are no descriptive labels attached to the variables. > summ() No. of observations = 66 Var. name obs. mean median s.d. min. max. 1 Subject 66 3.5 3.5 1.721 1 6 2 time 66 2.89 2 2.46 0.25 8 3 conc 66 0.59 0.34 0.63 0.05 2.72

The minimum value of 'Subject' is 1 and the maximum is 6 but that does not necessarily mean that there are 6 subjects in total. We have to check with tabulation. > tab1(Subject) Subject : Frequency Percent Cum. percent 1 11 16.7 16.7 4 11 16.7 33.3 2 11 16.7 50.0 5 11 16.7 66.7 6 11 16.7 83.3 3 11 16.7 100.0 Total 66 100.0 100.0

15

The table above indicates that there are 6 subjects, each contributing 11 records. The order of 'Subject' in this table is not sorted from lowest to highest because the variable is an ordered factor and the levels have been preset to this order.

To make sure that time of measurement of the drug concentration is systematic for all subjects a cross-tabulation can be carried out. > table(time, Subject) Subject time 1 4 2 5 6 3 0.25 1 1 1 1 1 1 0.5 1 1 1 1 1 1 0.75 1 1 1 1 1 1 1 1 1 1 1 1 1 1.25 1 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 4 1 1 1 1 1 1 5 1 1 1 1 1 1 6 1 1 1 1 1 1 8 1 1 1 1 1 1

Time increases by 0.25 unit until 1.25. Then it increases by a step of 1 until time 8, except that time 7 is missing. This kind of tabulation is a good routine practice in longitudinal data exploration. All cells in the table are filled with 1s indicating the uniqueness for the combination of 'Subject' and 'time'. Note that there is no time 0. When the data set is small, eyeball scanning on such tabulation may be adequate. When the data set is large, it may be better to check whether there are any missing follow up times or duplicated records by typing the following command. > table(table(time, Subject)) 1 66

This command tabulates all values of the above table and finds that there are 66 cells all having a common value of 1.

Alternatively, missing follow up times can be checked by typing: > any( table(time, Subject) < 1 ) [1] FALSE

Duplication of records having the same person with the same time point could be checked by typing: > any( table(time, Subject) > 1 ) [1] FALSE

16

Graphing longitudinal data

There are two main methods for graphing the relationship between concentration and time for each subject. The first method is to employ trellis plots, which give one small plotting frame for each subject. The second method is to employ the epicalc package, which has graphical functions that show all the data in the one plotting frame.

The simplest plotting command for this data is coplot. > coplot(conc ~ time | Subject, data = Indometh)

The upper part of the graph indicates the order of the Subject corresponding to the panels in the plot. The plot is read from left-to-right and bottom-to-top, so the bottom left panel corresponds to Subject 1 and top right panel corresponds to Subject 3. To visualize the position of Subject in each panel we can replace the open circles with the actual Subject number as follows: > coplot(conc ~ time|Subject, data=Indometh,pch=as.character(Subject))

17

The coplot function is designed to show the relationship between two variables conditional on the value of a third variable, in this case subject. Instead of using coplot, we can use the xyplot function from the lattice package. > library(lattice) > xyplot(conc ~ time | Subject, data=Indometh) > xyplot(conc ~ time | Subject, type="b", data=Indometh)

The last command plots connecting lines in each frame instead of just open circles.

Since the class of the data frame is "groupedData", we can also call the nlme library, which has a default plot method for this class of data frame. > library(nlme) > plot(Indometh)

18

The result is similar to the xyplot command; however the labels for the X and Y axes, which are stored as attributes in the data frame, are used here. > attributes(Indometh)

In order to add the "groupedData" class to an ordinary data frame, we must employ the groupedData function from the nlme package.

Let's create Sitka.gp from the Sitka data, which comes from the MASS package. > library(nlme) > data(Sitka, package="MASS") > Sitka.gp <- groupedData(size ~ Time|tree, data=Sitka,

labels=list(x="Time (Days since 1 Jan 1988)", y="Log(Height x diameter ^2)"))

> plot(Sitka.gp)

19

In order to show all the subjects in the same plotting frame, let's return to the Indometh data set. We first think that if there is only one subject, this value could be simply plotted as a line graph against time. > plot(conc~time, subset=Subject==1, type="l", data=Indometh)

The above command produces a plot of the pharmacokinetic curve for the first subject. We can further proceed with the second and third subjects. > lines(conc~time, subset=Subject==2, type="l", data=Indometh) > lines(conc~time, subset=Subject==3, type="l", data=Indometh)

Each of the above lines commands adds one line to the existing graph for the second and the third subjects, respectively. One can repeat the same process until all six subjects have had their curves displayed.

The problem encountered so far is that the maximum value of the Y axis defined by the first subject is too low for subsequent subjects. To prevent this, the initial plot command should include a 'ylim' argument so that subsequent curves with higher

20

concentrations can still be accommodated. The remaining lines commands can then follow as above.

Obviously, if there are too many subjects, the command would be too tedious to run. It may be better to exploit a for loop. > plot(conc~time, subset=Subject==1, ylim=c(0, max(conc, na.rm=TRUE)),

xlab="", ylab="", type="l", data=Indometh) > for(i in 2:6) lines(conc~time, subset=Subject==i, col=i,

data=Indometh)

That completes the majority of the requirements for the graph. Readers can further proceed with putting axis labels, a legend, title, etc.

The above process could be carried out with two epicalc commands. > use(Indometh) > followup.plot(id=Subject, time=time, outcome=conc,

main="Pharmacokinetics of Indomethicin")

The resultant graph is more or less the same as the previous commands using the for loop construct. Note that the colours are automatically chosen based on the Subject number.

21

Plots of aggregated values

The examples in the help page of the followup.plot function explore the Sitka data set and give some ideas for the colour of the lines indicating the treatment group. > data(Sitka, package="MASS") > use(Sitka) > followup.plot(id=tree, time=Time, outcome=size, by=treat,

main="Growth Curves for Sitka Spruce Trees in 1988")

The control group, represented by solid black lines, tends to have larger trees than the ones grown in the ozone-enriched chambers. This can be more clearly seen with the following command. > aggregate.plot(x=size, by=Time, group=treat)

22

It is clear that the mean tree size of the ozone group was somewhat smaller at the start and distinctively smaller at the end of the follow-up period. If the argument 'return.output' is set to TRUE then the numerical results are shown as well. > aggregate.plot(x=size, by=Time, group=treat, return=TRUE) grouping time mean.size lower95ci upper95ci 1 control 152 4.166000 3.837764 4.494236 2 ozone 152 4.059630 3.898858 4.220401 3 control 174 4.629600 4.333247 4.925953 4 ozone 174 4.467037 4.310078 4.623996 5 control 201 5.037200 4.760472 5.313928 6 ozone 201 4.849074 4.693651 5.004498 7 control 227 5.438400 5.161541 5.715259 8 ozone 227 5.180926 5.014109 5.347743 9 control 258 5.654400 5.372244 5.936556 10 ozone 258 5.313148 5.145238 5.481058

23

Dichotmous longitudinal outcome variable

All the above examples have outcome variables on a continuous scale. Let's explore a data set which has a dichotomous outcome variable. > data(bacteria, package="MASS") > use(bacteria) > des() No. of observations = 220 Variable Class Description 1 y factor 2 ap factor 3 hilo factor 4 week integer 5 ID factor 6 trt factor

The data set comes from a study testing the presence of the bacteria H. influenzae in children with otitis media in the Northern Territory of Australia. The outcome is in the variable 'y' and the follow-up period is represented by the 'week' variable. Some follow-up times are missing, as shown by the following command. > table(table(week, ID)) 0 1 30 220

It is not appropriate to draw a follow-up plot because the outcome here is dichotomous. In order to show the different prevalence of bacteria over the follow-up period by treatment group, we have to use the aggregate.plot command. > aggregate.plot(x=y, by=week, group=trt)

The 95% confidence intervals of the prevalences overlap due to the relatively small sample sizes in the three treatment groups. We can also use 'ap' (active vs placebo) as the group variable instead of 'trt'.

24

> aggregate.plot(x=y, by=week, group=ap)

25

The prevalence of bacteria in the active treatment group declined steadily until week 6 when the difference is the highest. The two groups tend to have a closer prevalence again after 11 weeks.

Exercise

• Explore the Theoph data set again from the datasets package. • How many subjects are there? How many times does each subject appear? • Were the subjects assessed on drug levels in exactly the same pattern of

time? • Plot the concentration of this drug of each individual over time using

followup.plot and coplot. • Note that from followup.plot, the colours of the lines are all the

same. Why? • How could we change the color? How can it be more colourful? • Was the weight of each subject stable? • Divide the subjects into 2 groups, one below 70kg and one greater than or

equal to 70 kg. Create a variable called 'Wtgr' based on this weight division and use the followup.plot command to draw a graph similar to the one on page 20.

• Is there a tendency that the heavier group has a higher level of drug concentration over time?

26

Chapter 3: Area under the curve (AUC)

The area under the plasma (serum, or blood) concentration versus time curve (AUC) has a number of important uses in toxicology, biopharmaceutics and pharmacokinetics. In pharmacokinetics, drug AUC values can be used to determine other pharmacokinetic parameters, such as clearance or bioavailability.

The Theoph data set has a starting point at time zero, a good reason to compute area under the time-concentration curve (AUC). > library(epicalc) > class(Theoph) [1] "nfnGroupedData" "nfGroupedData" "groupedData" "data.frame"

The data frame has the same class as those from previous chapters. > use(Theoph) > class(.data) # same as above

Area under the curve is computed by summing the trapezoids formed by the two closest time points and their base. From elementary geometry, the area of one such trapezoid is equal to the average height of these two points and the width of the base. The records of the first subject in the Theoph data frame are shown below. > .data[Subject==1, c(1,4,5)] Subject Time conc 1 1 0.00 0.74 2 1 0.25 2.84 3 1 0.57 6.57 4 1 1.12 10.50 5 1 2.02 9.66 6 1 3.82 8.58 7 1 5.10 8.36 8 1 7.03 7.47 9 1 9.05 6.89 10 1 12.12 5.94 11 1 24.37 3.28

For subject 1, the area under the curve of the first trapezoid is (0.25 – 0.00) × (2.84 + 0.74)/2 = 0.4475 unit. This is then added to all the trapezoids belonging to the same subject. The second one is (0.57 – 0.25) × (6.57 + 2.84)/2 = 1.5056 and so on.

27

The final summation result is 148.92.

In epicalc, this computation can be done for each subject using the following command: > auc.data <- auc(conc=conc, time=Time, id=Subject) > auc.data Subject auc 1 1 148.92305 2 2 91.52680 3 3 99.28650 4 4 106.79630 5 5 121.29440 6 6 73.77555 7 7 90.75340 8 8 88.55995 9 9 86.32615 10 10 138.36810 11 11 80.09360 12 12 119.97750 The function auc is based on the above principle of summation of trapezoids.

We can also compute the AUC one subject at a time, omitting the 'id' argument, since as the default value of 'id' is NULL. > auc(conc=conc[Subject==1], time=Time[Subject==1]) [1] 148.9230 > auc(conc=conc[Subject==2], time=Time[Subject==2]) [1] 91.5268 > auc(conc=conc[Subject==3], time=Time[Subject==3]) [1] 99.2865

The above three lines just confirm the same results as the preceding command.

Since the 'Subject' variable is not ordered numerically, let's create an integer variable, say 'subject' (small s), that has the same value as 'Subject' (capital S). The command is: > subject <- as.integer(Subject)[order(Subject)]

In order to create an integer vector from 'Subject', which is an ordered factor, the values are coerced to integer first and then sorted in descending order. > pack() > class(.data) [1] "data.frame"

The epicalc command pack has changed the class of .data to a simple data frame. The original Theoph data frame remains intact. It can be used for more complicated analyses later.

28

Now run the auc command once more. > auc.data <- auc(conc=conc, time=Time, id=subject)

For those who are serious in pharmacokinetic studies, it is advised to install and load the PK package. This package gives more details and variety of AUC than epicalc. The auc function in epicalc is simply based on a trapezoid summation as described above. The auc function from the PK package also gives this value. In addition, it provides a better estimation of the area under time-concentration curve based on the estimation of unavailable information of concentration during the time interval. The right tail of the curve can go up to infinity to better reflect the total amount of drug that the subject was exposed to. However, the auc function in epicalc allows the user to subset the AUC by subject, making it more convenient for data management.

The auc.data data frame contains AUC for each subject. It will be merged with other data frames for further analysis.

Medically speaking, the AUC would reflect the speed of redistribution (into different compartments of the body), destruction and excretion of the individual subject on the drug. Thus the subject who destroyed/excreted fastest was the 6th subject and the slowest was the first one. From now on we will try to find the relationship between AUC and individual characteristics of the subject.

One can check the variability of the values of various variables across subjects using the following command. > aggregate(x=.data[,2:5], by=list(subject=subject), FUN="sd") subject Wt Dose Time conc 1 1 0 0 7.273320 3.034533 2 2 0 0 7.269680 3.027389 3 3 0 0 7.234519 2.684222 4 4 0 0 7.329930 2.921623 5 5 0 0 7.278871 3.537344 6 6 0 0 7.151650 2.180382 7 7 0 0 7.250075 2.485130 8 8 0 0 7.233919 2.455296 9 9 0 0 7.244127 2.716175 10 10 0 0 7.108937 3.050539 11 11 0 0 7.223542 2.552839 12 12 0 0 7.235402 3.499246

There are certain interesting points here.

The first argument of the function is the data frame to aggregate. Unlike the aggregate.numeric function, which can apply several statistical summaries on a single variable after splitting the data into subsets, the aggregate function

29

can apply only one summary statistic to multiple variables in the data frame.

One may try changing the "by" argument to "list(subject = Subject)" to see the ordering of the subjects (results omitted). Here, using 'subject' instead of 'Subject' allows the data to be displayed in ascending subject ID order.

Variables 'Subject' (capital S), 'Wt' and 'Dose' have zero sd. This means that there is no variation within subjects. Since the values of 'Subject', 'Wt' and 'Dose' of the same person do not change at all, their standard deviations are all zero. The standard deviations of 'Time' are all relatively similar indicating that the time of drawing blood for drug assay was probably set to be synchronized for all subjects. However, they are not exactly the same. The synchronization process is not perfect. Let's check the variation graphically. > summ(Time) obs. mean median s.d. min. max. 132 5.895 3.53 6.93 0 24.65

The first 11 points are in the same vertical line, that is, at time zero. Later on, the timing of drug drawing was not so synchronised. Variation in time of drawing blood causes the stacks of points to jitter.

Now modify the above aggregate command as follows to obtain the mean weight and dose for each subject.

30

> WtDose.data <- aggregate(.data[,2:3], by=list(subject=subject), FUN="mean")

> WtDose.data subject Wt Dose 1 1 79.6 4.02 2 2 72.4 4.40 3 3 70.5 4.53 4 4 72.7 4.40 5 5 54.6 5.86 6 6 80.0 4.00 7 7 64.6 4.95 8 8 70.5 4.53 9 9 86.4 3.10 10 10 58.2 5.50 11 11 65.0 4.92 12 12 60.5 5.30

Now we merge the WtDose dataframe with auc.data. > merge.data <- merge(WtDose.data, auc.data) > merge.data subject Wt Dose auc 1 1 79.6 4.02 148.92305 2 2 72.4 4.40 91.52680 3 3 70.5 4.53 99.28650 4 4 72.7 4.40 106.79630 5 5 54.6 5.86 121.29440 6 6 80.0 4.00 73.77555 7 7 64.6 4.95 90.75340 8 8 70.5 4.53 88.55995 9 9 86.4 3.10 86.32615 10 10 58.2 5.50 138.36810 11 11 65.0 4.92 80.09360 12 12 60.5 5.30 119.97750

The merge command is often used in longitudinal data management. In this exercise both data frames share one common field, namely 'subject'. The merge command detected this and used it to join the records from both data frames. > intersect(names(WtDose.data), names(auc.data)) [1] "subject"

By default, only records sharing the common field in both data frames are returned. In this case all of the records are returned and no subject was omitted.

From now on, the data can be analysed quite easily. > use(merge.data) > des() > summ() > summ(subject) obs. mean median s.d. min. max. 12 6.5 6.5 3.61 1 12

31

> summ(Wt) obs. mean median s.d. min. max. 12 69.583 70.5 9.5 54.6 86.4 > summ(Dose) obs. mean median s.d. min. max. 12 4.626 4.53 0.75 3.1 5.86 > summ(auc) obs. mean median s.d. min. max. 12 103.807 95.407 23.65 73.776 148.923

All the variables are quite uniformly distributed. Let's create a two-way scatter plot. > plot(Wt, auc, type="n", xlab="Wt (kg) ", ylab="Area under time-

concentration curve (hour-mg/L)") > text(Wt, auc, labels=subject)

55 60 65 70 75 80 85

8010

012

014

0

Wt (kg)

Are

a un

der t

ime-

conc

entra

tion

curv

e (h

our-m

g/L)

1

2

3

4

5

6

78

9

10

11

12

There is a slight negative correlation between AUC and Wt. Heavier persons tended to destroy/excrete the drug faster that the lighter ones causing the drug to have a small AUC. One exception is the first subject who has the highest AUC (remember 148 units!) and yet he/she was among the heaviest. This person would need special investigation for this out-lying nature (e.g. maybe due to disease or genetic mark up).

32

Now we plot AUC against dose. > plot(Dose, auc, type="n", xlab="Dose (mg/kg) ", ylab="Area under

time-concentration curve (hour-mg/L)") > text(Dose, auc, labels=subject)

3.0 3.5 4.0 4.5 5.0 5.5

8010

012

014

0

Dose (mg/kg)

Are

a un

der t

ime-

conc

entra

tion

curv

e (h

our-m

g/L)

1

2

3

4

5

6

78

9

10

11

12

As medically expected, there is a positive correlation between dose and AUC. Persons who receive higher dose tended to have higher AUC because the drug will have a higher concentration and stay longer in the blood. Again, the fist subject is the exception. This subject received a dose of 4.02 units, yet retained as much as 148 units of AUC. > plot(Dose, Wt) > text(x=4,y=65,labels=paste("Corr. coef. =", round(cor(Dose, Wt),3)),

col="blue", cex=1.5)

Heavy subjects were more likely to be given lower doses. There are no outliers here since both variables were controlled by the protocol. Having such a high correlation indicates is a potential confounding situation. We should clarify whether dose or weight had a stronger effect on AUC.

33

3.0 3.5 4.0 4.5 5.0 5.5

5560

6570

7580

85

Dose

Wt

Corr. coef. = -0.99

> regress.display(lm(auc ~ Dose + Wt, data=.data), crude=TRUE) Linear regression predicting auc crude coeff.(95%CI) adj. coeff.(95%CI) P(t-test) P(F-test) Dose 11.99 (-8.65,32.62) 73.19 (-73.25,219.64) 0.287 0.23 Wt -0.83 (-2.48,0.82) 4.86 (-6.65,16.38) 0.364 0.364 No. of observations = 12

Since all t-tests and F-tests in the crude and adjusted analyses are above 0.05, the conclusion is that neither dose nor weight significantly determines the AUC.

Remember that we have an outlier. Let's exclude it and repeat the analysis. > data.not1 <- .data[-1,] > regress.display(lm(auc ~ Dose + Wt, data=data.not1), crude=TRUE)

Linear regression predicting auc crude coeff.(95%CI) adj. coeff.(95%CI) P(t-test) P(F-test) Dose 18.02 (3.71,32.32) -23.27 (-142.12,95.59) 0.664 0.023 Wt -1.49 (-2.62,-0.37) -3.35 (-12.93,6.22) 0.443 0.443 No. of observations = 11

34

This model suggests that after excluding the outlier, in the crude analysis, dose had a significant positive effect on AUC whereas weight had a significant negative effect. This is judged by their 95% confidence intervals not including zero. In the adjusted analysis, where both independent variables are included in the model, the t-test on both variables suggests that they are not significant factors. However, dose, and not weight, is significant by analysis of variance (F-test). We conclude that dose (in mg/kg) is more important than patient weight in prolonging high levels of oral theophylline.

In summary, this chapter gives you an example of computing the area under the curve (AUC) as the outcome variable. This approach removes the need to do sophisticated statistical modelling.

Summary

In more advanced approaches, the level of drug in a body at any given time can be modelled as a function of several underlining parameters. The nlme function in the package of the same name was created to address this problem. Use of that package is beyond the scope of this book.

Exercise

Read in the Sitka data set from the MASS package. Try to find out whether ozone exposure reduced the area under the time-size curve. For simplicity, keep only records without any missing values.

35

Chapter 4: Individual growth rate

In the preceding chapter, we calculated the area under the time-concentration curve. This method of analysis is justified in pharmacokinetics as it reflects the ability of a person to destroy and or excrete the drug. In this chapter we will analyse the Sitka data set again, comparing the tree growth rates in each treatment group. > library(epicalc) > zap() > data(Sitka, package="MASS") > use(Sitka) > des()

No. of observations = 395 Variable Class Description 1 size numeric 2 Time numeric 3 tree integer 4 treat factor

Note that the 'tree' variable is equivalent to 'Subject' in the Indometh and Theophdata sets. Here, its class is "integer". Note that the data frame is not in "groupedData" format. > summ() No. of observations = 395 Variable Class Description 1 size numeric 2 Time numeric 3 tree integer 4 treat factor

Let's check whether there is any duplication of measurement on the same tree at the same time. > table(Time, tree)

The output (omitted) is not too extensive for this data set and shows that there are 5 different values for 'Time' and 79 different trees. All cells have counts of 1 indicating no duplication. For larger data sets, the following command may be better. > table(table(Time, tree)) 1 395

Since each cell in the previous table contains only the number one, the mean for that cell would be the size of the tree at that point of time.

36

> tapply(size, list(time=Time, tree=tree), mean)

The orientation of the table is the same. Again, the output is rather large and is omitted here. Each column represents the size of the tree over time. Let's create some follow-up plots. > followup.plot(id=tree, time=Time, outcome=size)

Unlike the follow-up plots of the pharmacokinetic studies, in which there are only 6 lines for Indometh and 12 for Theoph, there are 79 lines in this plot, one for each tree. These plots are sometimes called "spaghetti-plots" due to the crossing lines.

Subsets can be achieved with the 'by' argument. > followup.plot(id=tree, time=Time, outcome=size, by=treat)

37

There is a tendency that trees grown in ozone-rich chambers (red dashed lines) are smaller than those in the control group (black lines).

Time is days since 1 Jan 1988. However, it is not clear when the experiment started.

Unlike pharmacokinetic studies, which start at zero level of a drug, in the Sitka study, the area under the curve was measured at some unknown time after the tree had grown. Therefore the AUC may be an invalid outcome measure.

Let's try to subtract the size on the first (152nd day) measurement from each tree and then calculate the AUC. First we must create a 'visit' index for each tree.

Indexing visits

Let us make sure that the data are properly sorted by 'tree' and 'Time'. > sortBy(tree, Time)

Next, count the number of records contributed by each tree (in R, this is called 'run

38

length encoding' since the lengths of each element that appears repeatedly are encoded into a list). The function is rle. > list1 <- rle(tree) > list1 Run Length Encoding lengths: int [1:79] 5 5 5 5 5 5 5 5 5 5 ... values : int [1:79] 1 2 3 4 5 6 7 8 9 10 ...

The object 'list1' has two elements, namely 'lengths' and 'values'. The first element shows that there are 5 visits for each tree. The second is the value of the tree. Note that the function rle takes only an atomic vector as its argument. In this case we do not have any problem as tree is a vector. If it was a factor, the corresponding function would be > lst1 <- rle(as.vector(tree))

R has a function called sapply, which applies a function to each element of a vector. > visit <- unlist(sapply(X=lst1$lengths, FUN=function(x) 1:x,

simplify=FALSE)) > visit

The above rle and sapply commands are complicated and not easy to remember. The epicalc package has a function called markVisits for this purpose. > visit <- markVisits(id=tree, time=Time)

This is the required index vector for visits that we can pack into our data frame. > pack()

Note that marking of the visit may not be well synchronized with the 'Time' variable for a couple of reasons. Firstly, the exact time may not be repeated, as seen in the Theoph data. Secondly, data may not be collected at the scheduled time. For example, if patients are supposed to come weekly but they do not show up in the

second week, then their 3rd

week visit will become their second visit and their 4th week visit will become their third week visit, etc. We can simply check the consistency between 'Time' and 'visit' as follows. > table(Time, visit) visit Time 1 2 3 4 5 152 79 0 0 0 0 174 0 79 0 0 0 201 0 0 79 0 0 227 0 0 0 79 0 258 0 0 0 0 79

All the numbers (79) in the diagonal indicate perfect consistency.

39

> head(.data, 10) size Time tree treat visit 1 4.51 152 1 ozone 1 2 4.98 174 1 ozone 2 3 5.41 201 1 ozone 3 4 5.90 227 1 ozone 4 5 6.15 258 1 ozone 5 6 4.24 152 2 ozone 1 7 4.20 174 2 ozone 2 8 4.68 201 2 ozone 3 9 4.92 227 2 ozone 4 10 4.96 258 2 ozone 5

Creating an index variable allows easier manipulation of the follow-up records.

Our current task is to subtract the tree size for each tree at visit 1 from its size at all subsequent visits. This can be achieved with the following commands. > tmp <- by(.data, INDICES=tree, FUN=function(x) x$size -

x$size[x$visit==1]) > size.change <- sapply(tmp, as.numeric)

In analysis of longitudinal data, the by function should be one of the most commonly used. The first argument is the data frame used. The second argument, INDICES, is the factor that splits the data into subsets and then a function, FUN, is applied to each subset. The FUN argument can be any valid R function or a user defined one, which is the case here. This approach using by and sapply will be repeatedly used throughout our lessons.

The first command above splits the data frame into each individual tree and subtracts the tree size at the first visit from the tree sizes at all subsequent visits. The result is saved into a temporary object, 'tmp'. This object is a type of list and is not very useful unless it is converted to a vector or matrix using the sapply function.

In the second command above, the function as.numeric is applied to each element of 'tmp'. The result is a matrix containing the numeric differences between the tree sizes at each visit and those from the first visit. There are 5 visits for each tree and there are 79 trees, so the result is a 5 (visits) by 79 (trees) dimension matrix. > dim(size.change) [1] 5 79

We need to convert this matrix into one long vector and integrate it as a variable into the current data frame. > size.change <- as.numeric(size.change) > pack()

40

Now, let's look at the summary this new variable. > summ(size.change) obs. mean median s.d. min. max. 395 0.747 0.76 0.55 -0.52 2.1

Note that there are a few negative values (size decreased over time) in addition to a number of zeros corresponding to the first visit, and the remaining positive values (tree size increased). We temporarily ignore the records with negative tree growth and complete the AUC for the size differences and then compare the values of this variable between the two treatment groups. > auc.tree <- auc(size.change, time=Time, id=tree)

Here the 'tree' variable is used as the subject identification. This vector can now be merged with a data frame containing a subset of records from the first visit. > visit1 <- subset(.data, subset=visit==1, select=c("tree", "size",

"treat")) > auc.visit1 <- merge(auc.tree, visit1)

Before using auc.visit1, since we have made many changes to .data, let's make a copy of it for future use. > .data -> Sitka1

Now auc.visit1 can be used. > use(auc.visit1) > des()

41

No. of observations = 79 Variable Class Description 1 tree integer 2 auc numeric 3 size numeric 4 treat factor

A summary of the AUC for each treatment group can be shown. > summ(auc, by=treat) For treat = control obs. mean median s.d. min. max. 25 93.78 91.66 22.414 62.93 141.5 For treat = ozone obs. mean median s.d. min. max. 54 82.29 82.14 24.991 -31.41 122.1

Surprisingly, the minimum AUC value for the treatment group is negative, which means one or more trees actually got smaller! To check which one(s) type > tree[auc <0] [1] 15

This can also be illustrated graphically. > use(Sitka1) > followup.plot(id=tree, time=Time, outcome=size, stress=15,

stress.col=2, stress.width=3)

This unlikely finding was perhaps due to a measurement error made at the first visit of this tree.

More details about how to detect abnormal records where the size became smaller is shown in the example of the followup.plot command.

We will omit this tree and then test the hypothesis that the AUC is different between the two treatment groups. > use(auc.visit1) > keepData(.data, subset=auc>0) > tableStack(auc, by=treat) control ozone Test stat. P value auc t-test (76 df) = 1.88 0.064 mean(SD) 93.8 (22.4) 84.4 (19.6)

The conditions satisfy the requirements for a t-test and the difference in AUC is not statistically significant.

42

Individual growth rates

Apart from AUC, we can compute and compare the growth rates of trees in the two treatment groups.

Assuming that tree size is a linear function of time, for each individual tree, the intercept would be its expected size at time 0 and the coefficient for 'Time' would be the growth rate. We return to the original Sitka data set and use the function by to get the coefficients for each individual tree. > use(Sitka) > tmp <- by(.data, INDICES=tree, FUN=function(x) lm(size ~ Time,

data=x))

Each element of 'tmp' contains the results of a linear model predicting the tree size from 'Time' using only the data records of each tree. We then use the sapply function to extract the coefficients of each model out from the 'tmp' object. > tree.coef <- sapply(tmp, FUN=coef)

The class of this new object is a matrix.

43

> class(tree.coef) [1] "matrix" > dim(tree.coef) [1] 2 79

This matrix has 2 rows and 79 columns. Our objective here is to create a data frame containing three variables. One variable must contain the unique tree id. The other two variables should consist of the initial tree sizes and the growth rates for each tree, which we have already obtained from the individual linear models.

We can convert the matrix above to a data frame using the as.data.frame function, but first it needs to be transposed (columns to rows) using the function t. > tree.growth <- as.data.frame( t(tree.coef) ) > des(tree.growth)

No. of observations = 79 Variable Class Description 1 (Intercept) numeric 2 Time numeric

The names of the variables created from linear modelling should be changed to something more appropriate. The 'Time' variable represents the individual growth rates obtained from the linear model. > names(tree.growth)[2] <- "growth.rate"

We now add the 'tree' variable into this data frame. > tree.growth$tree <- 1:79

This variable will be used to link with the visit1 data frame created previously. > tree.growth <- merge(tree.growth, visit1) > use(tree.growth) > des()

No. of observations = 79 Variable Class Description 1 tree integer 2 (Intercept) numeric 3 growth.rate numeric 4 size numeric 5 treat factor

Now we have a data frame with 79 records, one record for each tree, containing each tree's own intercept, growth rate and treatment. We can now test the hypothesis of different growth rates between trees in the two different treatment groups.

44

> tableStack(vars=2:3, by=treat, decimal=3) control ozone Test stat. P value (Intercept) t-test (77 df) = 1.0213 0.310 mean(SD) 2.122 (1.108) 2.343 (0.783) growth.rate t-test (77 df) = 2.7682 0.007 mean(SD) 0.014 (0.003) 0.012 (0.003)

At time zero, both groups are not significantly different. However, the growth rate of trees in the treatment group (0.012 per day) is significantly lower than those in the control group (0.014 unit per day).

Note that in this experiment, we do not know when ozone treatment started. If treatment was given late in the growth curve, then the validity of using the linearly increasing tree sizes (from time zero) as the outcome in the models would be in doubt.

Summary

In summary, without Epicalc, manipulation of data within each subject in longitudinal data requires several complicated functions, such as rle and sapply to create an index variable within the same subject, here called 'visit'. The markVisits function from epicalc can simplify this task.

Measurements from the first visit can be subtracted from the other visits in the same individual to see the change from baseline. In the Sitka tree example, no baseline data was given, so the first visit records were used to represent the baseline. The linear growth rate of each individual can be computed using functions by and sapply. These two functions, when used together, are very powerful. Analysts of longitudinal data should get acquainted with them. They will be encountered extensively in subsequent chapters.

45

Exercises

Based on the experience of the above examples, check whether there are any missing records in the Xerop data set. Use the markVisits function to create a 'visit' index which indicates the order of visit for each subject. Check the consistency of this visit index and the 'time' variable.

Use the Sitka data set to compute coefficients for predicted quadratic growth curve for each tree. Determine which components of the growth curve are significantly predicted by ozone.

46

Chapter 5: Within subject, across time comparison

In the previous two chapters, we computed solitary outcome variables, such as area under curve (AUC) and growth rate. By using this strategy, the complicated of statistical models for repeated observations on the same subjects can be avoided. We used the functions by and sapply to create linear models of growth for each tree. In fact, there are more important applications of these functions – that is, within subject, across time comparison.

If you had run the example code in the help page for the followup.plot function, you would have found that some trees became smaller. Whether this is naturally possible or whether it was due to human error during data collection and/or data entry is not known. The technical challenge that we are facing is how to detect the records that have decreasing tree sizes. > library(epicalc) > zap() > data(Sitka, package="MASS") > use(Sitka)

Since there are no missing visits, we just mark the visits. > visit <- markVisits(id=tree, time=Time) > pack()

(See detailed explanation from the preceding chapter). > tmp <- by(data=.data, INDICES=tree, FUN = function(x)

x$size[x$visit==2] - x$size[x$visit==1])

The data frame is split into subsets for each unique value of 'tree'. The tree size at the second visit in each subset is then subtracted from the size at the first visit and stored in a temporary object called 'tmp'. > diff2from1 <- sapply(tmp, FUN=as.numeric)

To find which tree(s) got smaller type:

47

> which(diff2from1 < 0) 2 15 2 15

They were the second and the fifteenth trees. The numbers on the top row are 'names' of the values, which are on the bottom row.

Similarly, one can find records where the tree size at the third visit is smaller than the size at the second etc.

Lag measurements

A more efficient method is to compare the current tree size at time t of each tree with its size at time t-1. A lag vector of sizes can be created using the strategy shown in the example of followup.plot. It is further discussed here.

The data frame first needs to be sorted by 'tree' and 'Time'. > sortBy(tree, Time)

Then we create a vector which has one visit lag. > size.lag.1 <- lagVar(var=size, id=tree, visit=visit, lag.unit=1)

The last argument 'lag.unit' has a default value of 1 if omitted. For a lag of 2 type: > size.lag.2 <- lagVar(var=size, id=tree, visit=visit, lag.unit=2)

A 'lag.unit' value of -1 will use the next visit. > size.next.1 <- lagVar(var=size, id=tree, visit=visit, lag.unit=-1) > pack()

These newly created lags can be seen from the following command. > head(.data, 10) size Time tree treat visit size.lag.1 size.lag.2 size.next.1 1 4.51 152 1 ozone 1 NA NA 4.98 2 4.98 174 1 ozone 2 4.51 NA 5.41 3 5.41 201 1 ozone 3 4.98 4.51 5.90 4 5.90 227 1 ozone 4 5.41 4.98 6.15 5 6.15 258 1 ozone 5 5.90 5.41 NA 6 4.24 152 2 ozone 1 NA NA 4.20 7 4.20 174 2 ozone 2 4.24 NA 4.68 8 4.68 201 2 ozone 3 4.20 4.24 4.92 9 4.92 227 2 ozone 4 4.68 4.20 4.96 10 4.96 258 2 ozone 5 4.92 4.68 NA

Note that the first visit has neither a 'size.lag.1' nor 'size.lag.2' value. For 'size.next.1', the value is the tree size at the next (second) visit. At the second visit, 'size.lag.1' is the tree size from the first visit, etc. Now the trees that became smaller

48

at any point of time can be easily identified. > .data[which(size.lag.1 > size),] size Time tree treat visit size.lag.1 size.lag.2 size.next.1 7 4.20 174 2 ozone 2 4.24 NA 4.68 72 4.08 174 15 ozone 2 4.60 NA 4.17 94 4.62 227 19 ozone 4 4.76 3.93 4.64 135 5.32 258 27 ozone 5 5.44 4.70 NA 180 4.60 258 36 ozone 5 4.62 4.42 NA 270 5.02 258 54 ozone 5 5.03 4.55 NA

There are six records, corresponding to six different trees. We can now create a variable for the change in size between two adjacent visits on the same tree. > size.change <- size - size.lag.1 > pack()

All visits of trees which became smaller can now be shown. > smaller.trees <- tree[which(size.change<0)] > .data[tree %in% smaller.trees,] size Time tree treat visit size.lag.1 size.lag.2 size.next.1 size.change 6 4.24 152 2 ozone 1 NA NA 4.20 NA 7 4.20 174 2 ozone 2 4.24 NA 4.68 -0.04 8 4.68 201 2 ozone 3 4.20 4.24 4.92 0.48 9 4.92 227 2 ozone 4 4.68 4.20 4.96 0.24 10 4.96 258 2 ozone 5 4.92 4.68 NA 0.04 ======================= trees 15, 19, 27 and 36 omitted ===================== 266 3.72 152 54 ozone 1 NA NA 4.16 NA 267 4.16 174 54 ozone 2 3.72 NA 4.55 0.44 268 4.55 201 54 ozone 3 4.16 3.72 5.03 0.39 269 5.03 227 54 ozone 4 4.55 4.16 5.02 0.48 270 5.02 258 54 ozone 5 5.03 4.55 NA -0.01

The records can now be inspected and corrected if needed. A summary of the change is shown as follows. > summ(size.change) obs. mean median s.d. min. max. 316 0.332 0.33 0.18 -0.52 0.87

49

The leftmost outlying value was probably a serious error in measurement. The upper part of the graph suggests that there are many missing values. In fact, the statistical output from the command shows that there are only 316 non-missing values. The remaining 395-316 = 79 missing records are from the first measurements which did not have any preceding measurement. > summ(size.change, by=Time)

50

The time intervals between measurements are 22, 27, 26 and 31 days, which slightly increased over time. Noticeable from the graph is that the growth of the trees actually slowed down over time. This can be seen more clearly with: > boxplot(size.change ~ Time)

Despite the plots, one must realize that the variable 'size' is the logarithm of the actual tree size. The untransformed values would in fact show accelerated growth.

In subsequent chapters, we will go deeper using level of change as the main outcome variable compared with using the absolute value. Right now let's finish with tracking changes of a dichotomous outcome over time.

Follow up of dichotomous outcome variable

So far, we have used longitudinal data with continuous outcomes. Let's explore the bacteria data set once again, which comes from a longitudinal study in which the outcome is dichotomous. > zap() > data(bacteria, package="MASS") > use(bacteria) > des() No. of observations = 220

51

Variable Class Description 1 y factor 2 ap factor 3 hilo factor 4 week integer 5 ID factor 6 trt factor

In this data set, the time variable is 'week', an integer, and the subject variable is 'ID', which is a "factor". > summ() No. of observations = 220 Var. name obs. mean median s.d. min. max. 1 y 220 1.805 2 0.397 1 2 2 ap 220 1.436 1 0.497 1 2 3 hilo 220 1.445 1 0.498 1 2 4 week 220 4.45 4 3.85 0 11 5 ID 220 25.218 24 14.582 1 50 6 trt 220 1.845 2 0.835 1 3

There are 220 records. Let's check whether there are any IDs missing in any of the follow-up periods. > length(unique(ID)) [1] 50 > table(table(ID)) 2 3 4 5 3 5 11 31

There are a total of 50 subjects. Three people came only twice, five people came 3 times, 11 people came four times and 31 came to every follow-up visit.

Now we check for duplicates. > table(ID, week) week ID 0 2 4 6 11 X01 1 1 1 0 1 X02 1 1 0 1 1 X03 1 1 1 1 1 X04 1 1 1 1 1 X05 1 1 1 1 1 X06 1 1 1 0 1 X07 1 1 1 1 1 X08 1 1 1 1 1 X09 1 1 1 1 1 ================= remaining lines omitted ============= > any(table(ID, week) > 1) [1] FALSE

52

No person had a duplicate record in any follow-up visit. To assess the total number of subjects who attended in each week type: > colSums(table(ID, week)) 0 2 4 6 11 50 44 42 40 44

All 50 subjects attended week 0. The numbers declined to 44, 42 and 40 at weeks 2, 4 and 6, respectively. At the final follow-up visit (week 11), 44 persons attended.

We now create a 'visit' index from the 'week' variable. > visit <- week > pack() > recode(visit, old.value = c(0,2,4,6,11), new.value = 1:5) > table(week, visit) # just to check

visit week 1 2 3 4 5 0 50 0 0 0 0 2 0 44 0 0 0 4 0 0 42 0 0 6 0 0 0 40 0 11 0 0 0 0 44

It is a good idea to see the change of bacteria status from one week to the next. Let's start with the change from the first to the second week. > next.y <- lagVar(var=y, id=ID, visit=visit, lag.unit=-1) > pack()

Before continuing, we keep a copy of the data frame for later use. > .data -> data1 > head(.data,10) y ap hilo week ID trt visit next.y 1 y p hi 0 X01 placebo 1 y 2 y p hi 2 X01 placebo 2 y 3 y p hi 4 X01 placebo 3 <NA> 4 y p hi 11 X01 placebo 5 <NA> 5 y a hi 0 X02 drug+ 1 y 6 y a hi 2 X02 drug+ 2 <NA> 7 n a hi 6 X02 drug+ 4 y 8 y a hi 11 X02 drug+ 5 <NA> 9 y a lo 0 X03 drug 1 y 10 y a lo 2 X03 drug 2 y

Note that the first ID, X01, did not attend in week 6, thus the value of 'next.y' for week 4 is missing. Similarly, the second ID, X02, did not attend in week 4. The value of 'next.y' for this subject for week 2 is missing. In order to cross-tabulate the outcome variable, 'y', at visit 1 and visit 2, type: > keepData(subset=visit==1) > addmargins(table(y, next.y))

53

next.y y n y Sum n 0 4 4 y 4 36 40 Sum 4 40 44

Out of 4 persons who did not have the bacteria ('y' = "n") in their first visit, all of them changed to "y". Out of 40 subjects who did have the bacteria ('y' = "y"), 4 persons changed to "n". > mcnemar.test(table(y, next.y)) # P value = 1

There is no significant discrepancy between transition states. To compare the outcome at the second and third visits, we must return to data1 and use only the records for the second visit only. > use(data1) > keepData(subset = visit==2) > addmargins(table(y, next.y)) next.y y n y Sum n 3 0 3 y 8 26 34 Sum 11 26 37

Note that at this transition, the total number of records is now 37, not 50. There were definitely more people who changed from "y" on their first visit to "n" on their next visit than in the opposite direction . In other words, bacteria tended to disappear in the second transition period (from week 2 to week 4), and this imbalance toward more bacteria is statistically significant by McNemar’s test. > mcnemar.test(table(y, next.y)) McNemar's chi-squared = 6.125, df = 1, p-value = 0.01333

We can continue in this manner until there is no further transition.

Reshaping to wide format Reshaping the data frame from long to wide form gives the same results. > wide <- reshape(bacteria, idvar="ID", v.names="y", timevar="week",

direction="wide") > head(wide) ap hilo ID trt y.0 y.2 y.4 y.11 y.6 1 p hi X01 placebo y y y y <NA> 5 a hi X02 drug+ y y <NA> y n 9 a lo X03 drug y y y y y 14 p lo X04 placebo y y y y y 19 p lo X05 placebo y y y y y 24 a lo X06 drug y y y y <NA>

The value of the 'week' variable from the original data set (in long form) became the

54

suffix for the repeated variables in wide form. The variable 'y.11' appears before 'y.6' because the first person (X01) did not attend the sixth week appointment. The outcome value in 'y.6' for this person is therefore <NA>. Then, > with(wide, addmargins(table(y.0, y.2))) y.2 y.0 n y Sum n 0 4 4 y 4 36 40 Sum 4 40 44 > with(wide, addmargins(table(y.2, y.4))) y.4 y.2 n y Sum n 3 0 3 y 8 26 34 Sum 11 26 37

The results are exactly the same as what we obtained using the preceding method.

In conclusion, there are two methods for comparing values across time within the same person. The first method created a 'visit' variable which was modified from 'week' and shifted the values by using the lagVar function. The second method reshaped the data to wide format before making the comparison. This method is more straightforward for tabulation but not useful for transition modelling.

Exercise

Read in the Xerop data from Epicalc. Explore the pattern of visiting times. Were they evenly distributed? Track changing status of respiratory infection (respinfect), xerop and stunting over the visits.

55

Chapter 6: Analysis of missing records

The preceding chapter looked at the changing status of the subject. This chapter pays attention to missing records during follow-up.

Missing records are different to missing values within variables of existing records. For missing records, the analysts should first highlight any pattern to the research team despite the fact that the data is not available. Later, like analysis of missing values, the missing records should be checked to see if they are missing at random or if there is some underlying cause. Some analysts prefer to impute data for missing values with their 'best guess'. For a follow-up study focusing on only one outcome variable with all other variables (such as demographic and clinical prognostic factors) being fixed and if the statistical methods used allowed no missing data, such imputation would be cost-effective. However, when there is more than one variable (especially both changing exposure and changing outcome) being monitored, and with statistical methods allowing non-perfect data, imputation would be less important.

Handling missing values is a complicated technique and beyond the scope of this book. Readers are advised to consult with other sources for more details on this topic.

Based on the above arguments, data management is the most important technique to deal with missing records. We will examine methods for identifying, refilling and highlighting the pattern of missing records.

Identifying missing records

Let's return to the bacteria data set, which tests the presence of the bacteria H. influenzae in children with otitis media in the Northern Territory of Australia, and identify the missing records. > library(MASS) > library(epicalc) > zap()

56

> use(bacteria) # This data is in the MASS package > des()

No. of observations = 220 Variable Class Description 1 y factor 2 ap factor 3 hilo factor 4 week integer 5 ID factor 6 trt factor > summ() No. of observations = 220 Var. name obs. mean median s.d. min. max. 1 y 220 1.805 2 0.397 1 2 2 ap 220 1.436 1 0.497 1 2 3 hilo 220 1.445 1 0.498 1 2 4 week 220 4.45 4 3.85 0 11 5 ID 220 25.218 24 14.582 1 50 6 trt 220 1.845 2 0.835 1 3 ]

There are 220 records. The most important variables for identification of missing visits are 'ID' and 'week'. Since the class of the 'ID' variable is "factor" the output from the summ command is not meaningful, particularly the minimum and maximum values. All we can say is that there are 50 distinct values.

The 'week' variable is an integer and ranges from 0 to 11. Let's view the distribution more closely. > table(week) week 0 2 4 6 11 50 44 42 40 44

The follow up interval is 2 weeks up until week 6 with a final visit at week 11. There were 50 children who attended in week 0 (all children in the study attended the initial visit). The number declined to between 40 and 44 over the following 11 weeks. There is no obvious pattern for the missing records.

Let's now explore the visit frequency of all 50 children. > table(ID) ID X01 X02 X03 X04 X05 X06 X07 X08 X09 X10 X11 X12 X13 X14 X15 X16 X17 X18 4 4 5 5 5 4 5 5 5 2 5 5 5 3 5 5 5 5 X19 X20 X21 Y01 Y02 Y03 Y04 Y05 Y06 Y07 Y08 Y09 Y10 Y11 Y12 Y13 Y14 Z01 5 4 5 5 5 5 4 3 4 5 4 4 5 5 2 4 3 3 Z02 Z03 Z05 Z06 Z07 Z09 Z10 Z11 Z14 Z15 Z19 Z20 Z24 Z26 5 5 5 2 5 4 5 5 5 3 5 4 5 5

57

We can further tabulate this table to obtain a frequency of total visits for all children. > table(table(ID)) 2 3 4 5 3 5 11 31

Of the 5 weeks scheduled, three children came twice, five came three times, 11 came four times and 31 attended on all five scheduled visits. No child came just once. That is, every child returned for treatment at least once during the 11 week study. The full vector of frequencies for 0 up to 5 visits is [0, 0, 3, 5, 11, 31].

Since there are 220 records out of the possible maximum 250 (50 subjects × 5 visits), the overall probability of a child attending at each visit is 220/250 = 0.88. The question is whether or not the observed distribution is actually random.

If we can prove that the observed data follows some known distribution, then we can conclude that the missing records have no pattern and are therefore missing at random.

Proving randomness requires an understanding of certain statistical theory. The binomial distribution is a discrete probability distribution of the number of successes in a sequence of n independent experiments (or trials), each of which has a success with probability equal to some constant, p. For n = 5 trials each with a probability of success equal to 0.88 the associated probabilities for each number of successes (from 0 up to 5) is given by: > p <- dbinom(x=0:5, size=5, prob=0.88); p [1] 0.00002488 0.00091238 0.01338163 0.09813197 0.35981722 0.52773192

The probability of observing zero successes out of 5 trials is 0.00002 while for five successes the probability is more than half. These probabilities are for one trial. If we repeated this trial 50 times, then the resulting total numbers of successes would be: > 50 * p [1] 0.00124416 0.04561920 0.66908160 4.90659840 17.99086080 26.38659584

Let's compare theses expected numbers with the observed numbers from the bacteria study. > data.frame(week=0:5, p=p, expected=50*p, observed=c(0,0,3,5,11,31)) week p expected observed 1 0 0.0000248832 0.00124416 0 2 1 0.0009123840 0.04561920 0 3 2 0.0133816320 0.66908160 3 4 3 0.0981319680 4.90659840 5 5 4 0.3598172160 17.99086080 11 6 5 0.5277319168 26.38659584 31

58

The observed numbers appear fairly close to the expected numbers. In our study, 31 out of 50 children attended all 5 visits. If the distribution of total visits followed a binomial distribution, with p = 0.88, we would expect this number to be 26. To test whether the whole vector of observed frequencies fits well with the above expected frequencies from a binomial distribution, we employ the chi-squared goodness-of-fit test. > chisq.test(x=c(0,0,3,5,11,31), p=p) Chi-squared test for given probabilities data: c(0, 0, 3, 5, 11, 31)

X-squared = 11.6921, df = 5, p-value = 0.03926

Warning message:In chisq.test(c(0, 0, 3, 5, 11, 31), p = p, : Chi-squared approximation may be incorrect

The p-value is significant; however the warning suggests that the test may not be appropriate, most likely because many of the expected frequencies are less than 5. Let's view the help page for this function. > help(chisq.test)

The argument 'simulate.p.value' determines whether to compute the p-value by Monte Carlo simulation. The default is "FALSE". We need to specify "TRUE". > chisq.test(c(0,0,3,5,11,31), p=p, simulate = TRUE) Chi-squared test for given probabilities with simulated p-value (based

on 2000 replicates) data: c(0, 0, 3, 5, 11, 31)

X-squared = 11.6921, df = NA, p-value = 0.06497

The result indicates that there is not enough evidence to conclude that the observed data do not come from a binomial distribution. Thus we conclude that the observed distribution of missing visits is missing at random.

Let's now check whether treatment ('trt') is associated with missing records. Note that treatment is fixed for each subject throughout the study, which can be checked firstly with eye-ball scanning of the cross-tabulation. > table(ID, trt) trt ID placebo drug drug+ X01 4 0 0 X02 0 0 4 X03 0 5 0 X04 5 0 0 X05 5 0 0 X06 0 4 0 ===== lines omitted ====

Within each row, all but one cell should be greater than zero.

59

> table(ID, trt) > 0 trt ID placebo drug drug+ X01 TRUE FALSE FALSE X02 FALSE FALSE TRUE X03 FALSE TRUE FALSE X04 TRUE FALSE FALSE X05 TRUE FALSE FALSE X06 FALSE TRUE FALSE ===== lines omitted =====

If no child changed treatment, then the sum of each row should not be more than 1. > any(rowSums(table(ID, trt) > 0) > 1) [1] FALSE

The conclusion is that nobody changed treatment during the study.

Similarly, we can explore the other intervention given (active encouragement to comply with treatment). > any(rowSums(table(ID, hilo) > 0) > 1) [1] FALSE

No child had their level of encouragement to comply with treatment changed during the study.

At baseline (week 0) we have shown that all children attended for treatment. Let's explore the treatment allocation for that first week, which is stated in the help page for the data set as being randomized. > table(trt[week==0]) placebo drug drug+ 21 14 15

The treatment allocation is not completely balanced, but in any case we would like to see if this distribution is more or less the same throughout the 5 weeks of follow-up. > table(week, trt) trt week placebo drug drug+ 0 21 14 15 2 20 13 11 4 18 12 12 6 17 11 12 11 20 12 12

For each subsequent visit (weeks 2 to 11), the number of children receiving the 3 different treatments appear to be quite similar, indicating a randomness to the missing records in terms of treatment group. We can test the hypothesis that for

60

each follow up the distribution is no different to the first week by using the chi-square test. > chisq.test(table(trt[week==2]), p=table(trt[week==0])/50,

simulate=TRUE) > chisq.test(table(trt[week==4]), p=table(trt[week==0])/50,

simulate=TRUE) > chisq.test(table(trt[week==6]), p=table(trt[week==0])/50,

simulate=TRUE) > chisq.test(table(trt[week==11]), p=table(trt[week==0])/50,

simulate=TRUE)

All four tests have large P values indicating that for each follow-up visit, the number of children who attended for treatment was very close to the expected value set by that at the first visit (week 0).

We can continue exploring the distribution of missing records for each variable in this manner. However, since the number of variables we can explore at one time is limited, we cannot investigate whether an interaction between variables exists. For example, we have shown that the distribution of missing records is neither associated with week nor with treatment. There is, however, a possibility that non-attendees (the missing records) occurred earlier in one treatment group compared to another.

A complete exploration of predictors for missingness can be done only if the missing records are actually present in the data set, in which case an indicator variable is required to specify if a record was not present in the original data.

To add missing records into a longitudinal data set, we first create a data frame containing all the possible combinations of 'ID' and 'week' by reshaping the data to wide format. > wide <- reshape(.data, idvar="ID", timevar="week", v.names="y",

direction="wide")

The warning message appears because we did not specify the treatment variables in the command. In the wide data frame the values of the variables will come from the first record of each ID. Since we have shown that these variables are fixed (constant) with the ID, we can safely ignore the warning. > head(wide) ap hilo ID trt y.0 y.2 y.4 y.11 y.6 1 p hi X01 placebo y y y y <NA> 5 a hi X02 drug+ y y <NA> y n 9 a lo X03 drug y y y y y 14 p lo X04 placebo y y y y y 19 p lo X05 placebo y y y y y 24 a lo X06 drug y y y y <NA>

61

Now we can reshape this data frame back to long format. As explained in chapter 2, most of the arguments of the function can be omitted, since the data frame was created from a previous reshape command. The result is a data frame containing 250 records, containing all unique combinations of ID and week. > long <- reshape(wide, direction="long") > des(long) No. of observations = 250 Variable Class Description 1 ap factor 2 hilo factor 3 ID factor 4 trt factor 5 week integer 6 y factor

Note that the order of the variables is alphabetical. This is a side-effect of the reshape command. > summ(long) No. of observations = 250 Var. name obs. mean median s.d. min. max. 1 ap 250 1.42 1 0.495 1 2 2 hilo 250 1.44 1 0.497 1 2 3 ID 250 25.5 25.5 14.46 1 50 4 trt 250 1.88 2 0.842 1 3 5 week 250 4.6 4 3.78 0 11 6 y 220 1.805 2 0.397 1 2

The 'y' variable contains only 220 observations, which comes from the original data.

The final step is to add an indicator variable, specifying whether the subject attended at each corresponding follow up visit. These are in fact the records in which the 'y' variable is not missing. > long$attend <- !is.na(long$y)

An alternative is to use Epicalc’s addMissingRecords function. This method is simpler and requires only the one command. > bacteria.new <- addMissingRecords(dataFrame=bacteria, id=ID,

visit=week, outcome=y)

The new records can be inspected easily.

62

> head(bacteria.new, 10) ID week y ap hilo trt present 1 X01 0 y p hi placebo 1 2 X01 2 y p hi placebo 1 3 X01 4 y p hi placebo 1 4 X01 6 <NA> p hi placebo 0 5 X01 11 y p hi placebo 1 6 X02 0 y a hi drug+ 1 7 X02 2 y a hi drug+ 1 8 X02 4 <NA> a hi drug+ 0 9 X02 6 n a hi drug+ 1 10 X02 11 y a hi drug+ 1

Note the reordering of the variables and the addition of the 'present' variable, which indicates whether the record was present in the original dataset.

Either data set is now ready to analyse. > use(long) > tableStack(vars=c(1,2,4,5), by=attend, vars.to.factor=week) FALSE TRUE Test stat. P value ap Chisq. (1 df) = 1.49 0.222 a 21 (70) 124 (56.4) p 9 (30) 96 (43.6) hilo Chisq. (1 df) = 0.08 0.784 hi 18 (60) 122 (55.5) lo 12 (40) 98 (44.5) trt Chisq. (2 df) = 3.21 0.201 placebo 9 (30) 96 (43.6) drug 8 (26.7) 62 (28.2) drug+ 13 (43.3) 62 (28.2) week Chisq. (4 df) = 10.61 0.031 0 0 (0) 50 (22.7) 2 6 (20) 44 (20) 4 8 (26.7) 42 (19.1) 6 10 (33.3) 40 (18.2) 11 6 (20) 44 (20)

The output suggests that attendance (and therefore non-attendance) in each week was not random. Let's run a logistic regression model. The 'week' variable needs to be converted to a factor first so that a comparison of attendance between the first visit and each remaining visit can be done. > week <- factor(week) > pack() > glm1 <- glm(attend ~ trt + hilo + week, family = binomial, data =

.data)

63

> logistic.display(glm1)

Logistic regression predicting attend crude OR(95%CI) adj. OR(95%CI) P(Wald) P(LRtest) trt: ref.=placebo 0.202 drug 0.73 (0.27,1.98) 0.86 (0.24,3.12) 0.814 drug+ 0.45 (0.18,1.11) 0.38 (0.13,1.16) 0.089 hilo: lo vs hi 1.2 (0.55,2.62) 0.74 (0.18,3.02) 0.678 0.679 week: ref.=0 0.003 2 0 (0,Inf) 0 (0,Inf) 0.986 4 0 (0,Inf) 0 (0,Inf) 0.985 6 0 (0,Inf) 0 (0,Inf) 0.985 11 0 (0,Inf) 0 (0,Inf) 0.986 Log-likelihood = -81.9821 No. of observations = 250 AIC value = 179.9642

Odds ratios comparing the follow-up weeks to the first week are zero because there was no missing visit in the first week. The likelihood ratio test, however, confirms that there is a significant difference in missing records among the visits. The adjusted odds ratio of compliance ('hilo') is quite different from the crude odds ratio (1.2 vs 0.74). This is due to the fact that compliance is associated with type of treatment (trt). In fact, treatment ('trt') is simply a recoding of the 'ap' variable (active/placebo) and 'hilo' (high/low encouragement to comply with treatment). Nonetheless, both variables are not statistically significant.

Our final conclusion is that missing records is not a random phenomenon but significantly increased after week 0. Neither treatment nor compliance is associated with missing.

64

Summary

Missing records are almost unavoidable when the sample size and the number of follow-up visits are large. When they do occur, it is important to investigate the reasons and to ensure that they are missing at random. The most effective method is to fill in the missing visits based on subjects and time of follow-up using addMissingRecords. By creating a 'missing' indicator variable as the outcome variable, cross tabulation (tableStack) and logistic regression (glm) can help to identify determinants of missing records.

Exercise

Load the Xerop data set from Epicalc. This is a dataset from an Indonesian study on vitamin A deficiency and risk of respiratory infection in 275 children.

• At each scheduled visit, determine how many records are missing. • Identify and remove the duplicated records based on the combination of

'id' and 'time', then repeat question 1. • Including the duplicated records that you removed, use Epicalc’s

addMissingRecords function to create a new data frame containing a complete set of records.

• Was season associated with non-attendance? • Determine whether or not vitamin A deficiency and/or respiratory

infection preceded non-attendance. • Find the determinant(s) of the missing records.

65

Chapter 7: Modelling longitudinal data

In modelling, one of the purposes is to maximise the efficiency of explanatory variables in the data set by keeping only explanatory variables that predict the dependent variable while trying to minimise the errors (residuals). Ordinary least squares technique minimizes the sum of squares of the residuals. Linear modelling in R is achieved using the lm function where the outcome variable is on a continuous scale. Maximum likelihood estimation (MLE or sometimes just ML) performs numerical iteration to obtain the estimates of the coefficients that produce the maximum likelihood. This procedure is done in R using the glm command, which stands for generalized linear model. The outcome variable, however, can be continuous, dichotomous or count in nature.

In the previous chapters, the relationship of the outcome and predicting variables is based on an assumption of independence among the records. Exceptions include, for example, situations where the data are analysed using non-linear models, such as conditional logistic regression, where the data are stratified and conditionally analysed in sets.

In longitudinal studies, data on one individual can be measured more than one time. Thus records belonging to one individual appear in the dataset more than once. Measurements from the same individual are not independent from each other. Analysis of this type of data using (general) linear models will therefore give erroneous results. There are three main choices of modelling here:

1. Population average models, or marginal models, or generalized estimating equation (GEE) models

2. Random coefficients or random effects models, or hierarchical models 3. Transition or Markov models.

Population average models are so called because they only focus on the average relationship among repeated measures. A number of individuals are measured on the outcome and exposure repeatedly. Repetitions may arise from measurements of the same individuals several times or from measurement of the same sets of variables on several individuals. The models don't take into account the source of repetition. They just find the average relationship. This relationship comes from

66

averaging among repeated times and within repeated persons. The model is so called marginal model. Remember that when we apply the addmargins function to a table, we have added, at the right most column, the sum of each rows and, at the bottom row, the sum of each column. The margins thus focus on the overall effect of rows and columns and ignore what is inside. In marginal models, we are not able to estimate the individual person outcome but we can still predict the outcome value of a new subject if the person is given the covariate values. This prediction is based on the average effects mentioned above. The modelling technique is called “generalized estimating equations”. The reason is probably because the final model is based on several equations of generalized linear models that share the same set of coefficients. GEE models require more parameters to be estimated than the ordinary GLM methodology. The correlation coefficient structure among the residuals of different rounds of observations must be specified. The choice of possible correlation structures will be discussed in future sections.

Random coefficients models take a different approach. Subjects are assumed to be a random representative sample of a large population. Members of this population share the same set of coefficients, called 'fixed effects'. Each member also has his/her own random variation from these coefficients, called 'random coefficients' or 'random effects'. Regardless of the number of the subjects, the random coefficients take only one parameter each making the model parsimonious. This can be compared with stratified analysis where the number of parameters increases by stratification of the number of strata less one. For example, if we have 10 strata, and all strata share other coefficients, the number of stratification parameters can increase (from the non-stratified analysis) by as much as 9. In random coefficients models, all these 10 subjects are assumed to randomly represent the population. If the subjects share all the coefficients, the number of parameters increase from assuming random baseline (intercept term) of all these 10 subjects will be only one. In other words, random coefficients models are quite similar to stratified models, except that the number of parameters is usually greatly reduced.

The other name for these types of models is 'mixed models' because they are a mixture of fixed coefficients and random ones. While marginal models predict the outcome of each person based on an average set of predicting coefficients, mixed models use both fixed coefficients common to all subjects and random coefficients specific to each individual in order to predict the outcome of that particular individual. For a model with only one random coefficient, the random coefficient would be the variation of intercept of each individual with other coefficients (or slopes) being fixed. The output of the model should show the standard deviation of the intercepts. If this is large, it would mean that there is a large level of baseline variation of the subjects under study. Random coefficients models also allow for random slopes. A model with random slopes means that different subjects may be

67

differently affected by the independent variable. This is similar to interaction of strata in stratified analysis discussed in our previous book1. Finally, while marginal models focus on correlations among residuals of different times of observations, random coefficient models are more interested in correlation of the residuals of different random variables. For example, if the intercepts have a positive correlation with the slopes, the lines on the upper part (high value of intercept) of the graph would be steeper than those in the lower part. When there is only one random variable, i.e. a random intercept, then the correlation is not of main concern.

Occasionally, analysts use conditional fixed effects models. This is an extreme case where random terms disappear. Comparison is made within the same person or the same matched set, just like in a matched case-control study. For longitudinal studies, the outcome of main concern using a fixed effects model is not at any individual time point but the difference between two time points in the same person. This is confined to before-after studies, or studies looking at the difference between two sides of the same organ, such as eyes or kidneys, within the same person. This type of model is of limited use and will be omitted in future discussion.

Finally, transitional models are focused on transition of states. They are used to predict the outcome of a set of subjects who share the same previous status as well as other independent variables. This model thus has two simultaneous interests. First, it is interested in the effect of the preceding outcome status on the current one after adjustment for other covariates. Second, it demonstrates the effects of those covariates after adjustment with the previous outcome. A simple transition model will look at the effects of a previous outcome in only one or a few preceding rounds. Autoregressive regression (AR) models, often employed by economists, include looking at more preceding lags, since financial data may have longer term effects compared to medical outcomes. Economic models often further aggregate outcome of individual time points into a 'moving average' to have a more stable outcome value.

As moving average at one time point is highly correlated with its neighbours, moving average outcomes are almost always associated with the autoregressive approach. The two components make up a new name autoregressive moving average analysis (ARMA). This is rarely used in epidemiology and will not be included in future discussion. ___________________________________________________________________ 1 Analysis of Epidemiological Data using R and Epicalc

68

Packages in R that are used for different modelling are shown in the following table:

Model Package Function

Marginal models (GEE) gee

geepack

gee

geeglm

Random coefficient models

Linear mixed models

Generalized linear mixed models

Generalized linear mixed models more advanced models

nlme

MASS

lme4

lme

glmmPQL

lmer

Transition models stats glm

Let's try these packages and functions with the longitudinal data that we have previously explored. Let's use the gee function in the gee package to model the Sitka dataset.

> library(epicalc) > data(Sitka, package="MASS") > use(Sitka) > library(gee) > gee.in <- gee(size ~ Time, id=tree, data=.data) > summary(gee.in)

GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-function, version 4.13 modified 98/01/27 (1998) Model: Link: Identity Variance to Mean Relation: Gaussian Correlation Structure: Independent Call: gee(formula = size ~ Time, id = tree, data = .data) Summary of Residuals: Min 1Q Median 3Q Max -2.02609732 -0.37956123 0.06948273 0.41669270 1.30948273

69

Coefficients: Estimate Naive S.E. Naive z Robust S.E. Robust z (Intercept) 2.27324430 0.1768643245 12.85304 0.1003348470 22.65658 Time 0.01268548 0.0008591845 14.76456 0.0003719549 34.10488 Estimated Scale Parameter: 0.4108594 Number of Iterations: 1 Working Correlation [,1] [,2] [,3] [,4] [,5] [1,] 1 0 0 0 0 [2,] 0 1 0 0 0 [3,] 0 0 1 0 0 [4,] 0 0 0 1 0 [5,] 0 0 0 0 1

Leaving most arguments to their default values, the link function is 'identity', which means that the original value of outcome variable is not transformed. This is applicable to continuous outcome variables in all models. The family is 'gaussian' by default. The default correlation structure among residuals of different times is "independent". This assumes that there is no association among residuals of different time periods, as shown by zero values in the off-diagonal cells of the working correlation matrix. There are two sets of standard errors produced. The robust ones are based on a conservative computation. The naïve ones give the same results as those from using the glm command. > summary(glm(size ~ Time))$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 2.27324430 0.1768643245 12.85304 8.319835e-32 Time 0.01268548 0.0008591845 14.76456 1.473822e-39

An independent correlation structure is not observed. > res <- (glm(size ~ Time))$residuals > wide <- reshape(data.frame(tree=tree,res=res,Time=Time),

idvar="tree", v.names="res", timevar="Time", direction="wide") > wide [1:5, ] tree res.152 res.174 res.201 res.227 res.258 1 1 0.30856322 0.4994827 0.58697486 0.7471525 0.6039027 6 2 0.03856322 -0.2805173 -0.14302514 -0.2328475 -0.5860973 11 3 -0.22143678 -0.1205173 -0.03302514 -0.1628475 -0.5160973 16 4 0.15856322 0.2894827 0.27697486 0.1471525 -0.1860973 21 5 0.13856322 0.4694827 0.59697486 0.8171525 0.7339027 > cor(wide[,2:6]) res.152 res.174 res.201 res.227 res.258 res.152 1.0000000 0.9614699 0.9176641 0.8710606 0.8565763 res.174 0.9614699 1.0000000 0.9721038 0.9370675 0.9247401 res.201 0.9176641 0.9721038 1.0000000 0.9653189 0.9494939 res.227 0.8710606 0.9370675 0.9653189 1.0000000 0.9866713 res.258 0.8565763 0.9247401 0.9494939 0.9866713 1.0000000

70

The correlation coefficients among residuals of different time points are very high. Thus the assumption of independence is not valid.

We first try the most commonly used correlation structure, “exchangeable”, which assumes that the correlation between time points is constant and non-zero. > gee.ex <- gee(size ~ Time, id=tree, data=.data, corstr =

"exchangeable") > summary(gee.ex) GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-function, version 4.13 modified 98/01/27 (1998) Model: Link: Identity Variance to Mean Relation: Gaussian Correlation Structure: Exchangeable Call: gee(formula = size ~ Time, id = tree, data = .data, corstr =

"exchangeable") Summary of Residuals: Min 1Q Median 3Q Max -2.02609732 -0.37956123 0.06948273 0.41669270 1.30948273 Coefficients: Estimate Naive S.E. Naive z Robust S.E. Robust z (Intercept) 2.27324430 0.0880570302 25.81559 0.1003348470 22.65658 Time 0.01268548 0.0002688318 47.18742 0.0003719549 34.10488 Estimated Scale Parameter: 0.4108594 Number of Iterations: 1 Working Correlation [,1] [,2] [,3] [,4] [,5] [1,] 1.0000000 0.9020987 0.9020987 0.9020987 0.9020987 [2,] 0.9020987 1.0000000 0.9020987 0.9020987 0.9020987 [3,] 0.9020987 0.9020987 1.0000000 0.9020987 0.9020987 [4,] 0.9020987 0.9020987 0.9020987 1.0000000 0.9020987 [5,] 0.9020987 0.9020987 0.9020987 0.9020987 1.0000000

The coefficients of the regression are exactly the same as those obtained from specifying an independent correlation structure. However, the standard errors are smaller.

Since the working correlation coefficients are constantly high, which is not similar to what we have found, the argument 'corstr' should be further changed. > gee.ar1 <- gee(size ~ Time, id=tree, data=.data, corstr = "AR-M") > summary(gee.ar1)

71

GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-function, version 4.13 modified 98/01/27 (1998) Model: Link: Identity Variance to Mean Relation: Gaussian Correlation Structure: AR-M , M = 1 Call: gee(formula = size ~ Time, id = tree, data = .data, corstr = "AR-M") Summary of Residuals: Min 1Q Median 3Q Max -1.9059082 -0.2816582 0.1728384 0.5287292 1.3964369 Coefficients: Estimate Naive S.E. Naive z Robust S.E. Robust z (Intercept) 2.31297888 0.1123461827 20.58796 0.1003399163 23.05143 Time 0.01199296 0.0004319665 27.76362 0.0003508396 34.18359 Estimated Scale Parameter: 0.4216764 Number of Iterations: 3 Working Correlation [,1] [,2] [,3] [,4] [,5] [1,] 1.0000000 0.9460445 0.8950002 0.8467100 0.8010253 [2,] 0.9460445 1.0000000 0.9460445 0.8950002 0.8467100 [3,] 0.8950002 0.9460445 1.0000000 0.9460445 0.8950002 [4,] 0.8467100 0.8950002 0.9460445 1.0000000 0.9460445 [5,] 0.8010253 0.8467100 0.8950002 0.9460445 1.0000000

Now the working correlation is closer to what we observed from the simple regression. AR-M denotes autoregressive correlation of one order one (M=1). This means corrt, t+(n+1) = (corr t, t+1)n. In our example, the correlation between two visits of one different time point (or the right side of this equation) is 0.9460445. When the lag is increased to 2, it is 0.94604452 = 0.8950002 and for a lag of 3, it is 0.94604453 = 0.84671 etc. The correlation coefficients among visits thus slowly reduce with time. To speed up reduction of correlation, another argument 'Mv' can be specified. For example, > gee.ar2 <- gee(size ~ Time, id=tree, data=.data, corstr = "AR-M",

Mv=2) > summary(gee.ar2)

The results are omitted. Both the coefficients and standard errors are slightly different compared to those from gee.ar1 as the working correlations drop faster by time lag.

Let's try the geepack package.

72

> library(geepack) > geeglm.ex <- geeglm(size~Time, id=tree, data=.data, corstr =

"exchangeable") > summary(geeglm.ex)

Call: geeglm(formula = size ~ Time, data = .data, id = tree, corstr = "exchangeable")

Coefficients: Estimate Std.err Wald p(>W) (Intercept) 2.27324430 0.0977722507 540.5813 0 Time 0.01268548 0.0003640436 1214.2462 0 Estimated Scale Parameters: Estimate Std.err (Intercept) 0.4087791 0.06377465 Correlation: Structure = exchangeable Link = identity Estimated Correlation Parameters: Estimate Std.err alpha 0.9043941 0.01815002 Number of clusters: 79 Maximum cluster size: 5

The estimated correlation parameter is 0.904, which is slightly larger than the one obtained from the gee package using the same correlation structure (gee.ex). The coefficients and standard errors are also very similar. These slight differences are seen among the results from different software regardless of whether the software is open-source or commercial. The very small differences are quite acceptable.

Let's try a model using an AR-1 correlation structure. > geeglm.ar1 <- geeglm(size ~ Time, id=tree, data=.data, corstr= "ar1") > summary(geeglm.ar1)

Call: geeglm(formula = size ~ Time, data = .data, id = tree, corstr = "ar1") Coefficients: Estimate Std.err Wald p(>W) (Intercept) 2.31280103 0.0980443147 556.4571 0 Time 0.01198546 0.0003445955 1209.7360 0 Estimated Scale Parameters: Estimate Std.err (Intercept) 0.4198992 0.06373226 Correlation: Structure = ar1 Link = identity Estimated Correlation Parameters: Estimate Std.err alpha 0.952948 0.009632563 Number of clusters: 79 Maximum cluster size: 5

73

Again, this gives practically the same coefficients and robust standard errors as the one from the gee package.

Now that we are acquainted with these packages, the next step is to use them for hypothesis testing. We want to see whether the size of trees is affected by treatment after adjusting for time. > geeglm1.ar1 <- geeglm(size ~ Time+treat, id=tree, data=.data,

corstr="ar1") > summary(geeglm1.ar1)$coefficient Estimate Std.err Wald p(>W) (Intercept) 2.46484647 0.1657640661 221.105204 0.0000000 Time 0.01198697 0.0003446439 1209.700571 0.0000000 treatozone -0.22238262 0.1621230081 1.881535 0.1701597

The trees treated with ozone had a non-significantly smaller size throughout the time of follow up. Unsurprisingly, tree sizes increased over time.

As treatment may have different effects over time we now put in the interaction term. > geeglm2.ar1 <- geeglm(size ~ Time*treat, id=tree, data=.data,

corstr="ar1") > summary(geeglm2.ar1)$coefficient Estimate Std.err Wald p(>W) (Intercept) 2.154288609 0.2071115387 108.1930041 0.000000000 Time 0.013501415 0.0005847241 533.1587465 0.000000000 treatozone 0.231863021 0.2331996714 0.9885693 0.320092306 Time:treatozone -0.002219175 0.0007066679 9.8617143 0.001687538

This final model gives a better picture of the ozone effect. The main effect of ozone is not significant indicating that at Time 0, there was no difference in size between trees in the two treatment groups. The interaction term is strongly significant and the negative co-efficient indicates that the growth rate of trees in the ozone treated group was significantly lower than that of trees in the control group.

You may like to try modelling this data using the gee package. The conclusion should be the same.

Now let's model dichotomous outcomes using the GEE methodology. Let's return to the bacteria data set. > zap() > data(bacteria, package="MASS") > use(bacteria) > infected <- y=="y" > pack() > geeglm3.ar1 <- geeglm(infected ~ ap+week, id=ID, data=.data,

corstr = "ar1", family="binomial")

74

> summary(geeglm3.ar1) Call: geeglm(formula = infected ~ ap + week, family = "binomial", data =

.data, id = ID, corstr = "ar1") Coefficients: Estimate Std.err Wald p(>W) (Intercept) 1.6315020 0.31066152 27.580382 1.506995e-07 app 0.8256061 0.48819185 2.859991 9.080800e-02 week -0.1041665 0.03738504 7.763552 5.331103e-03 Estimated Scale Parameters: Estimate Std.err (Intercept) 0.993334 0.3366146 Correlation: Structure = ar1 Link = identity Estimated Correlation Parameters: Estimate Std.err alpha 0.1921146 0.08664026 Number of clusters: 50 Maximum cluster size: 5

The output indicates that there is no evidence of any effect on bacterial infection between those receiving active treatment and placebo (p-value=0.09). On the other hand, the probability of getting infection reduced over time.

Attempt for interaction could be further made but all covariates are non-significant. The results are omitted here.

Results using the gee function from the gee package are similar.

Exercise Try modelling the respiratory data set from the geepack package using the methods described in this chapter. Compare the output from functions in the different packages.

75

Chapter 8: Mixed models

The previous chapter demonstrated how to use the GEE methodology to model the Sitka and bacteria data sets. In this chapter will use mixed modelling techniques to model the same data sets and compare the results. > library(epicalc) > library(MASS) > use(Sitka) > glmmPQL1 <- glmmPQL(fixed = size ~ Time * treat, random= ~ 1 | tree,

data=.data, family="gaussian") > summary(glmmPQL1) Linear mixed-effects model fit by maximum likelihood Data: .data AIC BIC logLik NA NA NA Random effects: Formula: ~1 | tree (Intercept) Residual StdDev: 0.6003342 0.1932339 Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: size ~ Time * treat Value Std.Error DF t-value p-value (Intercept) 2.1217179 0.15374924 314 13.799860 0.0000 Time 0.0141472 0.00046278 314 30.569975 0.0000 treatozone 0.2216775 0.18596433 77 1.192043 0.2369 Time:treatozone -0.0021385 0.00055975 314 -3.820480 0.0002 Correlation: (Intr) Time tretzn Time -0.609 treatozone -0.827 0.504 Time:treatozone 0.504 -0.827 -0.609 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.7050855 -0.4925306 0.1176727 0.5752986 4.3408330 Number of Observations: 395

76

Number of Groups: 79

The glmmPQL function comes from the MASS package. It fits generalized linear mixed models using the penalized quasi-likelihood technique.

The random component is ~1 | tree. The 1 denotes the first coefficient of the model, the intercept term. The sign | denotes "given" or "clustered by" or "grouped by". So the random component for this data is the random intercepts of the trees. In other words, the model assumes that all trees have the same coefficients for 'time' and 'Treat' as well as the interaction term. The only difference is in their intercepts, which are assumed to be a random variable (or random coefficients), thus there is no need for any additional coefficient.

The standard deviation of the random intercept is 0.6003 unit, which is quite large compared to the standard deviation of the residuals of the main effect, which is 0.193. This means that the intercept values of the trees varied considerably compared to variation of growth within the same tree.

The fixed effects are due to the interaction between 'Time' and 'treat'. Since the main objective of the analysis is to compare the size of trees in each group over time, fixed effects are more important than random effects. The results are not too dissimilar from those we obtained using the GEE methodology. The results indicate that the effect of ozone on tree size is different between the two groups of trees.

The correlation section and the standardised residuals are complicated, not very important, and can be ignored.

Now we add a random slope to the model. > glmmPQL2 <- glmmPQL(fixed = size ~ Time * treat, random= ~ Time |

tree, data=.data, family="gaussian") > summary(glmmPQL2) Linear mixed-effects model fit by maximum likelihood Data: .data AIC BIC logLik NA NA NA Random effects: Formula: ~Time | tree Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 0.790968484 (Intr) Time 0.002487428 -0.649 Residual 0.162608831 Variance function: Structure: fixed weights

77

Formula: ~invwt Fixed effects: size ~ Time * treat Value Std.Error DF t-value p-value (Intercept) 2.1217179 0.17806707 314 11.915274 0.0000 Time 0.0141472 0.00063379 314 22.321781 0.0000 treatozone 0.2216775 0.21537748 77 1.029251 0.3066 Time:treatozone -0.0021385 0.00076658 314 -2.789663 0.0056 Correlation: (Intr) Time tretzn Time -0.729 treatozone -0.827 0.603 Time:treatozone 0.603 -0.827 -0.729 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.3090537 -0.6772047 0.1037030 0.5585733 3.2705870 Number of Observations: 395 Number of Groups: 79

The random effect is now "~ Time | tree" which allows for a different effect of 'Time' on each tree. From the output, the intercept term has a larger standard deviation (0.79 compared with 0.6003 in the random intercept model). Individual slopes are strongly negatively correlated with the intercepts (Corr = -0.649). Within the same treatment group, trees with the larger initial size tended to have a flatter growth.

We now try the lme function from the nlme package. > library(nlme) > lme1 <- lme(fixed = size ~ Time * treat, random = ~ 1 | tree, data

=.data) > summary(lme1) Linear mixed-effects model fit by REML Data: .data AIC BIC logLik 175.5170 199.3293 -81.75852 Random effects: Formula: ~1 | tree (Intercept) Residual StdDev: 0.6082011 0.1938483 Fixed effects: size ~ Time * treat Value Std.Error DF t-value p-value (Intercept) 2.1217179 0.15439225 314 13.742386 0.0000 Time 0.0141472 0.00046190 314 30.628557 0.0000 treatozone 0.2216775 0.18674206 77 1.187078 0.2388 Time:treatozone -0.0021385 0.00055868 314 -3.827802 0.0002

78

Correlation: (Intr) Time tretzn Time -0.606 treatozone -0.827 0.501 Time:treatozone 0.501 -0.827 -0.606 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.6949604 -0.4923652 0.1176021 0.5733781 4.3279068 Number of Observations: 395 Number of Groups: 79

The results from using the lme function are very close to those from using glmmPQL. > lme2 <- lme(fixed = size ~ Time*treat, random= ~ Time | tree, data

=.data) > summary(lme2)

After adding a random time component, the results are very close to those of 'glmmPQL2' in all aspects.

The advantage of using the LME method is the availability of the AIC, BIC and log likelihood, which imply relative levels of fit. We can therefore use these to compare different models. > anova(lme1, lme2) Model df AIC BIC logLik Test L.Ratio p-value lme1 1 6 175.5170 199.3293 -81.75852 lme2 2 8 146.1218 177.8714 -65.06088 1 vs 2 33.39528 <.0001

The model 'lme2' has two degrees of freedom more than 'lme1' but it has a much smaller AIC value. Thus 'lme2' is significantly better than 'lme1'. The random slope model fits better than one with a random intercept alone for this data set.

Note that the lme function was designed for linear mixed effects modelling. It is confined to linear models where the outcome variable is on a continuous scale only. Therefore, there is no "family" argument. The function lme has an option to use maximum likelihood (ML) or restricted maximum likelihood (REML) methods.

A more recent package called lme4 allows a choice of "family" as well as a more versatile nesting procedure. The formula syntax is slightly different. > library(lme4) > lmer1 <- lmer(size ~ Time*treat + (1|tree), family="gaussian",

data=.data) > summary(lmer1)

Linear mixed model fit by REML Formula: size ~ Time * treat + (1 | tree)

79

Data: .data AIC BIC logLik deviance REMLdev 175.5 199.4 -81.76 130.2 163.5 Random effects: Groups Name Variance Std.Dev. tree (Intercept) 0.369909 0.60820 Residual 0.037577 0.19385 Number of obs: 395, groups: tree, 79 Fixed effects: Estimate Std. Error t value (Intercept) 2.1217179 0.1543913 13.742 Time 0.0141472 0.0004619 30.629 treatozone 0.2216775 0.1867409 1.187 Time:treatozone -0.0021385 0.0005587 -3.828 Correlation of Fixed Effects: (Intr) Time tretzn Time -0.606 treatozone -0.827 0.501 Time:tretzn 0.501 -0.827 -0.606

The results are the same as that from using the lme function except that p-values for the coefficients are not displayed. One can however obtain 95% confidence intervals by creating a Markov Chain Monte Carlo (MCMC) sample and creating a 95% highest posterior density (HPD) interval. > tmp <- mcmcsamp(lmer1, n=1000) > HPDinterval(tmp)

$fixef lower upper (Intercept) 1.790535519 2.4503748846 Time 0.012564591 0.0155585309 treatozone -0.152775369 0.6803190421 Time:treatozone -0.004096989 -0.0002541869 attr(,"Probability") [1] 0.95 ========= further lines omitted ==========

Both lower and upper limits of the 95% CI for 'Time' are positive whereas that of the interaction term are negative, indicating the statistical significance of these two variables. Now we add the random component. > lmer2 <- lmer(size~Time*treat + (Time|tree), family="gaussian",

data=.data) > anova(lmer1, lmer2)

Data: .data Models: lmer1: size ~ Time * treat + (1 | tree) lmer2: size ~ Time * treat + (Time | tree)

80

Df AIC BIC logLik Chisq Chi Df Pr(>Chisq) lmer1 6 142.201 166.074 -65.100 lmer2 8 114.102 145.933 -49.051 32.099 2 1.071e-07 ***

The conclusion is the same as before. The model with a random slope is significantly better than that with random intercept alone.

For dichotomous outcomes, we turn to the bacteria data set. > zap() > data(bacteria, package="MASS") > use(bacteria) > infected <- y=="y" > pack()

> glmmPQL.1 <- glmmPQL(infected ~ ap + week, random = ~ 1|ID, data =

.data, family="binomial") > summary(glmmPQL.1) Linear mixed-effects model fit by maximum likelihood Data: .data AIC BIC logLik NA NA NA Random effects: Formula: ~1 | ID (Intercept) Residual StdDev: 1.347478 0.7881903 Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: infected ~ ap + week Value Std.Error DF t-value p-value (Intercept) 2.0352357 0.3816667 169 5.332495 0.0000 app 1.0082124 0.5326217 48 1.892924 0.0644 week -0.1450321 0.0390851 169 -3.710677 0.0003 Correlation: (Intr) app app -0.485 week -0.536 -0.047 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -4.2940431 0.2019745 0.3111024 0.5620260 2.0447455 Number of Observations: 220 Number of Groups: 50

The coefficients are fairly different from those using GEE in the previous chapter, however the conclusion is the same. Treatment is no better than placebo for controlling infection, and the probability of infection decreases with time.

81

Modelling using the lmer function gives similar results. > lmer1 <- lmer(infected ~ ap + week + (1|ID), family="binomial",

data=.data) > summary(lmer1)

Generalized linear mixed model fit by the Laplace approximation Formula: infected ~ ap + week + (1 | ID) Data: .data AIC BIC logLik deviance 206.4 220 -99.2 198.4 Random effects: Groups Name Variance Std.Dev. ID (Intercept) 1.4012 1.1837 Number of obs: 220, groups: ID, 50 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.09745 0.41334 5.074 3.89e-07 *** app 1.07571 0.55431 1.941 0.05230 . week -0.14440 0.04833 -2.988 0.00281 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) app app -0.418 week -0.621 -0.063

Note that when the outcome variable is dichotomous, the lmer function also provides z values and p-values. 95% confidence intervals of the coefficients thus can be computed. > coefs <- attr(summary(lmer1), "coefs") > coefs Estimate Std. Error z value Pr(>|z|) (Intercept) 2.0974492 0.41333783 5.074419 3.886823e-07 app 1.0757068 0.55431021 1.940622 5.230411e-02 week -0.1444006 0.04833126 -2.987726 2.810615e-03 > ci95 <- data.frame(lower=coefs[,1] - 1.96*coefs[,2], upper=coefs[,1]

+1.96*coefs[,2]) > ci95 lower upper (Intercept) 1.28730708 2.90759139 app -0.01074125 2.16215476 week -0.23912984 -0.04967129

One advantage of using lmer in hypothesis testing is the availability of the anova function.

There are 3 levels of treatment. In order to determine whether or not there is any treatment effect we can try the following model.

82

> lmer2 <- lmer(infected ~ trt+week + (1|ID), family="binomial", data=.data)

> summary(lmer2)

====== lines omitted ===== Fixed effects:

Estimate Std. Error z value Pr(>|z|) (Intercept) 3.1440 0.5290 5.943 2.79e-09 *** trtdrug -1.3202 0.6252 -2.111 0.03473 * trtdrug+ -0.7955 0.6401 -1.243 0.21400 week -0.1437 0.0483 -2.975 0.00293 ** ====== lines omitted =====

Treatment with the drug (with a low level of encouragement to comply) is significantly better than with a placebo (referent group). However, to test whether this variable as a whole is significant we use the anova function.. > lmer3 <- lmer(infected ~ week + (1|ID), family="binomial", data =

.data) > anova(lmer2, lmer3)

Data: .data Models: lmer3: infected ~ week + (1 | ID) lmer2: infected ~ trt + week + (1 | ID) Df AIC BIC logLik Chisq Chi Df Pr(>Chisq) lmer3 3 208.204 218.385 -101.102 lmer2 5 207.771 224.739 -98.885 4.4337 2 0.1090

The output indicates that there is not enough evidence to show that treatment (with or without encouragement to comply) has a significant effect on infection.

In summary, this chapter has shown that generalized linear mixed effects modelling, in which the outcome variable may be continuous or dichotomous, can be performed with glmmPQL in the MASS package, and lmer in the lme4 package with similar results. The lmer command has the advantage of being able to test whether the random effects should be simple random intercepts or random slopes as well. It is also useful in testing hypothesis for variables with more than two levels of exposure. However, package lme4, which contains the lmer function, is still under development. Some changes are expected in future versions.

Exercise

Load the BMD data from Epicalc. It contains data from a clinical trial of three does of a drug which is thought to affect bone density levels in post-menopausal women. Treatment 1 is the lowest dosage of the drug and Treatment 2 is the highest dosage.

83

Because it is always preferable to have patents on the lowest effective dosage of a drug, the interest in this trial is focussed on whether Treatment 1 is significantly different from Treatments 2 and 3.

Subjects were post-menopausal women, who were randomly allocated to one of three doses (that is, this is a completely randomized design). Bone mineral densities were measured at the start of the trial, and at 12 and 24 months after the trial commenced.

1. Is there a beneficial treatment effect on bone density at the hip, after correcting for the covariates given?

2. Anecdotal evidence is that a side-effect of the treatment is a gain in weight (or increase in BMI). Do these data provide evidence for this theory?

84

Chapter 9: Transition models

This chapter discusses the last type of modelling of longitudinal data; transition models.

Transition - the change of status from one point of time to the next, as its name suggests, is the centre of interest. We have already explored and analysed the bacteria data set, where the time points did not increase regularly. In this chapter, let's try to analyse a more complicated data set – Xerop, which is concerned with the relationship between vitamin A deficiency and respiratory infection among children. > library(epicalc) > zap() > data(Xerop) > use(Xerop) > des() (subset) No. of observations = 1200 Variable Class Description 1 id integer 2 respinfect integer 3 age.month integer 4 xerop integer 5 sex factor 6 ht.for.age integer 7 stunted integer 8 time integer 9 baseline.age integer 10 season factor > summ() (subset) No. of observations = 1200 Var. name obs. mean median s.d. min. max. 1 id 1200 168402.32 166150 103202.68 121013 1725110 2 respinfect 1200 0.09 0 0.29 0 1 3 age.month 1200 3.2 3 20.25 -32 50 4 xerop 1200 0.05 0 0.21 0 1 5 sex 1200 1.41 1 0.492 1 2

85

6 ht.for.age 1200 0.91 1 5.85 -23 25 7 stunted 1200 0.12 0 0.33 0 1 8 time 1200 3.42 3 1.76 1 6 9 baseline.age 1200 -4.05 -3 19.63 -32 44 10 season 1200 2.488 2 0.922 1 4

The outcome of interest is contained in the variable 'respinfect'. The main independent variable is 'xerop'. There is a maximum of 6 scheduled visits ('time') for each child. Three variables ('age.month', 'ht.for.age' and 'baseline.age') contain negative values. They were transformed by a previous analyst for unknown reasons. > length(table(id)) [1] 275

There are 275 children in the data set. Check for missing and duplicate visits. > T <- table(id, time) > sum(T == 0) [1] 452 > sum(T > 1) [1] 2

There are 452 visits missed and in 2 records the combination of 'id' and 'time' are duplicated. We can use the following command to list the ids of the duplicated records: > id[which(duplicated(cbind(id, time)))] [1] 161013 161013

Now we can list the records of the child whose 'id' is equal to 161013. (The output has been edited to fit on the page). > .data[id==161013,] id respinf age.month xerop sex ht.for.age stunted time base.age 496 161013 0 -1 0 0 2 0 1 -1 497 161013 0 2 0 0 3 0 2 -1 498 161013 0 5 0 0 2 0 3 -1 499 161013 0 8 0 0 3 0 4 -1 500 161013 0 11 0 0 2 0 1 11 501 161013 0 14 1 0 1 0 2 11

The duplicated records are rows 500 and 501 where 'time' of 1 and 2 is repeated from rows 496 and 497. This is likely to be a human error arising during data entry. Inspection of the 'age.month' variable, which has a constant increment of 3, would suggest that the 'time' variable for the duplicated records should be changed to 5 and 6, respectively. This subject was vitamin A deficient ('xerop' = 1) at the last visit and had a lower height for age but did not yet have stunted growth. Their baseline age was -1 in the first four visits but noted to be 11 in the last two. We now have the dilemma of either deleting these two records or changing the times to 5 and 6. Let's choose the first option just for illustration purposes. > data.new <- .data[-c(500,501),]

86

> use(data.new) > anyDuplicated(cbind(id, time)) # final check [1] 0

The two duplicated records have been removed.

A cross-sectional relationship between vitamin A deficiency and the infection can now be assessed. > cc(respinfect, xerop) xerop respinfect 0 1 Total 0 1044 47 1091 1 100 7 107 Total 1144 54 1198 OR = 1.55 95% CI = 0.58 3.58 Chi-squared = 1.13 , 1 d.f. , P value = 0.288 Fisher's exact test (2-sided) P value = 0.323

Vitamin A deficiency is associated with a 55 percent increase in the odds of infection. However, this increase does not reach statistical significance.

We expect that it should take some amount of time before vitamin A deficiency could have any effect on infection. To create lag variables for respiratory infection and vitamin A deficiency we use the lagVar command from epicalc. > respinfect.lag1 <- lagVar(respinfect, id=id, visit=time, lag.unit=1) > xerop.lag1 <- lagVar(xerop, id=id, visit=time, lag.unit=1) > pack()

Now we tabulate current infection against preceding vitamin A status. > cc(respinfect, xerop.lag1) xerop.lag1 respinfect 0 1 Total 0 754 31 785 1 62 8 70 Total 816 39 855 OR = 3.13 95% CI = 1.19 7.35 Chi-squared = 8.26 , 1 d.f. , P value = 0.004 Fisher's exact test (2-sided) P value = 0.011

The risk of infection is 3 times higher if the child had vitamin A deficiency in the preceding visit. We should check whether this association is confounded by preceding infection status using the Mantel Haenszel method. > mhor(respinfect, xerop.lag1, respinfect.lag1) Stratified analysis by respinfect.lag1

87

OR lower lim. upper lim. P value respinfect.lag1 0 3.39 1.1886 8.46 0.01145 respinfect.lag1 1 1.65 0.0305 19.45 0.52669 M-H combined 3.03 1.3348 6.89 0.00534 M-H Chi2(1) = 7.76 , P value = 0.005 Homogeneity test, chi-squared 1 d.f. = 0.33 , P value = 0.567

The adjusted odds ratio (3.03) and the crude odds ratio (3.13) are quite close, indicating minimal confounding by past infection.

Logistic regression gives similar results. > glm1 <- glm(respinfect ~ xerop.lag1 + respinfect.lag1,

family="binomial", data=.data) > logistic.display(glm1) Logistic regression predicting respinfect crude OR(95%CI) adj. OR(95%CI) P(Wald) P(LR-test) xerop.lag1: 1 vs 0 3.14 (1.38,7.12) 3.06 (1.34,6.97) 0.008 0.015 respinfect.lag1: 1 vs 0 1.88 (0.92,3.84) 1.82 (0.88,3.75) 0.104 0.124 Log-likelihood = -237.9799 No. of observations = 855 AIC value = 481.9597

This model suggests that during the transition from previous visit to current visit, susceptibility to respiratory infection is enhanced by vitamin A deficiency and not by the presence of the preceding infection.

We can add other putative risk factors to see if the odds ratio changes. > glm2 <- glm(respinfect ~ xerop + xerop.lag1 + respinfect.lag1 +

season + sex + age.month, family="binomial", data=.data) > logistic.display(glm2)

The variables 'xerop' and 'sex' are not significant and so we run a new model with them omitted. > glm3 <- glm(respinfect ~ xerop.lag1 + respinfect.lag1 + season +

age.month, family="binomial", data=.data) > logistic.display(glm3, decimal=1)

Logistic regression predicting respinfect crude OR(95%CI) adj. OR(95%CI) P(Wald) P(LR) xerop.lag1: 1 vs 0 3.1 (1.4,7.1) 4.2 (1.7,10.2) 0.002 0.004 respinfect.lag1: 1 vs 0 1.9 (0.9,3.8) 1.61 (0.8,3.4) 0.223 0.239

88

season: ref.=1 < 0.001

2 4.0 (1.7,9.5) 4.6 (1.9,11.2) < 0.001 3 1.9 (0.8,4.3) 1.9 (0.8,4.6) 0.133 4 1.3 (0.5,3.6) 1.3 (0.5,3.7) 0.588 age.month (cont. var) 0.98 (0.97,0.99) 0.97 (0.96,0.99) <0.001 < 0.001 Log-likelihood = -223.9393 No. of observations = 855 AIC value = 461.8786

Both season and age are significant determinants of respiratory infection. With their presence, the odds ratio of preceding vitamin A deficiency increases from 3.1 to 4.2 and the odds ratio of preceding infection slightly decreases from 1.8 to 1.6, which is again not significant.

In summary, transition models are suitable for cohort studies containing a regular follow-up interval and changing exposure and outcome status. Transition modelling is statistically relatively simple but needs careful data management and exploration. Keeping preceding outcome status in the model ensures that the 'carry-over' effects are adjusted for and the problem of correlation over time is taken care of. Demonstrating that preceding exposure is associated with current health outcome provides stronger logic of causation than what is found in usual cross-sectional data.

89

Exercise Analyse the bacteria data set from the MASS package.

1) Create a variable called 'visit' using the markVisit function.

2) Ignore the differences among the visits and proceed with the transition analysis. • Compute the crude odds ratio between the outcome 'y' and active treatment

'ap'. • Compute the odds ratio between 'y' and 'ap' but this time adjusting for

preceding outcome status using the Mantel Haenszel method. Is the conclusion the same?

• Compute the adjusted odds ratio using logistic regression. • In the logistic regression model, replace 'ap' with 'trt' and explain the

findings.

3) The conclusion from a transition model is different from those using marginal (GEE) and mixed models. Which one is the most sensible?

4) Draw up a table to summarize your understanding of the three main modelling techniques for longitudinal data. The columns should be the three techniques (marginal model, mixed model, transition model). The rows should include the following items. • Underlying interest • Data manipulation needed • Interpretation of the main and the minor coefficients and other parameters • Advantages and disadvantages • Indication and limitation of application

90

Further Reading

The following are useful reference books on longitudinal data analysis.

Diggle. Analysis of Longitudinal Data

Hedeker & Gibbons. Longitudinal Data Analysis. Wiley.

91

Solutions to exercises

Chapter 1 > library(epicalc) > des(Theoph) # This dataset has a "lazy loading" attribute > help(Theoph)

The help page describes the variables in this dataset. The data appear to be in long format, since there are no repeating variables, and there is a "time" variable. To confirm this, we can determine if the Subject id is duplicated. > head(Theoph) Grouped Data: conc ~ Time | Subject <environment: R_EmptyEnv> Subject Wt Dose Time conc 1 1 79.6 4.02 0.00 0.74 2 1 79.6 4.02 0.25 2.84 3 1 79.6 4.02 0.57 6.57 4 1 79.6 4.02 1.12 10.50 5 1 79.6 4.02 2.02 9.66 6 1 79.6 4.02 3.82 8.58 7 1 79.6 4.02 5.10 8.36 8 1 79.6 4.02 7.03 7.47 9 1 79.6 4.02 9.05 6.89 10 1 79.6 4.02 12.12 5.94 11 1 79.6 4.02 24.37 3.28 12 2 72.4 4.40 0.00 0.00 13 2 72.4 4.40 0.27 1.72 14 2 72.4 4.40 0.52 7.91 15 2 72.4 4.40 1.00 8.31 > use(Theoph) > any(duplicated(Subject)) [1] TRUE > any(duplicated(cbind(Subject, Time))) [1] FALSE

The variable "Time" appears to be inconsistent across subjects, as can be seen from: > tab1(Time)

92

Reshaping this data to wide form using this "time" variable is possible, but would result in a useless data set. > Theoph.wide <- reshape(Theoph, direction="wide", idvar="Subject",

timevar="Time", v.names="conc") > des(Theoph.wide); head(Theoph.wide) # output not shown

Chapter 2 > library(epicalc) > use(Theoph) > des()

No. of observations = 132 Variable Class Description 1 Subject ordered 2 Wt numeric 3 Dose numeric 4 Time numeric 5 conc numeric

> summ()

No. of observations = 132

Var. name obs. mean median s.d. min. max. 1 Subject 132 6.5 6.5 3.465 1 12 2 Wt 132 69.58 70.5 9.13 54.6 86.4 3 Dose 132 4.63 4.53 0.72 3.1 5.86 4 Time 132 5.89 3.53 6.93 0 24.65 5 conc 132 4.96 5.28 2.87 0 11.4

> table(Subject)

Subject 6 7 8 11 3 2 4 9 12 10 1 5 11 11 11 11 11 11 11 11 11 11 11 11

For a small data set such as this one, we can easily see that there are 12 subjects, each having 11 records. For large data sets, the following may be better. > length(unique(Subject))

[1] 12 > tab1(table(Subject))

table(Subject) : Frequency Percent Cum. percent 11 12 100 100 Total 12 100 100

Assess the timing of drug measurements using the tab1 and summ commands. > tab1(Time)

Time : Frequency Percent Cum. percent

93

0 12 9.1 9.1 0.25 5 3.8 12.9 0.27 3 2.3 15.2 0.3 2 1.5 16.7 0.35 1 0.8 17.4 0.37 1 0.8 18.2

====== remaining lines omitted ====== > table(Time, Subject) # output not shown > summ(Time)

0 5 10 15 20 25

Distribution of Time

Sub

ject

sor

ted

by X

-axi

s va

lues

The jittering of the stacks of points indicates that the time of drawing blood was not perfectly synchronised for all subjects. It appears as if some attempt was made to draw the blood at specific intervals for each subject, namely at 15 and 30 minutes, and then at 1, 2, 3.5, 5, 7, 9, 12 and 24 hours after the start of the study, however this was not achieved exactly. > followup.plot(id=Subject, time=Time, outcome=conc, xlab="Time (hrs)",

ylab="Concentration (mg/L)", las=1) > title(main="Pharmacokinetics of theophylline")

94

0 5 10 15 20 25

0

2

4

6

8

10

Time (hrs)

Con

cent

ratio

n (m

g/L)

Pharmacokinetics of theophylline

The concentration rises sharply after the first dose, then drops gradually over time. > coplot(conc~Time|Subject, panel=lines, type="b", data=Theoph)

02

46

8

0 5 10 20 0 5 10 20

02

46

8

0 5 10 20

02

46

8

0 5 10 20

Time

conc

6 7 8 11 3 2 4 9 12 10 1 5

Given : Subject

95

Multicoloured lines can be achieved as follows (graph not shown). > followup.plot(Subject, Time, conc, line.col="multicolour")

For the examination of the subject's weights over time, we can use the aggregate command. If the standard deviation of each subject’s weight is zero, then that would indicate stability. > aggregate(Wt, by=list(Subject=Subject), FUN=sd) Subject sd.Wt 1 6 0 2 7 0 3 8 0 4 11 0 5 3 0 6 2 0 7 4 0 8 9 0 9 12 0 10 10 0 11 1 0 12 5 0

A weight group variable can be created using the cut command. > Wt.gp <- cut(Wt, br=c(0, 70, 100), labels=c("<70 kg", "70+ kg")) > pack() > followup.plot(Subject, Time, conc, by=Wt.gp, las=1,

main="Pharmacokinetics of theophylline")

0 5 10 15 20 25

0

2

4

6

8

10

Pharmacokinetics of theophylline

Time

conc

<70 kg70+ kg

96

A comparison may perhaps be better visualised by aggregating the concentrations for each subject withing the two weigth groups at suitably chosen time points. > aggregate.plot(conc, by=Time, group=Wt.gp, lwd=2, lty=c(1,3), las=1) > title(ylab="Concentration (mg/L)", xlab="Time (hour)")

Note that because the time of drawing blood is not exact for each subject, the aggregate.plot command will group the time points into 4 "bins" by default. You may like to experiment with the "bin" arguments to see what effect they have on the graph. > aggregate.plot(x=conc, by=Time, group=Wt.gp, bin.method="quantile") > aggregate.plot(x=conc, by=Time, group=Wt.gp, bin.time=11

bin.method="fixed")

In order to use the scheduled times, which were 15 minutes, 30 minutes, 1 hour and then 2, 3, 5, 7, 9, 12 and 24 hours after drug administration, a new vector is needed. > visit <- markVisits(Subject, Time) > pack() > aggregate.plot(conc, by=visit, group=Wt.gp)

In this graph, the distances between the time points do not reflect the actual times in hours. We need to change the visit times. > scheduled.visit <- c(0,0.25,0.5,1,2,3,5,7,9,12,24) > recode(visit, 1:11, scheduled.visit) > aggregate.plot(conc, by=visit, group=Wt.gp, lwd=2, lty=c(1,3), las=1)

0 5 10 15 20

0

5

10

<70 kg70+ kg<70 kg70+ kg

Mean and 95% CI of conc by visit and Wt.gp

Those weighing over 70kg start to have lower theophylline concentrations after 2 hours, but the difference is negligible at the end of the study.

97

Chapter 3 > zap() > use(Sitka) > auc.data <- auc(conc=size, time=Time, id=tree) > treat.data <- reshape(Sitka, direction="wide", idvar="tree",

timevar="Time", v.names="size")[,1:2] > data <- merge(auc.data, treat.data, by="tree") > use(data) > des() > tableStack(auc, by=treat) control ozone Test stat. P value auc t-test (77 df) = 1.45 0.151 mean(SD) 535.4 (72.6) 512.6 (61.1)

Trees treated with ozone have a lower AUC on average, however the difference is not significant.

Chapter 4 > zap() > use(Xerop) > des(); summ() > table(table(id, time)) 0 1 2 452 1196 2

The output indicates that there are 452 missing records, however there are also 2 duplicates. These must be removed before continuing the anlaysis. > id.dup <- id[duplicated(cbind(id, time))] > .data[id %in% id.dup,] > keepData(subset=!duplicated(cbind(id,time))) > sortBy(id, time) > visit <- markVisits(id, time) > pack() > table(time, visit)

The newly created 'visit' variable is not consistent with the 'time' variable due to the missing records. > zap() > use(Sitka) > tmp <- by(.data, INDICES=tree,

FUN=function(x) lm(size ~ Time+I(Time^2), data=x)) > tree.coef <- sapply(tmp, FUN=coef) > tree.growth <- as.data.frame( t(tree.coef) ) > use(tree.growth)

98

> des() > tableStack(vars=2:4, by=treat, decimal=2) control ozone Test stat. P value (Intercept) Ranksum test 0.5728 median(IQR) -0.38 (-3.21,0.53) -1.01 (-2.51,0.16) Time Ranksum test 0.6849 median(IQR) 0.05 (0.03,0.06) 0.05 (0.04,0.06) I(Time^2) Ranksum test 0.4966 median(IQR) 0 (0,0) 0 (0,0)

None of the components of a quadratice model are significantly predicted by ozone.

Chapter 5 > zap() > use(Xerop) > des(); summ() > table(time) time 1 2 3 4 5 6 230 214 177 183 195 201 > length(unique(id)) [1] 275

The visit times are not evenly distributed. Of the 275 subjects, only 230 came to the first visit (time=1). Subsequent visits are imbalanced. The two duplicates are now removed. > keepData(subset=!duplicated(cbind(id,time)),

select=c(id, baseline.age, sex, respinfect, xerop, stunted, time)) > Xerop.wide <- reshape(.data, idvar="id", v.names=c("respinfect",

"xerop", "stunted"), timevar="time", direction="wide") > summ(Xerop.wide) > table(time)

Note that the first 3 variables, ('id', 'baseline.age' and 'sex'), all have 275 non-missng values, since we omitted these in the 'v.names' argument of the reshape command. The others have varying numbers of missing values, and the frequencies should match those from the last tabulation of the 'time' variable. > with(Xerop.wide, addmargins(table(respinfect.1, respinfect.2))) respinfect.2 respinfect.1 0 1 Sum 0 162 8 170 1 21 2 23 Sum 183 10 193

99

> with(Xerop.wide, mcnemar.test(table(respinfect.1, respinfect.2))) McNemar's Chi-squared test with continuity correction data: table(respinfect.1, respinfect.2) McNemar's chi-squared = 4.9655, df = 1, p-value = 0.02586

There is a significant change in respiratory infection from visit 1 to visit 2. There were 23 who had respiratory infection at the start of the study, and only 10 at visit 2.

The change from visit 2 to visit 3 is not significant, as evidenced by the following commands. > with(Xerop.wide, addmargins(table(respinfect.2, respinfect.3))) respinfect.3 respinfect.2 0 1 Sum 0 148 8 156 1 4 1 5 Sum 152 9 161 > with(Xerop.wide, mcnemar.test(table(respinfect.2, respinfect.3))) McNemar's Chi-squared test with continuity correction data: table(respinfect.2, respinfect.3) McNemar's chi-squared = 0.75, df = 1, p-value = 0.3865

Continuing in this fashion, you will discover that from visit 4 to visit 5, respiratory infection actually increases. Vitamin A deficiency (xerop) does not change significantly during any of the transitional periods. Stunting only changes significantly between visits 3 and 4.

Chapter 6 > zap() > data(Xerop) > use(Xerop) > id.dup <- id[duplicated(cbind(id, time))] > .data[id %in% id.dup,] > keepData(subset=!duplicated(cbind(id,time))) > Xerop.all <- addMissingRecords(.data, id, time,

outcome=c("season","respinfect", "xerop","stunted")) > use(Xerop.all) > summ() > tableStack(vars=season, by=present) 0 1 Test stat. P value season Chisq. (3 df) = 22 < 0.001 1 92 (20.4) 183 (15.3) 2 126 (27.9) 424 (35.4) 3 136 (30.1) 414 (34.6) 4 98 (21.7) 177 (14.8)

100

Seasons 1 and 4 had significantly lower attendance rates than the other 2 seasons. > sortBy(id, time) > present.next <- lagVar(present, id, time, lag = -1) > pack() > logistic.display(glm(present.next ~ xerop+respinfect, data=.data,

family=binomial)) Logistic regression predicting present.next adj. OR(95%CI) P(Wald's test) P(LR-test) xerop: 1 vs 0 0.71 (0.34,1.5) 0.37 0.385 respinfect: 1 vs 0 0.87 (0.48,1.59) 0.659 0.663 Log-likelihood = -407.6334 No. of observations = 997 AIC value = 821.2667

There is no evidence that non-attendance was due to vitamin A deficiency or respiratory infection from the previous visit. > logistic.display(glm(present ~ ht.for.age, data=.data,

family=binomial), crude=FALSE) Logistic regression predicting present.varname adj. OR(95%CI) P(Wald's test) P(LR-test) season: ref.=1 < 0.001 2 1.7 (1.23,2.34) 0.001 3 1.54 (1.12,2.11) 0.008 4 0.91 (0.64,1.29) 0.589 baseline.age 0.9907 (0.9853,0.9962) < 0.001 < 0.001 sex: 1 vs 0 1.04 (0.83,1.3) 0.724 0.724 Log-likelihood = -952.3859 No. of observations = 1650 AIC value = 1916.7717

Both season and age at baseline are significant predictors for non-attendance. As age increases, the chances of attending reduces.

Chapter 7 > library(geepack) > library(epicalc) > zap() > data(respiratory) > use(respiratory) > des(); summ() > table(table(id, visit))

101

The output is interesting. According to the help page, two centers were involved in this study. Each center assigned a running number for the id, thus explaining the duplicates. We need to create a new id variable. > id2 <- paste(center,id,sep="") > label.var(id2, "patient ID") > table(id2, visit)

Now there are no duplicates. This dataset is not exactly in wide or long form. The outcome is respiratory status, wich was measured on the 111 patients at baseline, and at 4 subsequent visits. Thus, the outcome saved in two separate variables, and we need to reshape the dataset so that it is contained in only one variable. First create a dataset containing just the records from the baseline. > Baseline <- .data[visit==1,]

Next, change the values of the 'visit' variable all to 0 and create a new outcome variable called 'resp'. > Baseline$visit <- 0 > Baseline$resp <- Baseline$baseline

Finally, do the same to the original dataset and append them together. > .data$resp <- .data$outcome > data <- rbind(Baseline, .data) > use(data) > sortBy(id2, visit) > head(.data,30)

There should now be 555 records, with 111 patients having their respiratory status measured at 5 visits. Modelling can now proceed. > resp.ex <- geeglm(outcome~treat+center+age+sex, id=id2, data=.data,

family="binomial", corstr = "exchangeable") > summary(resp.ex)

Only 'treatment' and 'center' are significant. Those from center "2" were more likely to have a “good” respiratory status in addition to those given the active treatment. Note that the coding for the outcome variable is 1=good, 0=poor.

Chapter 8

Chapter 9 > library(geepack) > library(epicalc) > zap() > data(bacteria, package="MASS") > use(bacteria) > des(); summ() > sortBy(ID, week)

102

> visit <- markVisits(ID, week) > pack() > table(week, visit)

Note the differences between the two variables, due to the missing records. > cc(y, ap)

The risk of infection is 2.3 times higher for the placebo group. > y.lag1 <- lagVar(y, ID, visit) > pack() > mhor(y, ap, y.lag1)

The risk of bacterial infection is still 2.3 times higher in the placebo group after adjusting for previous infection. Conclude tha the active treatment is protective against the bacteria regardless of whether the child was infected in the preceding visit or not. > logistic.display(glm(y ~ y.lag1+ap, family=binomial, data=.data))

Note that the crude odds ratio for 'ap' is not the same as above using the cc command. This is because the records of the first visit do not have preceding infection status and therefore are not included in the adjusted analysis nor the logistic regression model. Checking for a confounding effect using the glm command is more accurate than from a comparison of the results from the cc and mhor commands. In this case there is moderate confounding by previous infection status. > logistic.display(glm(y ~ y.lag1+trt, family=binomial, data=.data))

There are three treatment groups now. One group was given no active treatment at all (placebo). The second and third groups were given the active treatment and further randomised to recieve active encouragement to comply with treatment (drug+) or not (drug).

The effect (OR) of 'y.lag1' is similar to the model which included 'ap'. However, while the likelihood ratio test (LR-test) for 'ap' is significant (p=0.038), it is not for 'trt' (p=0.096). This is because the risk for the "drug" and "drug+" groups are rather close. The LR-test reports the effect of 'trt' as a whole. It tells us that there is not sufficient evidence that the three treatment groups have a different risk for infection after adjusting for preceding infection. The Wald's tests of this set of variables tell a slightly different story. The group receiving treatment with low compliace (drug) has a significantly lower risk of infection compared to the placebo group but treatment with high compliance is not better than placebo (one may wonder if the data contains wrong coding). However, the odds ratio is still relatively and negatively strong (0.5) with quite a wide 95% CI. We conclude that the sample size (170 valid subjects) may not be large enough.