An R and S-PLUS Companion to Applied...
Transcript of An R and S-PLUS Companion to Applied...
STAT:5201 Applied Statistic II
Random Coefficient Model (a.k.a. growth curve)
Adapted from John Fox example in Appendix: Linear Mixed Models found inAn R and S-PLUS Companion to Applied Regression.
Exercise trend over time
The data consists of the exercise history of 138 teenaged girls hospitalized for eating disor-ders, and on a group of 93 control subjects. The variables are:
subject: a factor with subject id codes.
age: age in years.
exercise: hours per week of exercise.
group: a factor indicating whether the subject is a ‘patient’ or ‘control’..
> library(car)
> data(Blackmore)
> head(Blackmore)
subject age exercise group
1 100 8.00 2.71 patient
2 100 10.00 1.94 patient
3 100 12.00 2.36 patient
4 100 14.00 1.54 patient
5 100 15.92 8.63 patient
6 101 8.00 0.14 patient
> dim(Blackmore)
[1] 945 4
> length(unique(Blackmore$subject[Blackmore$group=="patient"]))
[1] 138
> length(unique(Blackmore$subject[Blackmore$group=="control"]))
[1] 93
Fox transformed the response variable for numerous reasons (described in text)as log2(y + 5/60).
> Blackmore$log.exercise <- log(Blackmore$exercise + 5/60, 2)
> attach(Blackmore)
1
Investigating the data with plots (in R).
Use a random sample of 20 girls from each group for trend plotting. The ‘groupedData’object from the nlme package is used to form the trellis plots.
> library(nlme)
> chosen.pat.IDs=sample(unique(subject[group=="patient"]), 20)
> chosen.pat.20=groupedData(log.exercise ~ age | subject,
data=Blackmore[is.element(subject,chosen.pat.IDs),])
> chosen.con.IDs=sample(unique(subject[group=="control"]), 20)
> chosen.con.20=groupedData(log.exercise ~ age | subject,
data=Blackmore[is.element(subject,chosen.con.IDs),])
> print(plot(chosen.con.20, main="Control Subjects",xlab="Age",ylab="log2 Exercise",
ylim=1.2*range(chosen.con.20$log.exercise, chosen.pat.20$log.exercise),
layout=c(5,4),aspect=1),position=c(0, 0, 0.5, 1), more=T)
> print(plot(chosen.pat.20, main="Patients",xlab="Age",ylab="log2 Exercise",
ylim=1.2*range(chosen.con.20$log.exercise, chosen.pat.20$log.exercise),
layout=c(5,4),aspect=1),position=c(0.5, 0, 1, 1))
Control Subjects
Age
log2
Exe
rcis
e
-4-2024
8 10 14
262 277
8 10 14
202 275
8 10 14
251
240 245 222 209
-4-2024
239-4-2024
235 228 272 210 231
242
8 10 14
257 281
8 10 14
223
-4-2024
204
Patients
Age
log2
Exe
rcis
e
-4-2024
810 14 18
151 317
810 14 18
161 338
810 14 18
331
168 171 118 189
-4-2024
306-4-2024
125 340 318 119 166
130
810 14 18
116 304
810 14 18
109
-4-2024
333
The groupedData object is automatically plotted in order by average exercise. The subjectswith the highest exercise values are in the top row, the subjects with the lowest exercisevalues are in the bottom row.
2
Investigating the data with plots (in SAS).
After I created subsetted data sets of patients from each group called ‘control1’ and ‘pa-tient1’ (8 subjects per group), I used the PROC SGPANEL procedure to plot the individualtrajectories. Here I’ve asked for a linear regression line for each subject, but you can simplyconnect the observed points using the vline option instead of the reg option.
proc sgpanel data=control1;
title ‘Control Subjects’;
panelby subject/columns=4 rows=2;
reg x=age y=log_exercise;
rowaxis min=-4 max=4;
colaxis values=(8, 10, 12, 14, 16);
run;
proc sgpanel data=patient1;
<similar coding>
3
You can also plot the overlay of these individual lines using PROC SGPLOT...
proc sgplot data=control1;
title ‘Subset of Control Subjects’;
reg x=age y=log_exercise/group=subject;
run;
4
Investigating the subject-specific parameter estimates (in R).
Fox formally fits a linear regression to each subject (231 separately fit models) in orderto investigate the variability and correlation in the slopes and intercept estimates from agraphical perspective. The predictor age is transformed to represent age after the start ofthe study or age-8. He points out that the random coefficients model (fitted to all the data)fits a ‘unified’ model that considers slopes and intercepts as random effects, and in that case,the estimated random effects or u are estimated using BLUPs (best linear unbiased predic-tors). For a model with independent random subject effects (i.e. just a random intercept),the BLUPs are actually shrinkage estimators and fall between the individual observed valuesand the overall mean values. Formally, BLUPs are estimated as u = GZ ′Σ−1(y−Xβ) whereΣ = var(y) = ZGZ ′ +R.
Before moving to a mixed model, we consider truly fitting a separate line to each subject(so not a random coefficients model). Again, the nlme package is utilized by employing thelmList function:
> pat.list=lmList(log.exercise ~ I(age - 8) | subject, subset = group=="patient",
data=Blackmore)
> con.list=lmList(log.exercise ~ I(age - 8) | subject, subset = group=="control",
data=Blackmore)
> pat.coef=coef(pat.list)
> con.coef=coef(con.list)
> par(mfrow=c(1,2))
> boxplot(pat.coef[,1], con.coef[,1], main="Intercepts", names=c("Patients","Controls"))
> boxplot(pat.coef[,2], con.coef[,2], main="Slope", names=c("Patients","Controls"))
Patients Controls
-4-2
02
Intercepts
Patients Controls
-1.0
-0.5
0.0
0.5
1.0
Slope
The intercept represents the level of exercise at the start of the study. As expected,there isa great deal of variation in both the intercepts and the slopes. The median intercepts aresimilar for patients and controls, but there is somewhat more variation among patients. Theslopes are higher on average for patients than for controls and the slopes tend to be positive(suggesting their exercise increases over time).
5
It makes sense to also plot the relationship between the estimated intercept and slope pa-rameters. The dataEllipse function is in the car library.
> plot(c(-5,4),c(-1.2,1.2),xlab="intercept",ylab="slope",type="n",
main="(Individual) Estimates of slope and intercept")
> points(con.coef[,1],con.coef[,2],col=1)
> points(pat.coef[,1],pat.coef[,2],col=2)
> abline(v=0)
> abline(h=0)
> legend(-4.5,-.7,c("Controls","Patients"),col=c(1,2),pch=c(1,1))
> dataEllipse(con.coef[,1],con.coef[,2],levels=c(.5,.95),add=TRUE,
plot.points=FALSE,col=1)
> dataEllipse(pat.coef[,1],pat.coef[,2],levels=c(.5,.95),add=TRUE,
plot.points=FALSE,col=2)
-4 -2 0 2 4
-1.0
-0.5
0.0
0.5
1.0
(Individual) Estimates of slope and intercept
intercept
slope
ControlsPatients
Recall that we are on the log-scale base 2 for our response, so y = 0 coincides with 1 hour ofexercise a week (this is the scale for the intercept, but not the slope). It looks like the twogroups have a reasonably similar correlation structure for the slope and intercept. It alsolooks like the patients have a shifted distribution such that they tend to have higher slopes.
6
Fitting the random coefficients model (in SAS)
This model allows for a random slope and intercept for each subject (which are allowed to becorrelated). The population-level mean structure allows for separate lines for each treatmentgroup (control and patient). The predictor age is transformed to represent age after the startof the study or age-8.
data Blackmore; set Blackmore;
age_trans = age-8;
run;
proc mixed data=Blackmore;
class subject group;
model log_exercise = group age_trans group*age_trans/solution ddfm=satterth;
random intercept age_trans/subject=subject type=un gcorr;
run;
The Mixed Procedure
Dimensions
Covariance Parameters 4
Columns in X 6
Columns in Z Per Subject 2
Subjects 231
Max Obs Per Subject 5
Estimated G Correlation Matrix
Row Effect subject Col1 Col2
1 Intercept 100 1.0000 -0.2808
2 age_trans 100 -0.2808 1.0000
Covariance Parameter Estimates
Standard Z
Cov Parm Subject Estimate Error Value Pr Z
UN(1,1) subject 2.0839 0.2901 7.18 <.0001
UN(2,1) subject -0.06681 0.03698 -1.81 0.0708
UN(2,2) subject 0.02716 0.007975 3.41 0.0003
Residual 1.5478 0.09743 15.89 <.0001
We see that the correlation between b0i and b1i is estimated to be negative (ρ = −0.2808)and marginally significant with a p=0.0708.
7
Solution for Fixed Effects
Standard
Effect group Estimate Error DF t Value Pr > |t|
Intercept -0.6300 0.1487 230 -4.24 <.0001
group control 0.3540 0.2353 234 1.50 0.1338
group patient 0 . . . .
age_trans 0.3039 0.02386 196 12.73 <.0001
age_trans*group control -0.2399 0.03941 221 -6.09 <.0001
age_trans*group patient 0 . . . .
Type 3 Tests of Fixed Effects
Num Den
Effect DF DF F Value Pr > F
group 1 234 2.26 0.1338
age_trans 1 221 87.16 <.0001
age_trans*group 1 221 37.05 <.0001
The groups do not have significantly different intercepts (average exercise values at start ofstudy, at age 8), but they do have significantly different slopes with the patient group havinga higher slope than the control group.
I can capture the estimated BLUP s or u = GZ ′Σ−1(y−Xβ) where Σ = var(y) = ZGZ ′+Rusing the ODS output and the solution option in the random statement:
ods output SolutionR=blups;
proc mixed data=Blackmore covtest;
class subject group;
model log_exercise = group age_trans group*age_trans/ddfm=satterth;
random intercept age_trans/subject=subject type=un gcorr solution;
run; /* Solution for the random effects are BLUPs*/
ods output close;
proc print data=blups (obs=10);
run;
StdErr
Obs Effect subject Estimate Pred DF tValue Probt
1 Intercept 100 1.0095 0.7092 235 1.42 0.1560
2 age_trans 100 -0.05272 0.1261 69.8 -0.42 0.6771
8
3 Intercept 101 -2.1614 0.7094 256 -3.05 0.0026
4 age_trans 101 0.01287 0.1221 79.5 0.11 0.9163
5 Intercept 102 0.9339 0.7161 266 1.30 0.1933
6 age_trans 102 0.1258 0.1353 53.1 0.93 0.3567
7 Intercept 103 0.9283 0.7101 250 1.31 0.1923
8 age_trans 103 0.02691 0.1413 44.5 0.19 0.8498
9 Intercept 104 1.1407 0.7177 273 1.59 0.1131
10 age_trans 104 -0.03742 0.1332 56.7 -0.28 0.7798
proc export data=blups
outfile="blups.csv"
dbms=csv replace;
run;
Below I’ve plotted the estimated BLUPs for the random intercepts against the estimatedslopes from the separately fit regression lines (in absolute values).
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Absolute values of slopes
|separately fit slope (individual regression)|
|BLU
P o
f slo
pe|
9
Fitting the random coefficients model (in R)Using the lme function in the nlme package, we see the same estimates for the covarianceparameters:
> lme.1=lme(log.exercise~I(age-8)*group, random=~I(age-8)|subject, data=Blackmore)
> summary(lme.1)
Linear mixed-effects model fit by REML
Data: Blackmore
Random effects:
Formula: ~I(age - 8) | subject
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.4435580 (Intr)
I(age - 8) 0.1647954 -0.281
Residual 1.2440951
Var(Intercept)=1.44355802 = 2.083Var(slope)=0.16479542 = 0.027
Corr(Intercept, slope)= −.066820.1647954∗1.4435580 = −0.281
Var(Residual)=1.24409512 = 1.548
Reference:Pinheiro, J.C.&D.M Bates.2000. Mixed-Effects Models in S and S-PLUS. NewYork:Springer.
10
SAS Hint when using PROC IMPORT (Error in trying to import data)
proc import out=Blackmore
datafile="Blackmore.csv"
dbms=csv
replace;
run;
LOG BOX SHOWS ERROR READING IN DATA...
NOTE: Invalid data for subject in line 442 1-4.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+
442 207a,8,0.17,control,-1.980891177 32
subject=. age=8 exercise=0.17 group=control log_exercise=-1.980891177
_ERROR_=1_N_=441
What’s happening in the data at on line 442?
> Blackmore[438:443,]
subject age exercise group log.exercise
206 8.00 0.89 control -0.03899413
206 10.00 0.89 control -0.03899413
206 12.08 1.03 control 0.15488560
207a 8.00 0.17 control -1.98089118 <--- The subject ID is suddenly
207a 10.00 0.00 control -3.58496250 no longer numeric
207a 14.00 1.15 control 0.30256277
By default, PROC IMPORT uses a certain number of observations in each column to decidewhether the variable is numeric or categorical. Here, by the time subject 207a is read-in,SAS had decided the column is numeric, and then decides a value of ‘207a’ (character) isnot an appropriate entry. Thus, it considered it incorrect and called it missing. The solutionis to force it to look farther down the column to make this decision using ‘guessingrows=X’for some large X.
proc import out=Blackmore
datafile="Blackmore.csv"
dbms=csv
replace;
guessingrows=1000;
run;
NOTE: WORK.Blackmore data set was successfully created.
NOTE: The data set WORK.Blackmore has 945 observations and 5 variables
11