Regression calibration for Cox regression under ... · Cox regression under heteroscedastic error 4...

Augustin, Döring, Rummel:

Regression calibration for Cox regression underheteroscedastic measurement error - Determining riskfactors of cardiovascular diseases from error-pronenutritional replication data

Sonderforschungsbereich 386, Paper 345 (2003)

Online unter: http://epub.ub.uni-muenchen.de/

Projektpartner

http://www.stat.uni-muenchen.de/

http://www.gsf.de/

http://www.mpg.de/

http://www.tum.de/

Regression calibration for Coxregression under heteroscedastic

measurement error — Determining riskfactors of cardiovascular diseases fromerror-prone nutritional replication data

T. Augustin∗

MunichDepartment of Statistics

A. DoringNeuherberg

GSF–National Research Center

for Environment and Health

D. RummelMunich

Department of Statistics

Abstract

For instance nutritional data are often subject to se-vere measurement error, and an adequate adjustment ofthe estimators is indispensable to avoid deceptive conclu-sions. This paper discusses and extends the method ofregression calibration to correct for measurement error inCox regression. Special attention is paid to the modellingof quadratic predictors, the role of heteroscedastic mea-surement error, and the efficient use of replicated mea-surements of the surrogates. The method is used to ana-lyze data from the German part of the MONICA cohort

∗Corresponding author: T. Augustin, Department of Statistics, Universityof Munich, Ludwigstr. 33, D-80539 Munich, Germany, [email protected]

1

Cox regression under heteroscedastic error 2

study on cardiovascular diseases. The results corroboratethe importance of taking into account measurement errorcarefully.

Keywords: Error-in-variables; replication data; heteroscedasticmeasurement error in quadratic variables; Cox model; regressioncalibration; MONICA/KORA study.

1 Introduction

A widespread problem in applying regression analysis is the pres-ence of measurement error. Often the variables of interest, calledideal variables or gold standard, cannot be observed directly ormeasured correctly, and one has to be satisfied with so called sur-rogates (often also named indicators or proxies), i.e., with some-how related, but different variables. If one ignores the differencebetween the ideal variables in the model and the observable vari-ables and just plugs in the surrogates instead of the variables(‘naive estimation’), then all the estimators must be suspectedto be severely biased. Error-in-variables modelling provides amethodology, which is serious about that fact. Based on an errormodel describing the relation between ideal variables and surro-gates, it develops procedures to adjust for the measurement error.Recent surveys on measurement error modelling, also containingmany examples from different fields of application, include Chengand van Ness [1], who concentrate on linear models, and Car-roll, Ruppert, and Stefanski [2], Stefanski [3], and Van Huffel andLemmerling [4], who are concerned with non-linear models.

It should be stressed explicitly that the topic of measurement er-ror is not simply a matter of sloppy research; quite often the ‘truevalue’ is unascertainable eo ipso. A typical example is the record-ing of the protein intakes in surveys on eating habits and theirinfluence on certain diseases. Though much attention is paid tothe high quality of the questionnaire and the subsequent proce-dures, a considerable random distortion in the data can not beavoided. Below we analyze data from the WHO MONICA Augs-burg substudy on the surveillance of dietary intake, see [5, 6].This study, which is embedded into the WHO MONICA project


(MONItoring of trends and determinants in CArdiovascular dis-ease), is concerned with the question whether changes in dietaryintake can explain trends in the incidence and mortality of cardiacinfarctions. Indeed severe error is present in the measurements ofthe animal and plant protein intake from a seven day food diary,and so applying Cox regression without adjusting for the errorcould lead to wrong conclusions. The MONICA Augsburg studyis currently continued as the KORA study (Cooperative healthresearch in the area of Augsburg).

Recently the quality of Swedish nutrition data was investigated in[7], where the reproducibility of food frequency measurements ofa sample of respondents to the Swedish MONICA study was con-sidered. It may be mentioned that, if such local studies were com-bined and compared, additional measurement error would arise:It is quite important to take into account the variation in theseaggregated observations (cf. [8]).

Covariate measurement error correction in Cox regression is cur-rently an area of intensive and fruitful research (see, in particular,[9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] — as well as the surveyand comparison of basic approaches in [20]).

In this paper we rely on a variant of the regression calibrationapproach, which is one of the most universal methods to correctfor measurement error (see [2, Chapter 3] for a general descrip-tion). Its basic idea is to run a standard analysis where the un-observable variables are replaced by values predicted from theobservable ones. For Cox regression, regression calibration typemethods were introduced by Prentice ([21]) and were studied anddeveloped further in [22, 23, 24, 9] and [17].

Here we adapt and extend this method taking into account threegeneral methodological issues, which also deserve special attentionin the data analyzed below:

• Heteroscedastic measurement error. Recent research in nu-tritional epidemiology strongly suggests that the measure-ment error must be expected to vary considerably amongthe different study participants (cf., e.g., [25, pp. 33-48]).

• The presence of replication data. The protein intake mea-


surements are based on diaries, where all food intake hadto be recorded in great detail for seven days. Taking for ev-ery individual the errors in these measurements as indepen-dently and identically distributed gives us the opportunityto estimate the error variances.

• The non-linearity of the influence. Pre-studies showed thatthe effect of protein intake on morbidity and mortality couldbe nonlinear: both types of extreme intakes, very high aswell as very low intakes, could be detrimental, and so itis of great importance to work with quadratic predictors.While introducing non-linearity in the covariates does notencounter much difficulty in the error-free situation, undermeasurement error it is often hard or even impossible to han-dle non-linear terms. (For the problems already arising inthe linear polynomial model see, e.g., [26]. For some modelsa general result [27, Theorem 1] can be used to prove eventhe non-existence of a so-called corrected score function.)

As shown below, the convenience of regression calibration is main-tained in this extended setting; still the core parts of the estima-tion can be done by standard software packages. Applying thiscorrection method shows a complex relationship between naiveand corrected estimates. After having adjusted for measurementerror, some of the estimates change substantially, others do not.Sometimes there is a high deattenuation, sometimes the absolutevalues even get smaller. Since, however, regression calibration isknown to be only an approximative correction method, reducingthe bias but not necessarily producing consistent estimators, weunderstand our analysis more as a motivation for further method-ological development than as the last word on the topic.

The paper is organized as follows: The next section describes ourmodelling of the replication data. Section 3 adapts the idea ofregression calibration to replication data and to quadratic pre-dictors. The application to the MONICA data is reported inSection 4, while Section 5 concludes by sketching some topics forfurther research.


2 Survival Data with Replicated Co-

variate Measurements

2.1 The Main Setting

Let n be the sample size and T1, . . . , Tn the lifetimes, which maybe subject to noninformative independent censorship in the senseof, e.g., [28]. For every i = 1, . . . , n we split the vector of covari-ates into a vector Xi and a vector Zi. All error-prone variablesare collected in Xi, while Zi consists of the correctly measuredvariables. Let all elements of Xi be measured on a metrical scale,Zi may contain metrical and categorical covariates in 0/1-coding.Both types of covariates should not be time-varying. With the ap-plication below in mind, we additionally consider another vector,

denoted by X➁i , which contains the squared elements of Xi.

We assume that Cox’s ([29]) proportional hazard model describesthe relationship between the lifetimes and the covariates; the in-dividual hazard rate λ(t|Xi, Zi) has the form

λ(t|Xi, Zi) = λ0(t) · exp(β′

1Xi + β′2X

➁i + β′

ZZi

), (1)

with the unspecified baseline hazard rate λ0(t) and the regressionparameter vector β = (β′

1, β′2, β

′Z)′.

For Xi, i.e., plant and animal protein in the application discussedbelow, replicated measurements Wi1, . . . ,Wik, k > 1 (later on,k=7) are available for every unit i. We assume them to follow theadditive error model

Wij = Xi + Uij, j = 1, . . . , k, i = 1, . . . , n, (2)

and make the usual assumptions: The errors (Uij), j = 1, . . . , k,i = 1, . . . , n should have zero mean (see however the last para-graphs of this section) and they should be independent amongeach other as well as of X1, . . . , Xn and T1, . . . , Tn. It will proveimportant to allow for heteroscedasticity of the errors, where, for ifixed, Ui1, . . . , Uik are i.i.d., but the covariance matrix Σi may vary


among the units i = 1, . . . , n. The common covariance matrix inthe homoscedastic case will be denoted by Σ .

In a naive analysis, for every unit i, the individual average

W i :=1

k

k∑j=1

Wij

would function as the surrogate for Xi. Additionally defining

U i :=1

k

k∑j=1

Uij

leads us back to the classical error model

W i = Xi + U i, i = 1, . . . , n, (3)

with E(U i) = 0 and V(U i) = 1k·Σi . The particular attractiveness

of replication data is based on the fact that the measurementerror variances can be estimated from the data. Therefore, incontrast to most cases relying on the classical error model, it ispossible here to avoid additional assumptions, which are quiteoften difficult to justify.

2.2 A Note on Systematic Measurement Error

Before addressing this topic, the assumption E(Uij) = 0 deservessome attention. If it is violated, i.e. if systematic measurementerror with E(Uij) = a �= 0 with a unknown is present, then it be-comes important to distinguish whether the covariates act merelylinearly or also in a nonlinear way. In order to bring out thispoint most clearly, concentrate on the following special case: Xi

is one-dimensional, there are no error-free covariates Zi, and thereis only a deterministic error a so that (3) reads as

W i = Xi + a, i = 1, . . . , n.

In the case of no quadratic influence, where β2 ≡ 0 a priori,Relation (1) can be written as

λ0(t) · exp(β1W i) = λ0(t) · exp(β1a + β1Xi)(4)

=: λ∗0(t) · exp(β1Xi) .


Therefore, the naive partial likelihood estimator based on replac-ing Xi by W i still estimates β1 consistently, and a bias only occursin the estimation of λ0(t), where the naive standard methods es-timate λ∗

0(t) = λ0(t) · exp(β1a) instead of λ0(t) itself. If, however,quadratic terms are taken into account, then we have to consider

λ0(t) · exp(β1W i + β2W2

i )

= λ0(t) · exp(β1a + β1Xi + β2X2i + 2β2aXi + β2a

2)

=: λ∗∗0 (t) · exp(β1Xi + β2X

2i + 2β2aXi) ,

and also inconsistencies in the estimation of the regression param-eters must be expected.

3 Regression Calibration under Repli-

cation Data

3.1 The Basic Concept

Regression calibration (cf., in particular, [2, Chapter 3]) is an uni-versally applicable, easy-to-handle method to adjust for measure-ment error. The main idea is to utilize the surrogate W i, togetherwith the error-free variable Zi, to predict the corresponding valueof the unobservable variable Xi, and then to proceed with a stan-dard analysis where Xi is replaced by its prediction Xi.

Applying this concept, the vector (X′i , Z

′i)

′ of covariates is as-sumed to be i.i.d., with unknown mean vector (μ′

X , μ′Z)′ and un-

known covariance matrix(ΣX,X ΣX,Z

Σ′X,Z ΣZ,Z

).

Based on Relation (3), the best linear prediction of Xi given Wi

and Zi is

Xi = μX + (ΣX,X ΣX,Z)(5)

×(

ΣX,X + 1kΣi ΣX,Z

Σ′X,Z ΣZ,Z

)−1(W i − μX

Zi − μZ

).


If additionally Xi, Ui and Zi are Gaussian then (5) is exactly theconditional expectation of Xi given Wi and Zi.

Under replication data all nuisance parameters in (5), i.e. theparameters μX , μZ , ΣX and ΣZ of the distribution of (X ′

i, Z′i)

′ aswell as the measurement error variances Σi, can be estimatedpowerfully from the data. We firstly adopt the procedure for thehomoscedastic case (Σi ≡ Σ), taken from [2, p. 47f.], and thenpresent the generalization to the heteroscedastic case.

3.2 The Case of Homoscedastic MeasurementError

Equation (3) immediately suggests the overall mean

W :=1

n

n∑i=1

W i (6)

as an unbiased estimator for μX ; analogously μZ is estimated byZ := 1

n

∑ni=1 Zi.

In order to derive estimators for the other parameters, it is illu-minating to embed the situation under homoscedastic measure-ment error into the theory of design of experiments. Then (2) isreinterpreted as a one-factorial model with a random effect (e.g.,Toutenburg ([30], pp. 147-150)), yielding the estimators

Σ =1

n(k − 1)·

n∑i=1

k∑j=1

(Wij − W i

) (Wij − W i

)′(7)

ΣX,X =

(1

n − 1·

n∑i=1

(W i − W

) (W i − W

)′) − 1

k· Σ (8)

ΣX,Z =1

n − 1·

n∑i=1

(W i − W

) (Zi − Z

)′(9)

ΣZ,Z =1

n − 1·

n∑i=1

(Zi − Z

) (Zi − Z

)′. (10)


3.3 The Case of Heteroscedastic MeasurementError

Under heteroscedastic measurement error, the relation

V(Uij) = V(Uij|Xi) = V(Xi|Xi) + V(Uij|Xi)

= V(Xi + Uij|Xi)

= V(Wij|Xi)

plays a central role. It provides

Σi =1

(k − 1)·

k∑j=1

(Wij − W i

) (Wij − W i

)′, i = 1, . . . , n, (11)

as an estimator for the error covariance matrices Σi at the indi-vidual level. ΣX,Z and ΣZ,Z are estimated in the same way as in(9) and in (10). To get an idea how to estimate ΣX,X it is helpfulto apply the covariance decomposition formula

Cov(W i[l1],W i[l2]) = Cov(E

(W i[l1] |Xi

), E

(W i[l2]

∣∣ Xi

))+ E

(Cov

(W i[l1],W i[l2]

∣∣ Xi

))to every pair (W i[l1],W i[l2]) of components of W i. For the covari-ance matrices this finally yields, in somewhat informal notation,the relation

V(W i) = V(E(W i|Xi)

)+E

(V(W i|Xi)

)= V(Xi)+E

(V(U i|Xi)

),

which suggests to generalize (8) by using the pooled version

ΣX,X =1

n·

n∑i=1

((W i − W

) (W i − W

)′ − 1

k· Σi

). (12)

3.4 Calibrating the Quadratic Part

The most consequential way to deal with the quadratic part X➁i is

to replace every component (Xi[l])2 of X➁

i by the square(Xi[l]

)2

of the corresponding component Xi[l] of Xi. Alternatively to thisprocedure, which is also pursued in the analysis below, one could


prefer to calibrate (Xi[l])2 ‘directly’ by an appropriate approxi-

mation to E((Xi[l])2|Wi, Zi). By means of the relation

E

((Xi[l])

2|Wi, Zi

)=

(E

(Xi[l]|Wi, Zi

))2

+ V(Xi[l]|Wi, Zi)

≈(Xi[l]

)2

+ V(Xi[l]|Wi, Zi)

and arguments very similar to (4), both approaches lead to thesame estimator for β as long as V(Xi[l]|Wi, Zi) does not depend oni. This is the case, for instance, if under homoscedastic Gaussianmeasurement error (X ′

i, Z′i)

′ are Gaussian, too.

4 Application to the MONICA Data

4.1 The Data

Within the WHO MONICA project (MONItoring of trends anddeterminants in CArdiovascular disease) also the influence of nu-trition was considered. We analyze data from a panel of the WHOMONICA substudy on the surveillance of dietary intake, con-ducted in 1984/1985 in Southern Germany, which is currentlycontinued as the KORA study (Cooperative health research inthe area of Augsburg), see [5, 6]. A subpopulation of 899 malerespondents, aged from 45 to 65, filled in a comprehensive diary.For seven consecutive days all meals had been listed in detail. Byusing a nutritional data base also containing standard recipes, nu-tritional variables were derived from the raw data given in every-day units like ladle or gram of certain ingredients. Among otherquestions the role of plant protein intake (PLANT in the tablesbelow) and animal protein intake (ANIMAL) was investigated.Though high attention has been paid to the exactness of the mea-surement procedure, substantial error in the calculation of proteinintake is unavoidable, and so we applied the correction methodsdeveloped above to adjust for it.

By a mortality and morbidity follow-up for more than 10 years,the respondents’ first cardiac infarctions (total number 71 of 858


observations) and deaths (114 cases of 892 observations1) hadbeen registered.2 The main interest focused on the influenceprotein intake had on the response variable which was definedas age at the event. In the analysis also confounders were in-corporated, namely cholesterol (mg/dl) (CHOL), daily alcoholconsumption (g/day) (ALC) as continuous variables, as well ashypertension (HYPER) and smoking3 (SMOKER) as categori-cal variables (1=yes, 0=no). The measurement error in thesevariables may be expected to be quite low compared to that inthe protein intakes, and so the confounders were treated as errorfree. The estimated regression coefficients are written in the formβ[V ARIABLE], i.e., β[PLANT ], β[ANIMAL], etc.

4.2 The Results

Table 1 and Table 2 summarize the results of naive and correctedproportional hazards regression. The first two columns belong tothe naive analysis, which used the seven-days averages of calcu-lated animal protein intake and of calculated plant protein intakeas surrogates for the true corresponding intake. They containthe naive estimates and the p-values based on them.4 Column 3and 5 report the corrected estimates after having adjusted for ho-moscedastic measurement error by the methods of Subsection 3.2,and for heteroscedastic measurement error along the lines of Sub-section 3.3, respectively. In Column 4 and 6 also “approximativep-values” are given, which, however, have to be used with particu-lar reservation here. They are based on the standard errors whichusual software calculates after every Xi was replaced by the corre-

1The number of overall observations slightly differs for the two events,because for some units there was no information about morbidity, but itcould be found out whether they died or survived the follow-up period.

2The median of the follow-up times with respect to the occurrence ofinfarction was 2302 days for the cases and 3996 days for the censored ob-servations. The median of the follow-up times concerning the death eventwas 2598.5 days for the cases and 4006 days for the censored observations,respectively.

3In this analysis persons who are currently smoking or are ex-smokerswere summarized into the smoker category.

4It may be noted explicitly that not only the naive estimators of theregression parameters are inconsistent, but also the estimators of the standarderror.


sponding Xi; they are only meant to give a very rough impressionand should not be taken literally. Correct estimators for the stan-dard error of regression calibration estimators are not straightfor-wardly found (cf. [2, Section 3.5 and Subsection 3.12.2]), and sowe used those easy available values as a rule of thumb to judgethe significance. Though they are not correct, they still shouldgive an impression of the correct magnitude.

In order to illustrate the overall influence of animal and plantprotein intake on morbidity and mortality, it is helpful to look atthe functions

f(x) = β[ANIMAL] · x + β[(ANIMAL)2] · x2 (13)

g(y) = β[PLANT ] · y + β[(PLANT )2] · y2 . (14)

They describe the effect of the animal protein intake x, and ofthe plant protein intake y, respectively, on the predictor in thehazard function in (1). The domains of x and y are chosen suchthat they cover approximately the whole range of the observedvalues. These functions are plotted below in Figure 3, where thedotted and dashed line corresponds to the naive estimation. Theresults, after having adjusted for homoscedastic or heteroscedas-tic measurement error, are plotted by thin and thick solid lines,respectively.

4.2.1 The Naive Analysis

For the naive analysis the seven-days averages of calculated an-imal protein intake and of calculated plant protein intake wereused as surrogates for the true corresponding intake in a propor-tional hazards regression. The naive analysis judges the linearand quadratic terms for animal protein to be significant at thefive percent level, and cholesterol to have a highly significant in-fluence on morbidity. For mortality the estimates β[PLANT ] andβ[(PLANT )2] are significant at least at the ten percent level, andhypertension becomes highly significant.


naiv

ees

tim

atio

nho

mos

ceda

stic

erro

rhe

tero

sced

asti

cer

ror

esti

mat

ep-

valu

ees

tim

ate

p-va

lue

esti

mat

ep-

valu

eA

NIM

AL

−5.6

2·1

0−5

0.03

66−1

.07·1

0−4

0.04

24−5

.49·1

0−5

0.27

85P

LA

NT

−2.7

7·1

0−5

0.78

87−1

.32·1

0−5

0.92

98−6

.60·1

0−5

0.64

15(A

NIM

AL

)24.

68·1

0−10

0.01

009.

09·1

0−10

0.01

615.

27·1

0−10

0.17

43(P

LA

NT

)21.

47·1

0−11

0.99

38−4

.10·1

0−10

0.88

224.

51·1

0−10

0.86

97C

HO

L8.

28·1

0−3

0.00

088.

14·1

0−3

0.00

107.

81·1

0−3

0.00

15H

YP

ER

4.60

·10−

10.

0627

4.70

·10−

10.

0578

4.71

·10−

10.

0573

SM

OK

ER

8.79

·10−

10.

0328

8.68

·10−

10.

0344

8.35

·10−

10.

0406

AL

C8.

00·1

0−5

0.98

315.

01·1

0−5

0.98

95−2

.03·1

0−5

0.99

57

Tab

le1:

Est

imat

esfo

rth

ein

fluen

ceon

mor

bid

ity


naiv

ees

tim

atio

nho

mos

ceda

stic

erro

rhe

tero

sced

asti

cer

ror

esti

mat

ep-

valu

ees

tim

ate

p-va

lue

esti

mat

ep-

valu

eA

NIM

AL

−1.0

1·1

0−5

0.72

96−1

.54·1

0−5

0.78

62−6

.43·1

0−6

0.89

32P

LA

NT

−1.1

5·1

0−4

0.06

04−1

.61·1

0−4

0.06

69−1

.72·1

0−4

0.03

39(A

NIM

AL

)24.

55·1

0−11

0.83

588.

23·1

0−11

0.85

013.

41·1

0−11

0.92

98(P

LA

NT

)22.

07·1

0−9

0.04

472.

94·1

0−9

0.05

003.

16·1

0−9

0.03

23C

HO

L7.

34·1

0−4

0.73

876.

77·1

0−4

0.75

975.

78·1

0−4

0.79

34H

YP

ER

5.43

·10−

10.

0047

5.41

·10−

10.

0049

5.42

·10−

10.

0049

SM

OK

ER

6.79

·10−

10.

0231

6.77

·10−

10.

0236

6.97

·10−

10.

0204

AL

C3.

00·1

0−3

0.29

243.

06·1

0−3

0.28

322.

86·1

0−3

0.31

62

Tab

le2:

Est

imat

esfo

rth

ein

fluen

ceon

mor

tality


The decisive question following the naive analysis now is: arethese result still valid if one takes into account the substantialmeasurement error which is naturally inherent in the protein in-take?

4.2.2 Adjusting for Homoscedastic Measurement Error

First the homoscedastic error model is considered. In order to ob-tain corrected estimates the regression calibration method basedon (5) and the estimators from (7) to (10) are applied. Column 3and 4 of Table 1 report the corrected estimates for the influenceon morbidity. In comparison to the naive estimates the effects ofanimal protein are estimated about twice as high; this results inthe thin solid line in Figure 3a) below. The point of minimal risk(x=59038) is about the same as in the naive analysis (x=60079),and also the zeros are equal in essence, but the curve is muchsteeper. β[PLANT ] is half as high as the naive estimate. Nowβ[(PLANT )2] has a negative sign, too. The corresponding func-tion g(y), which is depicted as the thin solid line in Figure 3b), isconcave and decreasing in y in a monotone way: the higher theplant intake the higher is the reduction of the risk by an additionalunit of intake.

The role of the confounders is more or less the same. The esti-mated strong influence of hypertension and smoking is confirmed.The regression parameter for alcohol intake changes its sign, butit remains insignificant.

Turning to mortality (cf. Table 2), the absolute values of the re-gression parameters of the linear and the quadratic terms in theprotein variables become higher by factors between 1.4 and 1.8,the effects of the confounders remain unchanged in essence. Fig-ure 3c) and Figure 3d) show the corresponding curves, which areof the same shape as those from the naive analysis, but run steeperagain.


4.2.3 Adjusting for Heteroscedastic Measurement Error

As discussed above, the presence of replication data also allows,for every unit i, i = 1, . . . , n, to estimate the covariance matrix Σi

of the error variable in animal protein intake and in plant proteinintake at the individual level (cf. Equation (11)). Even if onetakes into account that only seven observations are available toestimate Σi, the variation in the estimated variances (Figure 1 andFigure 2) is high enough that a detailed study of heteroscedasticmeasurement error is promising.

Figure 1: Estimated individual error variances for animal proteinintake: overall and detail figure.

The last two columns in Table 1 refer to the corrected estimatesfor morbidity, the corresponding curves are shown by the thicksolid lines in Figure 3a) and Figure 3b).

Compared to the analysis assuming homoscedastic measurementerror, the absolute values of the estimates of the regression coef-ficients for the linear and the quadratic terms in animal proteinintake are attenuated, indeed they are even closer to the resultsfrom the naive analysis. The curve grows flatter (cf. Figure 3a),the point of minimal risk and the second zero are shifted to theleft: from about x=60000 to x=52408, and from about x=120000to x=104098, respectively. In contrast to this, the quadratic na-ture of the influence of plant protein becomes much clearer. The


Figure 2: Estimated individual error variances for plant proteinintake: overall and detail figure.

regression coefficient for the quadratic term now again has a pos-itive sign, its value is about 30 times as high as in the naiveanalysis.

As can also be seen in Figure 3b), the risk is still decreasing withincreasing plant protein intake, but now the curve is clearly con-vex: the relative gain in risk reduction becomes the smaller thehigher the intake is, and there would be a border value (outsidethe domain of the data, at y=73241), where further intake wouldincrease the risk again. Correcting for heteroscedastic measure-ment error in the estimation of mortality confirms the resultsobtained from the homoscedastic error model for plant proteinintake (cf. also Figure 3d)). The absolute values of the estimatedcoefficients of animal protein intake are by the factor 2.4 lower,which results in a much flatter curve in Figure 3c).


Morbidity and Animal Protein

–2

0

2

4

6

20000 40000 60000 80000 100000 120000 140000 160000

Morbidity and Plant Protein

–2.5

–2

–1.5

–1

–0.5

0

0.5

1

10000 20000 30000 40000 50000 60000

Mortality and Animal Protein

–0.6

–0.4

–0.2

0

0.2

20000 40000 60000 80000 100000 120000 140000 160000

Mortality and Plant Protein

–2

–1.5

–1

–0.5

0

0.5

1

10000 20000 30000 40000 50000 60000

Figure 3: Estimated overall influence of animal/plant protein in-take on morbidity and mortality (cf. (13) and (14)), calculatedfrom the naive estimates (dotted line), from the estimates af-ter having corrected for homoscedastic measurement error (thinsolid line), and from the estimates after having corrected for het-eroscedastic measurement error (thick solid line), respectively.

It is also worth mentioning that morbidity and mortality differwith respect to the consequences a certain amount of protein in-take has. High plant protein intake considerably reduces the riskof cardiac infarction, but increases the risk of death. In the caseof animal protein the intake which minimizes the risk of death(x=94263 for the heteroscedastic error model) has already a ratherhigh risk for cardiac infarction.

5 Concluding Remarks

We discussed an extended version of regression calibration to cor-rect for possibly heteroscedastic measurement error in Cox re-


gression with a quadratic predictor when replication data areavailable. This method was applied to a part of the MONICAAugsburg survey to study the influence of eating habits on car-diovascular diseases.

It has become clear how important it is to take into account mea-surement error carefully. In particular under heteroscedastic mea-surement error there is a complex relationship between naive andcorrected estimation, which may alter the estimates substantially.Nevertheless, the results reported here must be taken only as afirst step towards a comprehensive analysis, suggesting and mo-tivating further research in several directions. Four topics shouldbe mentioned explicitly:

First of all, it must not be forgotten that the regression calibra-tion method is only an approximate method, reducing the bias ofnaive analysis but not necessarily producing consistent estimators.Furthermore, the parameter estimates have to be interpreted inrelative terms because correct estimators for their standard errorsare missing. To derive such appropriate estimators is demanding(cf. [2, Section 3.5 and Subsection 3.12.2]), an interesting alter-native would be bootstrapping.

Secondly, alternative correction methods should be applied, inorder to justify, or to correct, the preliminary results obtainedhere. Of special interest here is a so-to-say dynamic regressioncalibration procedure, developed by [17], where at every failuretime only those units are taken into account which still are underrisk (cf. also [31]).

Another powerful method to correct for homoscedastic measure-ment error in the Cox model was developed by Nakamura ([32])and extended to heteroscedastic error by [19]. However, prior toapplying this method, further theoretic development is needed, inorder to be able to model the quadratic influence of the covari-ates. The inherent restriction to linear predictors is also the mainhurdle for an application of Huang and Wang’s nonparametricfunctional correction method (cf. [15]), which would provide anappealing alternative to utilize replicated measurements.

There are good reasons to doubt the assumption made above that


the measurement error should be independent of the true proteinintake, and so more complex error models deserve special atten-tion (cf., e.g., [33, 34, 35]).

The third issue to keep in mind is that valuable insights in thedata may be gained by applying accelerated failure time mod-els instead of Cox’s proportional hazards model. Techniques formeasurement error correction in such survival models have notyet received much attention. One of the very rare exceptions is[36] where Nakmura illustrates his general method of so-calledcorrected score functions with members of the exponential fam-ily. His approach is generalized to possibly censored Weibull dis-tributed lifetimes in [37]. A procedure to correct for covariatemeasurement error in the nonparametric log-linear lifetime modelis suggested by [38], while [39, Chapter 5f.] proposes two methodsfor corrected quasi-likelihood estimation in arbitrary parametricaccelerated failure time models. As discussed there, the latterapproaches need some non-standard treatment of censored obser-vations, but have, on the other hand, the advantage of being ableto take also error-prone lifetimes into account.

The last item to be mentioned is the most difficult one: Eatinghabits may change! Even if the Xi to be measured by the diarycould be determined exactly, this measurement would only stemfrom a cursory glance at a process developing over time. Morbid-ity and mortality is also affected by the intake before as well asafter the recording of the eating habits. This leads to the superpo-sition of the heteroscedastic measurement error treated here witha complex kind of measurement error where a time-dependent co-variate is only observed at a certain time point. (Compare forthis also [40], considering a Cox model where a time dependentcovariate is only observed irregularly.)

Acknowledgements

We are very grateful to Helmut Kuchenhoff and Hans Schneeweißfor many helpful comments. Thomas Augustin and David Rum-mel thank the Deutsche Forschungsgemeinschaft (DFG) for finan-cial support.


6 References

1. Cheng C-L, van Ness JW. Statistical regression with mea-surement error. Arnold: London, 1999.

2. Carroll RJ, Ruppert D, Stefanski LA. Measurement errorin nonlinear models. Chapman and Hall: London, 1995.

3. Stefanski LA. Measurement error models. Journal of theAmerican Statistical Association 2000; 95(452): 1353-1358.

4. Van Huffel S, Lemmerling P (eds). Total least squares anderrors-in-variables modeling: analysis, algorithms and ap-plications. Kluwer: Dordrecht, 2002.

5. Doring A, Kußmaul B. Ernahrungsdeterminanten des Herz-infarktrisikos. Report GSF-Fe-7629. GSF — National Re-search Center for Environment and Health: Neuherberg,1997.

6. Winkler G, Doring A, Keil U. Selected nutrient intakes ofmiddle-aged men in Southern Germany: Results from theWHO MONICA Augsburg Dietary Survey of 1984/ 1985.Annals of Nutrition and Metabolism 1991; 35(5): 284-291.

7. Johansson I, Hallmans G, Wikman A, Biessy C, Riboli E,Kaaks R. Validation and calibration of food-frequency ques-tionnaire measurement in the Northern Sweden health anddisease cohort. Public Health Nutrition 2002; 5(3): 487-496.

8. Kulathinal SB, Kuulasmaa K, Gasbarra D. Estimation ofan errors-in-variables regression model when the variancesof the measurement errors vary between the observations.Statistics in Medicine 2002; 21(8): 1089-1101.

9. Wang CY, Hsu L, Feng ZD, Prentice RL. Regression cali-bration in failure time regression. Biometrics 1997; 53(1):131-145.

10. Hu P, Tsiatis A, Davidian M. Estimating the parameters inthe Cox model when covariate variables are measured witherror. Biometrics 1998; 54(4): 1407-1419.


11. Buzas JS. Unbiased scores in proportional hazards regres-sion with covariate measurement error. Journal of Statisti-cal Planning and Inference 1998; 67(2): 247-257.

12. Kong FH, Huang W, Li X. Estimating survival curves un-der proportional hazards model with covariate measurementerrors. Scandinavian Journal of Statistics 1998; 25(4): 573-587.

13. Kong FH. Adjusting regression attenuation in the Cox pro-portional hazards model, Journal of Statistical Planning andInference 1999; 79(1): 31-44.

14. Kong FH, Gu M. Consistent estimation in Cox proportionalhazards model with covariate measurement errors. Statis-tica Sinica 1999; 9(4): 953-969.

15. Huang Y, Wang CY. Cox regression with accurate covari-ates unascertainable: A nonparametric-correction approach.Journal of the American Statistical Association 2000; 95(452):1209-1219.

16. Zhou H, Wang CY. Failure time regression with continu-ous covariates measured with error. Journal of the RoyalStatistical Society Series B 2000; 62(4): 657-665.

17. Xie SX, Wang CY, Prentice RL. A risk set calibration methodfor failure time regression by using a covariate reliabilitysample. Journal of the Royal Statistical Society Series B2001; 63(4): 855-870.

18. Hu C, Lin DY. Cox regression with covariate measurementerror. Scandinavian Journal of Statistics 2002; 29(4): 637-655.

19. Augustin T. An exact corrected log-likelihood function forCox’s proportional hazards model under measurement errorand some extensions. To appear in: Scandinavian Journalof Statistics 2003.

20. Augustin T, Schwarz R. Cox’s proportional hazards modelunder covariate measurement error — A review and com-parison of methods. In: S. Van Huffel and P. Lemmerling


(eds): Total Least Squares and Errors-in-Variables Model-ing: Analysis, Algorithms and Applications. Kluwer: Dor-drecht, 2002; pp 175-184.

21. Prentice RL. Covariate measurement errors and parameterestimation in a failure time regression model. Biometrika1982; 69(2): 331-342.

22. Pepe MS, Self SG, Prentice RL. Further results in covariatemeasurement errors in cohort studies with time to responsedata. Statistics in Medicine 1989; 8(9): 1167-1178.

23. Clayton DG. Models for the analysis of cohort and case-control studies with inaccurately measured exposures. In:JH Dwyer, M Feinleib, P Lipsert et al. (eds): StatisticalModels for Longitudinal Studies of Health. Oxford Univer-sity Press: New York, 1991; pp 301-331.

24. Hughes MD. Regression dilution in the proportional hazardsmodel. Biometrics 1993; 49(4): 1056-1066.

25. Willett W. Nutritional epidemiology. Oxford University Press:New York, 19982.

26. Cheng C-L, Schneeweiß H. The polynomial regression witherrors in the variables. Journal of the Royal Statistical So-ciety Series B 1998; 60(1): 189-199.

27. Stefanski LA. Unbiased estimation of a nonlinear function ofa normal mean with application to measurement error mod-els. Communications in Statistics — Theory and Methods1989; 18(12): 4335-4358.

28. Kalbfleisch JD, Prentice RL. The Statistical analysis of fail-ure time data. Wiley: New York, 20022.

29. Cox DR. Regression models and life tables (with discussion).Journal of the Royal Statistical Society Series B 1972; 34:187-220.

30. Toutenburg H. Statistical analysis of designed experiments.Springer: New York, 20022.


31. Wang CY, Xie SX, Prentice RL. Recalibration based on anapproximate relative risk estimator in cox regression withmissing covariates. Statistica Sinica 2001; 11(4): 1081-1104.

32. Nakamura T. Proportional hazards model with covariatessubject to measurement error. Biometrics 1992; 48(3): 829-838.

33. Heitmann BL, Lissner L. Dietary underreporting by obeseindividuals — is it specific or non-specific? British MedicalJournal 1995; 311(7011): 986-989.

34. Prentice RL. Measurement error and results from analyticepidemiology: dietary fat and breast cancer. Journal of theNational Cancer Institute 1996; 88(23): 1738-1747.

35. Carroll RJ, Freedman LS, Kipnis V, Li L. A new classof measurement error models, with applications to dietarydata. Canadian Journal of Statistics 1998; 26(3): 467-477.

36. Nakamura T. Corrected score functions for errors-in-variablesmodels: Methodology and application to generalized linearmodels. Biometrika 1990; 77(1): 127-137.

37. Gimenez P, Bolfarine H, Colosimo EA. Estimation in Weibullregression model with measurement error. Communicationsin Statistics — Theory and Methods 1999; 28(2): 495-510.

38. Wang Q. Estimation of linear error-in-covariables modelswith validation data under random censorship. Journal ofMultivariate Analysis 2000; 74(2): 245-266.

39. Augustin T. Survival analysis under measurement error.Habilitation (post-doctorial) thesis. Department of Statis-tics: University of Munich, 2002.

40. de Bruijne MHJ, le Cessie S, Kluin-Neemans HC, van Houwe-lingen HC. On the use of Cox regression in the presence ofan irregularly observed time-dependent covariate. Statisticsin Medicine 2001; 20(24): 3817-3829.

Regression calibration for Cox regression under ... · Cox regression under heteroscedastic error 4...

Documents

Transcript of Regression calibration for Cox regression under ... · Cox regression under heteroscedastic error 4...