Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980)...

10
Inference Using Conditional Logistic Regression with Missing Covariates Author(s): Stuart R. Lipsitz, Michael Parzen, Marian Ewell Source: Biometrics, Vol. 54, No. 1 (Mar., 1998), pp. 295-303 Published by: International Biometric Society Stable URL: http://www.jstor.org/stable/2534015 Accessed: 12/08/2009 21:15 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=ibs. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected]. International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to Biometrics. http://www.jstor.org

Transcript of Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980)...

Page 1: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

Inference Using Conditional Logistic Regression with Missing CovariatesAuthor(s): Stuart R. Lipsitz, Michael Parzen, Marian EwellSource: Biometrics, Vol. 54, No. 1 (Mar., 1998), pp. 295-303Published by: International Biometric SocietyStable URL: http://www.jstor.org/stable/2534015Accessed: 12/08/2009 21:15

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unlessyou have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and youmay use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained athttp://www.jstor.org/action/showPublisher?publisherCode=ibs.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with thescholarly community to preserve their work and the materials they rely upon, and to build a common research platform thatpromotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected].

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access toBiometrics.

http://www.jstor.org

Page 2: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

BIOMETRICS 54, 295-303 March 1998

SHORTER COMMUNICATIONS

EDITOR: LOUISE M. RYAN

Inference Using Conditional Logistic Regression with Missing Covariates

Stuart R. Lipsitz,l * Michael Parzen,2 and Marian Ewell3

'Department of Biostatistics, Harvard School of Public Health and Dana-Farber Cancer Institute, 44 Binney Street,

Boston, Massachusetts 02115, U.S.A. 2Graduate School of Business, University of Chicago, 1101 East 58th Street, Chicago, Illinois 60637, U.S.A.

3Emmes Corporation, 11325 Seven Locks Road, Potomac, Maryland 20854, U.S.A.

SUMMARY

When there are many nuisance parameters in a logistic regression model, a popular method for eliminating these nuisance parameters is conditional logistic regression. Unfortunately, another common problem in a logistic regression analysis is missing covariate data. With many nuisance parameters to eliminate and missing covariates, many investigators exclude any subject with miss- ing covariates and then use conditional logistic regression, often called a complete-case analysis. In this article, we derive a modified conditional logistic regression that is appropriate with covariates that are missing at random. Performing a conditional logistic regression with only the complete cases is convenient with existing statistical packages, but it may give bias if missingness is not completely at random.

1. Introduction

The logistic regression model is one of the most widely used models in biostatistics. Conditional logistic regression has been proposed as a general technique to use when the usual logistic regres- sion fails, either because of sparse data or a proliferation of nuisance parameters. For example, in a matched case-control study, conditional logistic regression (Breslow et al., 1978; Breslow and Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values of the covariates of interest may be missing for some subjects, either by accident or study design. For example, consider data from Eastern Cooperative Oncology Group liver cancer clinical trials (Falkson, Cnaan, and Simson, 1990; Falkson et al., 1994). Our primary interest here is to model the probability of jaundice when entering the clinical trial as a function of performance status (a measure of overall health, dichotomized as good or poor) and the biochemical marker gammaglob- ulin, classified as normal or abnormal. The subjects were randomized into the clinical trials in 22

*Corresponding author's email address: stuart~jimmy.harvard. edu

Key words: Complete-case analysis; Missing at random; Missing completely at random; Missing covariate data; Nuisance parameters.

295

Page 3: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

296 Biometrics, March 1998

different institutions, and there may also be an institution effect in the logistic regression that we would like to condition out. Also, the variables jaundice and performance status are observed for all 181 subjects in the study, but the biochemical marker is observed for only 104 of the subjects. Obtaining the gammaglobulin status requires a blood test and complicated laboratory analysis and may not be performed on all subjects.

Gibbons and Hosmer (1991) applied three commonly used methods for handling missing covari- ate data to conditional logistic regression with matched pairs data. The first of these methods is complete-case analysis, i.e., using only subjects with no missing data. The second method sub- stitutes the mean covariate value for the missing covariate. The third method uses the predicted value, with or without random error, obtained from regressing the observed values of a covariate on the other covariates. Unfortunately, when these methods are used with conditional logistic regres- sion, they suffer from the same biases as in usual regression problems unless the data are missing completely at random (Rubin, 1976). For unconditional logistic regression, Vach and Schumacher (1993) compare methods for consistent estimation with covariates missing at random (Rubin, 1976), which is a weaker assumption than missing completely at random. One of these methods estimates the probability of an individual having complete data and uses it in the estimation procedure. We extend this approach to conditional logistic regression with covariates missing at random. In partic- ular, we estimate the probability of an individual having complete data and use it in a modification of the usual conditional logistic regression estimating equations.

Suppose we partition the logistic regression parameter vector d' [/3N~ 3'], where /N are the nuisance parameters that we wish to eliminate and /3I are the parameters of interest. With no missing data, the conditional logistic regression likelihood eliminates the nuisance parameters / NN. In formulating our proposed conditional likelihood with missing covariates, we first restrict the data to include only subjects with complete covariate information. Our conditional likelihood eliminates /N and is thus a function of the parameter vector 3i; however, it is also a function of all of the complete cases' probabilities of having complete data. Since they are usually unknown, these probabilities of having complete data also must be estimated from the data. We heuristically show that the estimate of /3I is consistent, asymptotically normal and unbiased, with both known and estimated probabilities of complete data.

2. The Logistic Regression Model

Suppose the ith subject (i = 1, . . . , n) in the study has Bernoulli response Yi and p-dimensional covariate vector xi, and the subjects are independent. The logistic regression model for subject i is

_ ~~~~~exp(x~/3) E(Yi I xi) = pr(Yi 1 Xi) I + exp( (X )v (1)

where 3 is a p-dimensional parameter vector. When there are missing covariates, we introduce the indicator Ri, which equals 1 when subject i has no missing data and 0 otherwise.

In our conditional likelihood, we will use only subjects with complete covariate information, i.e., only subjects with R2 = 1. In particular, we are looking at subjects with distribution {Yi I R2 -

1, xi} Bern(p2), where, applying Bayes' rule,

pi pr (Yi I |R 1, xi

pr(Ri = I I Yi = I, x-)pr(Yi = I I xi) . (2) pr(Ri = I Yi =, O. x)pr(Yi = 0 I Xi + pr(Ri = I I Yi = 1, x i)pr(Yi, I 11Xi

Let

,7 r = pr (Ri I I Yi = ,X.), (3)

which can depend on both Yi and xi. Then, substituting (3) and (1) in (2), we get

Pi =Pi(rqi,3) = pr(Yi = 1 I Ri = 1, x) - ir P(xp(x)f3) = e+P(i i(4) pi 7r,;~~~~~~~Wi 7r,;i exp (X' 3) 1+exnp(n, +4 x~,' (4

where ri log(ir~i/irio). This is the same result Breslow and Cain (1988) got for two-stage case- control studies. Assuming ri is known, it is an offset in the logistic regression model. In this section, we assume that ri is known. In the following section, we describe estimating equations when ri is unknown.

Page 4: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

Inference Using Conditional Logistic Regression 297

Without loss of generality, suppose the first m = E Iz R- < n subjects have complete covariate information. The distribution for these m independent complete cases is

m f (8lv *. ,YM I RI Rm, = ,) = |Pyi [I -Pt] 1-i. (5)

iZ=

Suppose, as before, we partition the logistic regression parameter vector 3' = [/,/3 , where / N are the nuisance parameters we wish to eliminate and /I are the parameters of interest, and we similarly partition xi' = Ix 'J so that

exp(qi + XNiON + XIZI) ,)(6) 1 + exp(rqi + X' /3N + X'i) (I)

Then, after some algebra, (5) equals

r m ~rn m m

exp {Y xYi + ,+log[-Pi] (7) i~~l i~1 i~1 i~1

Equation (7) is proportional to the joint distribution of TN = -11 XN2Yi, TI = I xjiYi, and To = EM Mimi, which is

f(to, tN,tI / ) = C(to, tN,tI) exp to + 3rtN + 3itI + Elog[ , P.] (8) Zi1

where c(tO,tN,tI) is the number of patterns of (Y1,..., Ym) in which To = to, TN tN, and TI = tI. Note that the sufficient statistics for /N and /I in (8) are TN and TI.

In (8), if we condition on TN, the sufficient statistic that is the coefficient of the nuisance parameter vector 3N, then we eliminate 3N, i.e.,

c(to,_ tI) exp {to + /3NtN + 3tI_ + EIZ1log[1 - PJ} ftt t - tt{c(to, tN, tI) exp {to + I3NtN + f3ItI + E'ZK I log[1 - Pi]}

c(to,tNtI)exp{to +_f3 t1I}

Wtot,{tO, tN, tI) exp[to + Olt,] 9

In particular, (9) is the conditional likelihood, which we denote by LC(/31). Suppose we denote the first derivative of the conditional log-likelihood by

SC (3I) = log [do c(/)] di log [

{C(t0,tNtI)exp{d' t} | (10)

The maximum conditional likelihood estimate /i is the solution to SC(/I) = 0. Under regular- ity conditions, expectation and differentiation can be exchanged and, using the usual proof from inference, the derivative of the conditional log-likelihood has expected value 0 at the true 3, i.e., E[SC(3I)] = 0. Then, using method of moments ideas for unbiased estimating equations, the conditional maximum likelihood estimator /I will be consistent and asymptotically normal, with asymptotic variance consistently estimated with the inverse of the second derivative matrix,

var~~dl) = 2{log 2 ( I) } .(1

The simplest alternative to our method is a complete-case analysis. The conditional logistic regression likelihood using complete cases is identical to (9), except that to = 0, which means that, to get (asymptotically) unbiased estimates of /3I using complete cases, the true 7ril and 7rio must satisfy ri = log(7ril/7rio) = 0 or, equivalently, 7ril = 7rio. Now we discuss the possible bias in the complete-case analysis. Suppose the subjects with Rf = 1 are a completely random sample from the all sampled individuals, then

irij =pr(R2 1 |Y2 j,x2) =pr(R2 1) = 7, (12)

i.e., 7r~i = Wio =ir and ri 0 . In Rubin's (1976) terminology, the data are missing completely at random. In this case, using conditional logistic regression with only the complete cases will give

Page 5: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

298 Biometrics, March 1998

(asymptotically) unbiased estimates of /I. Suppose that, given xi, Ri and Yi are conditionally independent; then 7ril = 7rio and, again, ri = 0. Here you will again get (asymptotically) unbiased estimates using conditional logistic regression with complete cases. Next, suppose, as in a case- control study (Breslow and Day, 1980), Trig only depends on Yi but not xi. Using results from case-control studies (Breslow and Day, 1980), one can show that only the intercept in the logistic regression model is not estimable with complete cases. Thus, as long as the intercept is included as a nuisance parameter in 3N, we will still get (asymptotically) unbiased estimates of 3I using conditional logistic regression with complete cases. However, often the probability that covariates are missing depends on both the response Yi and the covariates xi, so that a complete-case analysis will be biased, and one will need to put to in the conditional likelihood. Except in some designed studies, ri (and thus To) is unknown, must be estimated, and plugged into the conditional likelihood, as discussed in the following section.

3. Estimating Equations When rij Is Unknown

In large samples, the parameter estimate /I is unbiased and normally distributed, with an asymp- totic variance that depends on whether ri is known or estimated from the data. In practice, the probabilities irij are unknown and must be estimated. We do so by specifying a parametric mod- el for

pr(Ri 1 Yi - y Xi, Xa) = ri a(xi,) = i() =

which is a function of Yi. Typically, the logit of 7rjyi is taken to be a linear function of Yi and the components of xi. However, given that not all components of xi are observed, we make a missing at random (Rubin, 1976) assumption that

iriyi (a) 7rZ~i (Xi,all, a), (13)

where xi ,aIl are the components of xi that are observed on all subjects. The dependence of the score on 7riyr (a) is emphasized by adding the a parameter to (10), i.e.,

S(a,/31). In particular, To in (10) is a function of ri and thus a. For fixed a, let /3J(a) solve SC( ,/i) 0. When the ri's in To are unknown, we estimate 7r i by 7rij(6), where & denotes the maximum likelihood estimate of a. The distribution of Ri given Yi, xi is Bernoulli with probability 7riyi. Given that individuals are independent, & solves the usual binary regression score equations

T(&>) =j3E 9

log {7riy (ca)'ri [1-iyj(a)](1T2) } 0. (14)

Note that, even though only the m complete cases are used in SC(a,,/3), all n cases are used in (14). We then solve SC(6,fi) = 0 to obtain f3(6). In general, then, [6,/3]' is the solution to

U(6,PI== [ c(&A) 0 1O. (15)

Under the assumption that the model for pi and 7rWi are correctly specified, /I is consistent and asymptotically normal. A consistent estimate of the asymptotic variance of /I in (15) does not have a simple form as in (11) but also depends on the distribution of 6. Since, unconditionally, the n subjects are independent, a consistent estimate of the variance of /I can be obtained using the bootstrap or jackknife (Efron, 1982), in which the basic resampling unit is the individual.

4. Example We consider data from the Eastern Cooperative Oncology Group liver cancer clinical trials (Falkson et al., 1990, 1994). We are interested in the logistic regression model for the probability of jaundice as a function of performance status and the biochemical marker gammaglobulin, i.e.,

22

logit[pr(jauni 1 psi, gami, pi)1 = Ykhk + PPS= + Q~gami, (16) k=1l

where jaun2 equals 1 if patient has jaundice and equals 0 otherwise, Psi equals 1 if the performance status is good and equals 0 otherwise, and gamin equals 1 if the biochemical marker gammaglobulin is abnormal and equals 0 otherwise. For these two covariates, the higher risk level is coded as 1 and the lower risk level as 0. Since there may be institution effects, we also let 'ik be 1 if subject is from institution k and 0 otherwise.

Page 6: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

Inference Using Conditional Logistic Regression 299

Table 1 Estimates for the missing data model (for Ri)

Effect Estimate SE Z-value p-value

Intercept -0.87 0.41 -2.13 0.033 Jaundice 1.02 0.35 2.34 0.019 Performance status 0.83 0.35 2.90 0.004

We want to use conditional logistic regression to eliminate the -yk'S in (16). However, even though all 181 subjects have jaundice and performance status observed, only 104 out of the 181 have gammaglobulin observed. Obtaining the gammaglobulin status requires a blood test and complicated laboratory analysis and thus is more likely to be incomplete. Before estimating f3p and 3g in (16) using our modified conditional logistic regression, we first need to estimate ri. To do this,

we model the probability of being a complete case as a function of the observed data (jauni, psi). The best fitting model was

logit[pr(Ri 1 jauni, ps, )] a ao + aIpsi + a 2jauni. (17)

We tried an interaction between jauni and psi, but it was very nonsignificant. The estimate of a is given in Table 1; sicker patients (i.e., those with jaundice and/or poor performance status) are significantly more likely to have gammaglobulin measured. Our intuition is that, in this study, the sicker patients go to their physician more often and thus are more likely to have the blood test for gammaglobulin status.

We note here that we have an institution effect in the model of interest for jaundice in (16) but not for the probability of being a complete case in (17). In a preliminary analysis in which we tested for no institution effects using a statistic appropriate for small samples (Lipsitz et al., 1996), we found a significant institution effect when the outcome was jaundice (p = 0.00005), but found no institution effect when the outcome was missingness (p = 0.382). This test collapses over the other variables except for the outcome and the institution. Also, the test should be interpreted with caution when the outcome is jaundice since the data actually need to be missing completely at random for the test to be unbiased. However, these preliminary analyses led to the models given in (16) and (17).

From the modified conditional logistic regression estimate of f3 (using a bootstrap variance estimate) in Table 2, we see that the estimated odds ratio for jaundice versus performance status is e0.98 _ 2.7, and the estimated odds ratio for jaundice versus gammaglobulin is e0 31 1.4. Using conditional logistic regression with complete cases, we see that the estimated odds ratio for jaundice versus performance status is e1 .8 3.3, and the estimated odds ratio for jaundice versus gammaglobulin is e0 31 - 1.4. Neither of the covariates are significant at a 5% level for either the modified or complete-case method. Treating the estimates from the modified conditional logistic regression as correct, an estimate of the relative bias is defined as (13- c%)/ ),where 3 and /cc denote estimates based on the modified conditional logistic regression and complete cases, respectively. The estimated relative bias for the gammaglobulin is 0% and for the performance status effect is 20%. This bias is comparable to what we found in the simulations in the following section.

5. A Small Simulation Study We performed a small simulation study based on the liver cancer example discussed above, with jaundice as the binary response and performance status and gammaglobulin as covariates. We used the covariate data from the 181 subjects in the study, filling in the missing gammaglobulin values by sampling from the conditional Bernoulli distribution of gammaglobulin given performance status

Table 2 Modified conditional logistic regression estimates of p3 a

Effect Estimate SE Z-value value

Performance status 0.98 (1.18) 0.959 (0.980) 1.07 (1.27) 0.310 (0.229) Gammaglobulin 0.31 (0.31) 0.736 (0.736) 0.42 (0.42) 0.675 (0.675)

a Complete-case estimate in parentheses.

Page 7: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

300 Biometrics, March 1998

and jaundice. This gave us a single dataset with all covariates observed. This covariate data re- mained fixed over all simulations; in different simulations, we deleted values of gammaglobulin using different missing data mechanisms. Each simulation consisted of 1000 replications and compared the modified conditional logistic regression to the complete-case conditional logistic regression. We only report bias and mean square error here because bootstrap standard errors and thus coverage probabilities of confidence intervals are too computationally intensive.

In each simulation, the true logistic model for the response jauni given psi and gami was

logit[pr(jauni = 1 I psi, gami ,3)] = do + /ppsi + /9gami- -0.25- psi + gamin. (18)

In all simulations, the model for jauni was specified and estimated as in (18). For simplicity, we did not add an institution effect to (18); conditional logistic regression can still be used to get consistent estimates of /3p and /g. We did not misspecify the model for jaun2 because our main interest was determining the bias in estimating 3 when the missing data mechanism is misspecified. When the model for jauni is misspecified (i.e., leaving out an important covariate), there will be bias in 3. We expect this bias to be on the order of the bias found in misspecifying the model when no data are missing.

In addition, we specified various models for the missingness models. For the first simulation, we estimated the correct missingness model in order to assess the performance of the modified con- ditional logistic regression when the missingness probabilities are correctly modeled. In particular, we sampled the missingness indicator Ri from a Bernoulli distribution with

logit[pr(Ri 1 jauni, psi, gami, a)] ao + aIpsi + a 2jauni -.7 + psi + jauni (19)

and set gami missing if Ri = 0. Then we estimated 3 using the modified and complete-case conditional logistic regression. The results, shown in Table 3, show the advantages of the modified conditional logistic regression when the missingness model is correctly specified; there is negligible bias in both the estimates of O3a and /g. Using complete-case conditional logistic regression, the estimate of the gammaglobulin effect, which corresponds to the variable with missing data, appears unbiased, whereas the performance status effect appears highly biased (-33%).

The modified estimates perform well when the missing data model is correctly specified, but the performance in the face of model misspecification is also of interest. In the second simulation, we added a fixed institution effect to the missingness model in (19), i.e.,

22

logit[pr(Ri 1 I jaun2, psi, gami, a)] = 3 (aok)Ik + ao + alpSi + a2jaun

k=1

22

E (-0.04. k)Iik + psi + jaun2, (20) k=1

where Iik is 1 if subject i is from institution k and 0 otherwise. Note the effect of institution k is (-0.04. k). When using the modified method, in the missingness model, we only estimated a common intercept (no institution effect), thereby underspecifying the missing data model compared to the true model. The results of this simulation are given in Table 4 and are very similar to those in Table 3. The bias appears in the complete-case estimate of the performance status effect with a -23% bias. Thus, at least for this simulation, misspecifying the missing data model leads to minimal bias using the modified method, and it is apparently more important to put effects due

Table 3 Empirical estimates of bias with sample size 181 and true missing data model

logit[pr(Ri 1 l jauni,psi,gami,a)] =-.7 + psi + jauni

Performance Method status Gammaglobulin

Estimate Modified -1.051 1.047 Complete case -1.332 1.047

Bias Modified -0.051 0.047 Complete case -0.332 0.047

MSE Modified 0.592 0.318 Complete case 0.742 0.318

Page 8: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

Inference Using Conditional Logistic Regression 301

Table 4 Empirical estimates of bias with sample size 181 and true missing data model

logit[pr(Ri = I I jaunipsigamia)] = 1 (-0.04 k)Iik + Ps + jaun

Performance Method status Gammaglobulin

Estimate Modified -1.026 1.043 Complete case -1.232 1.043

Bias Modified -0.026 -0.043 Complete case -0.232 0.043

MSE Modified 0.250 0.226 Complete case 0.323 0.226

to jaundice and performance status in the posed missingness model than to put in an institution effect.

For the third simulation, we formed the true missingness model by adding a gammaglobulin effect to (19), giving

logit[pr(Ri = I I jauni, psi, gam, ,a)] =o + ?lPS2 + ? 2jauni + ca3gami

-.7 + psi + jaun + gam. (21)

However, we fit (19), which means we are underspecifying the missing data model in the modified conditional logistic regression. Note, in practice, with the data observed, we cannot fit (21) since we will not know gami when R= 0. The simulation results in Table 5 show that the modified and complete-case estimates of the gammaglobulin effect are identical and have little bias. The modified estimate of the performance status effect has much less bias than the complete-case estimate. Thus, overall, the modified estimate performs better than the complete-case estimate.

In a simulation not shown here, the true missing data model added an interaction between psi and jaun2 to (19), but we fit the model in (19). The results of the simulation were almost identical to Table 3. Thus, even though the missingness mechanism was underspecified in the simulation, the modified method performed well, leading to estimates with little bias. Also, in simulations not shown, when the missing data are missing completely at random, i.e., pr[R= 1] -r, the modified estimate with the overspecified missingness model in (19) and the complete-case estimate gave almost identical results. We also performed simulations in which the correct link for the missingness model was a logit link, and we fit both logit and probit links. The resulting bias in estimates of j3 was almost identical for either link. Apparently, this is because the predicted l7rij's are almost identical from either link. This suggests that misspecification of the link for the missingness model makes little difference and is much less important than the predictors we put in the model for -rij.

As seen in Tables 3-5, the estimated effect of the variable that is sometimes missing, gamma- globulin, was identical using the modified and complete-case methods. We attempted algebraically to see if this is true because the conditional likelihood factors into a part for estimating the effect of performance status and a part for estimating the effect of gammaglobulin, but we could find no such factorization, although it must be true. Thus, the modified method is best at reducing biases encountered in variables other than the one missing.

Table 5 Empirical estimates of bias with sample size 181 and true missing data model

logit[pr(Ri 1 l jauni,psi,gami, a)] =-.5 + .5psi + .5jaun. + .5gam2

Performance Method status Gammaglobulin

Estimate Modified -1.026 1.043 Complete case -1.232 1.043

Bias Modified -0.026 0.043 Complete case -0.232 0.043

MSE Modified 0.642 0.325 Complete case 0.670 0.325

Page 9: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

302 Biometrics, March 1998

To use the modified conditional logistic regression method, one needs to be able to estimate the probability of being observed with some precision. As such, one should use the rough rules for the validity of the parameter estimates in the logistic regression model for Ri. In particular, the number of subjects with Ri 1 divided by the number of parameters (not including the intercept) as well as the number with R2 = 0 divided by the number of parameters (not including the intercept) should both be greater than or equal to 10 (Peduzzi et al., 1996). This is satisfied for the liver cancer dataset used in this paper. We expect this rough rule of thumb is a little too strict since we are not really interested in the validity of the logistic regression parameter estimates but in the validity of the estimates of the rirj's, which are on a different scale than the logistic regression parameters and are less sensitive to outliers. This is a topic for further research. Nevertheless, when this rough rule of thumb does not hold, which may occur when the percentage of missing data is extremely high or low, we cannot estimate the parameters of the missing data model with validity, and the complete-case estimate may be preferred. Further, in practice, we have found that, when the percent of missing data is low, say less than 5%, the complete-case estimate has little bias. Thus, the modified approach is most appropriate when there are many nuisance parameters to eliminate in the model for Yi and the percentage of missing data is moderate so that an investigator can estimate the missing data model with some precision. In simulations in which the percentage of missing data was sufficient to estimate the lr-j 's, we found our modified estimator performed no worse with respect to bias, and usually better, than the complete-case estimate. This was true if the model for lrij was underspecified or overspecified. Although it performed better than the complete-case method, we suggest, when using the modified method, in order to prevent biases in the estimate of j3 from underfitting the model for lrij, one keeps any variable in the model for lri that is significant at the .20 level.

ACKNOWLEDGEMENTS

We are very grateful for the support provided by grants CA 57253 and CA 55576 from the NIH and the helpful comments from the referees and associate editor.

RESUME

Lorsqu'il y a plusieurs parametres de nuisance dans un module de regression logistique, une methode populaire est de les eliminer par une regression logistique conditionnelle. Malheureusement, un autre probleme courant de la regression logistique est absence de covariable. Lorsque les deux problems sont reunis, beaucoup d'analystes rejettent les donnees sans covariables et utilisent en- suite la regression logistique conditionnelle, appelee aussi "analyse des cas complets." Dans cette note, nous presentons une regression logistique conditionnelle qui est appropriee lorsque les covari- ables manquantes le sont au hasard. Realiser une regression logistique conditionnelle sur les seuls cas complets est pratique avec les logiciels existants, mais peut se reveler biaise lorsque les donnees manquantes ne sont pas aleatoires.

REFERENCES

Breslow, N. E. and Cain, K. C. (1988). Logistic regression for two-stage case-control data. Bio- metrika 75, 11-20.

Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research, Volume 1, The Analysis of Case-Control Studies. Lyon: World Health Organization.

Breslow, N. E., Day, N. E., Halvorsen, K. T., Prentice, R. L., and Sabai, C. (1978). Estimation of multiple relative risk functions in matched case-control studies. American Journal of Epi- demiology 108, 299-307.

Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans. SIAM, Monograph 38. Falkson, G., Cnaan, A., and Simson, I. W. (1990). A randomized Phase II study of acivicin and

4'deoxydoxorubicin in patients with hepatocellular carcinoma in an Eastern Cooperative On- cology Group Study. American Journal of Clinical Oncology 13, 510-515.

Falkson, G., Lipsitz, S., Borden, E., Simson, I. W., and Haller, D. (1994). A ECOG randomized Phase II study of beta interferon and Menogoril. American Journal of Clinical Oncology 18, 287-292.

Gibbons, L. B. and Hosmer, D. W. (1991). Conditional logistic regression with missing data. Com- munications in Statistics, Part B 20, 109-120.

Lipsitz, S. R., Dear, K. B. G., Laird, N. M., and Molenberghs, G. (1996). Moment estimating equations and a test for homogeneity in meta-analysis. Technical Report, Department of Bio- statistics, Harvard School of Public Health, Boston.

Page 10: Inference Using Conditional Logistic Regression with ...mparzen/published/parzen9.pdf · Day, 1980) has been proposed to eliminate the nuisance matching effects. Unfortunately, values

Inference Using Conditional Logistic Regression 303

Received April 1996; revised April 1997; accepted July 1997.

Peduzzi, P., Concato, J., Kemper, E., Holford, T., and Feinstein, A. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemi- ology 49, 1373-1379.

Rubin, D. B. (1976). Inference and missing data. Biometrika 63, 581-592. Vach, W. and Schumacher, M. (1993). Logistic regression with incompletely observed categorical

covariates: A comparison of three approaches. Biometrika 80, 353-362.