Estimating the relative recurrence risk ratio using a global cross-ratio model

10
Estimating the Relative Recurrence Risk Ratio Using a Global Cross-Ratio Model Chris Wallace 1n and David Clayton 2 1 London School of Hygiene and Tropical Medicine, IDEU, London, UK 2 Cambridge Institute for Medical Research, Cambridge, UK The relative recurrence risk ratio l R (and particularly the sibling recurrence risk ratio, l S ) is often of interest to those wanting to quantify the genetic contribution towards risk of disease or to discriminate between different genetic models. However, estimating l R for complex diseases for which genetic and environmental risk factors are both involved is not straightforward. Ignoring environmental factors may lead to inflated estimates of l R . We present a marginal model which uses a copula function to model the association in cumulative incidence rates between pairs of relatives. This model is applicable to present-state data and allows estimation of risk of disease in a pair of relatives (and hence l R ), given measured environmental covariates. We apply the model to leprosy among sibling pairs from the Karonga district, Malawi. If risk factors are ignored, the apparent l S in this population is over 3. Accounting for known nongenetic risk factors reduces it to just under 2. Genet Epidemiol 25:293–302, 2003. & 2003 Wiley-Liss, Inc. Key words: marginal models; copula models; sibling recurrence risk; present state data Grant sponsor: British Leprosy Relief Association; Grant Sponsor: Wellcome Trust; Grant Sponsor: International Federation of Anti-Leprosy Organisations; Grant Sponsor: WHO/UNDP/World Bank; Grant Sponsor: British Medical Research Council n Correspondence to: Chris Wallace, London School of Hygiene and Tropical Medicine, IDEU (2nd floor), Keppel St., London WC1E 7HT, UK. E-mail: [email protected] Received 5 November 2002; accepted 6 June 2003 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.10270 INTRODUCTION RELATIVE RECURRENCE RISK The relative recurrence risk is defined as the risk of disease in relatives (of type R, e.g., siblings) of affected individuals. The relative recurrence risk ratio, l R , is defined as the ratio of the relative recurrence risk to the risk of disease in the general population. Mathematically, let D 1 and D 2 denote the disease state of two relatives (D i ¼ 1 if affected, 0 otherwise). Then l R ¼ PðD 1 ¼ 1jD 2 ¼ 1Þ PðD 1 ¼ 1Þ ¼ PðD 1 ¼ 1 and D 2 ¼ 1Þ PðD 1 ¼ 1ÞPðD 2 ¼ 1Þ : Interest in this measure, and in the sibling recurrence risk ratio (l S ) in particular, was increased by three seminal papers by Risch [1990a–c]. If any locus is shown to affect a genetic disease, a locus-specific l S can be calculated, and Risch [1990a] showed that the power to detect linkage is a function of l S . l S is also commonly used in exclusion mapping, where regions of chromosomes are excluded from a genome scan on the basis that they do not confer a l S of at least, say, 1.5 [e.g., Duffy et al., 2001]. Generally, the overall (rather than locus-speci- fic) l R has been estimated by sampling relatives of affected individuals to estimate risk among relatives of cases, and taking the ratio of this to an established population risk. Note that if the apparent l R is greater than 1, this may reflect shared genetic or nongenetic factors between relatives. Guo [2000] showed that ignoring envir- onmental factors can inflate estimates of l R . In this paper, we show how we may account for known environmental factors and estimate any residual increase in risk among relatives of cases. This risk is due to unmeasured shared risk factors. If all environmental factors are measured (here denoted by the covariate vectors X 1 and X 2 ), then the residual risk must be the ‘‘genetic relative recur- rence risk ratio:’’ l R ¼ PðD 1 ¼ 1jD 2 ¼ 1; X 1 ; X 2 Þ PðD 1 ¼ 1jX 1 Þ ¼ PðD 1 ¼ 1 and D 2 ¼ 1jX 1 ; X 2 Þ PðD 1 ¼ 1jX 1 ÞPðD 2 ¼ 1jX 2 Þ ð1Þ Genetic Epidemiology 25: 293–302 (2003) & 2003 Wiley-Liss, Inc.

Transcript of Estimating the relative recurrence risk ratio using a global cross-ratio model

Estimating the Relative Recurrence Risk Ratio Usinga Global Cross-Ratio Model

Chris Wallace1n and David Clayton2

1London School of Hygiene and Tropical Medicine, IDEU, London, UK2Cambridge Institute for Medical Research, Cambridge, UK

The relative recurrence risk ratio lR (and particularly the sibling recurrence risk ratio, lS) is often of interest to thosewanting to quantify the genetic contribution towards risk of disease or to discriminate between different genetic models.However, estimating lR for complex diseases for which genetic and environmental risk factors are both involved is notstraightforward. Ignoring environmental factors may lead to inflated estimates of lR. We present a marginal model whichuses a copula function to model the association in cumulative incidence rates between pairs of relatives. This model isapplicable to present-state data and allows estimation of risk of disease in a pair of relatives (and hence lR), given measuredenvironmental covariates. We apply the model to leprosy among sibling pairs from the Karonga district, Malawi. If riskfactors are ignored, the apparent lS in this population is over 3. Accounting for known nongenetic risk factors reduces it tojust under 2. Genet Epidemiol 25:293–302, 2003. & 2003 Wiley-Liss, Inc.

Key words: marginal models; copula models; sibling recurrence risk; present state data

Grant sponsor: British Leprosy Relief Association; Grant Sponsor: Wellcome Trust; Grant Sponsor: International Federation of Anti-LeprosyOrganisations; Grant Sponsor: WHO/UNDP/World Bank; Grant Sponsor: British Medical Research CouncilnCorrespondence to: Chris Wallace, London School of Hygiene and Tropical Medicine, IDEU (2nd floor), Keppel St., London WC1E 7HT,UK. E-mail: [email protected] 5 November 2002; accepted 6 June 2003Published online in Wiley InterScience (www.interscience.wiley.com).DOI: 10.1002/gepi.10270

INTRODUCTION

RELATIVE RECURRENCE RISK

The relative recurrence risk is defined as therisk of disease in relatives (of type R, e.g., siblings)of affected individuals. The relative recurrencerisk ratio, lR, is defined as the ratio of the relativerecurrence risk to the risk of disease in the generalpopulation. Mathematically, let D1 and D2 denotethe disease state of two relatives (Di ¼ 1 ifaffected, 0 otherwise). Then

lR ¼ PðD1 ¼ 1jD2 ¼ 1ÞPðD1 ¼ 1Þ ¼ PðD1 ¼ 1 and D2 ¼ 1Þ

PðD1 ¼ 1ÞPðD2 ¼ 1Þ :

Interest in this measure, and in the siblingrecurrence risk ratio (lS) in particular, wasincreased by three seminal papers by Risch[1990a–c]. If any locus is shown to affect a geneticdisease, a locus-specific lS can be calculated, andRisch [1990a] showed that the power to detectlinkage is a function of lS. lS is also commonlyused in exclusion mapping, where regions ofchromosomes are excluded from a genome scan

on the basis that they do not confer a lS of at least,say, 1.5 [e.g., Duffy et al., 2001].

Generally, the overall (rather than locus-speci-fic) lR has been estimated by sampling relatives ofaffected individuals to estimate risk amongrelatives of cases, and taking the ratio of this toan established population risk. Note that if theapparent lR is greater than 1, this may reflectshared genetic or nongenetic factors betweenrelatives. Guo [2000] showed that ignoring envir-onmental factors can inflate estimates of lR. In thispaper, we show how we may account for knownenvironmental factors and estimate any residualincrease in risk among relatives of cases. This riskis due to unmeasured shared risk factors. If allenvironmental factors are measured (here denotedby the covariate vectors X1 and X2), then theresidual risk must be the ‘‘genetic relative recur-rence risk ratio:’’

lR ¼ PðD1 ¼ 1jD2 ¼ 1;X1;X2ÞPðD1 ¼ 1jX1Þ

¼ PðD1 ¼ 1 and D2 ¼ 1jX1;X2ÞPðD1 ¼ 1jX1ÞPðD2 ¼ 1jX2Þ

ð1Þ

Genetic Epidemiology 25: 293–302 (2003)

& 2003 Wiley-Liss, Inc.

COPULAS AND MARGINAL MODELS

The simple argument used in Equation (1)is suitable for traits that are observable frombirth. However, many diseases are not like this;subjects unaffected at one timepoint may notremain unaffected. When modeling such phenom-ena, we must consider the distribution of timeof onset. Familial aggregation of disease isreflected as association in the bivariate distribu-tion of onset times [Clayton, 1978]. It willgenerally be most natural to interpret ‘‘time’’ asage in this context. A general approach tomodeling such bivariate distributions is providedby copula models.

Copulas are functions that join (‘‘couple’’)multivariate distributions to their univariate mar-gins. They can also be thought of as multivariatedistribution functions with margins that are uni-form on ð0; 1Þ. Consider two individuals withonset times T1 and T2 with distribution functionsF1ðt1Þ and F2ðt2Þ and joint distribution functionHðt1; t2Þ. Sklar’s theorem states that there exists aunique copula C such that for all ðt1; t2Þ 2½RangeF1��½RangeF2� (which is ½0; 1��½0; 1� whenT1 and T2 are continuous),

Hðt1; t2Þ ¼ CðF1ðt1Þ;F2ðt2ÞÞ:Copulas are useful when the form of the

marginal distributions is known, but the jointdistribution is not, because they allow the creationof a joint distribution with given margins. Theyare used in marginal models to model the jointdistribution, allowing for dependence between theobservations, and are discussed in detail byNelsen [1999].

In many studies, it is possible to record time ofonset for observed cases of disease; these areincidence studies. Other studies (including theone considered here) record only present state, orprevalence data. In both cases, removal of subjectsfrom the study population before (or, for present-state studies, after) disease onset, termed censor-ing, has the potential to distort findings. Inincidence data, we have right censoring of onsettimes for those subjects who do not developdisease during the study. In prevalence data, wealso have left censoring of onset times for cases ofdisease.

Copula models were proposed previously in thestudy of disease data from family case-controlstudies, where families are recruited throughproband cases and controls [e.g., Li et al., 1998;Chatterjee et al., 2001; Chatterjee and Shih, 2001].Up to now, work on the copula approach has dealt

with incidence data. Li et al. [1998] reanalyzeddata from a lung cancer study [Schwartz et al.,1996], using a framework proposed by Whitte-more [1995]. This framework took account of thecase-control design, using proportional hazardsmargins and Clayton’s copula [Clayton, 1978], tostructure the joint survival function for each set ofrelatives. The authors found apparently signifi-cant evidence of familial aggregation of disease ifcovariates such as smoking history were notincluded in the model, but no significant evidenceonce these covariates were included.

More recently, Chatterjee et al. [2001] analyzeddata from the Washington Ashkenazi Study[Wacholder et al., 1998] of breast and ovariancancer in which probands were genotyped forknown BRCA1 and BRCA2 mutations. They usednonparametric marginal survival functions for themargins, and used copula functions to constructthe joint survival function for all possible pairs ofrelatives (not including probands), allowing theassociation to vary between carriers and noncar-riers of BRCA1 and BRCA2 mutations.

The authors fit this model using three differentcopula functions: those of Clayton [1978] andFrank [1979], and the positive stable model[Hougard, 1986], and found broadly similarresults for all three. From the fitted model, theycalculated the cumulative risk of breast cancer fora woman, given her BRCA1/2 carrier status andwhether she has a first-degree relative with breastcancer. They also showed how a relative recur-rence risk may be estimated from this, and foundthe first-degree recurrence risk ratio amongnoncarriers to be about 2.0 but not significantlyabove unity for carriers.

Chatterjee and Shih [2001] also proposed abivariate cure-mixture model for analysis of thesame data. The mixture model measures twotypes of association: between age at onset (asabove), and between overall lifetime susceptibility.This is done by categorising individuals as eithersusceptible or not. The authors used the pairwiseodds ratio, g, to measure association betweenoverall disease susceptibility and formulate thejoint survival distribution, conditional on bothindividuals being susceptible, as a copula functionof marginal survival distributions. The same threecopula functions were used, and the authors againfound the results to be comparable between allthree, providing evidence for a strong andsignificant association between overall suscept-ibility between pairs of relatives (g ’ 2:8).However, there was only weak evidence for

Wallace and Clayton294

association between age at onset once this associa-tion in overall susceptibility was accounted for.

Shih [1998] proposed a bivariate discretesurvival distribution which can accommodatecovariates in the margins and yields a constantodds ratio at any grid point. The author alsoshowed how this might be extended to multi-variate data, using the same margins, and pair-wise odds ratios and conditional odds ratios tomodel the associations. This was applied to dataon heart disease in sisters from a longitudinalstudy, and estimated a constant odds ratio of 4.1for death from coronary heart disease (but did notcalculate a recurrence risk ratio). This is a marginalmodel, with discretized hazard function margins,and the association was measured using oddsratios. Although the author did not use a copulafunction to model the association, his model issimilar to the one which will be proposed here.

In all the above studies, age-at-onset data wereavailable, and were used to construct jointsurvival distributions. In prevalence studies, suchdata are not available. It is also possible to viewprevalence data simply as multivariate binaryresponse data. This leads to methods which arenumerically equivalent to those proposed below,but we chose to approach the problem from thecopula perspective for two reasons. Firstly, thisapproach relates the analysis of prevalence data tothe same class of models which now dominatesthe analysis of incidence data, permitting aunified approach. Secondly, we have a stronginterest in estimation of indices based on recur-rence risks (e.g., lS), and such indices are onlyuseful when defined in terms of cumulativeincidence. The interpretation of parameters ofour model in this way requires a copula approach,together with some assumptions concerning thecensoring process (see Appendix A). For a carefuldiscussion of the problems of interpretation ofpresent-state data in terms of incidence, seeKeiding [1991].

MOTIVATING EXAMPLE

Leprosy is a disease caused by infection withMycobacterium leprae. Infection is necessary fordisease, but it is thought that only about 10% ofinfections lead to clinical disease, which may bemanifested across a spectrum from paucibacillary(PB) to multibacillary (MB) disease.

Development of disease depends not only oninfection, but also varies according to age (age-specific risk typically peaks in teenagers and

young adults) and vaccination history [BCGvaccination reduces the odds of disease by afactor of 2; Ponnighaus et al., 1994a]. There is alsoevidence that host genetics affect the developmentof disease. Many linkage and association studiesshowed the involvement of the HLA region [e.g.,Shaw et al., 2001], and recently, strong evidencewas found in linkage analyses of sibling pairsfrom Indian and Vietnamese populations forsusceptibility loci on chromosomes 6, 10, and 20[Mira et al., 2003; Siddiqui et al., 2001; Tosh et al.,2002]. Estimation of lR by the usual case-controlmethods would be expected to give biased results,since nongenetic risk factors, particularly expo-sure to the infectious agent, tend to cluster infamilies.

The Karonga Prevention Study (KPS) conductedtwo total population surveys between 1979–1989in the Karonga district, Northern Malawi, de-scribed by Ponnighaus et al. [1987]. Current orpast leprosy cases were identified by paramedicalleprosy control assistants, and data were alsocollected about familial relationships andnongenetic factors known to affect risk of disease.These data, in contrast to those in the familycase-control studies described above, are essen-tially present-state data: generally it is onlyknown whether disease onset occurred beforethe individual was first examined and not theirage at onset.

PLACKETT’S COPULA AND GLOBALCROSS-RATIO MODELS

The models discussed above made use ofincidence data and constructed likelihoods interms of joint survival functions. This is anefficient approach when such data are available,but not possible in this study. Instead we modelpresent-state data, and the most natural choice formargins is the logistic function, which allows us tomodel cumulative incidence (the probability of aperson ever having had disease) as a function ofcovariates (including age at examination). Wechose to employ Plackett’s copula to model theassociation between pairs of relatives. Otherchoices of copula functions would be possible,but Plackett’s is a natural generalization of thelogistic margins, parameterized in terms of anodds ratio.

Plackett [1965] described a class of bivariatedistributions with given margins and a singleparameter which measures the degree of associa-tion. Let X, Y have marginal distribution functions

Estimating Relative Recurrence Risk Ratio 295

FXðxÞ;GYðyÞ and a joint distribution functionHX;Yðx; y; yÞ (for simplicity, denoted F, G, and H)satisfying

y ¼ Hð1 � F� GþHÞðF�HÞðG�HÞ ð2Þ

where y is constant.Dale [1986] extended this model to deal

with ordinal bivariate responsesFthe globalcross-ratio model (CRM)Fin which the dataare summarized by an r�c contingency table.The table can be dichotomized at any ðr0; c0Þ for1 � r0or, 1 � c0oc, and a series of ðr� 1Þðc� 1Þglobal odds ratios yr0c0 was calculated. Dale(1986) allowed the marginal probabilities todepend on covariate vectors Z1, Z2 (through alogit link) and yr0c0 to depend both on Z1 and Z2

and the cutpoint ðr0; c0Þ (through a log link). This isa departure from the model of Plackett [1965],which requires that the odds ratio be constant nomatter where the dichotomization occurs, andis therefore sometimes known as the constantglobal cross-ratio model.

METHODS

ESTIMATING h USING MAXIMUMLIKELIHOOD

Plackett [1965] avoided estimation of y bymaximum likelihood because it would be a‘‘tedious numerical process,’’ and suggested thefrequency estimate

yþ ¼ ad=bc

where a, b, c, and d are the quadrants of acontingency table. yþ is asymptotically normal,with mean y and a variance estimated consistentlyby

VðyþÞ ¼ ðyþÞ2ð1=aþ 1=bþ 1=cþ 1=dÞ:

Choosing the quadrants so that the dividinglines are at the medians of the margins minimizesthis variance. But this estimate requires that y isheld constant, and in many cases we only observethe discrete realization of the distribution and cando nothing to improve the variance of theestimate. We instead use maximum likelihoodmethods, as did Dale [1986].

In this paper, we consider present-statebivariate dataFcumulative incidence of diseasefor pairs of relatives. Let ðT1;T2Þ be the time atwhich a pair of relatives becomes affected bydisease, and let the pair be observed at time

ðt1; t2Þ. Let ðd1; d2Þ be the disease states for the pairsuch that

di ¼1 Ti � ti0 Ti4ti

�; i ¼ 1; 2:

Assume that the disease is such that if peopledevelop disease before they are seen, they eitherremain affected or show signs of past disease. Thecumulative incidence, uiðtiÞ ¼ PðTi � tiÞ; i ¼ 1; 2, isthe marginal distribution function for the prob-ability of disease. In our example, time is age,which has been grouped (into 5-year bands) sothat time is a discrete variable. Choosing a logisticlink function for the margins, we can write

loguiðtiÞ

1 � uiðtiÞ¼ aþ bbðtiÞ i ¼ 1; 2

to estimate uiðtiÞ, where we use bðtiÞ to categorizetime (time is divided into nb bands, and bðtÞ 2f1; . . . ; nbg indicates which of these nb bands tbelongs to). Then bbðtiÞ is the baseline log-odds fordisease for an individual in ageband bðtiÞ. We canincorporate covariates (xi) using

loguiðtiÞ

1 � uiðtiÞ¼ aþ bbðtiÞ þ gxi Zi ¼ 1; 2:

Assuming there is no cohort effect, we expectbbðtiÞ to be nondecreasing with respect to ti(cumulative incidence does not decrease withage). Note that the above equation takes a similarform to Equation (1) in Rossini and Tsiatis [1996].In that application of a proportional odds regres-sion model to univariate data, a monotonicityrestriction was placed on bbðtiÞ. A similar restric-tion could be applied here, but, visual inspectionof the data showed that such a restriction wasunnecessary.

If we suspected a cohort effect, this mightintroduce nonmonotonicity in bbðtiÞ (e.g., if oldercohorts were less likely than younger cohorts toever suffer from disease). With data measured atonly one timepoint, we would not be able todisentangle any cohort effect from the age effect,and bbðtiÞ would be the baseline log-odds fordisease for an individual in the cohort formed bythose in age band bðtiÞ. However, this would inturn introduce nonmonotonicity in uiðtiÞ, whichwould not be acceptable since, under a copulamodel, uiðtiÞ estimates a distribution function.Note that a cohort effect in the other direction(older cohorts having a higher lifetime risk ofdisease than younger cohorts) would be unlikelyto be detectable (we do not know the lifetime riskof disease for younger cohorts), but would not

Wallace and Clayton296

violate the assumptions necessary for a copulamodel.

Rossini and Tsiatis [1996] also showed that, forsmall sample sizes (50–400), the choice of band-width for bðtiÞ is important, with too few leadingto bias in estimates of g. In our application, thesample size is very large, and we chose 5-yearbands based on visual inspection of the data, butmore careful consideration of bandwidth wouldbe required for smaller datasets.

In the absence of censoring,

yðt1; t2Þ ¼PðT1 � t1;T2 � t2ÞPðT14t1;T24t2ÞPðT1 � t1;T24t2ÞPðT14t1;T2 � t2Þ

:

ð3ÞUnder Plackett’s copula, this cross ratio is

constant, i.e., yðt1; t2Þ ¼ y for all ðt1; t2Þ, and wewill write y in place of yðt1; t2Þ. Assume that thejoint distribution function dðt1; t2Þ ¼ Hðt1; t2; yÞsatisfies Equation (2). Since y 2 ½0;1Þ, anothernatural link function is

log y ¼ gz ¼ n ð4Þwhere z denotes a joint covariate vector for thepair. Note that z need not contain any covariates,and setting z¼1 gives Plackett’s copula. SolvingEquation (2) gives the joint cumulative incidence(the probability that both members of a pair areaffected before time ðt1; t2Þ) as

where we write ui in place of uiðtiÞ for simplicity.We will also write d in place of dðt1; t2Þ for thesame reason, but note that, while y is constant, uiand d (and hence lR) depend on ðt1; t2Þ. The loglikelihood of such an observation is

Lðd1; d2; x1; x2; zÞ ¼ d1d2 log d

þ ð1 � d1Þd2 logðu1 � dÞþ d1ð1 � d2Þ logðu2 � dÞþ ð1 � d1Þð1 � d2Þ logð1 � u1 � u2 þ dÞ

and using Equation (1), we can estimate lR byllR ¼ dd=uu1uu2.

CENSORING

As mentioned above, the time individualsremain in a study may be censored, and this candistort findings of both incidence and prevalencestudies. PðT � tÞ is the cumulative incidence of

disease, but we observe only prevalence, which isconditional on the availability of subjects forstudy. Thus, for the first member of a pairobserved at ðt1; t2Þ, we observe disease not withprobability u1 ¼ PðT1ot1Þ but with probabilityu�1 ¼ PðT1ot1jC14t1;C24t2Þ. This is similar forthe second member. The probability that both areaffected, given that they are observed, is the sameas in the censored case, but is now written interms of these conditional probabilities. The like-lihood for the copula model y now refers to

y� ¼

PðT1 � t1;T2 � t2jC14t1;C24t2ÞPðT14t1;T24t2jC14t1;C24t2ÞPðT1 � t1;T24t2jC14t1;C24t2ÞPðT14t1;T2 � t2jC14t1;C24t2Þ

:

In Appendix A we show that, while therequirements for censoring such that marginalsurvival times may be estimated correctly frompresent-state data are quite restrictive [Keiding,1991], the necessary conditions for y�ðt1; t2Þ ¼yðt1; t2Þ are less so.

STANDARD ERRORS

For a rare disease, there will be substantiallymore doubly unaffected relative pairs than pairswith one or both members affected. In thissituation, it may speed analysis to take a sample

of the doubly unaffected pairs and use weightsaccordingly when calculating the likelihood.

Also note that not all pairs will be independent.A sibling trio can be split into three pairs, butthe affection status of the third pair is completelydetermined by the status of the first two pairs.However, including only nonindependentpairs could introduce a bias (a sib trio with twoaffected members could form two affected/unaffected pairs, or one affected/unaffectedand one affected/affected pair). An alternativeto splitting sibships into all possible pairswould be to calculate the likelihood for the sibshipitself. However, Plackett’s copula does not extendeasily to more than two dimensions [Molenberghsand Lesaffre, 1994], and preliminary investiga-tions showed that what might be expected tobe natural parameterizations of the three-dimensional model do not lead to valid copulafunctions.

dðt1; t2Þ ¼ Cyðu1; u2Þ ¼1 þ ðy� 1Þðu1 þ u2Þ �

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi½1 þ ðy� 1Þðu1 þ u2Þ�2 � 4u1u2yðy� 1Þ

q2ðy� 1Þ

Estimating Relative Recurrence Risk Ratio 297

When using sampling, or nonindependent ob-servations, standard errors may be inaccurate. Totake account of this, we use robust estimates ofstandard errors, which allows us to relax assump-tions about the independence of observations. Inparticular, we used clustered robust estimates,which allow for observations from the samefamily to be correlated [for more details, seeHuber, 1967; White, 1982, 1980; Royall, 1986].

CONFIDENCE INTERVALS

It is not so easy to calculate standard errors for land d because they are functions of more than onenonindependent parameter. Instead, we can usesimulation, but as this is time-consuming, it isadvisable to proceed to this step only after thebest-fit model has been determined. After fittingthe model, we have a variance-covariance matrixfor all parameters. From this we can calculate thevariance covariance matrix S for x ¼ ðZ1; Z2; nÞ

0;the expected values for these parameters arealready known, i.e., m ¼ ðZZ1; ZZ2; nnÞ

0. We use Cho-lesky decomposition to simulate trivariate nor-mals from Nðm;SÞ, and for each realization xi of x,calculate yi ¼ ðl; dÞ0. The empirical 95% confi-dence interval is then given by the 2.5% and 97.5%centiles of y.

INTERPRETATION OF h

y can be thought of as the ratio of the odds ofdisease, given that someone has a relative of typeR with disease, to the odds of disease, given thatthey have a relative (of type R) who is notaffected. It is easier to make inferences about ythan about lR, because y is a parameter in ourmodel, while lR can only be a fitted value and sowill vary according to an individual’s marginalprobability of disease. Note that this makes sense:if a disease is affected by environmental factors,then the relative risk of disease will vary accord-ing to those factors. Also, ll is limited by yy (seeAppendix B) and will approach y for rare diseases,so

ll 2 ½yy; 1Þ yyo1

ll ¼ 1 yy ¼ 1ll 2 ð1; yy� yy41:

8<:

However, estimates of lR, for example, indivi-duals with specific covariates, are easier tointerpret and should also be reported.

Plackett [1965] required that y be constant, whileDale [1986] allowed it to vary. Consider a diseasethat is influenced partly by genetics and partly by

environment, and suppose that two genetic typesexist in the population, susceptible and resistant.If both types are susceptible to disease at highenvironmental risk levels, but only geneticallysusceptible individuals are susceptible to diseaseat low environmental risk levels, we would expecty to be higher among low-exposure groups, sincethe affected members of these groups would bemostly genetically susceptible. This could be seenas genetic risk factors modifying the effect ofnongenetic risk factors. In this situation, it may bebeneficial to prefer people who are affected,despite low environmental risks, for inclusion ingenetic analysis studies. This is similar to theargument used by researchers who focus geneticstudies on those who have particularly early onsetof some disease. On the other hand, y may beconstant across levels of environmental risks. Inthis situation, there is no clear preference aboutwho should be included in genetic studies. Thesetwo situations can be distinguished if we use zfrom Equation (4) to dichotomize pairs accordingto their predicted nongenetic risks.

APPLICATION TO LEPROSY DATA

To measure cumulative incidence, we need to beable to detect signs of past leprosy in individualswho may have self-healed. In fact, a graph ofcumulative incidence by age (unpublished data)shows that it is nondecreasing between ages 0–75,but drops sharply after that. There is no evidencethat older members of this population are lesslikely to have ever had leprosy; it is far moreplausible that this is due to a combination ofunderascertainment, self-healing, and early mor-tality. Although leprosy itself is unlikely to causeearly death [in this population, the risk ratio forearly mortality is not significantly different from1; Chirwa, 2001], it is associated with low socio-economic status, which is itself associated withhigher mortality [Ponnighaus et al., 1994b]. Weexcluded 2,473 individuals aged 75 or olderbecause it is likely that cumulative leprosyincidence is not accurately measured in thisgroup. The remaining 170,279 individuals wereused in this analysis.

There is strong evidence that extended,close contact with a leprosy case favors transmis-sion of M. leprae, and that sharing a householdsubstantially increases risk of disease, particularlywhen a household is shared with an MB case[Fine et al., 1997]. Contact is measured bytwo covariates: whether an individual shared

Wallace and Clayton298

a household with a PB or an MB case duringeither of the population surveys. Multiple obser-vations of the same individual cannot be accom-modated by the proposed model. Instead, we useonly one observation per individual, chosen as thelatest observation for those who had neverbeen leprosy cases, and the first observation atwhich clinical leprosy was recorded for affectedindividuals.

The covariates included in the model then areBCG vaccination status, sex, and householdcontact with PB or MB cases. In total, there are194,949 sibling pairs on the KPS database. Theaffection status of these pairs are shown in Table I.In order to speed up calculation, we took a 1-in-5sample of the doubly unaffected pairs, andweighted those observations accordingly whencalculating the likelihood.

RESULTS

Covariates are grouped into two groups,according to whether they tend to cluster infamilies:

1. BCG status, sex.2. Household contact.

Group 2 covariates have a tendency for familialclustering, while group 1 covariates do not. Themodel was fitted under the following five sets ofconditions:

1. No covariates.2. (a) Group 1 covariates; y constant.

(b) Group 1 covariates; y allowed to vary.3. (a) Group 1 and 2 covariates; y constant.

(b) Group 1 and 2 covariates; y allowed tovary.

y was allowed to vary by dichotomizingindividuals according to their marginal predictedrisks under conditions 2(a) and 3(a). The cutoffchosen was the median marginal risk amongaffecteds estimated in conditions 2(a) and 3(a). It

was hoped that this would provide the mostpower by placing equal numbers of affecteds ineach group. Pairs were then categorized as low-low, low-high, or high-high. When estimating themodel, no significant difference could be foundbetween the low-high and high-high pairs, so thetwo groups were combined into a single high-riskgroup. Estimates of yy under each set of conditionsare shown in Table II. If all covariates are ignored,yy ’ 3. This falls only very slightly when sex andBCG status are included, but falls significantlywhen household contact is also included. Thus theapparent lS is inflated if nongenetic factors whichtend to cluster in families (household contact inthis example) are ignored, as predicted by Guo[2000].

When contact is accounted for (conditions 3(a)and 3(b)), y is constant across pairs predicted to behigh- and low-risk according to measured covari-ates. When contact is ignored (conditions 2(a) and2(b)), y is not higher overall, but varies signifi-cantly between pairs predicted to be high- andlow-risk according to group 1 covariates. Underthis condition, there are two risk factors notincluded: close contact and genetic risks. Since yis constant when contact is accounted for, thisvariation is most likely to be due to contact historymodifying the effect of other nongenetic riskfactors. This is expected, because exposure to M.leprae is necessary for disease.y (rather than lS) was used to discriminate

between conditions, because there is a range offitted values for lS, bounded by y and 1, while ytakes a maximum of two values, allowing us todiscriminate more clearly. For interpretation,though, lS is more useful. Under the preferredcondition (3(a)), the fitted values are broadlysimilar, ranging from 1.47–2.01, with a median of1.91. A histogram of all fitted values is shown inFigure 1.

TABLE I. Affection Status of Sibling Pairs Used in Study

Affection status Number Proportion (%)

Affected/affected 207 0.11Affected/unaffected 5,122 2.63Unaffected/unaffected 189,620 97.27

Total 194,949 100.00

TABLE II. Estimates of h Under Each Conditiona

Estimate of y (95% CI)

Condition y constant Low-risk pairs High-risk pairs

1 5.99 (5.01, 7.15)2 3.03 (2.49, 3.68) 5.74 (4.08, 8.08) 2.34 (1.93, 2.85)3 2.06 (1.73, 2.44) 3.07 (2.01, 4.70) 1.87 (1.56, 2.24)aCondition 1 contains no covariates, condition 2 all covariates

except household contact, and condition 3 all covariates. Forconditions 2(a) and 3(a), y was held constant. For conditions 2(b)and 3(b), y was allowed to vary between pairs who have high andlow environmental risk.

Estimating Relative Recurrence Risk Ratio 299

DISCUSSION

We propose a fully parametric marginal modelwith logistic margins and the joint distributionspecified by an extended Plackett copula, fromwhich lS fitted values may be found. This differsfrom previous approaches in this area, whichmade use of case-control family data includingage at onset, mainly because of the differentnature of the available data. Previous approachesused copula functions to specify the joint survivalfunction. This was not possible in our application.The finding by Chatterjee and Shih [2001] that, inthe case of breast cancer at least, there wassignificant association between relatives’ overallsusceptibility to disease but only weak evidence ofassociation between their ages at onset, suggeststhat the lack of age-at-onset data should notprevent the model from fitting well.

The model proposed in this paper is similar toanother marginal model approach, not involvingcopulas, proposed by Shih [1998]. Data wereavailable from a longitudinal survey, and she alsomodeled associations between observations fromrelated individuals with a constant odds ratio.Again, repeated measures on individuals atregular time intervals were available (and atten-tion was restricted to those seen 10 times or more).This allowed a discretized survival distribution tobe fitted. However, Shih [1998] used a two-stageestimation procedure, first fitting the marginsassuming independence, and then substitutingthese fitted values into a pseudolikelihood whichwas maximized. Our likelihood can be maximizedsimultaneously with respect to marginal and jointdistribution parameters, which allows standarderrors of parameters to be accurately estimated.

Another difference between our approach andothers is the treatment of three or more relatives.The framework by Li et al [1998] models the jointsurvival distribution for the proband and all his/her relatives, while Chatterjee et al. [2001] andChatterjee and Shih [2001] considered only thejoint survival distribution among pairs of relatives(not including the proband). Although Shih[1998] modeled associations in larger relative setsusing conditional odds ratios, they found max-imum likelihood estimation infeasible, andinstead used a pseudolikelihood estimation pro-cedure that depended only on pairwise para-meters. We found that the Plackett copula didnot easily generalize to three or more dimensions,and split larger relative sets into all possiblepairs, adjusting the standard errors appropriately.This enabled us to use fully parametric maximumlikelihood estimation. This model also makesclear that for a complex disease, where risk ofdisease is subject to both genetic and nongeneticfactors, lR will not be constant, but will depend onan individual’s age and environmental riskfactors.

Our results do not provide evidence for stronggenetic susceptibility to leprosy in Karonga. Theyare consistent with a hypothesis that susceptibilityto leprosy is under the control of many factors, thestrongest of which may be nongenetic, with hostgenetics playing a small but significant role.However, we recognize that there was likely tobe undetected household contact [Chirwa, 2001],given that our measure of contact was based onjust two surveys over 10 years. Given thepotentially long incubation period of leprosy(measured in years and even decades) and thatwe are considering cumulative incidence, we willcertainly have missed earlier household contact,which will be important in older cases. It istherefore likely that our estimates of lS would befurther reduced if contact histories could beaccurately accounted for.

The example here used sibling pairs, but themethod works equally well for other relatives forwhom data are available. Our example data comefrom a complete population survey, but sampleddata could be used by introducing appropriatesampling weights. Collection of such data may bepreliminary to genotyping, and if genetic risks doappear to modify environmental risks, estimatingy in this way may be useful to target genotypingof particular pairs (those with apparently highresidual risk) who are more likely to share geneticsusceptibilities.

Fig. 1. Histogram of fitted values of kS under preferredcondition 3(a).

Wallace and Clayton300

ELECTRONIC DATABASEINFORMATION

A Stata package which implements the modeldescribed in this paper may be found at http://www-gene.cimr.cam.ac.uk/clayton/software/stata/.

ACKNOWLEDGMENTS

The data analyzed here were collected withinthe context of the Karonga Prevention Study(KPS), which has been supported by the BritishLeprosy Relief Association and the WellcomeTrust, with contributions from the InternationalFederation of Anti-Leprosy Organisations and theWHO/UNDP/World Bank Special Programmefor Research and Training in Tropical Diseases.We acknowledge the special contributions of Drs.J.M. Ponnighaus, D.K. Warndorff, and A.C.Crampin, and more than 50 field staff for thefield work, Prof. A.V.S. Hill and Dr. J. Fitness forhelpful discussions, and the Health SciencesResearch Committee of Malawi for encouragingthe KPS program. C.W. was supported by theBritish Medical Research Council.

REFERENCES

Chatterjee N, Shih J. 2001. A bivariate cure-mixture approachfor modeling familial association in diseases. Biometrics57:779–786.

Chatterjee N, Shih J, Hartge P, Brody L, Tucker M, Wacholder S.2001. Association and aggregation analysis using kin-cohortdesigns with applications to genotype and family history datafrom the Washington Ashkenazi Study. Genet Epidemiol21:123–138.

Chirwa TF. 2001. Effect of household dynamics on risk of diseaseassociated with household contact. Ph.D. thesis, University ofLondon.

Clayton DG. 1978. A model for association in bivariate life tablesand its application in epidemiological studies of familialtendency in chronic disease incidence. Biometrika 650:141–151.

Dale JR. 1986. Global cross-ratio models for bivariate, discrete,ordered responses. Biometrics 42:909–917.

Duffy DL, Montgomery GW, Hall J, Mayne C, Healey SC, Brown J,Boomsma DI, Martin NG. 2001. Human twinning is not linkedto the region of chromosome 4 syntenic with the sheep twinninggene fecb. Am J Med Genet 100:182–186.

Fine PE, Sterne J, Ponnighaus JM, Bliss L, Saul J, Chihana A,Munthali M, Warndorff D. 1997. Household and dwellingcontact as risk factors for leprosy in northern Malawi. Am JEpidemiol 146: 91–102.

Frank M. 1979. On the simultaneous associativity of fðx; yÞ andxþ y� fðx; yÞ. Aequationes Math 19:194–226.

Guo S-W. 2000. Familial aggregation of environmental risk factorsand familial aggregation of disease. Am J Epidemiol 151:1121–1131.

Hougard P. 1986. A class of multivariate failure time distributions.Biometrika 73:671–678.

Huber P. 1967. The behaviour of maximum likelihood estimatesunder non-standard conditions. In: Proceedings of the FifthBerkley Symposium on Mathematical statistics and Probability,volume 1. Berkeley, CA: University of California Press.p 221–233.

Keiding N. 1991. Age-specific incidence and prevalence: astatistical perspective. J R Stat Soc A 154:371–412.

Li H, Yang P, Schwartz AG. 1998. Analysis of age of onset datafrom case-control family studies. Biometrics 54:1030–1039.

Mira MT, Alcaıs A, Thuc NV, Thai VH, Huong NT, Ba NN, VernerA, Hudson TJ, Abel L, Schurr E. 2003. Chromosome 6q25 islinked to susceptibility to leprosy in a Viewnamese population.Nat Genet 21:412–415.

Molenberghs G, Lesaffre E. 1994. Marginal modeling of correlatedordinal data using a multivariate Plackett distribution. J AmStat Assoc 89:633–644.

Nelsen RB. 1999. An introduction to copulas, Lecture notes instatistics, volume 139. New York: Springer-Verlag.

Plackett R. 1965. A class of bivariate distributions. J Am Stat Assoc60:516–522.

Ponnighaus J, Fine PE, Bliss L, Sliney I, Bradley D, Rees R. 1987.The Lepra Evaluation Project (LEP) and epidemiologicalstudy of leprosy in northern Malawi. I: methods. Lepr Rev52:359–375.

Ponnighaus JM, Fine PE, Sterne JA, Bliss L, Wilson RJ, Malema SS.1994a. Incidence rates of leprosy in Karonga district, northernMalawi: patterns by age, sex, BCG status and classification. Int JLepr Other Mycobact Dis 62:10–23.

Ponnighaus JM, Fine PE, Sterne JA, Malema SS, Bliss L, Wilson RJ.1994b. Extended schooling and good housing conditions areassociated with reduced risk of leprosy in rural Malawi. Int JLepr Other Mycobact Dis 62:345–352.

Risch N. 1990a. Linkage strategies for genetically complex traits. I.Multilocous models. Am J Hum Genet 42:222–228.

Risch N. 1990b. Linkage strategies for genetically complex traits.II. The power of affected relative pairs. Am J Hum Genet42:229–241.

Risch N. 1990c. Linkage strategies for genetically complex traits.III. The effect of marker polymorphism on analysis of affectedrelative pairs. Am J Hum Genet 42:242–253.

Rossini A, Tsiatis A. 1996. A semiparametric proportional oddsregression model for the analysis of current status data. J AmStat Assoc 91:713–721.

Royall R. 1986. Model robust confidence intervals using maximumlikelihood estimators. Int Stat Rev 54:221–226.

Schwartz AG, Yang P, Swanson GM. 1996. Familial risk of lungcancer among nonsmokers and their relatives. Am J Epidemiol144:554–562.

Shaw M, Donaldson I, Collins A, Peacock C, Lins-Lainson Z, ShawJ, Ramos F, Silveira F, Blackwell J. 2001. Association and linkageof leprosy phenotypes with HLA class II and tumour necrosisfactor genes. Genes Immun 20:196–204.

Shih JH. 1998. Modeling multivariate discrete failure time data.Biometrics 54:1115–1128.

Siddiqui MR, Meisner S, Tosh K, Balakrishnan K, Ghei S, FisherSE, Golding M, Narayan NPS, Sitaraman T, Sengupta U,Pitchappan R, Hill AV. 2001. A major susceptibility locus forleprosy in India maps to chromosome 10p13. Nat Genet 27:439–441.

Tosh K, Meisner S, Siddiqui MR, Balakrishnan K, Ghei S, GoldingM, Sengupta U, Pitchappan RM, Hill AV. 2002. A region ofchromosome 20 is linked to leprosy susceptibility in a SouthIndian population. J Infect Dis 186:1190–1193.

Estimating Relative Recurrence Risk Ratio 301

Wacholder S, Hartge P, Strewing JP, Pee D, McAdams M, Brody L,Tucker M. 1998. The kin-cohort study for estimating penetrance.Am J Epidemiol 148:623–630.

White H. 1980. A heteroskedasticity-consistent covariance matrixestimator and a direct test for heteroskedasticity. Econometrica48:817–830.

White H. 1982. Maximum likelihood estimation of misspecifiedmodels. Econometrica 50:1–25.

Whittemore AS. 1995. Logistic regression of family data from case-control studies. Biometrika 82:57–67.

APPENDIX A

CENSORING

The conditional independence assumptionstates that C1jT1 is conditionally independent ofC2 and T2, and C2jT2 is conditionally independentof C1 and T1, i.e.,

PðC1;C2jT1;T2Þ ¼ PðC1jT1ÞPðC2jT2Þ:If we assume conditional independence, then y

is not changed in the presence of censoring.Consider

p11 ¼ PðT1 � t1;T2 � t2jC14t1;C24t2Þ

¼ PðC14t1;C24t2jT1 � t1;T2 � t2ÞPðT1 � t1;T2 � t2ÞPðC14t1;C24t2Þ

¼ PðC14t1jT1 � t1ÞPðC24t2jT2 � t2ÞPðT1 � t1;T2 � t2ÞPðC14t1;C24t2Þ

:

Similarly,

p22 ¼ PðT14t1;T24t2jC14t1;C24t2Þ ¼PðC14t1jT14t1ÞPðC24t2jT24t2ÞPðT14t1;T24t2Þ

PðC14t1;C24t2Þ

p12 ¼ PðT1 � t1;T24t2jC14t1;C24t2Þ ¼PðC1 � t1jT14t1ÞPðC24t2jT24t2ÞPðT1 � t1;T24t2Þ

PðC14t1;C24t2Þand

p21 ¼ PðT14t1;T2rt2jC14t1;C24t2Þ ¼PðC14t1jT14t1ÞPðC24t2jT2 � t2ÞPðT14t1;T2 � t2Þ

PðC14t1;C24t2Þ:

Then

y ¼ p11p22

p12p21¼ PðT1 � t1;T2 � t2ÞPðT14t1;T24t2Þ

PðT1 � t1;T24t2ÞPðT14t1;T2 � t2Þ

as in Equation (3).

APPENDIX B

LIMITS FOR k

Note that

l ¼ u1u2

dand

y ¼ d� u1d� u2dþ d2

u1u2 � u1d� u2dþ d2¼ d� e

u1u2 � e

where e ¼ udþ vd� d2. Then if l41, we have

dou1u2

u1u2ðd� eÞ ¼ u1u2d� u1u2eou1u2d� de ¼ dðu1u2 � eÞ

l ¼ u1u2

do

u1u2 � ed� e

¼ y

so that l 2 ½1; y�. (Conversely, if lo1, thenl 2 ½y; 1�Þ. l will approach y when the disease israre, i.e., when u1, u2, and d are small and e ! 0.

Also, when there is no interaction, i.e., d ¼ u1u2,then y ¼ 1 and l ¼ 1.

Wallace and Clayton302