A conditional-on-exchangeable-parental-genotypes likelihood that remains unbiased at the causal...

4
Letter to the Editor A Conditional-on-Exchangeable-Parental-Genotypes Likelihood That Remains Unbiased at the Causal Locus Under Multiple-Affected- Sibling Ascertainment Peter Kraft, 1,2 Hsin-ju Hsieh, 3 Heather J. Cordell, 6 and Janet Sinsheimer 3–5 1 Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts 2 Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 3 Department of Biostatistics, University of California, Los Angeles 4 Department of Biomathematics, University of California, Los Angeles 5 Department of Human Genetics, University of California, Los Angeles 6 Department of Medical Genetics, University of Cambridge, Cambridge, United Kingdom Contract grant sponsor: NIH; Contract grant numbers: MH059532-05, MH66001, and MH59490; Contract grant sponsor: Wellcome Trust; Contract grant sponsor: Juvenile Diabetes Research Foundation. n Correspondence to: Peter Kraft, Ph.D., Harvard School of Public Health, 665 Huntington Avenue, Building 2, Room 109, Boston, MA 02115. E-mail: [email protected] Received 31 August 2004; Accepted 7 December 2004 Published online 14 February 2005 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/gepi.20069 In a recent report in Genetic Epidemiology , Cordell et al. [2004] introduced a general formula- tion of the case/pseudocontrol likelihood for case- parent trios that enables joint estimation of relative risks for offspring genotype, maternal genotype, parent-of-origin effects, and maternal- fetal-genotype-interaction effects. Cordell [2004] subsequently showed that a pseudolikelihood analysis of nuclear families with multiple affected siblings (namely, treating each sibling as if she were independent of her siblings, so the psuedo- likelihood contribution for each nuclear family is a product of the case/pseudocontrol likelihoods for each sibling), produces biased results when there is a true maternal genotype, parent-of-origin, or maternal-fetal-genotype-interaction effect. Here we present an alternative formulation that extends Cordell et al.’s [2004] conditional-on- exchangeable-parental-genotypes (CEPG) likeli- hood to nuclear families with multiple affected siblings. We show that our formulation eliminates the bias seen at a true causal locus using the pseudolikelihood analysis. We also describe how this formulation can be fit using standard statis- tical software, preserving an attractive feature of the case/pseudocontrol framework. The case/pseudocontrol likelihood for case- parent trio data actually predates the Transmis- sion Disequilibrium Test [Self et al., 1991; Spiel- man et al., 1993]. It is based on the retrospective likelihood Pr(G 0 |G p ,D o ), or the probability of the affected offspring’s genotype G o given her par- ent’s genotypes G p . Assuming a log-linear model for disease risk, Pr(D o ¼1|G o ) p R Go , where R Go is the relative risk of disease for individuals with genotype G o relative to some baseline genotype (usually homozygous wild type), this likelihood simplifies to: R Go P G o jG p R G o ½1 The summation in the denominator is over all four possible offspring genotypes, under the Mende- lian model of independent and equally-likely transmission of parental alleles. This likelihood is identical to the standard conditional logistic likelihood for a case-control study where each case is matched to three hypothetical controls, each with one of the three other possible geno- types, given the parents’ genotypes, hence the name case/pseudocontrol likelihood. As originally formulated, the case/pseudocon- trol likelihood does not readily incorporate maternal-genotype or maternal-fetal-genotype in- teraction effects, as the case and all pseudocon- trols are matched on these factors. For example, writing Pr(D o ¼1|G o ) p R Go S Gm , the relative risk due to the maternal genotype S Gm cancels from the numerator and denominator. In order to test for Genetic Epidemiology 29: 87–90 (2005) & 2005 Wiley-Liss, Inc.

Transcript of A conditional-on-exchangeable-parental-genotypes likelihood that remains unbiased at the causal...

Letter to the Editor

A Conditional-on-Exchangeable-Parental-Genotypes Likelihood ThatRemains Unbiased at the Causal Locus Under Multiple-Affected-

Sibling Ascertainment

Peter Kraft,1,2 Hsin-ju Hsieh,3 Heather J. Cordell,6 and Janet Sinsheimer3–5

1Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts2Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts

3Department of Biostatistics, University of California, Los Angeles4Department of Biomathematics, University of California, Los Angeles

5Department of Human Genetics, University of California, Los Angeles6Department of Medical Genetics, University of Cambridge, Cambridge, United Kingdom

Contract grant sponsor: NIH; Contract grant numbers: MH059532-05, MH66001, and MH59490; Contract grant sponsor: Wellcome Trust;Contract grant sponsor: Juvenile Diabetes Research Foundation.nCorrespondence to: Peter Kraft, Ph.D., Harvard School of Public Health, 665 Huntington Avenue, Building 2, Room 109, Boston, MA02115. E-mail: [email protected] 31 August 2004; Accepted 7 December 2004Published online 14 February 2005 in Wiley InterScience (www.interscience.wiley.com)DOI: 10.1002/gepi.20069

In a recent report in Genetic Epidemiology,Cordell et al. [2004] introduced a general formula-tion of the case/pseudocontrol likelihood for case-parent trios that enables joint estimation ofrelative risks for offspring genotype, maternalgenotype, parent-of-origin effects, and maternal-fetal-genotype-interaction effects. Cordell [2004]subsequently showed that a pseudolikelihoodanalysis of nuclear families with multiple affectedsiblings (namely, treating each sibling as if shewere independent of her siblings, so the psuedo-likelihood contribution for each nuclear family is aproduct of the case/pseudocontrol likelihoods foreach sibling), produces biased results when thereis a true maternal genotype, parent-of-origin, ormaternal-fetal-genotype-interaction effect.

Here we present an alternative formulation thatextends Cordell et al.’s [2004] conditional-on-exchangeable-parental-genotypes (CEPG) likeli-hood to nuclear families with multiple affectedsiblings. We show that our formulation eliminatesthe bias seen at a true causal locus using thepseudolikelihood analysis. We also describe howthis formulation can be fit using standard statis-tical software, preserving an attractive feature ofthe case/pseudocontrol framework.

The case/pseudocontrol likelihood for case-parent trio data actually predates the Transmis-sion Disequilibrium Test [Self et al., 1991; Spiel-man et al., 1993]. It is based on the retrospective

likelihood Pr(G0|Gp,Do), or the probability of theaffected offspring’s genotype Go given her par-ent’s genotypes Gp. Assuming a log-linear modelfor disease risk, Pr(Do¼1|Go) p RGo, where RGo isthe relative risk of disease for individuals withgenotype Go relative to some baseline genotype(usually homozygous wild type), this likelihoodsimplifies to:

RGoPG�o jGp

RG�o

½1�

The summation in the denominator is over all fourpossible offspring genotypes, under the Mende-lian model of independent and equally-likelytransmission of parental alleles. This likelihoodis identical to the standard conditional logisticlikelihood for a case-control study where eachcase is matched to three hypothetical controls,each with one of the three other possible geno-types, given the parents’ genotypes, hence thename case/pseudocontrol likelihood.

As originally formulated, the case/pseudocon-trol likelihood does not readily incorporatematernal-genotype or maternal-fetal-genotype in-teraction effects, as the case and all pseudocon-trols are matched on these factors. For example,writing Pr(Do¼1|Go) p RGo SGm, the relative riskdue to the maternal genotype SGm cancels from thenumerator and denominator. In order to test for

Genetic Epidemiology 29: 87–90 (2005)

& 2005 Wiley-Liss, Inc.

these effects, several authors suggested relaxingthe conditioning in [1] [Cordell, 2004; Cordellet al., 2004; Kraft et al., 2004; Kraft and Wilson, 2002;Sinsheimer et al., 2003]. Instead of conditioning onthe observed parental genotypes, they condi-tioned on the parental mating type and assumedexchangeable parental genotypes, that is,Pr(Gm¼ gm,Gf¼ gf|MT)¼Pr(Gm¼ gf, Gf¼ gm|MT).Assuming a log-linear model that incorporatesoffspring and maternal main effects as well asmaternal-fetal interaction terms (aGoGm) leads tothe following likelihood:

RGo SGmaGoGmPG�o jGp

RG�o SGmaG�oGm þP

G�o jGpRG�o SGf

aG�oGf

¼ RGo SGmaGoGmPG�o ;G

�mjGp

;RG�o SG�maG�oG�m½2�

This formulation retains the heuristic case/pseudocontrol interpretation. Each case is nowmatched to seven pseudocontrols: three with thenontransmitted genotypes and a mother who hasthe same genotype as the case’s mother and fourwith the four possible genotypes given theparental mating type and a mother who has thesame genotype as the case’s father.

To extend this to families with multiple affectedsiblings, Cordell [2004] formed a pseudolikeli-hood for each nuclear family by treating thesiblings as if they were independent, that is, bymultiplying terms of the form [2] for each affectedsibling i¼1,y, I:

Y

i

RGoiSGmi

aGoiGmiPG�

oi;G�

mijGp;RG�

oiSG�

miaG�

oiG�

mi

¼Q

i RGoiSGmi

aGoiGmiPG�

o1;���;G

�oI ;G

�m1;���;G�m1

jGp

Qi RG�

oiSG�

miaG�

oiG�

mi½3�

(Here there are 4I2I terms in the denominator: asum over offspring genotypes and a sum overmaternal genotypes for each offspring. All but oneof these terms are counter-factual.) As expected,Cordell [2004] showed that when the measuredlocus was linked to a causal locus, and hence thesiblings’ genotypes were not independent, thepseudolikelihood [3] produced biased estimatesand had inflated Type I error rates.

Somewhat surprisingly, the pseudolikelihood[3] also produced biased estimates at the truelocus when there was a maternal genotype ormaternal-fetal genotype interaction. Cordell [2004]showed that in this case, affected siblings’genotypes are not independent. Further intuitionfor the dependence among siblings’ genotypes can

be gleaned from the denominator of the pseudo-likelihood [3]. The multiple summation overmaternal genotypes for each offspring allows forterms where siblings have mothers with differentgenotypes, an impossible situation if they areindeed siblings, as Gmi � Gm, and G�mi ¼ G�m: Thislast observation suggests directly extendinglikelihood [2] to the multiple-affected-siblingscase, taking care to retain the dependence amongsiblings due to their shared mother:

Qi RGoi

SGmaGoiGmPG�o1

;���;G�oI ;G

�mjGp

Qi RG�

oiSG�maG�

oiG�m

: ½4�

This likelihood can, in principle, still be fit usingstandard software for conditional logistic regres-sion. Each nuclear family contributes as a matchedcase-control set, where the ‘‘case’’ has a relativerisk equal to the product of the affected siblings’relative risks, and each of the (4I� 2) -1 ‘‘controls’’has a relative risk equal to the product of therelative risks for one of the counter-factualgenotype configurations (although many of theseconfigurations will be identical). Kraft et al. [2004]present an extension of likelihood [4] that accom-modates birth-order effects, genotypes fromunaffected siblings (or siblings with unknownphenotypes), and missing parental genotypes.The latter extension drops the conditioning onparental mating type and hence requires furtherassumptions about the population distributionof parental genotypes and the missingnessprocess.

We wrote a SAS macro to fit likelihood [4] andconducted a small simulation study under condi-tions similar to a simulation study presented inCordell [2004] (see Table I). We simulated a causallocus with a mutant allele frequency of 20% andcalculated parental and offspring genotypesunder four penetrance models. In the first, onlythe children’s genotypes were associated withdisease (the relative risk for children carrying onemutant allele was R1¼3; for those carrying twoalleles R2¼6). In the second, only maternalgenotypes contributed additional risk (childrenwith mothers with one mutant allele had a relativerisk of S1¼2; for those with mothers withtwo copies, S2¼3). For the third model, childrenwith exactly one copy of the mutant allele wereat decreased risk, but only if their mother also hadexactly one copy of the mutant allele (i.e., a11¼0.5,all other parameters � 1). We calculated thebias as the average of the parameter estimatesminus its expected value (for computational

Kraft et al.88

convenience we estimated log-transformed para-meters). The final model included offspring andmaternal genotype main effects as well as amaternal-fetal interaction effect. In contrast tothe pseudolikelihood approach, we could notdetect any bias using the likelihood [4]; 95%confidence intervals also had appropriate cover-age.

Likelihood [4] still assumes that siblings’disease states are independent, conditional ontheir genotypes and their mothers’ genotype. Thisassumption will not hold if the observed locusis linked to a causal locus. In that case, we expectthat likelihood [4] would produce biasedestimates and anti-conservative tests (as any othermethod that assumes conditional independenceamong siblings). Table II presents results of asimulation study where the measured locus is notassociated with outcome but is in perfect linkage

with the causal locus. When the offsprings’genotypes at the causal locus influence diseaserisk, parameter confidence intervals are anti-conservative (the variance in parameterestimates is underestimated), leading to inflatedType I error rates for standard Wald tests. Topreserve proper test size, a robust varianceestimator such as the ‘‘sandwich’’ estimator orthe Huber-White correction could be used, asdiscussed in several studies [Cordell, 2004; Kraftand Siegmund, 2000; Siegmund et al., 2000] andelsewhere. When only the maternal genotype atthe causal locus influences disease risk and themeasured locus and the causal locus are in linkageequilibrium, however, the confidence intervalshave the correct coverage. This is because underthis scenario, the offsprings’ observed genotypesare independent conditional on the observedmaternal genotype.

TABLE I. Average bias and 95% coverage using CEPG pseudolikelihood approach to multiple affected siblings(Expression [3]) and CEPG likelihood explicitly accounting for shared parents (Expression [4])a

Scenario 1 Scenario 2

Cordell [3] New CEPG [4] Cordell [3] New CEPG [4]

Number sibs Parameter Truth Biasb Bias Coverageb Truth Bias Bias Coverage

2 R1 3 0.01 0.00 0.97 1 0.01 �0.01 0.95R2 6 0.01 0.01 0.97 1 0.00 �0.01 0.95S1 1 �0.01 0.00 0.95 2 0.71 0.01 0.95S2 1 0.00 0.00 0.94 3 1.13 0.03 0.94a11 1 0.00 0.00 0.94 1 �0.01 0.00 0.93

3 R1 3 0.01 0.01 0.94 1 0.02 �0.01 0.95R2 6 0.02 0.01 0.96 1 0.01 �0.01 0.95S1 1 �0.01 0.01 0.94 2 1.41 0.00 0.96S2 1 �0.02 0.00 0.95 3 2.20 0.01 0.95a11 1 �0.01 0.00 0.93 1 �0.02 0.00 0.95

Scenario 3 Scenario 4

Cordell [3] New CEPG [4] Cordell [3] New CEPG [4]

Number sibs Parameter Truth Bias Bias Coverage Truth Bias Bias Coverage

2 R1 1 0.00 0.00 0.97 1.5 0.00 0.01 0.96R2 1 �0.06 0.01 0.95 2 0.02 0.00 0.95S1 1 �0.29 �0.01 0.95 1.8 0.25 0.00 0.95S2 1 �0.03 0.00 0.94 2 0.64 0.01 0.96a11 0.5 0.00 0.00 0.95 0.5 �0.02 �0.01 0.96

3 R1 1 0.00 0.00 0.99 1.5 �0.03 0.00 0.95R2 1 0.00 �0.05 0.97 2 0.01 �0.01 0.94S1 1 �0.58 0.00 0.95 1.8 0.47 0.01 0.96S2 1 �0.13 0.01 0.94 2 1.29 0.01 0.97a11 0.5 0.00 0.00 0.98 0.5 0.02 �0.01 0.96

aBased on 500 replicate simulations of studies with 500 (333) nuclear families with exactly two (three) affected siblings. Baseline probabilityof disease¼5%. Relative risk parameters and causal-locus allele frequency are described in the text.bLog parameter values were fit; bias is the difference between the average estimate of the log parameter and its true value; coverage is theobserved coverage of the nominal 95% confidence intervals.

Letter to the Editor 89

ACKNOWLEDGMENTS

This work was supported by NIH grantsMH059532-05 (P.K.), MH66001 (H.J.H. and J.S.S.)and MH59490 (J.S.S.), as well as the Wellcome Trustand the Juvenile Diabetes Research Foundation (H.C.).

ELECTRONIC DATABASEINFORMATION

The SAS macro that we used to fit likelihood [4]is available from the first author at: http://www.hsph.harvard.edu/faculty/kraft/soft.htm.

REFERENCES

Cordell HJ. 2004. Properties of case/pseudocontrol analysis forgenetic association studies: effects of recombination,ascertainment, and multiple affected offspring. GenetEpidemiol 26:186–205.

Cordell HJ, Barratt BJ, Clayton DG. 2004. Case/pseudocontrolanalysis in genetic association studies: a unified framework fordetection of genotype and haplotype associations, gene-geneand gene-environment interactions, and parent-of-origineffects. Genet Epidemiol 26:167–85.

Kraft P, Siegmund K. 2000. Testing linkage disequilibrium insibships using conditional logistic regression with robustvariance estimators. Genet Epidemiol 19:257.

Kraft P, Wilson M. 2002. Family-based association tests incor-porating parental genotypes. Am J Hum Genet 71:1238–1239.

Kraft P, Palmer C, Woodward J, Turunen J, Minassian S, Paunio T,Lonnqvist J, Peltonen L, Sinsheimer J. 2004. RHD maternal-fetalgenotype incompatibility and schizophrenia: extending theMFG test to include multiple siblings and birth order. Eur JHum Genet 12:192–198.

Self S, Longton G, Kopecky K, Liang K. 1991. On estimating HLA/disease association with application to a study of aplasticanemia. Biometrics 47:53–61.

Siegmund K, Langholz B, Kraft P, Thomas D. 2000. Testing linkagedisequilibrium in sibships. Am J Hum Genet 67:244–248.

Sinsheimer J, Palmer C, Woodward J. 2003. The maternal-fetalgenotype incompatibility test: detecting genotype combinationsthat increase risk for disease. Genet Epidemiol 24:1–13.

Spielman R, McGinnis R, Ewens W. 1993. Transmission test forlinkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516.

TABLE II. Average bias and 95% coverage using CEPGlikelihood [4] when measured locus is linked to thecausal locusa

Scenario 1

Causal locusR1¼3, R2¼6

Causal locusR1¼4, R2¼16

Number sibsParameter

(observed locus) Biasb Coverageb Bias Coverage

2 R1 �0.01 0.93n 0.00 0.94R2 �0.01 0.94 �0.01 0.92n

S1 0.01 0.95 0.00 0.96S2 �0.01 0.94 0.00 0.94a11 0.01 0.94 0.00 0.93n

3 R1 0.00 0.95 0.00 0.89nnn

R2 0.00 0.94 �0.01 0.91nn

S1 0.00 0.96 0.00 0.91nn

S2 0.01 0.95 0.00 0.96a11 0.00 0.96 0.00 0.90nnn

Scenario 2

Causal locusS1¼2, S2¼3

Causal locusS1¼3, S2¼9

Number sibsParameter

(observed locus) Biasb Coverageb Bias Coverage

2 R1 0.00 0.95 0.01 0.94R2 �0.02 0.96 0.01 0.96S1 0.00 0.94 0.00 0.95S2 0.00 0.95 0.00 0.93a11 0.00 0.94 �0.01 0.95

3 R1 0.00 0.94 0.00 0.95R2 0.00 0.94 0.00 0.95S1 0.00 0.96 0.00 0.95S2 0.00 0.95 0.00 0.95a11 0.00 0.94 0.00 0.95

aBased on 500 replicate simulations of studies with 500 (333)nuclear families with exactly two (three) affected siblings; baselineprobability of disease¼5%. Observed locus is in perfect linkage(y¼0) and linkage equilibrium (d¼0) with the causal locus; thusthe true parameter values at the observed locus are all 1.bLog parameter values were fit; bias is the difference between theaverage estimate of the log parameter and its true value; coverageis the observed coverage of the nominal 95% confidence intervals.nSignificantly different than .95, P¼0.05nnP¼0.01nnnP¼0.001.

Kraft et al.90