Accounting for haplotype uncertainty in matched association studies: A comparison of simple and...

12
Accounting for Haplotype Uncertainty in Matched Association Studies: A Comparison of Simple and Flexible Techniques Peter Kraft, 1,2n David G. Cox, 1 Randi A. Paynter, 1 David Hunter, 1,3,4 and Immaculata De Vivo 1,4 1 Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts 2 Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 3 Department of Nutrition, Harvard School of Public Health, Boston, Massachusetts 4 Channing Laboratory, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts Population-based case-control studies measuring associations between haplotypes of single nucleotide polymorphisms (SNPs) are increasingly popular, in part because haplotypes of a few ‘‘tagging’’ SNPs may serve as surrogates for variation in relatively large sections of the genome. Due to current technological limitations, haplotypes in cases and controls must be inferred from unphased genotypic data. Using individual-specific inferred haplotypes as covariates in standard epidemiologic analyses (e.g., conditional logistic regression) is an attractive analysis strategy, as it allows adjustment for nongenetic covariates, provides omnibus and haplotype-specific tests of association, and can estimate haplotype and haplotype environment interaction effects. In principle, some adjustment for the uncertainty in inferred haplotypes should be made. Via simulation, we compare the performance (bias and mean squared error of haplotype and haplotype environment interaction effect estimates) of several analytic strategies using inferred haplotypes in the context of matched case-control data. These strategies include using only the most likely haplotype assignment, the expectation substitution approach described by Stram et al. ([2003b] Hum. Hered. 55:179–190) and others, and an improper version of multiple imputation. For relatively uncomplicated haplotype structures and moderate haplotype relative risks (r2), all methods performed comparably well (small bias with appropriately-sized confidence intervals). For larger relative risks, the most likely haplotype and multiple imputation strategies showed noticeable bias towards the null; the expectation substitution strategy still performed well. When there was more uncertainty in the inferred haplotypes, the most likely and multiple imputation strategies showed even more bias towards the null, while the expectation substitution method had slightly smaller than nominal confidence intervals for larger relative risks (Z5). An application to progesterone-receptor haplotypes and endometrial cancer further illustrates that the performance of all these methods depends on how well the observed haplotypes ‘‘tag’’ the unobserved causal variant. Genet. Epidemiol. 28:261–272, 2005. & 2005 Wiley-Liss, Inc. Key words: haplotypes; population-based matched case-control data; gene-environment interaction n Correspondence to: Peter Kraft, 665 Huntington Avenue, Building 2, Room 109, Boston, MA 02115. E-mail: [email protected] Received 17 February 2004; Accepted 11 October 2004 Published online 6 January 2005 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/gepi.20061 INTRODUCTION Population-based case-control studies measur- ing associations between haplotypes of tightly linked diallelic markers are increasingly popular, for a number of reasons. First, the discovery that regions of high linkage disequilibrium and limited haplotype diversity span the human genome [Daly et al., 2001; Johnson et al., 2001] suggests that a few SNPs may serve as accurate surrogates for variation in relatively large sections of the genome, leading to considerable savings in geno- typing costs [Stram et al., 2003a]. Second, haplo- types can also be more powerful to detect linkage disequilibrium with a causal locus than individual SNPs, in part because the haplotypes act as multiallelic markers [Akey et al., 2001; Chapman and Wijsman, 1998; Morris and Kaplan, 2002; Pritchard, 2001]. Finally, haplotypes themselves may be causally relevant due to the epistatic interaction of multiple loci on the same chromo- some. Distinct analytic strategies may be required to maximize power to detect linkage disequili- brium or sensibly model epistatic effects, e.g., an ‘‘allelic’’ or ‘‘genotypic’’ model [Fallin et al., 2002; Schaid, 2002], but in any case, due to current technological limitations, haplotypes must gener- ally be inferred from unphased genotypic data. One well-established method for estimating haplotype frequencies from a random population Genetic Epidemiology 28: 261–272 (2005) & 2005 Wiley-Liss, Inc.

Transcript of Accounting for haplotype uncertainty in matched association studies: A comparison of simple and...

Accounting for Haplotype Uncertainty in Matched AssociationStudies: A Comparison of Simple and Flexible Techniques

Peter Kraft,1,2n David G. Cox,1 Randi A. Paynter,1 David Hunter,1,3,4 and Immaculata De Vivo1,4

1Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts2Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts3Department of Nutrition, Harvard School of Public Health, Boston, Massachusetts

4Channing Laboratory, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts

Population-based case-control studies measuring associations between haplotypes of single nucleotide polymorphisms(SNPs) are increasingly popular, in part because haplotypes of a few ‘‘tagging’’ SNPs may serve as surrogates for variationin relatively large sections of the genome. Due to current technological limitations, haplotypes in cases and controls must beinferred from unphased genotypic data. Using individual-specific inferred haplotypes as covariates in standardepidemiologic analyses (e.g., conditional logistic regression) is an attractive analysis strategy, as it allows adjustment fornongenetic covariates, provides omnibus and haplotype-specific tests of association, and can estimate haplotype andhaplotype� environment interaction effects. In principle, some adjustment for the uncertainty in inferred haplotypesshould be made. Via simulation, we compare the performance (bias and mean squared error of haplotype andhaplotype� environment interaction effect estimates) of several analytic strategies using inferred haplotypes in the contextof matched case-control data. These strategies include using only the most likely haplotype assignment, the expectationsubstitution approach described by Stram et al. ([2003b] Hum. Hered. 55:179–190) and others, and an improper version ofmultiple imputation. For relatively uncomplicated haplotype structures and moderate haplotype relative risks (r2), allmethods performed comparably well (small bias with appropriately-sized confidence intervals). For larger relative risks,the most likely haplotype and multiple imputation strategies showed noticeable bias towards the null; the expectationsubstitution strategy still performed well. When there was more uncertainty in the inferred haplotypes, the most likely andmultiple imputation strategies showed even more bias towards the null, while the expectation substitution method hadslightly smaller than nominal confidence intervals for larger relative risks (Z5). An application to progesterone-receptorhaplotypes and endometrial cancer further illustrates that the performance of all these methods depends on how well theobserved haplotypes ‘‘tag’’ the unobserved causal variant. Genet. Epidemiol. 28:261–272, 2005. & 2005 Wiley-Liss, Inc.

Key words: haplotypes; population-based matched case-control data; gene-environment interaction

nCorrespondence to: Peter Kraft, 665 Huntington Avenue, Building 2, Room 109, Boston, MA 02115. E-mail: [email protected] 17 February 2004; Accepted 11 October 2004Published online 6 January 2005 in Wiley InterScience (www.interscience.wiley.com)DOI: 10.1002/gepi.20061

INTRODUCTION

Population-based case-control studies measur-ing associations between haplotypes of tightlylinked diallelic markers are increasingly popular,for a number of reasons. First, the discovery thatregions of high linkage disequilibrium and limitedhaplotype diversity span the human genome[Daly et al., 2001; Johnson et al., 2001] suggeststhat a few SNPs may serve as accurate surrogatesfor variation in relatively large sections of thegenome, leading to considerable savings in geno-typing costs [Stram et al., 2003a]. Second, haplo-types can also be more powerful to detect linkagedisequilibrium with a causal locus than individual

SNPs, in part because the haplotypes act asmultiallelic markers [Akey et al., 2001; Chapmanand Wijsman, 1998; Morris and Kaplan, 2002;Pritchard, 2001]. Finally, haplotypes themselvesmay be causally relevant due to the epistaticinteraction of multiple loci on the same chromo-some. Distinct analytic strategies may be requiredto maximize power to detect linkage disequili-brium or sensibly model epistatic effects, e.g., an‘‘allelic’’ or ‘‘genotypic’’ model [Fallin et al., 2002;Schaid, 2002], but in any case, due to currenttechnological limitations, haplotypes must gener-ally be inferred from unphased genotypic data.One well-established method for estimating

haplotype frequencies from a random population

Genetic Epidemiology 28: 261–272 (2005)

& 2005 Wiley-Liss, Inc.

sample is the Expectation-Maximization (EM)algorithm [Excoffier and Slatkin, 1995; Fallin andSchork, 2000; Qin et al., 2002]. This is particularlyattractive in the context of tightly linked markers,where the number of observed haplotypes isusually much smaller than the number of theore-tically possible haplotypes. Very rare haplotypescan be excluded from the calculations, greatlyreducing the number of parameters to be esti-mated. The EM algorithm also produces theprobability that an individual has a given haplo-type pair as a natural by-product [Chiano andClayton, 1998]. As discussed below, these prob-abilities can be used to define a haplotypecovariate to be included in a standard analysis,such as logistic regression [Stram et al., 2003a,b;Zaykin et al., 2002].It is tempting to assign each subject her most

likely haplotype pair and then analyze the dataas if the haplotypes were observed. There areseveral potential problems with this approach.First, ignoring the uncertainty in the assignedhaplotype pair can lead to overly narrowconfidence intervals and bias, depending on thedegree of misclassification. This is essentiallya measurement error problem [Carroll, 1998;Thomas et al., 1993]. If misclassification ratesdiffer between cases and controls (i.e., if thereis differential misclassification), then tests ofhaplotype-effect parameter estimates can bebiased. Estimates of haplotype� environmenteffects may also be biased (either towards oraway from the null).Second, the haplotype probabilities for a subject

conditional on their genotypes depend on thepopulation haplotype frequencies, which in thiscase are estimated using the case-control sample.Taking the most likely haplotype pair (or anyfunction of the haplotype probabilities) as ob-served ignores the additional uncertainty due tovariation in haplotype frequency estimates. Thisis an example of shared measurement error [Stramand Kopecky, 2003]; changes in haplotypefrequency estimates affect the haplotype prob-ability assignments for all subjects simultaneously.When there is a low degree of haplotype complex-ity (high linkage disequilibrium with a fewcommon haplotypes), ignored uncertainty inhaplotype frequency estimates may not greatlyinfluence haplotype risk estimates, since theinferred haplotype probabilities for most geno-types will not vary greatly for a range offrequency estimates (e.g., there may be relativelymany genotypes homozygous for all markers, for

which haplotypes can always be inferred withcertainty). On the other hand, when there is a highdegree of haplotype complexity (low linkagedisequilibrium and many equifrequent haplo-types), changes in haplotype frequency estimatesmay change inferred subject-specific haplotypeprobabilities and hence change the haplotype riskestimates.Finally, the standard EM algorithm calculates

haplotype frequencies and subject-specifichaplotype probabilities without reference to sub-jects’ case-control status, i.e., it ignores ascertain-ment. As discussed further in Methods, thiscan affect haplotype probability assignmentsif haplotype variation is associated with variationin disease risk. Furthermore, the EM algorithmassumes the haplotypes in the sampled popula-tion are in Hardy-Weinberg equilibrium; thismay not be the case for a case-enrichedsample.Several analytic strategies have been proposed

to account for uncertainty in haplotype assign-ment and uncertainty in haplotype frequencyestimates and ascertainment. Schaid et al. [2002]discuss a score test from a generalized linearmodel for the composite null hypothesis of nohaplotype effect. As haplotype frequencies andsubject-specific haplotype probabilities are calcu-lated under the null, ascertainment is not an issue;further, they present an estimator for the varianceof the score that accounts for the variance infrequency estimates. They deal with uncertaintyin haplotype assignment by assigning each subjecther expected haplotype score. For example, if thelinear predictor for outcome involves a dominanteffect for haplotype A, aþb I[A present], and asubject carries haplotype A with 80% probability,then the linear predictor for that subject would beaþb� 0.8. This technique comes from the mea-surement error literature where it is known as‘‘regression substitution’’ [Stram et al., 2003a;Thomas et al., 1993] or the ‘‘replacement method’’[Carroll, 1998].In order to estimate haplotype risks away from

the null, several methods have been proposed thatjointly estimate haplotype risk parameters andhaplotype frequencies while accounting for ascer-tainment. Epstein and Satten [2003] present aretrospective likelihood for case-control data thatintegrates over the unobserved phase (haplotype)assignments; Zhao et al. [2003] propose a similarestimating equations approach, although under arare disease assumption they calculate haplotypefrequencies only using controls. Stram et al.

Kraft et al.262

[2003b] make use of case and control samplingfractions (which would be known in nestedcase-control or case-cohort studies) to adjustfor the sampling scheme. Stram et al. [2003b]also show that the regression-substitutionapproach can be accurate relative to the compu-tationally more intensive algorithm that accountsfor both haplotype measurement error and ascer-tainment.The latter two approaches consider unmatched

case-control designs (or frequency-matched withlarge strata). Here we consider n:m matched case-control studies where n and m are small, such aswould usually be analyzed by conditional logisticregression. We compare the performance of sixanalytic strategies via simulation, in terms of theirability to estimate haplotype main effects andhaplotype-environment interactions. We also ap-ply these strategies to data from a nested case-control study of endometrial cancer and variationin the progesterone receptor gene (PGR). All sixstrategies first use the EM algorithm to estimatehaplotype frequencies in the pooled sample ofcases and controls and calculate subject-specifichaplotype probabilities. They differ in how thesehaplotype probability distributions are used insubsequent association analysis.The first strategy (MHAP) assigns each subject

her most likely haplotype and then analyzes thedata as if the haplotypes had been observed.The next two strategies (MH70 and MH90) aresimilar, except we restrict the analysis to thosesubjects whose most likely haplotypes are atleast 70% (90%) probable. The fourth strategy(IHAP) accounts for uncertainty in haplotypeassignment using a naive (or ‘‘improper’’ [Littleand Rubin, 2002]) version of multiple imputation.That is, a number of replicate data sets aregenerated by randomly assigning a haplotype toeach subject according to her haplotype probabil-ities, conditional on her genotypes (but notconditional on her disease status, which is whythis is an improper approach). Haplotyperisk estimates are calculated by averaging thereplicate-specific estimates; estimate variancesare estimated using within and across replicateestimate variance. The fifth strategy (EHAP)uses regression substitution, assigning each sub-ject her expected haplotype score. The finalstrategy (JHAP) approximates the joint likelihoodfor haplotype risk parameters and haplotypefrequencies by averaging the usual conditionallogistic likelihood over all possible haplotypeassignments, weighted by their estimated prob-

abilities under the null. This likelihood isthen maximized over the haplotype risk para-meters alone. (The full joint likelihood wouldbe maximized over both the haplotype riskparameters and the haplotype frequencies simul-taneously.)Although none of these strategies accounts for

uncertainty in haplotype frequency estimates orascertainment, most had negligible bias andappropriate confidence-interval coverage in situa-tions with moderate haplotype risk and lowhaplotype complexity. The strategies using themost likely haplotypes or multiple imputation allshowed noticeable bias towards the null for largerrelative risks, while the EHAP and JHAP methodsperformed well even with strong haplotypeeffects. However, for situations with greaterhaplotype complexity and larger relative risks(Z5) the confidence intervals for the EHAP andJHAP methods had smaller than nominal cover-age.The retrospective likelihood proposed by Ep-

stein and Satten [2003] does account for bothuncertainty in haplotype frequency estimates andascertainment by jointly fitting haplotype risk andhaplotype frequency parameters. Although ori-ginally presented for unmatched case-controlstudies, it can be applied to matched data. Inprinciple, it can also be extended to account forhaplotype� environment interaction (Glen Satten,personal communication). We do not evaluate theperformance of the retrospective likelihood, how-ever, as fitting it requires specialized software,and the extension to haplotype� environmentinteraction models has yet to be implemented.

METHODS

SUBJECT-SPECIFIC HAPLOTYPE PROBABILITYCALCULATION

Given a set of distinct haplotypes h1,y,hH andtheir frequencies q¼q1,yqh, we calculate theprobability that an individual with genotypes Ghas the haplotype pair H¼(H1,H2) using Bayes’Rule:

PrqðH ¼ ðH1;H2ÞjGÞ

¼ PrðGjHÞPrqðHÞPH� PrðGjH�ÞPrqðH�Þ : ð1Þ

Here Pr(G|H) is an indicator function: 1 if thegenotype G is consistent with the haplotypes H, 0otherwise. The sum in the denominator is thus

Comparison of Haplotype-Association Analyses 263

over all haplotype pairs consistent with genotypeG. In practice, we calculate (1) using the estimatesqq, derived using the EM algorithm applied to thepooled sample of cases and controls. In fact, theEM algorithm uses the probabilities (1) in the ‘‘E-step,’’ and most haplotype frequency estimationprograms (e.g., PROC HAPLOTYPE in SAS andPL-EM [Qin, et al., 2002]) return these subject-specific haplotype distributions.Note that expression (1) and the EM algorithm

assume the sample comes from a population inHardy-Weinberg equilibrium (HWE), in particularPr(H¼(Hi,Hj))¼qi qj. Even if the haplotypes are inHWE in the general population, the case-controlsampling scheme may force the sampled haplo-types out of HWE. In the case of rare disease,initial EM estimation of haplotype frequencies qqcould be restricted to control haplotypes, whichare likely to be in HWE. However, this approachmay miss some of the haplotype diversity in cases(e.g., some haplotypes may only appear in cases).Alternatively, haplotype frequencies in cases qq1and controls qq0 could be estimated separately,with qq1 then used to calculate the haplotypeprobabilities in cases and qq0 used in controls.Haplotypes in cases need not be in HWE,although they will be if the risk of disease is alog-additive function of haplotype counts (as inthe simulation model below). Regardless, if thenumber of cases is small, qq1 could be highlyvariable, potentially compounding the problem ofover-optimistic confidence intervals and leadingto an increase Type I error. Methods that jointlyestimate population haplotype frequencies andhaplotype risk parameters are more appropriate.For example, ascertainment could be accountedfor explicitly by using Pr(H|G, subject issampled) instead of (1) in the joint likelihooddescribed below, or implicitly by using the retro-spective likelihood Pr(G|D), and then estimatinghaplotype risk parameters and qq simultaneously.Both the joint likelihood approach (also outlinedin Stram et al. [2003b]) and the retrospectiveapproach [Epstein and Satten, 2003] presentcomputational challenges and are not implemen-ted here.

HAPLOTYPE RISK MODELING

If the subject-specific haplotypes were ob-served, they could be analyzed as any otherobserved covariate in a standard epidemiologicalanalysis. In the case of n:m matched case-controlstudies, we use conditional logistic regression.

This involves maximizing the following like-lihood:

Lðb;HÞ ¼ PiLiðb;HiÞ

¼Yi

Qj exp b0Zij Hij

� �Dij

h iP

D�i : D

�j

�� ��¼ Dij j

Qj exp b0Zij Hij

� �D�

ij

h i: ð2Þ

Here Di¼ðDij; . . . ; DiJiÞ0 is a vector of disease

indicators for subjects j¼1,y,Ji in the ith matchedset, Hij is the haplotype pair for subject ij(analogouslyHi are the haplotype data for stratumi), Zij(Hij) is a numerically coded vector ofcovariates and b is a vector of log odds-ratio(OR) parameters. For example, to simultaneouslyestimate the OR associated with a given haplotypeA under a dominant model, the OR associatedwith a dichotomous exposure X and the OR forthe haplotype� exposure interaction (departurefrom a multiplicative OR model), Zij(Hij) would beset to:

Zij Hij

� �¼

I Apresent in Hij

� �Xij

Xij�I Apresent in Hij

� �0@

1A:

In our simulation studies and applications weconsider two models. The first includes only anadditive main effect for a particular haplotype A,i.e.,

Zij Hij

� �¼ number of copies ofA inHij: ð3Þ

The second includes a main effect for haplotype Aas well as an interaction between A and thedichotomous matching exposure X:

ZijðHijÞ ¼# A inHij

� �Xij�# A inHij

� �� �: ð4Þ

The main effect for X cannot be estimated,because cases and controls are matched on X;variation in Hij within matched sets makes itpossible to estimate the interaction parameter.For the first four analysis strategies (MHAP,

MH70, MH90, and IHAP), the covariate vectors (3)and (4) are calculated given fixed estimates for{Hij}. Maximum likelihood estimates for b andtheir variances are calculated from (2), treating theHij as known. For the IHAP strategy, replicatehaplotype data H

ðrÞij r¼1,y, R are sampled from

(1), and bbr and vaarðbbrÞ are calculated for eachreplicate data set. (We use R¼20.) The finalestimate for b is

Pr bbr=R; its variance is estimated

by ð1=RÞP

r vaarðbbrÞ þ ððRþ 1Þ=RÞvarðbbÞ [Littleand Rubin 2002].

Kraft et al.264

For the regression substitution strategy EHAP,Zij(Hij) in (2) is replaced by its expectation over thedistribution (1), or

EZij ¼XH

ZijðHÞPr GijjH� �

PrqqðHÞPH� Pr GijjH�� �

PrqqðH�Þ:

Maximum likelihood estimates for b and theirvariances are again calculated from (2), takingthe E Zij as fixed. A SAS macro that calculatesthese expected scores under various inheritancemodels and tests for haplotype and haplotype�environment interaction effects is available fromthe authors.The joint likelihood strategy JHAP approxi-

mates the appropriate missing-data marginallikelihood (where the missing data is the phaseinformation in H) by taking the expectation of (2)over (1):

LJðbÞ ¼Yi

XHi

Li b;Hið ÞPr HijGið Þ( )

¼Yi

XHi

Li b;Hið ÞYj

Pr GijjHij

� �PrqqðHijÞP

H� Pr GijjH�� �PrqqðH

�Þ

8<:

9=;:

ð5ÞThis new approximate likelihood LJ(b) is thenmaximized to estimate b; the variance of theseestimates is estimated using the inverse informa-tion, as usual. This likelihood is similar to the‘‘cohort likelihood’’ described by Stram et al.[2003b], except we do not maximize (5) withrespect to both b and q. Instead, we fix q at itsestimate under the null. Note that (similar to the‘‘cohort likelihood’’), the likelihood (5) misspeci-fies the distribution of Pr(Hi|Gi), as it assumes thesample is in HWE. This may not be true,especially for case-enriched samples.

SIMULATION STUDIES

Haplotype and genotype for four diallelic lociwere generated from two haplotype distributions,shown in Table I. The first roughly corresponds tohaplotype frequencies for the progesterone recep-tor (PGR) gene in a sample of endometrial cancercases and matched controls (described below).The second distribution assumes all 16 possiblehaplotypes are equally frequent. We chose thisdistribution because the amount of uncertainty ininferred haplotypes is much greater than theuncertainty under the PGR-like distribution (seeFig. 1). For the PGR-like distribution the r2h (ameasure of how much information the unphased

genotypes contain about phased haplotypes thatruns from 0 for no information to 1 for perfectinformation [Stram et al., 2003a]) ranged from 0.72to 0.98. For the equifrequent haplotypes the r2h was0.54.The probability that a subject with haplotype

pair H developed disease was set to

PrðDjHÞ ¼ B�RRZðHÞH �RRX

X�RRZðHÞ�XI ;

where Z(H) is the number of copies of a specifiedhaplotype, and X is a dichotomous exposure. B isthe probability that an unexposed non-carrierdevelops disease (sporadic incidence); RRH, RRX,and RRI are haplotype-main effect, exposure-maineffect, and interaction relative risks, respectively.[We chose this log-linear distribution because ityields computationally convenient sampling prob-abilities Pr(H|D). For rare disease, the odds ratiosfrom the logistic model accurately estimate therelative risks in the log-linear model. Note thatunder this model, the case and control groups areeach in HWE, although the combined sample isnot.] Under the PGR-like model, we varied thefrequency of the at-risk haplotype by choosingeither haplotype 0000 (freq. 32% r2h 0.97) orhaplotype 0101 (freq. 4%, r2h 0.83) to be the at-riskhaplotype. For each set of simulation parameters,we generated 200 replicate studies of 200 1:3exposure-matched case-control sets. Three-quar-ters of the sets were exposed (X�1); the remainderwere unexposed. This mimics the structure of theendometrial cancer data set analyzed by Paynteret al. (2004, unpublished data) where ‘‘exposure’’is menopausal status.

TABLE I. Simulated haplotype frequencies

Haplotype PGR-like Equifrequent

0000 0.3265 0.06250001 0.1327 0.06250010 F 0.06250011 F 0.06250100 0.0306 0.06250101 0.0408 0.06250110 F 0.06250111 F 0.06251000 0.1633 0.06251001 F 0.06251010 0.0408 0.06251011 F 0.06251100 0.0204 0.06251101 F 0.06251110 0.2449 0.06251111 F 0.0625

Comparison of Haplotype-Association Analyses 265

We varied B, RRH, RRX, and RRI to simulate arange of incidence rates and haplotype effects:B was set to 5.7/10,000 for rare disease and 0.1for common; RRH and RRX ranged from 1 to 5;and RRI ranged from 1 to 2. We analyzedeach data set using the true haplotypes and thesix methods outlined above. Models with nointeraction (RRI¼1) were analyzed using the maineffects model (3); models with interaction wereanalyzed using the main effects and interactionmodel (4).

DATA APPLICATION

Genotype information for seven SNPs spacedapproximately 10,000 base pairs apart spanningthe entire PGR gene was obtained from a panel of45 individuals from Applied Biosystems. Allseven SNPs had a minor allele frequency of418%. Haplotypes were reconstructed usingPHASE [Stephens and Donnelly, 2003; Stephenset al., 2001] and haplotype tagging SNPs (htSNPs)were selected using BEST [Sebastiani et al., 2003].

Fig. 1. Distribution of the probability of carrying the less common at-risk haplotype 0101 conditional on observed unphased genotypes

G for heterozygote carriers and non carriers. Under the MHAP scheme, which assigns each subject her most likely haplotype pair,carriers are misclassified as non-carriers 4.2% of the time for PGR-like haplotype frequencies (Table I); non carriers are misclassified as

carriers 1.4% of the time. Under the IHAP scheme, which samples H from Pr(H|G), the misclassification rate in carriers increases to

16.9%; the misclassification rate in non-carriers is essentially unchanged. For equally frequent haplotypes, the MHAP and IHAP

schemes have the same misclassification rates (all H consistent with G are equally likely): 45.8% in carriers, 6.1% in non-carriers.

Kraft et al.266

Eight haplotypes were described, all with aminimum frequency of 42% in this sample set.Four of the seven SNPs were necessary to tagthese eight haplotypes.These four htSNPs (Applied Biosystems assay

numbers P97855189, P97871567, P97911821, andP97929780) were genotyped in 216 cases and 657matched controls from a nested case-control studywithin the prospective Nurse’s Health Study(NHS) Cohort to examine the relationship be-tween PGR haplotypes and endometrial cancer.Controls were matched to cases on age at blooddraw, menopausal status at blood draw, postme-nopausal hormone use at blood draw, time of dayat blood draw, date of blood draw, and fastingstatus at blood draw. Genotyping was carried outusing TaqMan primer and probe sets availablecommercially from Applied Biosystems. A pre-vious study using the NHS Cohort [De Vivo et al.,2002] established that the PGRþ331G/A SNP (not

one of the four htSNPs) was functional andsignificantly associated with endometrial canceramong women with BMI 428. We also assess thecorrelation between the þ331G/A SNP and thehtSNP haplotypes using the haplotype r2 de-scribed in Weale et al. [2003].Note that this application is intended as

an illustration of statistical methods for haplotypeassociation rather than a substantive contributionto the genetic epidemiology of endometrialcancer.

RESULTS

SIMULATION STUDY

For PGR-like haplotype frequencies and mod-erate haplotype relative risks (RRHr1.5) allmethods performed comparably well in terms ofbias and variance estimation, although MH90

Fig. 2. Boxplots of parameter estimate errors (estimate � true value) from 200 replicate simulations of a rare disease with a less common

(4%) at-risk haplotype (model with no gene� environment interaction) and PGR-like haplotype frequencies. TRUE are the estimates

using phase-known haplotypes. The other methods use unphased genotype data and are described in the text.

Comparison of Haplotype-Association Analyses 267

estimates showed higher variation and hence hada higher mean squared error (see Fig. 2 and TableII for rare disease; results for common diseasesimilar but not shown). This higher variation isdue to the smaller sample size; MH90 does not useabout 10% of subjects because they do not meetthe 90% cut-off (MH70 excluded about 2%).Estimates for all methods based on unphasedgenotype data showed slightly higher variationthan estimated based on phased haplotypes, asexpected. However, for larger relative risks, themethods that treat the most likely haplotype pairsas if they were observed show noticeable biastowards the null, with average bias as large as14%. The MH90 estimates appear to remainunbiased, although due to their large variancethe MH90 estimates still have the largest meansquared error. These trends are more pronouncedfor less common (4%) at-risk haplotypes, but theycan also be observed for common at-risk haplo-types (Table II). The variance in estimates forcommon at-risk haplotypes is smaller than for lesscommon at-risk haplotypes, as expected (smallereffective sample size for less common at-riskhaplotypes). Results for estimates of gen-e� environment interaction are similar (Table II).EHAP and JHAP estimates are comparable tothose from analyses of the true, phased haplo-types in terms of bias, showing only a slightincrease in variability. MHAP and IHAP arebiased towards the null, and this bias increaseswith effect size.For equal haplotype frequencies, the EHAP

and JHAP methods again perform well when thesimulated relative risk is moderate (approxi-mately 2) (Table II), while the MHAP andIHAP methods show stronger bias towards thenull (average bias as large as 35%). MH70 andMH90 estimates failed to converge in somereplicates (2 to 10%), as they excluded a largenumber of subjects (approximately 68% for bothmethods) and hence suffered from small samplesize. Under this extreme equal haplotype fre-quency model, many subjects cannot be assigneda haplotype pair with great certainty (Fig. 1); thisis why MH70 and MH90 exclude so manysubjects.For larger relative risks, the confidence intervals

from the EHAP and JHAP methods have slightlysmaller than nominal coverage, apparently be-cause the standard variance estimators (which donot take haplotype uncertainty in to account)underestimate the true variance in the parameterestimates.

DATA APPLICATION

Table III shows the correlation between indivi-dual PGR tag-SNP haplotypes and the functionalþ331 G/A SNP that has been previously asso-ciated with endometrial cancer among overweightwomen [De Vivo et al., 2002]. The r2s [Stramet al., 2003a] between the inferred tag-SNPhaplotypes and the functional SNP was 0.66.In the current sample, after adjusting for potentialconfounders (see footnotea to Table IV), theodds ratio (OR) for disease in carriers of theþ331 A allele with BMI 4 30 relative to non-carriers with BMI 4 30 was 3.93 (95% C.I.1.41–10.97). As haplotype 3 was most highlycorrelated with the þ331 G/A SNP, it wasdesignated the at-risk haplotype in furtheranalyses (ORs for haplotype 3 were calculatedrelative to all other haplotypes combined).Adjusted ORs for carriers of haplotype 3 werelower than the corresponding ORs for þ331 Aallele carriers in all models tested, due to theimperfect correlation of this haplotype with therisk allele (Table IV). However, the relativeperformance of the haplotype analyses wassimilar to that seen in the simulations. The EHAPOR had the smallest confidence interval and wasthe only OR to reach significance at the 0.05 level.The MH90 estimate showed the least attenuationrelative to þ331A/G OR but had the largestconfidence interval; the MHAP OR showed themost attenuation.It should be noted that we performed these

analyses with prior knowledge of the risk alleleand the haplotype that was most correlated withit. Absent a priori knowledge of the risk haplo-type, a standard omnibus analysis would fit amodel with six ‘‘dummy’’ variables (one for eachhaplotype above 5% prevalence except the mostcommon haplotype, which would serve as thereferent). ORs from the EHAP implementation ofthis model are listed in Table IV. The OR forhaplotype 3 is no longer statistically significant, assome power is lost by simultaneously estimatinghaplotype-effect parameters for the other fivehaplotypes. This example illustrates the ‘‘doublewhammy’’ that can arise in analyses of haplotype-tagging SNPs. First, the causal variant may not beperfectly correlated with one of the observedhaplotypes, leading to a decrease in effectivesample size. Second, a straightforward haplotypeanalysis will increase the number of parametersto be estimated (increase test degrees of freedom),further reducing power. (For dominant or

Kraft et al.268

TABLEII.Estim

ate

perform

ance:rare

disease,select

models

a

a.PGR-likefrequencies

Uncommonhap

lotype,

RRH¼1.5,

RRX¼RRI¼

1.0

Uncommonhap

lotype,

RRH¼5.0,

RRX¼RRI¼

1.0

Uncommonhap

lotype,

RRH¼RRI¼

2.0,

RRX¼5.0

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

True

�0.00

60.082

0.06

80.082

0.925

�0.034

0.042

0.047

0.043

0.955

�0.002

0.071

0.065

0.07

00.940

EHAP

�0.00

20.094

0.08

10.094

0.945

�0.026

0.051

0.053

0.051

0.955

0.006

0.0787

0.075

0.07

70.955

JHAP

�0.00

60.093

0.08

00.092

0.945

�0.048

0.093

0.093

0.095

0.950

�0.002

0.079

0.076

0.07

80.965

MHAP

�0.05

10.074

0.06

30.076

0.920

�0.158

0.041

0.044

0.066

0.890

�0.132

0.064

0.060

0.08

10.910

MH70

�0.05

60.078

0.06

90.081

0.920

�0.158

0.041

0.044

0.066

0.890

�0.131

0.066

0.062

0.08

30.915

MH90

�0.04

50.173

0.14

20.174

0.945

�0.155

0.043

0.045

0.066

0.890

�0.007

0.133

0.129

0.13

20.965

IHAP

�0.06

60.069

0.08

00.073

0.960

�0.219

0.035

0.050

0.083

0.855

�0.171

0.055

0.071

0.08

40.920

Commonhap

lotype,

RRH¼1.5,

RRX¼RRI¼

1.0

Commonhap

lotype,

RRH¼5.0,

RRX¼RRI¼

1.0

Commonhap

lotype,

RRH¼RRI¼

2.0,

RRX¼5.0

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

True

0.00

50.017

0.01

50.017

0.935

0.007

0.028

0.025

0.028

0.950

0.015

0.027

0.028

0.02

70.960

EHAP

0.00

40.016

0.01

50.016

0.945

0.005

0.029

0.025

0.029

0.935

0.009

0.028

0.028

0.02

80.950

JHAP

0.00

40.016

0.01

50.016

0.950

0.013

0.029

0.025

0.029

0.940

0.015

0.028

0.029

0.02

80.955

MHAP

0.00

00.016

0.01

50.016

0.940

0.017

0.029

0.025

0.030

0.945

0.018

0.029

0.029

0.02

90.960

MH70

0.00

20.016

0.01

50.016

0.950

0.014

0.029

0.026

0.029

0.945

0.017

0.029

0.030

0.02

90.965

MH90

0.00

40.016

0.01

60.016

0.960

0.010

0.031

0.027

0.031

0.945

0.016

0.029

0.031

0.02

90.965

IHAP

�0.00

70.016

0.01

50.016

0.945

�0.018

0.028

0.025

0.028

0.925

�0.017

0.026

0.029

0.02

70.945

b.Equalfrequencies

RRH¼2.0,

RRX¼RRI¼

1.0

RRH¼5.0,

RRX¼RRI¼

1.0

RRH¼RRI¼

2.0,

RRX¼5.0

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

Bias

varðbbÞ

EvaarðbbÞ

MSE

Cover

True

0.02

50.044

0.04

30.044

0.95

�0.009

0.035

0.036

0.035

0.960

0.000

0.066

0.054

0.06

50.910

EHAP

0.06

40.088

0.08

10.091

0.95

0.025

0.070

0.055

0.070

0.910

0.033

0.124

0.089

0.12

40.920

JHAP

0.04

40.083

0.07

70.084

0.96

�0.551

0.031

0.025

0.334

0.090

0.001

0.104

0.094

0.10

30.939

MHAP

�0.29

00.059

0.05

10.142

0.71

�0.265

0.065

0.051

0.134

0.710

�0.609

0.033

0.037

0.40

40.110

MH70

0.04

00.308

0.30

60.306

0.99

�0.068

0.157

0.192

0.161

0.940

�0.025

0.364

0.385

0.36

10.902

MH90

0.09

40.365

0.34

90.370

0.99

0.282

1.656

0.328

1.719

0.939

�0.024

0.436

0.447

0.43

20.913

IHAP

�0.30

80.031

0.06

40.125

0.86

�0.544

0.023

0.040

0.318

0.160

�0.575

0.032

0.062

0.36

30.340

aBiasis

theav

erag

edifference

betweenparam

eter

estimates

andtrueparam

eters;

varðbbÞis

thesample

variance

oftheparam

eter

estimates;EvaarðbbÞis

thesample

averag

eofthe

variance

estimates;MSEis

thesample

variance

ofthesquared

difference

betweenparam

eter

estimates

andtrueparam

eters;

Cover

isthepercentageofreplicateswherethe95

%confiden

ceinterval

contained

thetrueparam

eters.

Thevariousmethods(True,

EHAP,

etc.)areexplained

inthetext.

recessive haplotype-effect models, the choice ofthe reference haplotype also affects power.)Methods to overcome these pitfalls using noveldesigns (e.g., better methods for choosing ‘‘tag-ging’’ SNPs) or analyses (e.g., grouping haplo-types based on evolutionary history) are currentlyunder development. Pending future develop-ments, a two-stage analysis seems reasonable.First, test overall gene-disease association usingan omnibus multiple-haplotype test under anadditive model, setting the most common haplo-type as the referent and pooling very rarehaplotypes (or excluding them altogether, if theircumulative frequency is too small to permitaccurate parameter estimation). Second, if theomnibus test is significant, explore models usingindividual haplotypes (such as haplotype 3 in theprevious example).

DISCUSSION

We have shown that two methods that accountfor uncertainty in haplotype assignments based onunphased genotypesFregression substitution(EHAP) and simplified joint likelihood (JHAP)approachesFprovide accurate estimates of hap-lotype relative risk and haplotype� environmentinteraction in matched case-control studies acrossa range of reasonable relative risks (1–5) when thehaplotype uncertainty is small (r2h40.8). These twomethods were accurate even though they (1) relyon haplotype probability estimates that do not

account for ascertainment and (2) do not accountfor uncertainty in haplotype frequency estimates.The EHAP and JHAP estimates were morevariable than estimates based on the true, phasedhaplotypes. In practice, this can be accounted forat the design stage by increasing the sample sizerelative to the anticipated haplotype uncertainty[Stram et al., 2003a; Weale et al., 2003]. Our resultsare consistent with other simulation studies [Zhaoet al., 2003] and likelihood comparisons [Stramet al., 2003b] in the context of unmatched case-control studies.Conversely, methods that assign subjects their

most likely haplotype and then analyze the data asif the haplotypes were observed (MHAP, MH70)provide biased estimates (towards the null).Intuitively, this is because the apparent exposurecategories (imputed haplotypes) are diluted withmembers actually from the other categories. Infact, for both PGR-like frequencies and equalhaplotype frequencies, carriers of the at-riskhaplotype are more likely to be misclassified asnon-carriers than vice versa, washing out the true

TABLE III. Correlation between the PGR+331G/A SNPand PGR haplotypesa

Haplotype SNP1 SNP2 SNP3 SNP4 Frequency r2þ331

H1 0 0 0 0 0.342 0.03H2 1 0 0 0 0.106 0.01H3 0 1 0 0 0.073 0.85

H4 1 1 0 0 0.001 o0.01H5 0 0 1 0 0.002 o0.01H6 1 0 1 0 0.06 o0.01H7 0 1 1 0 0.001 o0.01H8 1 1 1 0 0.171 0.01H9 0 0 0 1 0.181 0.01H10 1 0 0 1 0.001 o0.01H11 0 1 0 1 0.062 o0.01

aSNP1–SNP4 are haplotype tagging SNPs chosen as described inthe text. r2þ331 is the correlation between the given haplotype andthe +331G/A SNP [Weale et al., 2003]. Frequencies differ fromthose in Table 1, because the simulated frequencies were based onestimates from preliminary data.Boldface haplotype had highest correlation with the +331 G/ASNP.

TABLE IV. Odds ratios of endometrial cancer forprogesterone receptor haplotypes

Odds Ratio (95% C.I.)a

Single SNP (+331 A carriers) 3.93 (1.41–10.97)Single haplotype (H3 carriers)b

MHAP 2.64 (0.95, 7.31)MH90 3.28 (0.88, 12.23)EHAP 2.85 (1.16, 6.95)

Multiple haplotypesc

H2 0.92 (0.33, 2.55)H3 1.74 (0.51, 5.98)H6 0.72 (0.19, 2.77)H8 0.54 (0.23, 1.26)H9 0.63 (0.25, 1.62)H11 0.35 (0.10, 1.26)

aAnalyzed using conditional logistic regression, adjusting forsmoking pack years, age at menarche, parity and age at first birth,familial history of uterine and colon cancer, BMI at diagnosis(categorized as BMIo30 kg/m2 and BMIZ30), weight gain sinceage 18, and including a BMI�PGR interaction effect. Reportedodds ratios are for carriers with BMIZ30 relative to non-carrierswith BMIZ30.bMH70 analysis identical to MHAP analysis (no subjects ex-cluded). Although both the IHAP and JHAP methods can beextended to account for covariates, flexible software implement-ing these extensions is not currently available; these methods werenot applied to the endometrial cancer data set.cHaplotype effect parameters fit simultaneously using an EHAPmodel with seven ‘‘dummy’’ variables: one for each of the listedhaplotypes and a seventh pooling all rare haplotypes (OR notshown). Odds ratios are thus relative to the most commonhaplotype, H1.

Kraft et al.270

haplotype effect. Restricting the analysis to sub-jects whose haplotypes could be determined witha high degree of certainty (MH90) increased theaccuracy of this but greatly increased the varianceof the estimates (due to reduction in sample size),leading to large mean squared error. Furthermore,there is no guarantee that this restricted approachwill increase accuracy under general models forhaplotype relative risk (beyond the simple addi-tive model considered here), because it selectivelyexcludes predicted heterozygotes (homozygotesare always predicted with 100% certainty). Im-proper multiple imputation (IHAP) remainsbiased towards the null, for the same reason asMHAP and MH70: the risk estimates themselvesare based on misclassified haplotypes. UnderPGR-like haplotype frequencies, the magnitudeof differential misclassification of carriers andnon-carriers actually increases by randomly im-puting haplotypes (Fig. 1), leading to increasedbias towards the null. The correct multipleimputation approach requires that haplotypes besampled conditional on both G and outcome, D[Little and Rubin, 2002]. This in turn requiresaccurate estimates of the penetrance model para-meters b. Methods that impute missing data whileestimating b are generally much more complicatedthan the naive multiple imputation approachsketched here. As an alternative, the full, jointlikelihood (the ascertainment-corrected likelihood,marginal over unknown phase) could be max-imized.For ‘‘well-tagged’’ haplotypes (high r2h), the

EHAP and JHAP methods appear to be accurate.However, we emphasize that as the ability of themeasured SNPs to ‘‘tag’’ the unmeasured causalvariant decreases or the uncertainty in inferredhaplotypes increases (i.e., as r2h or r2s decreases orthe amount of missing genotype data increases),the performance of these methods will deteriorate.Our limited simulation studies show that evenwith an r2h of 0.54 the EHAP and JHAP methodshad smaller than nominal confidence intervals.The EHAP and JHAP methods can model

individual-specific covariates in the familiar gen-eralized linear model framework. Furthermore,EHAP and JHAP easily accommodate matcheddesigns (e.g., as in this report), whereas methodsthat compare haplotype frequencies in cases andcontrols ignore any matching. We note that thematching factors can indeed be ignored if they donot confound the exposure of interest [Rothmanand Greenland, 1998]. In many (but not all) cases,it may be reasonable to assume haplotypes are

independent of the matching factors in the generalpopulation. Thus, if haplotype main effects are ofprimary interest, the matching can be ignored andthe data analyzed using the method of Fallin et al.[2001]. Prospective likelihood approaches such asthose proposed by Lake et al. [2003] and theestimating equations approach of Zhao et al.[2003] could also be used, provided they adjustfor matching-factor main effects. (These methodsstill make some assumptions about HWE,although simulation studies reported in Lake etal. [2003] give some evidence that departures fromHWE do not greatly change odds ratio estimates.)However, if the interaction between haplotypesand an environmental covariate correlated with amatching factor is of interest, then the matchingfactor confounds the interaction and cannot beignored.The retrospective likelihood approach [Epstein

and Satten, 2003] is attractive, because it uses bothcases and controls to estimate haplotype frequen-cies, without assuming the cases are in HWE, andit simultaneously estimates haplotype frequenciesand haplotype odds ratios. However, it requiresspecialized software that currently cannot test orestimate haplotype� environment interactions.The EHAP method is easy to implement instandard statistical software and readily accountsfor haplotype� environment interaction.The goal of our simulation studies was to

evaluate the effect of haplotype misclassificationon estimates of haplotype and haploty-pe� environment interaction effects. We assumedthe true at-risk variant is a haplotype, consistentwith previous simulation studies that investigatedthe power of association studies using haplotype-tagging SNPs [Thompson et al., 2003]. Thesesimulations did not address the important issueof model misspecification, such as might arisewhen the true disease variant is a single un-observed SNP. In that case, the variant may lie onseveral tag-SNP haplotypes and not be stronglycorrelated with any. Even when the causal variantis specific to one tag-SNP haplotype, a substantialproportion of the chromosomes with that tag-SNPhaplotype may not carry the causal variant,leading to a loss in effective sample size andpower, as in the PGR example. This issue is relatedto the discussion of whether tagging SNPs shouldbe chosen to predict multi-marker haplotypes orsingle SNPs [Chapman et al., 2003; Stram et al.,2003a; Weale et al., 2003; Lazarus et al., 2003],which requires further empirical study and isbeyond the scope of this report.

Comparison of Haplotype-Association Analyses 271

ACKNOWLEDGMENTS

The authors thank Dan Stram, Donna Speigel-man, and two reviewers for helpful comments anddiscussion.

REFERENCES

Akey J, Jin L, Xiong M. 2001. Haplotypes vs. single marker linkagedisequilibrium tests: what do we gain? Eur J Hum Genet 9:291–300.

Carroll R. 1998. Measurement error in epidemiologic studies. In:Armitage P, Colton T, editors. Encycopedia of biostatistics. NewYork: John Wiley. p 2491–2519.

Chapman JM, Cooper JD, Todd JA, Clayton DG. 2003. Detectingdisease associations due to linkage disequilibrium usinghaplotype tags: a class of tests and the determinants ofstatistical power. Hum Hered 56:18–31.

Chapman N, Wijsman E. 1998. Genome screens using linkagedisequilibrium tests: optimal marker characteristics andfeasibility. Am J Hum Genet 63:1872–1885.

Chiano M, Clayton D. 1998. Fine genetic mapping using haplo-type analysis and the missing data problem. Ann Hum Genet62:55–60.

Daly M, Rioux J, Schaffner S, Hudson T, Lander E. 2001. High-resolution haplotype structure in the human genome. NatureGenet 29:229–232.

De Vivo I, Huggins GS, Hankinson SE, Lescault PJ, Boezen M,Colditz GA, Hunter DJ. 2002. A functional polymorphism in thepromoter of the progesterone receptor gene associated withendometrial cancer risk. Proc Natl Acad Sci USA 99:12263–12268.

Epstein M, Satten G. 2003. Inference on haplotype effects in case-control studies using unphased genotype data. Am J HumGenet 73(6):1316–1329.

Excoffier L, Slatkin M. 1995. Maximum-likelihood estimation ofmolecular haplotype frequencies in a diploid population. MolBiol Evol 12:921–927.

Fallin D, Schork N. 2000. Accuracy of haplotype frequency estimationfor biallelic loci, via the expectation-maximization algorithm forunphased diploid data. Am J Hum Genet 67:947–959.

Fallin D, Cohen A, Essioux L, Chumakov I, Blumenfeld M, CohenD, Schork N. 2001. Genetic analysis of case/control data usingestimated haplotype frequencies: application to APOE locusvariation and Alzheimer’s disease. Genome Res 11:143–151.

Fallin D, Beaty T, Liang K, Chen W. 2002. Power comparisons forgenotypic vs. allelic TDT methods with 4 2 alleles. GenetEpidemiol 23:458–461.

Johnson G, Esposito L, Barratt B, Smith A, Heward J, Di Genova G,Ueda H, Cordell H, Eaves I, Dudbridge F, Twells RCJ, Payne F,Hughes W, Nutland S, Stevens H, Carr P, Tuomilehto-Wolf E,Tuomilehto J, Gough SCL, Clayton DG, Todd JA. 2001.Haplotype tagging for the identification of common diseasegenes. Nature Genet 29:233–237.

Lake S, Lyon H, Tantisira K, Silverman E, Weiss S, Laird N, SchaidD. 2003. Estimation and tests of haplotype-environmentinteraction when linkage phase is ambiguous. Hum Hered55:56–65.

Little R, Rubin D. 2002. Statistical analysis with missing data.Hoboken, NJ: John Wiley & Sons.

Lazarus R, Silverman E, Raby B, Kwiatkowski D, Weiss ST. 2003.Choosing subsets of SNP for genotyping in association studies:htSNP and ldSNP. Am J Hum Genet 73:A387.

Morris R, Kaplan N. 2002. On the advantage of haplotype analysisin the presence of multiple disease susceptibility alleles. GenetEpidemiol 23:221–233.

Pritchard J. 2001. Are rare variants responsible for susceptibility tocomplex diseases? Am J Hum Genet 69:124–137.

Qin Z, Niu T, Liu J. 2002. Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet 70:157.

Rothman KJ, Greenland S. Matching. 1998. In: Rothman KJ,Greenland S, editors. Modern epidemiology. Philadelphia:Lippincott Williams & Williams. p 147–162.

Schaid D. 2002. Reply to Fallin et al. Genet Epidemiol 23:462–464.

Schaid D, Rowland C, Tines D, Jacobson R, Poland G. 2002. Scoretests for association between traits and haplotypes when linkagephase is ambiguous. Am J Hum Genet 70:425–434.

Sebastiani P, Lazarus R, Weiss ST, Kunkel LM, Kohane IS, RamoniMF. 2003. Minimal haplotype tagging. Proc Natl Acad Sci USA100:9900–9905.

Stephens M, Donnelly P. 2003. A comparison of Bayesian methodsfor haplotype reconstruction from population genotype data.Am J Hum Genet 73:1162–1169.

Stephens M, Smith N, Donnelly P. 2001. A new statistical methodfor haplotype reconstruction from population data. Am J HumGenet 68:978–989.

Stram D, Kopecky K. 2003. Power and uncertainty analysis ofepidemiologic studies of radiation-related disease risk in whichdose estimates are based on a complex dosimetry system: someobservations. Radiat Res 160:408–417.

Stram D, Haiman C, Hirschhorn J, Altshuler D, Kolonel L,Henderson B, Pike M. 2003a. Choosing haplotype-tagging SNPsbased on unphased genotype data using as preliminary sampleof unrelated subjects with an example from the multiethniccohort study. Hum Hered 55:27–36.

Stram D, Pearce C, Bretsky P, Freedman M, Hirschhorn J,Altshuler D, Kolonel L, Henderson B, Thomas D. 2003b.Modeling and E-M estimation of haplotype-specific relativerisks from genotype data for a case-control study of unrelatedindividuals. Hum Hered 55:179–190.

Thomas DC, Stram D, Dwyer J. 1993. Exposure measurementerror: Influence on exposure-disease relationships and methodsof correction. Ann Rev Pub Health 14:69–93.

Thompson D, Stram D, Goldgar D, Witte JS. 2003. Haplotypetagging single nucleotide polymorphisms and associationstudies. Hum Hered 56:48–55.

Weale M, Depondt C, MacDonald S, Smith A, Lai P, Shorvon S,Wood N, Goldstein D. 2003. Selection and evaluation of taggingSNPs in the neuronal-sodium-channel gene SCN1A:implications for linkage-disequilibrium mapping. Am J HumGenet 73:551–565.

Zaykin D, Westfall P, Young S, Karnoub M, Wagner M, Ehm M.2002. Testing association of statistically inferred haplotypes withdiscrete and continuous traits in samples of unrelatedindividuals. Hum Hered 53:79–91.

Zhao L, Li S, Khalid N. 2003. A method for the assessment ofdisease associations with single-nucleotide polymorphismhaplotypes and environmental variables in case-controlstudies. Am J Hum Genet 72:1231–1250.

Kraft et al.272