Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a...

8
Copyright Ó 2010 by the Genetics Society of America DOI: 10.1534/genetics.110.121665 Conditions Under Which Genome-Wide Association Studies Will be Positively Misleading Alexander Platt,* ,1 Bjarni J. Vilhja ´lmsson* and Magnus Nordborg* ,† *Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089 and Gregor Mendel Institute, Austrian Academy of Sciences, 1030 Vienna, Austria Manuscript received May 31, 2010 Accepted for publication August 18, 2010 ABSTRACT Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype map. Statistical association between an allele at a locus and the trait in question is used as evidence that variation at the locus is responsible for variation of the trait. Indirect association, however, can give rise to statistically significant results at loci unrelated to the trait. We use a haploid, three-locus, binary genetic model to describe the conditions under which these indirect as- sociations become stronger than any of the causative associations in the organism—even to the point of representing the only associations present in the data. These indirect associations are the result of disequilibrium between multiple factors affecting a single trait. Epistasis and population structure can exacerbate the problem but are not required to create it. From a statistical point of view, indirect associations are true associations rather than the result of stochastic noise: they will not be ameliorated by increasing sampling size or marker density and can be reproduced in independent studies. G ENOME-WIDE association mapping is a power- ful tool that leverages the natural variation of a trait in a population to identify genetic factors that influence the trait. The theory is that due to the large number of recombination events in the genetic history of the population, only markers in tight linkage dis- equilibrium with loci responsible for the trait variation will exhibit significant statistical association with the trait. There are two ways in which genome-wide as- sociation mapping will fail by identifying loci that are not responsible for the variation in the trait (i.e., false positives): stochastic noise can generate an association in a sample that is not present in the larger population, or patterns of correlation among loci and factors causing trait variation can create indirect associations between markers and traits where no causal relation exists. While the former can be well quantified and managed with traditional sampling theory and replica- tion, genomic control, and properly specified error terms in statistical models, these techniques do little to address the latter. As the association is true and not a statistical aberration, all accurate tests of association will point to the same noncausative loci; increasing sample sizes and marker densities will only heighten the misleading results, and these results can be reproduced in all follow-up studies. It has long been recognized that population structure can cause these kinds of spurious, nonrandom associa- tions (Li 1969; Lander and Schork 1994), and consider- able effort has been devoted to addressing this problem statistically (Devlin and Roeder 1999; Pritchard et al. 2000; Price et al. 2006; Yu et al. 2006). However, attention has almost exclusively focused on the case where a non- causal marker is falsely identified as causal (or closely linked to a causal polymorphism) because both it and the trait are correlated with a single unobserved variable (e.g., geographic origin in a structured population). The effect of including multiple causal loci has not adequately been considered. That this matters has been demonstrated by two recent articles. Dickson et al. (2010) used simulations to show that the presence of two or more rare causal variants in disequilibrium that can themselves not be detected due to lack of statistical power can produce spurious associations that are only distantly linked to the causal polymorphisms, and Atwell et al. (2010) showed that negative disequilibrium between two causal poly- morphisms in the gene FRIGIDA interfered with the ability to find either of them but created strong signals at several distantly linked markers in a genome-wide association study in Arabidopsis thaliana. To understand these cases we need a model with at least three variables: a noncausal marker and two background, unobserved factors. Here we present the simplest possible model—a haploid model of three binary loci—and use it to illustrate what conditions give rise to misleading genome-wide association mapping results. Available freely online through the author-supported open access option. 1 Corresponding author: Gregor Mendel Institute, Austrian Academy of Sciences, 1030 Vienna, Austria. E-mail: [email protected] Genetics 186: 1045–1052 (November 2010)

Transcript of Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a...

Page 1: Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype

Copyright � 2010 by the Genetics Society of AmericaDOI: 10.1534/genetics.110.121665

Conditions Under Which Genome-Wide Association StudiesWill be Positively Misleading

Alexander Platt,*,1 Bjarni J. Vilhjalmsson* and Magnus Nordborg*,†

*Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089 and †Gregor MendelInstitute, Austrian Academy of Sciences, 1030 Vienna, Austria

Manuscript received May 31, 2010Accepted for publication August 18, 2010

ABSTRACT

Genome-wide association mapping is a popular method for using natural variation within a species togenerate a genotype–phenotype map. Statistical association between an allele at a locus and the trait inquestion is used as evidence that variation at the locus is responsible for variation of the trait. Indirectassociation, however, can give rise to statistically significant results at loci unrelated to the trait. We use ahaploid, three-locus, binary genetic model to describe the conditions under which these indirect as-sociations become stronger than any of the causative associations in the organism—even to the point ofrepresenting the only associations present in the data. These indirect associations are the result ofdisequilibrium between multiple factors affecting a single trait. Epistasis and population structure canexacerbate the problem but are not required to create it. From a statistical point of view, indirectassociations are true associations rather than the result of stochastic noise: they will not be ameliorated byincreasing sampling size or marker density and can be reproduced in independent studies.

GENOME-WIDE association mapping is a power-ful tool that leverages the natural variation of a

trait in a population to identify genetic factors thatinfluence the trait. The theory is that due to the largenumber of recombination events in the genetic historyof the population, only markers in tight linkage dis-equilibrium with loci responsible for the trait variationwill exhibit significant statistical association with thetrait. There are two ways in which genome-wide as-sociation mapping will fail by identifying loci that arenot responsible for the variation in the trait (i.e., falsepositives): stochastic noise can generate an associationin a sample that is not present in the larger population,or patterns of correlation among loci and factorscausing trait variation can create indirect associationsbetween markers and traits where no causal relationexists. While the former can be well quantified andmanaged with traditional sampling theory and replica-tion, genomic control, and properly specified errorterms in statistical models, these techniques do little toaddress the latter. As the association is true and not astatistical aberration, all accurate tests of association willpoint to the same noncausative loci; increasing samplesizes and marker densities will only heighten themisleading results, and these results can be reproducedin all follow-up studies.

It has long been recognized that population structurecan cause these kinds of spurious, nonrandom associa-tions (Li 1969; Lander and Schork 1994), and consider-able effort has been devoted to addressing this problemstatistically (Devlin and Roeder 1999; Pritchard et al.2000; Price et al. 2006; Yu et al. 2006). However, attentionhas almost exclusively focused on the case where a non-causal marker is falsely identified as causal (or closelylinked to a causal polymorphism) because both it and thetrait are correlated with a single unobserved variable (e.g.,geographic origin in a structured population). The effectof including multiple causal loci has not adequately beenconsidered.

That this matters has been demonstrated by tworecent articles. Dickson et al. (2010) used simulationsto show that the presence of two or more rare causalvariants in disequilibrium that can themselves not bedetected due to lack of statistical power can producespurious associations that are only distantly linked to thecausal polymorphisms, and Atwell et al. (2010) showedthat negative disequilibrium between two causal poly-morphisms in the gene FRIGIDA interfered with theability to find either of them but created strong signalsat several distantly linked markers in a genome-wideassociation study in Arabidopsis thaliana.

To understand these cases we need a model with at leastthree variables: a noncausal marker and two background,unobserved factors. Here we present the simplest possiblemodel—a haploid model of three binary loci—and useit to illustrate what conditions give rise to misleadinggenome-wide association mapping results.

Available freely online through the author-supported open accessoption.

1Corresponding author: Gregor Mendel Institute, Austrian Academy ofSciences, 1030 Vienna, Austria. E-mail: [email protected]

Genetics 186: 1045–1052 (November 2010)

Page 2: Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype

MODEL AND RESULTS

The simplest model possible: Table 1 defines themodel: C denotes the causative locus we are trying toidentify; L is a latent variable, be it a second locus oran environmental factor, that may also influence theorganism’s phenotype; and N is a noncausal markerlocus. Parameters a, . . . , h are the population frequen-cies of all the possible ‘‘genotypes’’ (Table 2). bC and bL

represent the additive component of the influence onphenotype of the designated causative allele and stateof the latent variable, respectively. bLC is an epistaticterm defined as the deviation from additivity of thecombined effects of L and C . Without loss of generality,the causative alleles and latent variables are labeled sothat bC and bL are both $0 and the noncausal marker islabeled so that cov(N, P) $ 0. In every case we considerthe phenotype, P, to be fully determined by L and C.There is no stochastic noise included in our analyses.

With this model we can describe simple traits withonly a single factor influencing the phenotype by settingbL and bLC to 0. A trait governed by purely additivecontributions from two factors is modeled by letting bC

and bL vary freely but keeping bLC at 0. Varying bLC

gives us a wide range of epistatic effects. Positive valuesof bLC give us synergistic epistasis and negative valuesare antagonistic.

In association mapping we are looking for nonin-dependence between alleles and phenotypes. Nonin-dependence can be quantified in many ways. Ouranalytical work focuses on covariance between proposedfactors and observed phenotypes. A significantly nonzerocovariance indicates an association between the trait andthe marker being examined. The hope is that this indi-cates that the associated locus contributes biologically tovariation for the trait or is very closely linked to a locusthat does. In our model, we want the covariance be-tween the causal polymorphism and the trait, cov(C, P),

to be high (or we will not be able to detect the causalassociation), and we want the covariance between thenoncausal marker, cov(N, P), and the trait to be high ifand only if the marker is tightly linked to the causalpolymorphism. We do not want cov(N, P) . cov(C, P)lest we misidentify the noncausal marker as causal. Thecovariance between the latent variable and the trait,cov(L, P), finally, is just a nuisance from the point ofview of identifying the causal polymorphism. For ourmodel, we have

covðN ; PÞ ¼ bC DNC 1 bLDNL 1 bLC ðDNLC � rN DLC Þ ð1ÞcovðC ;PÞ ¼ bC rC ð1� rC Þ1 bLDLC 1 bLC ðg 1 hÞð1� rC Þ ð2ÞcovðL;PÞ ¼ bLrLð1� rLÞ1 bC DLC 1 bLC ðg 1 hÞð1� rLÞ: ð3Þ

By looking at these covariance terms in varioussettings we illustrate when we can expect associationmapping to be misleading. For clarity, we focus onexpectations and do not consider the stochastic errorintroduced by finite sample sizes.

Simple traits: Setting bL ¼ 0 and bLC ¼ 0 we describea trait that is influenced only by a single causativepolymorphism. In this case Equations 1 and 2 reduce to

covðN ;PÞ ¼ bC DNC

and

covðC ;PÞ ¼ bC rCð1� rCÞ;

respectively. The causative allele will give the most sig-nificant results when its effect on the phenotype is largeand it is at an intermediate frequency in the sample.The noncausal marker will give significant results whenthe effect of the causative allele is large and there is dis-equilibrium between the two loci. In expectation, how-ever, the noncausal marker should not give a moresignificant result than the causative polymorphism.Indeed,

TABLE 1

Model specification

Latentvariable

Causativepolymorphism

Noncausalmarker Phenotype

Genotypefrequency

0 0 0 0 a0 0 1 0 b0 1 0 bC c0 1 1 bC d1 0 0 bL e1 0 1 bL f1 1 0 bC 1 bL 1 bLC g1 1 1 bC 1 bL 1 bLC h

The model is defined as a ‘‘genotype’’ of three binary fac-tors, L, C, and N . Every combination of these factors perfectlydescribes a phenotype P and occurs with a frequency indi-cated by a, . . . , h. Table 2 defines some useful parameteriza-tions.

TABLE 2

Parameterization

Symbol Description Definition

rL Frequency of variable L e 1 f 1 g 1 hrC Frequency of allele C c 1 d 1 g 1 hrN Frequency of allele N b 1 d 1 f 1 hDNC Disequilibrium between N and C d 1 h � rNrC

DNL Disequilibrium between N and L f 1 h � rNrL

DLC Disequilibrium between L and C g 1 h � rNrC

DNLC Three-locus disequilibrium h � rNrLrC

Reparameterizing the model in terms of the frequencies ofindividual factors and the disequilibrium between them facil-itates biological understanding of what creates associationsbetween factors and phenotypes.

1046 A. Platt, B. J. Vilhjalmsson and M. Nordborg

Page 3: Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype

covðN ;PÞ# covðC ;PÞ ð4Þ

expands to

bCðd 1 h � rN rCÞ# bC rCð1� rC Þ;

which simplifies to

ðc 1 g � b � f ÞrC # c 1 g :

This is always true as c, g, b, f, and rC are all defined onthe interval [0, 1]. While disequilibrium can generatesignificant results for noncausal markers, with sufficientsample size the most significant results can be expectedto be for the causative polymorphism or, if it is notpresent in the marker panel, the marker in greatestdisequilibrium with it.

Thus, while false positives, in the sense of significantlyassociated but unlinked noncausal markers may exist(especially if population structure induces long-distancelinkage disequilibrium across the genome), sufficientlypowered association studies should always also locate thecausal polymorphism if it exists. However, with traits withmore than one contributing factor there is no suchguarantee. This is the problem we turn to next. (Associ-ation studies can of course always be misleading if nocausal polymorphism exists but noncausal markerscovary with a nongenetic latent variable: this is readilyseen by setting bC ¼ 0 and bLC ¼ 0 in our model).

Complex traits: When two or more factors contributeto variation in a trait, association studies may be mis-leading in the sense that noncausal markers can beexpected to be more strongly associated than either causalpolymorphism. To see this we consider several scenariosbeginning with causative factors with only additive effects.

Additive effects, strong latent variable: In an extremecase where effects are additive (bLC ¼ 0), but bL ? bC,Equations 1 and 2 can be approximated by

covðN ;PÞ ¼ bLDNL

and

covðC ;PÞ ¼ bLDLC :

Under these conditions the causative polymorphismacts like a noncausal marker and the most significantsignals will come from whichever one has the greatestdisequilibrium with the latent variable that is responsi-ble for most of the variation in the phenotype. If thelatent variable is another genetic locus, this is not aproblematic result as we have simply approximated thepreviously described case of a simple genetic trait. If thelatent variable is an exogenous factor, however, we nowsee that we may erroneously ascribe its effect to a geneticlocus that happens to be correlated with it.

Equivalent additive factors: Less trivially, setting bLC¼ 0and bL ¼ bC ¼ b describes a trait controlled equally bytwo factors and gives us covariance terms

covðN ;PÞ ¼ bðDNC 1 DNLÞ ð5Þ

covðC ;PÞ ¼ b½DLC 1 rCð1� rCÞ� ð6Þ

covðL;PÞ ¼ b½DLC 1 rLð1� rLÞ�: ð7Þ

In this case, the noncausal marker is expected to have amore significant result than the causative allele whenever

DNC 1 DNL . DLC 1 rCð1� rC Þ; ð8Þ

which makes it intuitive to see how rare causative al-leles can give rise to the kind of ‘‘synthetic’’ associationdescribed by Dickson et al. (2010). The term involvingrC on the right becomes small, leaving ample opportu-nity for the two disequilibrium terms on the left toswamp out the one disequilibrium term on the right.The specific pattern described in that article is onewhere the latent variable is a second causative geneticvariant at a locus. This creates strong negative covariancebetween the two causative factors and eliminates theopportunity for genetic interactions to play any role. Inthis case the only haplotypes that occur with appreciablefrequencies correspond in our model to a, b, d, and f.Setting all other haplotype frequencies to 0 in Equation8 and simplifying show us that under these conditionsthe strongest association will be expected at the non-causal locus whenever rN , 1� bd/f. For this scenario tocause problematic results, the noncausal marker cannotbe too common or it cannot be in sufficiently stronglinkage disequilibrium with the rare causative loci.

Epistasis: There are limits to the degree of confound-ing possible when interactions are purely additive.Within the restriction of additivity, even when the stron-gest signal in an association study is coming from anoncausal locus, we should expect at least one of thetruly causative factors to exhibit at least some associa-tion. This is because the covariance between the non-causal marker and the phenotype will never be largerthan the sum of the covariance between the causativelocus and the phenotype and the latent variable and thephenotype. From Equations 1–3,

covðN ;PÞ# covðC ;PÞ1 covðL;PÞ

expands to

bC DNC 1 bLDNL

# bC rCð1� rC Þ1 bC DLC 1 bLrLð1� rLÞ1 bLDLC :

From Equation 4 it follows that

bC DNC # bC rCð1� rCÞ;

which is also true if you replace all the C ’s with L’s.Doing so and substituting lets us cancel and get

0 # ðbC 1 bLÞDLC ;

which is always true.

Indirect Associations and Misleading GWA Studies 1047

Page 4: Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype

A nonzero interaction term does away with this upperbound for cov(N, P), however. Consider, for example,the case where bC ¼ bL ¼ b but bLC ¼ �b (negativeepistasis: either causative allele is sufficient for thephenotype), a ¼ b ¼ c ¼ e ¼ h ¼ 0, and d ¼ f ¼ g ¼ 1

3(negative covariance between the two causal factors). Inthis example, cov(C, P)¼ cov(L, P)¼ 0, but cov(N, P)¼2b/3. In other words, the noncausal marker can have anarbitrarily large covariance with the trait even thoughthere is no association for any of the truly causativefactors, no matter how powerful the study.

Simulated example: To illustrate the behavior of ourmodel using real polymorphism data, we use the data ofAtwell et al. (2010), who carried out a genome-wideassociation study using 216,130 single-nucleotide poly-morphism (SNP) markers in a set of 199 inbred linesof A. thaliana. The sample is characterized by complexpopulation structure (Platt et al. 2010), which makes itideal for illustrative purposes. Many traits are stronglycorrelated with latitude in A. thaliana. This can comeabout through geographically distributed causativegenetic polymorphism of large effect, the combinedeffect of many causative polymorphisms of small effect,or nongenetic confounding factors. We performed twosets of simulations. A first causative locus is picked atrandom from the 216,130 SNPs and a random allele isassigned an effect. The second causative factor is theneither a SNP or a binary environmental factor whereboth possibilities for an effect allele are used. This isrepeated for 10% of the SNPs in the data set and a newtrait is generated, resulting in �43,200 nonconstanttraits for each of the sets of simulations. For the first set,the traits are correlated with the population structure ofthe organism, and the second causative variable is alatent indicator variable that identifies each individualas having been collected north of 50� latitude, a line thatlies midway between London and Paris, and that dividesthe sample roughly in half. In the second set ofsimulations, the second causative variable is anotherrandomly selected SNP.

Phenotypes were calculated for three different traitarchitectures, letting bC¼ bL¼ b with differing degrees

of interaction (Table 3). Setting bLC ¼ 0 gives a purelyadditive model. With bLC ¼ �b we get an ‘‘or’’ modelwhere either causative factor is sufficient to create phe-notypic change. When describing two genetic loci, thismodel can reflect the interaction between loss-of-function mutations in different genes in a commonpathway. With an environmental cofactor this repre-sents a canalized trait whose genetic variation is revealedphenotypically only in certain environments. As de-scribed above, this kind of negative epistasis can giverise to situations where only the noncausal marker iscorrelated with the phenotype. Setting bLC ¼ �2b givesus an ‘‘xor’’ model where individuals with zero and twolabeled factors share a common phenotype but aredifferent from those with only one (regardless of whichone it is). Genetically, this model can reflect theinteraction between a compensatory pair of mutations,such as one in a transcription factor and one in abinding site. As an environmental effect this scenariooccurs whenever there are trade-offs between responsesin different environments. Pathogen resistance is oneexample. Functional resistance genes can increase seedproduction where pathogens are present but reduce itwhere they are not (Korves and Bergelson 2004).

For each simulated phenotype we performed a ge-nome-wide association study using the nonparametricWilcoxon rank sum test on every marker. For the first setof simulations, where the latent variable is a North–Southsplit, Figure 1, A–C, shows how far down in the list ofassociated markers one would have to go to find thecorrect locus. In the purely additive simulations there arefew problems (Figure 1A). The correct locus is easilyidentified as one of the very strongest results in almostall cases, with the vast majority of exceptions being as-sociated with cases where the causative locus has a verylow minor allele frequency. The or model exhibitsgreater confounding (Figure 1B). The locus is perfectlyidentified less than half of the time and is sometimesmissed even when the minor allele frequency is in-termediate. The correct locus was essentially never foundin the xor model regardless of the minor allele frequency(Figure 1C). Measurements of the distance between the

TABLE 3

Simulated phenotypes

‘‘Genotype’’Phenotype

Latent variableCausative

polymorphism

Additive: or: xor:

bLC ¼ 0 bLC ¼ �P bLC ¼ �2P

North 0 0 0 0South 0 b b b

North 1 b b b

South 1 2b b 0

Model is shown for generating phenotypes from data with one causative genetic locus and a non-genetic, geographic factor that is treated as a latent variable.

1048 A. Platt, B. J. Vilhjalmsson and M. Nordborg

Page 5: Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype

causative locus and the locus with the lowest P-valuefollowed the same pattern. When the causative locus isamong the highest ranked SNPs, it is near the locus withthe lowest P-value. As its rank falls, it tends to be fartherand farther away, and by the time it is not within the top1000 SNPs it is often on the wrong chromosome.

Figure 1, D–F, shows the distribution of maximumdistances to the causative SNP for all markers with asso-ciation greater than or equal to that of the causativelocus. It is evident that when the causative marker is notthe most significant, a very distant marker usually is. Thisis true even in the simple additive case. In the xor modelthe causative marker is not significant most of the time.

Turning to the simulations with two randomlychosen causative loci, Figure 2, A–C, shows the P-valuerank distribution of the two causative alleles, both thetop ranking and the second ranking. A true causativelocus is essentially always found in the additive case(Figure 2A), and the more weakly associated locus isoften among the most significant ones. For the epi-

static or and xor models a true causative locus is missedone time in eight and two times in five, respectively(Figure 2, B and C). The rank of the second-rankingcausative locus also becomes lower in the epistaticmodels. Figure 2, D–F, shows the distribution ofmaximum distances to the nearest causative SNP forall markers with association greater than that of thesecond-ranking causative locus. This demonstratesthat there are often unlinked loci with greater signif-icance than the second-ranking causative locus, evenwhen both causative loci are significant. This is aparticularly serious problem in the epistatic models(see also Table 4).

DISCUSSION

Causes of confounding: We used a very simple three-locus model to clarify the conditions under whichgenome-wide association studies are expected to be re-producibly misleading. We believe there are three distinct

Figure 1.—Simulation results for a geographical latent variable, a North–South split. (A–C) Rank of the causative SNP: illus-tration of how many markers had a stronger association than the causative SNP in a given analysis under (A) the ‘‘additive’’ geneticmodel, (B) the ‘‘or’’ model, and (C) the ‘‘xor’’ model. Colors indicate the minor allele frequency of the causative SNP. (D–F)Maximum distance to the causative SNP of all SNPs with greater or equal association than the causative SNP under (D) the additivegenetic model, (E) the or model, and (F) the ‘‘xor’’ model. Colors indicate whether the causative marker was found to be sig-nificant at the Bonferroni threshold. Only results where at least one SNP was found significant were included in the analysis.

Indirect Associations and Misleading GWA Studies 1049

Page 6: Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype

problem sources: correlation between causal factors and(unlinked) noncausal markers, more than a single causalfactor (especially if the factors themselves are correlated),and epistasis (i.e., nonlinear interactions between causalfactors in the determining the phenotype).

Consider each in turn.Correlation with unlinked markers: Correlation between

causal factors and unlinked, noncausal markers (notethat all noncausal markers are unlinked if the causalfactors are nongenetic) violates the basic assumption ofgenome-wide association studies (GWAS) and causesfalse positives.

Population structure, by definition, causes genome-wide correlations between alleles (linkage disequilib-rium), which can easily lead to genome-wide occurrenceof false positives (Rosenberg and Nordborg 2006), aproblem that has long been recognized (Li 1969; Lander

and Schork 1994) and for which many statistical sol-utions have been proposed (Devlin and Roeder 1999;Pritchard et al. 2000; Price et al. 2006; Yu et al. 2006).

However, it is important to realize that associations atunlinked, noncausal markers can also arise because ofpleiotropy. Consider, for example, a scenario in whichone polymorphism affects both skin and eye colorand another affects just skin color. If skin color variationis locally adaptive, then selection causes correlation(linkage disequilibrium) between the two loci. A GWASfor eye color would detect associations at both loci, eventhough one of them has nothing to do with this trait.Unlike false positives caused by population structure,these types of false positives would not occur at randomthroughout the genome: they would occur only at non-causal markers correlated with causal factors throughselection on pleiotropic traits. This might make themless common: it would certainly make them more dif-ficult to eliminate through statistical methods.

More than a single causative factor: Whenever a trait iscontrolled by more than a single factor, it is possible thatthe strongest associations in the data are indirect ones.As biologically uninformative as these associations are,

Figure 2.—Simulation results for two causative SNPs, where both are chosen at random. (A–C) Rank of the top-ranking caus-ative SNP (blue) and the second-ranking causative SNP (orange) under (A) the ‘‘additive’’ genetic model, (B) the ‘‘or’’ model, and(C) the ‘‘xor’’ model. (D–F) Maximum distance to nearest causative SNP among SNPs with greater association than the moreweakly associated causative SNP under (D) the additive genetic model, (E) the or model, and (F) the xor model. Colors indicatewhether two, one, or none of the causative SNPs were found significant at the Bonferroni threshold. Only results where at least oneSNP was found significant were included in the analysis.

1050 A. Platt, B. J. Vilhjalmsson and M. Nordborg

Page 7: Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype

they are true associations and will respond as such tostatistical tests, gaining significance with increased sam-pling and reproducing in multiple data sets.

Without any population structure, strong indirectassociations can arise at loci that are genetically linkedto two or more causative factors, even if the causativefactors are in equilibrium with each other. This linkage-only case has been well documented in linkage mappingliterature (Haley and Knott 1992; Martinez andCurnow 1992). Here, two genetically linked quantita-tive trait loci combine to produce a false or ‘‘ghost’’ peakof association between them. In the presence of naturalselection it is no longer necessary for the indirectly as-sociated marker to be linked to more than one causativelocus (as in the ghost peak version) as correlations willalready exist between the causative factors. A markerlinked to one is likely to be in disequilibrium with allof them. With population structure or selection andpleiotropy, however, these indirect associations can befar removed from all causative factors.

Epistasis: When the causative loci interact epistatically,it is possible that the only loci exhibiting any associationwith the phenotype are noncausal. While it has longbeen recognized that epistatically interacting loci maybe difficult to find due to lack of marginal effect (Eaves

1994), correlated noncausal loci can serve as excellentmarkers for the joint state of several causative lociworking in concert.

Tests for association based on multilocus haplotypes(or that model explicit interaction terms) will improveresults but not completely ameliorate the problem.While we have mostly been describing the factors L, C,and N as single loci, they can just as easily representarbitrarily complex combinations of loci (and externalfactors). A statistician who perfectly models the traitarchitecture, and knows that he or she has done so, willhave effectively recast the complex trait as a simple trait(albeit with complex inputs). It would be guaranteedthat no noncausal marker complex will have a strongerassociation than the causative factor complex, but thereis nothing stopping noncausal marker complexes fromhaving associations just as strong as the causative ones.

Even simple noncausal markers may have associations asstrong as the causative marker complex, which wouldmislead any sort of model-selection algorithm.

Conclusions: Our purpose in writing this article wasto clarify the conditions under which GWAS areexpected to be reproducibly misleading. As our simula-tion results demonstrate, severe problems may arisewhen we attempt to model traits that are really due tomultiple, possibly correlated, possibly epistatically in-teracting factors using single-locus models that assumethat unlinked, noncausal markers are not correlatedwith the causal factors. Not only do we face the well-known problem of false positives across the genome, butalso we see that the strongest associations may appear onchromosomes completely devoid of causative loci andthat the true positives may be undetectable.

In this light, the common practice of ‘‘correcting forpopulation structure’’ may be misguided. The real goalshould be correcting for the confounding effects ofmultiple causative factors. Some of the techniquescurrently employed as population structure correctionactually do this very well. The mixed-model approach (Yu

et al. 2006), for instance, can be interpreted as removingthe effect of a large number of unlinked selectivelyneutral factors, each with an uninterestingly small effecton the studied trait (Kang et al. 2010). Approaches suchas structured analysis (Pritchard et al. 2000) andprincipal components analysis (Price et al. 2006), onthe other hand, aid in correcting for the correlationsamong multiple causative factors only to the extent thatclustering on global patterns of genetic variation approx-imates the distributions of the individual causativefactors. Attempting to correct for population structuredirectly, as opposed to correcting for correlations amongmultiple causative factors, runs the risk of eliminating theeffects of the largest, most interesting loci from the study.This will happen whenever alleles at those loci have adistribution similar to the genomic patterns of correla-tion. Such factors can easily and accurately be identifiedas being associated, although they will be in disequilib-rium with many noncausal loci, making them difficult tolocate with any precision.

TABLE 4

Summary of simulation result

At least one significant?a Top-ranking causal?b Distant noncausal found?c

Model Additive or xor Additive or xor Additive or xor

Latent North–South variable 1.00 1.00 0.86 0.70 0.49 0.00 0.23 0.43 1.00Two causal loci 1.00 1.00 0.94 0.96 0.80 0.86 0.25 0.76 0.81

a Fraction of results with at least one significant SNP (at a Bonferroni-corrected threshold of 0.05) and that were used for sub-sequent analysis.

b Fraction of results in which the top-ranking association was a causal polymorphism (the causal polymorphism in the case of alatent variable).

c Fraction of results in which a SNP more strongly associated with the phenotype than a casual polymorphism (the causal poly-morphism in the case of a latent variable) was .50 kb away from the nearest causal polymorphism.

Indirect Associations and Misleading GWA Studies 1051

Page 8: Conditions Under Which Genome-Wide Association Studies ... · Genome-wide association mapping is a popular method for using natural variation within a species to generate a genotype–phenotype

This is not to say, however, that the presence of any ofthese confounding attributes of complex traits dooms agenome-wide association study to failure. All of them,multiple factors, natural selection, epistasis, and popula-tion structure, contribute to confounding in quantitativeways and in amounts that will be greatly influenced bytheir specific details. A carefully constructed humancase–control study, for instance, may not suffer fromappreciable population structure and would thereforeintroduce an imprecision only in the location of thecause of the associations. Larger, population-based co-hort studies, however, may soon find themselves runninginto the kinds of large-scale population structure in-herent in the human species (Freedman et al. 2004;Novembre et al. 2008). The results may still be mostlyaccurate if natural selection is weak and the additiveeffects of the majority of the causative loci are large, butmay become questionable when considering highlypolygenic traits under strong selection. Genome-wideassociation studies applied to other organisms, however,may be considerably more problematic. The very worstsituation is likely to arise in species that have undergonestrong local adaptation or have experienced artificialselection to create numerous different phenotypes. Inthese cases the correlated effects of population structureand selection may well be expected to swamp any remain-ing causative associations with rampant and excessiveindirect associations spread all across the genome. Organ-isms like A. thaliana may be intermediate, with con-founding ranging from almost nonexistent to extremelyproblematic depending on the architecture of the trait.In organisms with high levels of confounding, it isnecessary to proceed with caution and treat identifiedassociations as hypotheses for follow-up confirmatorystudies (Atwell et al. 2010).

It is also worth noting that these indirectly associatedsites confound not just the scientist attempting todiscover the map between phenotype and genotype,but similarly interfere with the process of natural se-lection as well. In the example of epistasis describedabove, in which marginal effects of the causal factors arecompletely missing, any selection applied to the trait inquestion would change the allele frequency (producinga partial selective sweep) only at the noncausal, neutrallocus, not at any of the loci that actually contribute tothe phenotype. Where natural selection has an advan-tage over the scientist is that the scientist is generallyrestricted to a snapshot of a population and its patternsof disequilibrium. Natural selection is a process thatunfolds over successive generations and may have theopportunity to break apart disadvantageous correla-tions. Scientists can mimic this process in some cases byperforming experimental crosses, genetic transforma-

tions, or pedigree- or family-based analyses and therebydisrupting the extant patterns of disequilibrium, al-though this is often not feasible in clinical studies.

We thank David Conti, Sergey Nuzhdin, Paul Marjoram, Juan PabloLewinger, Thomas Turner, Quingrun Zhang, and Quan Long forhelpful discussions. This work was supported by the National ScienceFoundation (DEB-0723935), the National Institutes of Health (P50HG002790), and the Austrian Academy of Sciences.

LITERATURE CITED

Atwell, S., Y. S. Huang, B. J. Vilhjalmsson, G. Willems, M.Horton et al., 2010 Genome-wide association study of 107phenotypes in Arabidopsis thaliana inbred lines. Nature 465:627–631.

Devlin, B., and K. Roeder, 1999 Genomic control for associationstudies. Biometrics 55: 997–1004.

Dickson, S. P., K. Wang, I. Krantz, H. Hakonarson and D. B.Goldstein, 2010 Rare variants create synthetic genome-wideassociations. PLoS Biol. 8: e1000294.

Eaves, L. J., 1994 Effect of genetic architecture on the power of hu-man linkage studies to resolve the contribution of quantitativetrait loci. Heredity 72: 175–192.

Freedman, M. L., D. Reich, K. L. Penney, G. J. McDonald, A. A.Mignault et al., 2004 Assessing the impact of populationstratification on genetic association studies. Nat. Genet. 36:388–393.

Haley, C. S., and S. A. Knott, 1992 Maximum-likelihood mappingof quantitative trait loci using full-sib families. Genetics 132:1211–1222.

Kang, H. M., J. H. Sul, S. K. Service, N. A. Zaitlen, S.-y. Kong et al.,2010 Variance component model to account for sample struc-ture in genome-wide association studies. Nat. Genet. 42: 348–354.

Korves, T., and J. Bergelson, 2004 A novel cost of r gene resistancein the presence of disease. Am. Nat. 163: 489–504.

Lander, E. S., and N. J. Schork, 1994 Genetic dissection of com-plex traits. Science 265: 2037–2048.

Li, C. C., 1969 Population subdivision with respect to multiplealleles. Ann. Hum. Genet. 33: 23–29.

Martinez, O., and R. N. Curnow, 1992 Estimating the locationsand the sizes of the effects of quantitative trait loci using flankingmarkers. Theor. Appl. Genet. 85: 480–488.

Novembre, J., T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko et al.,2008 Genes mirror geography within Europe. Nature 456: 98–101.

Platt, A., M. Horton, Y. S. Huang, Y. Li, A. E. Anastasio et al.,2010 The scale of population structure in Arabidopsis thaliana.PLoS Genet. 6: e1000843.

Price, A. L., N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A.Shadick et al., 2006 Principal components analysis corrects forstratification in genome-wide association studies. Nat. Genet. 38:904–909.

Pritchard, J. K., M. Stephens, N. A. Rosenberg and P. Donnelly,2000 Association mapping in structured populations. Am. J.Hum. Genet. 67: 170–181.

Rosenberg, N., and M. Nordborg, 2006 A general population-genetic model for the production by population structure of spu-rious genotype-phenotype associations in discrete, admixed, orspatially distributed populations. Genetics 173: 1665–1678.

Yu, J., G. Pressoir, W. Briggs, I. Vroh Bi, M. Yamasaki et al., 2006 Aunified mixed-model method for association mapping that ac-counts for multiple levels of relatedness. Nat. Genet. 38: 203–208.

Communicating editor: F. Zou

1052 A. Platt, B. J. Vilhjalmsson and M. Nordborg