2000, Pritchard, Inference of Population Structure, Multilocus, Data

7/28/2019 2000, Pritchard, Inference of Population Structure, Multilocus, Data

http://slidepdf.com/reader/full/2000-pritchard-inference-of-population-structure-multilocus-data 1/15

Copyright © 2000 by the Genetics Society of America

Inferenceof Population Structure UsingMultilocus Genotype Data

Jonathan K. Pritchard, MatthewStephens and Peter Donnelly

Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom

Manuscript received September 23, 1999Accepted for publication February 18, 2000

ABSTRACT

We describe a mod el-based clustering meth od for using multilocus genotype data to infer population

structure and assign individuals to populations. We assume a model in which there are K populations

(where K may be unknown), each of which is characterized by a set of allele frequencies at each locus.

Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more popula-

tions if their genotypes indicate that they are ad mixed. Our model d oes not a ssume a particular mutation

process, and it can be applied to mo st of the common ly used genetic markers, provided th at they are n ot

closely linked. Applications of our method include demonstrating th e presence o f population structure,

assigning individua ls to population s, studying hybrid zones, and id entifying migran ts and a dmixed individu-

als. We sho w tha t the method can pro duce highly accurate assignmen ts using mod est num bers of loci—e.g.,

seven microsatellite loci in a n exam ple using geno type data fro m an end ang ered bird species. The software

used for th is article is available from http://www.stats.ox.ac.uk/ pritch/ho me.html.

I N applications of population genetics, it is often use- populations based on these subjective criteria represents

a natural assignment in genetic terms, and it would beful to classify individuals in a sample into popula-

tions. In one scenario, the investigator begins with a useful to be able to confirm that subjective classifications

are consistent with genetic information and hence ap-sample of individuals and wants to say something about

the properties of populations. For example, in studies propriate for studying the questions of interest. Further,

there a re situations where on e is interested in “ cryptic”of hum an evolution, th e population is often considered

to be the unit of interest , and a great deal of work has population structure—i.e., population structure that is

difficult to detect using visible ch aracters, but may b efocused on learning about the evolutionary relation-

ships of modern populations (e.g., Caval l i et al. 1994). significant in genetic terms. For example, when associa-

tion m appin g is used to find disease gen es, the presenceIn a second scenario, the investigator begins with a set

of predefined populations and wishes to classify individ- of undetected population structure can lead to spurious

associations and thus invalidate standard tests (Ewensuals of unknown origin. This type of problem arisesin many contexts (reviewed by Davies et al. 1999) . A an d Spiel man 1995). The prob lem of cryptic populatio n

structure also a rises in th e con text o f D NA finger print-standard approach involves sampling DNA from mem-

bers of a number of potential source populations and ing for forensics, where it is important to assess thedegree of po pulation structure to estimate the prob abil-using these samples to estimate allele frequencies inity of false matches (Bal din g an d Nichol s 1994, 1995;each population at a series of unlinked loci. Using theFor ema n et al. 1997; Roeder et al. 1998).estimated allele freq uencies, it is then po ssible to com-

Pr it ch ar d and Rosenber g (1999) considered how pute the likelihood that a given genotype originated ingenetic information might be used to detect the pres-each population. Individuals of unknown origin can beence of cryptic population structure in the associationassigned to populations according to these likelihoodsmapping context. More generally, on e would like to b ePaet kau et al. 1995; Ran na l a and Mount ain 1997).able to identify the actual subpopulations and assignIn bo th situat ions described abo ve, a crucial first stepindividuals (probabilistically) to these populations. In

is to d efine a set of popu lation s. The d efinition of po pu- this article we use a Bayesian clustering approach tolations is typically subjective, based, for example, ontackle this problem. We assume a model in which therelinguistic, cultural, or physical characters, as well as theare K populations (where K may be unknown), each ofgeog raph ic location o f samp led individuals. This subjec-which is characterized by a set of allele frequencies attive a pproach is usually a sensible way of incorporatingeach locus. Our method attempts to assign individualsdiverse typeso f infor matio n. H owever, it may be d ifficultto populations on the basis of their genotypes, whileto know whether a given assignment of individuals tosimultaneously estimating population allele frequen-

cies. The method can be applied to various types ofmarkers [e.g., microsatellites, restriction fragment

Correspondi ng author: Jonathan Pritchard, Department of Statistics,length polymorphisms (RFLPs), or single nucleotideUniversity of Oxford, 1 S. Parks Rd., Oxford OX1 3TG, United King-

dom. E-mail: [email protected] polymorphisms (SNPs)], but it assumes that the marker

Genetics 155: 945– 959 ( June 2000)



946 J. K. Pritch ard, M. Steph ens and P. Donnelly

loci are unlinked and at linkage equilibrium with one observations from each cluster are random draws

from some parametric model. In ference for the pa-an oth er within po pulatio ns. It also assumes H ard y-Wein-

berg equilibrium within populations. (We discuss these rameters corresponding to each cluster is then done

jointly with inference for the cluster memb ership o fassumptions further in backgr ound on cl ust er ing

met ho ds and the discussion.) each individual, using standard statistical methods

(for example, maximum-likelihood or BayesianOur approach is reminiscent of that taken by Smouse

et al. (1990), who used the EM algorithm to learn about methods).

the contribution of different breeding populations to a

Distance-based m etho ds are usually easy to a pply andsample of salmon collected in the open ocean. I t is also are often visually appealing. In th e genetics literature, itclosely related to the meth ods of For eman et al. (1997)has been common to ada pt distance-based phylogeneticand Roeder et al. (1998), who were concerned withalgorithms, such as neighbor-joining, to clusteringestimating the degree of cryptic population structuremultilocus genotype data (e.g., Bowcock et al. 1994).to assess the probability of obtaining a false match atHowever, these methods suffer from many disadvan-DNA fingerprint loci. Co nsequently they focused ontag es: the clusters iden tified m ay be h eavily depen den testimating the amoun t of genetic differentiation amo ngon b oth th e distance measure and graph ical representa-the unobserved populations. In contrast, our primarytion chosen; it is difficult to assess ho w confident weinterest lies in th e a ssignment of individuals to p opula-should be that the clusters obtained in this way aretions. Our approach also differs in that it allows for themeaning ful; and it is difficult to incorpora te ad ditionalpresence of admixed individuals in the sample, whoseinformation such as the geographic sampling locationsgenetic makeup is drawn from more than one of the K of individuals. D istance-based meth ods are thus morepopulations.suited to explo rato ry data an alysis tha n to fine statisticalIn the next section we provide a brief descriptioninference, an d we have chosen to take a m odel-basedof clustering methods in general and describe someapproach here.advantages of the m odel-based a pproach we take. The

The first cha lleng e when ap plying mo d el-ba sed meth -details of the models and algorithms used are given inods is to specify a suitable mod el for observations frommodel s an d met hod s. We illustrate our method witheach cluster. To make our discussion more concrete weseveral examples in appl icat ions t o dat a : b o t h o nintroduce very briefly some of our model and notationsimulated data and on sets of genotype data from anhere; a fuller treatment is given later. Assume that eachendangered bird species and from humans. inco r po-cluster (population) is modeled by a characteristic setr at ing popul at ion info r mat ion describes ho w ourof allele freq uencies. Let X denote the genotypes of themethod can be extended to incorporate geographicsampled individuals, Z denote the (unknown) popula-information into the inference process. This may betions of origin of the individuals, and P denote theuseful for testing whether p articular individuals are mi-

(unknown) allele frequencies in all populations. (Notegrants or to assist in classifying individuals of unknown that X , Z , a n d P actually represent multidimensionalorigin (as in Ran na l a an d Mount ain 1997, for exam-vectors.) Our main modeling assumptions are Hardy-ple). Background on the computational methods usedWeinberg equ ilibrium within populations and completein this article is provided in the appendix.linkage equilibrium between loci within populations.

Under these assumptions each allele at each locus in

each genotype is an independent draw from the appro-BACK GROUND ON CLUSTERING METHODS

priate frequen cy distribution, and this completely speci-Con sider a situation where we have genetic data from

fies the pr ob ability distribution Pr( X |Z , P ) (given latera sample of individuals, each of whom is assumed to

in Equation 2). Loosely speaking, the idea here is thathave originated from a single unknown population ( no

the mod el accounts for the presence of Hard y-Weinbergad mixture) . Suppose we wish to cluster togeth er ind ivid-

or linkage disequilibrium by introducing populationuals who a re gen etically similar, identify distinct clusters,

structure and attempts to find population groupings

and perhaps see how these clusters relate to geograph i- that (as far as possible) are not in disequilibrium. Whilecal or phenotypic data on the individuals. There areinference may depend heavily on these modeling as-

broadly two types of clustering methods we might use:sumptions, we feel tha t it is easier to assess the validity

of explicit modeling assumptions than to compare the1. Di stan ce-based methods. These proceed by calculatingrelative merits of more abstract quantities such as dis-a pairwise distance matrix, whose entries give thetance m easures and graphical representations. In situa-distance ( suitab ly defined ) b etween every pair o f in-tions where th ese assumptions are deemed unreason-dividuals. This matrix may then be represented usingable then alternative models should be built.some convenient graphical representation (such as a

Ha ving specified our mod el, we must d ecide ho w totree or a multidimensional scaling p lot) an d clusters

perform inference for the quantities of interest ( Z an dmay b e id entified by eye.

2. M odel-based methods. These proceed by assuming that P ). H ere, we have chosen to ado pt a Ba yesian approa ch,



947Inferring Population Structure

by specifying models (priors) Pr(Z ) and Pr(P ), for both Assume that before observing the genotypes we have

Z an d P. The Bayesian approach provides a coherent no information about the population of origin of each

framework for incorporating th e inherent uncertainty individual and that the probab ility that individual i origi-of parameter estimates into the inference procedure nated in population k is the same for all k ,and for evaluating the strength of evidence for the in-

Pr( z ( i ) ϭ k ) ϭ 1/K , (3)ferred clustering. It also eases the incorporation of vari-

ous sorts of prior information that may be available, independ ently for all ind ividuals. ( In cases where somesuch as information about the geographic sampling lo-

populations may be more heavily represented in the

cation of individuals. sample th an others, this assumption is inappro priate; itHa ving observed the genotypes, X , our knowledge

would be straightforward to extend our model to dealabout Z and P is then given by the posterior distribution

with such situations.)

We follow the suggestion of Bal din g an d Nichol sPr( Z , P |X ) ϰ Pr( Z )Pr(P )Pr(X |Z , P ) . (1)(1995) (see also For eman et al. 1997 and Ranna l a

While it is not usually possible to compute this distribu- and Moun t ain 1997) in using the Dirichlet distri-tion exactly, it is possible to obtain an approximate bution to model the allele frequencies at each locussample (Z (1), P (1)) , (Z (2), P (2)), . . . ,( Z (M ), P (M )) from Pr( Z , within each population. The Dirichlet distributionP |X ) using Markov chain Monte Carlo (MCMC) meth- D(1, 2, . . . , J ) is a distribution on a llele frequen ciesods described below (see Gil ks et al. 1996b, for more

p ϭ (p 1, p 2, . . . , p J ) with the property that th ese frequen-general background). Inference for Z and P may then

cies sum to 1. We use this distribution to specify thebe b ased on summary statistics obtained from this sam-

probability of a particular set of allele frequencies p kl· ple (see Inf erencefor Z , P , and Q below). Abrief introduc-

for population k at locus l ,t ion to MCMC methods and Gibbs sampling may be

found in the appendix. p kl· D(1, 2, . . . , J l ) , (4)

independently for each k,l. The expected frequency ofMODELS AND METHO DS allele j is proportional to j , and the variance of this

frequency decreases as the sum of the j increases. WeWe now provide a more detailed description of ourtake 1 ϭ 2 ϭ ···ϭ J l ϭ 1.0, which gives a uniformmodeling assumptions and the algorithms used to per-distribution on the allele frequencies; alternatives areform inference, beginning with the simpler case wherediscussed in the discussion.each ind ividual is assumed to have originated in a single

MCMC algorithm(without admixture): Equations 2,population (no admixture).3, and 4 define the quantities Pr(X |Z , P ) , P r (Z ) , a n dThemodel withoutadmixture: Suppose we genotypeP r( P ), respectively. By setting ϭ (1, 2) ϭ (Z , P ) a ndN diploid individuals at L loci. In the case without admix-

letting (Z , P ) ϭ P r( Z , P |X ) we can use the approachture, each individual is assumed to originate in one ofoutlined in Algorithm A1 to construct a Markov chainK populations, each with its own characteristic set ofwith stationary distribution Pr( Z , P |X ) as follows:allele frequencies. Let the vector X deno te the observed

Al gor it h m 1: Starting wi th i ni tial values Z (0) for Z (by genotypes, Z the (unknown) populations of origin ofdrawing Z (0) at random using (3) for example), i terate the the ind ividuals, an d P the (un known) allele frequen ciesfollowing steps for m ϭ 1, 2, . . . .in the populations. These vectors consist of the follow-

ing elements,Step 1. Sample P (m ) from Pr( P |X , Z (m Ϫ1)).

(x ( i ,1)l , x ( i ,2)

l ) ϭ genotype of the i th individual at the l th locus, Step 2. Sample Z (m ) from Pr( Z |X , P (m )) .where i ϭ 1, 2, . . . , N an d l ϭ 1, 2, . . . , L ;

z (i ) ϭ population from which individual i originated; Informally, step 1 corresponds to estimating the allelep klj ϭ frequency of allele j at locus l in population k ,

frequen cies for each po pulation assuming tha t the pop-where k ϭ 1, 2, . . . , K an d j ϭ 1, 2, . . . , J l ,

ulation of origin of each individual is known; step 2

where J l is the number of distinct alleles observed at corresponds to estimating the population of origin oflocus l , and these alleles are labeled 1, 2, . . . , J l . each ind ividual, assuming th at the population a llele fre-

Given the population of origin of each individual, q uencies are known . For sufficiently large m and c , (Z (m ),the genotypes are assumed to be generated by drawing P (m )), (Z (m ϩc ), P (m ϩc )) , (Z (m ϩ2c ), P (m ϩ2c )), . . . will be approxi-alleles independently from the appropriate population mately independent random samples from Pr( Z , P |X ) .frequen cy d istributions, The distributions required to perform each step a re

given in the appendix.Pr( x ( i ,a )

l ϭ j |Z , P ) ϭ p z ( i ) lj (2)The model with admixture: We now expand our

model to allow for admixed individuals by introducingindependently for each x ( i,a )l . (Note that p z ( i ) lj is the fre-

a vector Q to denote the admixture proportionsfor eachquency of allele j at locus l in the population of origin

of individual i .) individual. The elements of Q are




q ( i )k ϭ proportion of individual i ’s geno m e t ha t t io n o f o r igin o f ea ch a llele c opy in eac h individua l is

known; step 2 correspond s to estimating th e populationoriginated from population k.

of origin of each allele copy, assuming that the popula-It is also necessary to modify the vector Z to replace the

tion allele frequencies and the admixture proportionsassumption that each individual i originated in some

are known. As b efore, for sufficiently large m and c ,unknown population z ( i ) with the assumption that each

(Z (m ), P (m ), Q (m )) , (Z (m ϩc ), P (m ϩc ), Q (m ϩc )) , (Z (m ϩ2c ), P (m ϩ2c ),observed allele copy x ( i ,a )

l originated in some unknownQ (m ϩ2c )), . . . will be approximately independent random

population z ( i ,a )l :

samples from P r(Z , P , Q |X ). The distributions required

to perform each step are given in the appendix.z ( i ,a )l ϭ population of origin of allele copy x ( i ,a )l .Inference:Inf erencefor Z, P, and Q: We now discuss how

We use the term “ allele copy” to refer to a n allele carriedthe MCMC output can be used to perform inference on

at a particular locus by a particular individual.Z , P , and Q. For simplicity, we focus our attention on Q ;

Our primary interest now lies in estimating Q. Weinference for Z or P is similar.

proceed in a mann er similar to the case without ad mix-Ha ving obtain ed a sample Q (1), . . . , Q (M ) (using suitably

ture, beginning by specifying a probability model forlarge burn -in m and thinning interval c ) from th e poste-

(X , Z , P , Q ). Analogues of (2) and (3) arerior distribution of Q ϭ (q 1, . . . , q N ) given X using

the MCMC method, it is desirable to summarize thePr( x ( i ,a )l ϭ j |Z , P , Q ) ϭ p z (i ,a )

l lj (5)information contained, perhaps by a point estimate of

and Q. A seemingly obvious estimate is the posterior mean

Pr( z ( i ,a )l ϭ k |P , Q ) ϭ q ( i )

k , (6)E (q i |X ) ≈

1

M ͚

M

m ϭ1

q (m )i . (8)

with (4) being used to model P as before. To completeour mod el we need to specify a distribution for Q , which

However, the symmetry of our model implies that th ein general will depend on the type and am ount of admix-

posterior mean of q i is (1/K ,1/K , . . . , 1 /K ) for all i ,ture we expect to see. Here we model the admixture

whatever the value of X. For example, suppose that thereproportions q ( i ) ϭ (q ( i )

1 , . . . , q ( i )K ) of individual i using

are just two population s and 10 ind ividuals and that th ethe Dirichlet distribution

genotypes of these individuals contain strong informa-

tion th at the first 5 are in on e populatio n an d th e secon dq ( i ) D(␣, ␣, . . . , ␣) (7)

5 are in the other population. Then eitherindependently for each individual. For large values of

␣ (ӷ1), this models each individual as having allele q 1 . . . q 5 ≈ ( 1, 0) a n d q 6 . . . q 10 ≈ ( 0, 1) ( 9)copies originating from all K populations in eq ual pro-

orportions. For very small values of ␣ (Ӷ1), it models each

individual a s originating mostly from a single popu- q 1 . . . q 5 ≈ ( 0, 1) a n d q 6 . . . q 10 ≈ (1, 0), (10)lation, with each population being equally likely. Aswith th ese two “symmetric mod es” being equa lly likely,␣ → 0 this model becomes the same as our modelleading to the expectation of any given q i being (0.5,without ad mixture (altho ugh the implementation of the0.5). This is essentially a pr ob lem o f n on iden tifiab ilityMCMC algorithm is somewhat different). We allow ␣caused by the symmetry of the model [see St ephen sto range from 0.0 to 10.0 and attempt to learn about ␣(2000b) for more discussion].from the data (specifically we put a uniform prior on

In general, if there are K populations then there will␣ [0, 10] a nd use a Metropo lis-H astings upda te stepbe K ! sets of symmetric modes. Typically, MCMCto integrate out our uncertainty in ␣). This model mayschem es find it ra ther difficult to move be tween suchbe considered suitable for situations where little ismodes, and the algorithms we describe will usually ex-known about admixture; alternatives are discussed inplore on ly one of th e symmetric modes, even when runth e discussion.

for a very large number of iterations. Fortunately thisMCMC algorithm (with admixture): The following

does not bother us greatly, since from the point ofalgorithm may be used to sample from Pr( Z , P , Q |X ) .view of clustering all the symmetric modes are the sameAl gor it h m 2: Starting with in iti al values Z (0) for Z (by

drawing Z (0) at random using (3) for example), i terate the [compare the clusterings corresponding to (9) and

following steps for m ϭ 1, 2, . . . . (10)]. If our sampler explores only one symmetric mode

then the sample means (8) will be very poor estimatesStep 1. Sample P (m ), Q (m ) from Pr( P , Q |X , Z (m Ϫ1)) .

of the posterior meansfor the q i , but will be much betterStep 2. Sample Z (m ) from Pr( Z |X , P (m ), Q (m )) .

estimates of the modes of the q i , which in this case turnStep 3. Update ␣ usin g a M etropoli s-Hasti ngs step.

out to be a much better summary of the information

in the data. Ironically then, the poor mixing of theInforma lly, step 1 correspond s to estimating th e allele

MCMC sampler between the symmetric modes givesfrequen ciesfor each population and the adm ixture pro-

portions of each individual, assuming that the popula- the asymptotically useless estimator (8) some practical




value. Where the MCMC sampler succeeds in moving Simulated data: To test the performance of the clus-

tering metho d in cases where the “an swers” a re known,between symmetric modes, or where it is desired to

combine results from samples obtained using different we simulated data from three population models, using

stand ard coalescent techniq ues (Hu d so n 1990). We as-starting points (which may involve combining results

corresponding to different modes), more sophisticated sumed that sampled individuals were genotyped at a

series of unlinked microsatellite loci. Data were simu-m et h od s [ su ch a s t h ose d escr ib ed b y St ephen s

(2000b)] may be req uired. lated und er the following models.

Inference for the nu mber of popul ati ons: The problem ofModel 1: A single random-mating population of con-

inferring the number of clusters, K , present in a data stant size.set is no torio usly difficult. In the B ayesian p ara digm the

Model 2: Two ran dom -mating populations of constantway to proceed is theoretically straightforward: place a

effective population size 2N. These were assumed toprior distribution o n K and base inference for K on the

have split from a single ancestral population, also ofposterior distribution

size 2N at a t ime N generations in the past, with no

subsequent migration.Pr( K |X ) ϰ Pr( X |K )Pr(K ) . (11)Model 3: Admixture of populations. Two discrete popu-

However, this posterior distribution can be peculiarlylations of equal size, related as in model 2, were fused

dependent on the modeling assumptions made, evento prod uce a single rand om-mating population . Sam-

where the posterior distributions of other quan tities( Q ,ples were collected after two generations of random

Z , and P , say) a re relatively rob ust to th ese assumpt ions.mating in the merged population. Thus, individuals

Moreover, ther e are typically severe computa tion al chal-have i grandparents from population 1, and 4 Ϫ i

lenges in estimating P r(X |K ). We th erefore d escribe angrandparents from population 2 with probabilityalternative approach, which is motivated by approximat-( 4

i )/16, where i {0, 4}. All loci were simulated inde-ing (11) in an ad hoc and computationally convenient

pendently.way.

We present r esults from an alyzing da ta sets simulatedArguments given in the appendix ( In ference on K, the

under each model. Data set 1 was simulated undernumber of populations ) suggest estimating Pr(X |K ) usingmodel 1, with 5 microsatellite loci. Data sets 2A and 2B

Pr( X |K ) ≈ exp( Ϫ̂/2 Ϫ ̂2/8), (12)were simulated und er mod el 2, with 5 and 15microsatel-

lite loci, respectively. Data set 3 was simulated underwheremod el 3, with 60 loci ( preliminary a nalyses with fewer

loci showed this to be a much harder problem than̂ ϭ1

M ͚

M

m ϭ1

Ϫ2 log Pr(X |Z (m ), P (m ), Q (m )) (13)models 1 and 2). Microsatellite mutation was modeled

by a simple stepwise mutat ion pro cess, with the muta tionand

parameter 4N set at 16.0 per locus ( i.e., the expectedvariance in repeat scores within populations was 8.0).̂2 ϭ

1

M ͚

M

m ϭ1

(Ϫ2 log Pr(X |Z (m ), P (m ), Q (m )) Ϫ ̂) 2.We d id n ot ma ke use of the assumed mutation model

in analyzing the simulated data.(14)

Ou r a na lysis consists of two pha ses. First, we co nsiderWe use (12) to estimate Pr(X |K ) for each K and substi- the issue of model choice—i.e., ho w many populationstute these estimates into ( 11) to a pproximate th e poste- are most appropriate for interpreting the data. Then,rior distribution Pr(K |X ) . we examine the clustering of ind ividuals for the inferred

In fa ct, the assumptions underlying (12) a re dub ious number of populations.at best, and we do n ot claim (or believe) tha t our proce- Choiceof K for simulated data: For each model, wedure provides a quantitatively accurate estimate of the ran a series of independent runs of the Gibbs samplerposterior distribution of K. We see it merely as an ad for each value of K (the number of populations) be-hoc guide to which models are most consistent with the tween 1 and 5. The results presented are b ased on runs

data, with the main justification being that it seems of 106 iterations or more, following a burn-in period ofto g ive sensible an swers in practice ( see next section fo r at least 30,000 iterations. To choose the length of theexamples). Notwithstanding this, for convenience we burn-in period, we printed out log(Pr( X |P (m ), Q (m ))), andcontinue to refer to “estimating” Pr(K |X ) and Pr(X |K ) . several o ther summary statistics during th e cou rse o f a

series of trial runs, to estimate how long it took to reach

(appro ximate) stationa rity. To check for possible prob-APPLICATIONS TO DATA

lems with mixing, we compa red th e estimates of P (X |K )

and other summary statistics obtain ed over several inde-We now illustrate the p erformance o f our m ethod on

both simulated data and real data (from an endangered pendent runs of the Gibbs sampler, starting from differ-

ent in itial points. In general, substantial d ifferences be-bird species and from humans). The analyses make use

of the method s described in The model with admixtur e. tween runs can indicate that either the runs should




TABLE 1

Estimated posterior probabilities of K , for simulateddatasets 1, 2A, 2B, and 3 (denoted X 1, X 2A, X 2B,

and X 3, respectively)

K lo g P ( K |X 1) P ( K |X 2A) P (K |X 2B) P (K |X 3)

1 1.0 0.0 0.0 0.02 0.0 0.21 0.999 1.0

3 0.0 0.58 0.0009 0.04 0.0 0.21 0.0 0.05 0.0 0.0 0.0 0.0

The numbers should be regarded as a rough guide to whichmodels are consistent with the data, rather than accurate esti-mates of posterior probabilities.

Figur e 1.—Summary o f th e clustering results for simulateddata sets 2A and 2B, respectively. For each individual, webe longer to obtain more accurate estimates or thatcomputed the mean value of q ( i )

1 (the proportion of ancestryindependent runs are getting stuck in different mo des in population 1), over a single run of the Gibbs sampler. Thein the parameter space. (Here, we consider the K ! dashed line is a histogram of mean values of q ( i )

1 for individualsfrom pop ulation 0; the solid line isfo r individuals from po pula-modes that arise from the nonidentifiability of the K tion 1.populations to be eq uivalent, since they arise from per-

muting the K population labels.)

We found that in most cases we obtained consistentand Q estimating the number of grandparents fromestimates of P (X |K ) across independ ent run s. H owever,each of the two original populations, for each individual.when an alyzing d ata set 2Awith K ϭ 3, the G ibbs samplerIntuitively it seems that another plausible clusteringfound two different modes. This data set actually con-would be with K ϭ 5, individuals being assigned totains two populations, and when K is set to 3, one ofclusters according to h ow man y grandpa rents they havethe p opula tion s expan ds to fill two of th e thr ee clusters.from each population. In biological terms, the solutionIt is somewhat arbitrary which of the two populationswith K ϭ 2 is more natural an d is indeed the inferredexpan ds to fill the extr a cluster: this leads to two mod esvalue o f K for this data set using our ad hoc guide [theof slightly different heigh ts. The G ibbs sampler d id n otestimated value of Pr(X |K ) was higher for K ϭ 5 thanmanage to move between the two mo des in any of ourfo r K ϭ 3, 4, or 6, but much lower than for K ϭ 2].runs.

However, this raises an important point: the inferredIn Table 1 we report estimates of the posterior proba- value of K may not always have a clear biological inter-bilities of values of K , assuming a uniform prior on K pretation ( an issue that we return to in th e discussion) .between 1 and 5, o btained a s described in Inf erence for

Clusteri ng of simul ated data: Having considered thethe number of populations. We repeat the warning givenproblem of estimating the number of populations, wethere that these numbers should be regarded as roughnow examine the performance of the clustering algo-guides to which models are consistent with the data,rithm in assigning particular individuals to the appro-rather tha n accurate estimates of the posterior proba bil-priate populations. In the case where the populationsities. In the case where we found two modes (da ta setare discrete, the clustering performs very well (Figure2A, K ϭ 3), we present results based on the mode that1), even with just 5 loci (data set 2A), and essentiallygave the higher estimate of Pr( X |K ) .perfectly with 15 loci (data set 2B).With all four simulated data sets we were able to

The case with admixture (Figure 2) appears to becorrectly infer whether or not there was populationmore difficult, even using many more loci. H owever,structure (K ϭ 1 for data set 1 and K Ͼ 1 otherwise).

the clustering algorithm did manage to identify theIn the case of data set 2A, which consisted of just 5population structure appropriately and estimated theloci, there is not a clear estimate of K , as the posteriorancestry of individuals with reasonable accuracy. Partprobab ility is consistent with both th e correct value, K ϭof the reason th at this problem is difficult is that it is2, and also with K ϭ 3 or 4. However, when th e numb erhard to estimate the original allele frequencies (beforeof loci was increased to 15 (d ata set 2B), virtually all ofadmixture) when almost all the individuals (7/8) arethe posterior probability was on the correct number ofadm ixed. Amore fund amenta l problem is that it isd iffi-populations, K ϭ 2.cult to get accurate estimates of q ( i ) for particular individ-Data set 3 was simulated under a more complicateduals because ( as can b e seen from the y -axis of Figuremod el, where mo st in dividuals have mixed ancestry. In

2) for an y given individual, the variance of how man ythis case, the population was formed by admixture of

two population s, so th e “ true” clustering is with K ϭ 2, o f it s a lleles a re actually derived from each population




Figur e 2.—Summary of the clustering results for simulateddata set 3. Each point plots the estimated value of q (i )

1 (theproportion of an cestry in po pulation 1) fo r a particular indi-vidual against the fraction of their alleles that were actuallyderived from population 1 (across the 60 loci genotyped).The five clusters (from left to right) are for individuals with0, 1, . . . , 4 grandparents in population 1, respectively.

can be substantial (for intermediate q ). This property

means that even if the allele frequencies were known,

it would still be necessary to use a considerable numberFigur e 3.—Neighbor-joining tr ee of individua ls in the T .

of loci to get accurate estimates of q for ad mixed individ- helleri data set. Each tip represents a single individual. C, M,uals. N, and Yind icate the populations of origin (Chawia, Mbololo,

Ngan gao , and Yale, respectively). Using the lab els, it isp ossibleData from the Taitathrush: We now present resultsto group the Ch awia an d Mbololo individualsinto ( somewhat)from applying our method to genotype data from andistinct clusters, as marked. H owever, it would not be po ssibleendangered bird species, the Taita thrush, Tu rdus helleri.to identify these clusters if the population labels were not

Ind ividuals were sampled a t four locations in southeast available. Individuals who appear to be misclassified areKenya [ Cha wia (17ind ividuals), Ngangao (54), Mbololo marked *. One of these individuals [marked (*)] was also

identified by our own algorithm a s a possible migrant. The(80), and Yale (4)]. Each individual was genotyped attree was constructed using the program Neighbor included in

seven microsatellite loci ( Gal buser a et al. 2000). Phylip ( Fel sens t ein 1993). The pairwise distance matrix wasThis data set is a useful test for our clustering metho d,computed as follows (Moun t ain an d Cava l l i-Sfo r za 1997).

because the geograph ic samples are likely to represent For each pair of individuals, we ad ded 1/L for each locus atdistinct populations. These locations represent frag- which they had no alleles in common , 1/2L for each locus at

which they had one allele in common ( e.g., AA:AB or AB:AC),ments of indigenous cloud forest, separated from eachand 0 for each locus at which they had two alleles in commonother by human settlements and cultivated areas. Yale,( e.g., AA:AA or AB:AB), where L is the number of loci com-which isa verysmall fragment, is quite close to Nganga o.pared.

Extensive da ta o n ring ed a nd radio-tagged birds over a

3-year period indicate low migration rates (Gal buser a

et al. 2000).

As discussed in backgr ound on cl ust er ing met h-TABLE 2o d s, it is currently commo n to use distan ce-based clus-

tering methods to visualize genotype data of this kind. Summary statistics of variation within and between

To permit a comparison between that type of a pproach geographic groupsand our own method, we begin by showing a neighbor-

Chawia Mbololo Ngangao Yalejoining tree of the bird data (Figure 3). Inspection of

the tree reveals that th e Chawia and Mbololo individualsChawia 5.1

represent ( somewhat) distinct clusters. Several ind ividu- Mbololo 7.1 5.6als ( marked by asterisks) appear to be classified with Ngangao 3.1 1.6 5.5other groups. The four Yale individuals appear to fall Yale 1.9 2.3 0.1 6.0

within the Ngangao group [a view that is supported byDiagonal, variance in repeat scores within groups; below

summary statistics of divergence showing the Yale and diagonal, square of mean difference in repeat scores betweenNgangao to be very closely related (Table 2)]. populations [(␦) 2; Gol dst ein and Pol l ock 1997, Equation

C3)].The tree illustrates several shortcomings of distance-




based clustering methods. First, it would not be possible we obtained these results. Our clustering algorithm

seems to have performed very well, with just a few indi-(in this case) to identify the appropriate clusters if the

labels were missing. Second, since the tree does not use viduals (labeled 1– 4) falling somewhat outside the obvi-

ous clusters. All of the points in the extreme cornersa fo rmal pr oba bility mod el, it is difficult to ask statistical

questions about features of the tree, for example: Are (some of which may be difficult to resolve on the pic-

ture) are correctly assigned. The four Yale individualsthe individuals marked with asterisks actually migrants,

or are they simply misclassified by chance? Is there evi- were assigned to the Ngan gao cluster, consistent with

the neighbor-joining tree an d the ( ␦) 2 distances. Wedence of population structure within the Ngangao group

(whic h a ppea rs fro m t he t ree t o be q uit e diverse)? r et ur n t o t his da t a set in inco r por at ing popul at ioninfo r mat io n to consider the question of whether theWe now apply our clustering method to these data.

Choice of K , for Taita thrush data: To choose an individuals that seem not to cluster t ightly with others

sampled from the same location are the product ofappropriate value of K for modeling the data, we ran a

series o f independent r uns o f the Gibbs sa m pler a t a m igr at io n.

Application to human data: The next data set, takenrange of values of K. After runnin g n umerous medium-

length r uns t o invest iga te the beha vio r o f t he Gibbs fro m Jor d e et al. (1995), includes data from 30 biallelic

restriction site polymorph isms, genotyped in 72Africanssampler (using the diagnostics described in Choice of K

for simul ated data ), we again chose to use a burn-in period (Sotho, Tsonga, Nguni, Biaka and Mbuti Pygmies, and

San) and 90 Europeans (British and French).of 30,000 iterations and to collect data for 106 iterations.

We ran three to five independent simulations of this Application of our MCMC scheme with K ϭ 2 indi-

cates the presence of two very distinct clusters, corre-length for each K between 1 and 5 and found that the

independent runs produced highly consistent results. sponding to the Africans and Europeans in the sample

(Figure 5). The model with K ϭ 2 has vastly higherAt K ϭ 5 , a r un o f 1 06 st eps t a kes 70 m in o n o ur

desktop machine. posterior probability than the model with K ϭ 1.

Additional runs of the MCMC scheme with the mod-Using the approach described in Inferencefor thenum-

ber of populati ons , we estimated Pr(X |K ) f o r K ϭ 1, els K ϭ 3, 4, and 5 suggest that those models may be

somewhat better than K ϭ 2. This ma y reflect t he pres-2, . . . , 5 and corresponding values of Pr( K |X ) f o r a

uniform prior on K ϭ 1, 2, . . . , 5. ( In fa ct , t his da t a ence o f po pula t io n st ruct ur e w it hin t he c o nt inenta l

groupings, although in this case the additional popula-set contains a lot o f information ab out K , so that infer-

ence is relatively robust to choice of prior on K , and t ions do not form discrete clusters and so are difficult

to interpret.other priors, such as taking Pr( K ) proportional to Pois-

son(1) for K Ͼ 0, would give virtually indistinguishable Again it is interesting to contrast our clustering results

with the n eighbor-joining tree of these data (Figure 6).results.) From the estimates of Pr( K |X ), shown in the

last column of Table 3, it is clear that the models with While our method finds it q uite easy to separate the

two continental groups into the correct clusters, it wouldK ϭ 1 or 2 are com pletely insufficient to mod el the da taand that the model with K ϭ 3 is substantially better not be possible to use the neighbo r-joining tree to detect

distinct clusters if the labels were not present. The datathan models with larger K. Given these results, we now

focus our subsequent analysis on the model with three set of Jorde also contains a set of individuals of Asian

origin (which are more closely related to Europeanspopulations.

Clustering results for Taita thrush data: Fig ur e 4 t h an a re Af rica n s) . N eit he r t h e n eig h bo r-jo in in g

method nor our method differentiates between the Eu-shows a plot of the clustering results for th e ind ividuals

in the sample, assuming that there are three populations ropeans and Asians with great accuracy using this data

set.(as inferred above). We did not use (and indeed, did

not know) the sampling locations of individuals when

INCORPORATING POPULATION INFORMATION

TABLE 3The results presented so far have focused on testing

Inferring the valueof K , the number of populations, how well our method works. We now turn our atten tionfor the T. helleri data to some further applications of this method.

Our clustering results (Figure 4) con firm that theK log P ( X |K ) P ( K |X ) three main geographic groupings in the thrush data

set (Chawia, Mbololo, and Ngangao) represent three1 Ϫ3144 0genetically distinct po pulations. The geograph ic labels2 Ϫ2769 0

3 Ϫ2678 0.993 correspond very closely to the genetic clustering in all4 Ϫ2683 0.007 but a handful of cases (1–4 in Figure 4). Individual 25 Ϫ2688 0.00005 is also identified as a possible outlier o n the neighbo r-

joining tree (Figure 3). Given this, it is natural to askThe values in the last column assume a uniform prior forK (K {1, . . . , 5}). whether these apparent outliers are immigrants (or de-




Figur e 4.—Summary of the clustering resultsfor the T. helleri data assuming three populations.Each point showsthe mean estimated ancestry foran ind ividual in the sample. For a given individua l,

the values of th e thr ee coefficients in the an cestryvector q ( i ) are given by the distances to each ofthe three sides of the equilateral triangle. Afterthe clustering was performed, the points were la-beled according to sampling location. Numbers1–4 are individuals who appear to be possibleoutliers (see text). For clarity, the fou r Yale ind i-viduals (who fall into the Ngangao cluster) arenot plotted. We were not told th e sampling loca-tions of individuals until after we o btained theseresults.

scendants o f recent immigrants) from other popula- whose genetic makeup suggests they were misclassified.

Thus, while we speak of “immigrants” and “immigranttions. For example, given the gen etic data , how probab le

is it that individual 1 is actually an immigrant from ancestry,” in some contexts these terms may relate to

something other than changes in physical location.Chawia?

To a nswer t his so r t o f q uest io n, we need to ext end P r ovided t ha t geo gr a phic la bels usually correspond

to population membership, using the geog raphic infor-our algorithm to incorporate the geographic labels. By

doing this, we break the symmetry of the labels, an d we

can ask specifically whether a part icular in dividual is a

migrant from Chawia (say). In essence our approach

(described more formally in the next section) is to as-sume that each individual originated, with high proba-

bility, in the geographical region in which it was sam-

pled, but to allow some small probability that it is an

immigrant (or has immigrant ancestry). Note that this

model is also suitable for situations in which individuals

are classified according to some characteristic oth er

than sampling location ( physical appearance, for exam-

ple). “Immigran ts” in this situation would be individuals

Figur e 5.—Summary of the clustering results for the dataset of Africans and Europeans taken from Jor d e et al. (1995).For each individual, we computed th e mean value of q (i )

1 (theFigur e 6.—Neighbor-joining tree of in dividua ls in the d ataproportion of a ncestry in population 1), over a single run of

the Gibbs sampler. The dashed li ne i s a histogram of mean set of Jor d e et al. (1995). Each tip represents a single in divid-ual. A and E indicate that individuals were African or Euro-values of q (i )

1 for individuals of European origin; the solid lineis f or in divid ua ls o f Af rica n o rigin . p ea n, resp ect ively. Th e tr ee wa s co nst ruct ed a s in Fig ure 3.




mation will clearly improve our accuracy at assigning 2t

(K Ϫ 1)RG T ϭ0 2T

, (17)ind ividua ls to clusters; it will also imp rove our estimates

of P , thus also giving us greater precision in assignmentwhere t { 0 , . . . , G }. As before, q ( i )

l Ն 0 fo r l of individuals who d o not h ave geographic information .{1, . . . , K }, and Rq ( i )

l ϭ 1.However, in practice we suggest that before making useAgain, we can sample from Pr( Q |X ) using Algorithm of such info rmation, users of our method should first

2. In this case, ho wever, since th ere are a small num bercluster the data without using the geographic labels, toof possible values of q ( i ), we update q ( i ) by sampling di-check that the genetically defined clusters do in fa ct

rectly from the posterior probability of q

( i )

|X ,P , ratheragree with geographic labels. We return to this issue in than conditional on Z.th e discussion.Note tha t in this framework, it is easy to include indi-Ran nal a and Mount ain (1997) also considered the

viduals for whom there is no geographic informationproblem of detecting immigrants and individuals withby using the same prior and update steps as beforerecent immigrant ancestors, taking a somewhat similar(Equations 7 and A10).approach to that used here. Ho wever, rather than con-

Testing for migrants in the Taita thrush data: To applysidering all individuals simultaneously, as we do here,our meth od , we mu st first specify a value for . In thisthey test each individual in the sample, one at a time,case, based o n ma rk-release-recaptu re da ta fr om th eseas a possible immigrant, assuming that all the otherpopulations (Gal buser a et al. 2000), migration seemsindividuals are not immigrants. This approach will haverelatively rare, and so is likely to be small. We per-reduced power to detect immigrants if the sample con-formed analyses for ϭ 0.05 and ϭ 0.1; a summarytains several immigrants from one population to an-of the results is shown in Table 4. Individuals 2 and 3other. In contrast, our approach can cope well with thishave moderate posterior proba bilitieso f having migran tkind of situation.ancestry, but these probabilities are perhaps smallerModel with prior population information: To incor-than might be expected from examining Figure 4. Thisporate geographic information, we use the followingis due to a combination of the low prior probability formodel. Our primary goal is to identify individuals whomigration (from the m ark-release-recapture da ta) and ,are immigrants, or who have recent immigrant ancestry,perhaps more importantly, the fact that there is a limitedin the last G generations, say, where G ϭ 0 is the presentamount of information in seven loci, so that the uncer-generation. [In practice there will only be substantialtainty associated with th e position of th e points markedpower to detect immigration for small G ; cf. Rann al a1, 2, 3, and 4 in Figure 4 may be quite large. A moreand Mount ain (1997).]definite con clusion could b e o btained by typing moreFirst, we code each of the geographic locations by aloci.(unique) integer between 1 and K , where K would usu-

It is interesting to note that our conclusions hereally be set equal to the number of locations. Using thisdiffer from those obtained on this data set using the

coding, let g ( i )

represent the geographic sampling loca- package IMMANC (Ranna l a and Mount ain 1997).tion of individual i. Now, let be the probability thatIMMANC indicates that three individuals (1, 2, and 3an ind ividual isa n immigrant to population g ( i ) o r ha s a nhere) show significant evidence of immigrant ancestryimmigrant ancestor in the last G generations. Oth erwise,at the 0.01 significan ce level (Gal buser a et al. 2000).with probability 1 Ϫ , the individual is considered toHowever, IMMANC does not make a multiple compari-be purely from population g ( i ). While in principle onesons correction; such a correction would bring thosecould place a prior on and learn about it from theresults into line with ours.data a s part of the MCMC scheme, in our current imple-

We anticipate tha t our m ethod might a lso be appliedmen tatio n the user must specify a fixed value f or ; wein situations where there is little data to help make angive some guidelines in the next section.informed choice of . In such situations we suggestAssuming that migration is rare, we can use the ap-ana lyzing th e d ata using several d ifferent values of , toproximation th at each in dividual ha s at most one immi-see whether the conclusions are robust to choice of .grant ancestor in the last G generations (where G is

The range of sensible values for will depend on thesuitably small). Then, assuming a constant migrationcontext, but typically we suggest values in the rangerate, the probabilityo f an immigrant ancestor in genera-0.001– 0.1 migh t be a ppro priate. Sen sitivity to choice oftion t (0 Յ t Յ G ) is proportional to 2t , where t ϭ 0 indicates that the amount of information in the dataindicates that the individual migrated in the presentis insufficient to dra w strong con clusions.generation. Thus, we set the prior on q ( i ) to be

q ( i )g (i ) ϭ 1, q ( i )

k ϭ 0 (k ϶ g ( i )) (15)

D I S C U S S I O Nwith probability 1 Ϫ and

We have described a method for using multilocusq

( i )g ( i ) ϭ 1 Ϫ 2Ϫt , q ( i )

j ϭ 2Ϫt , q ( i )k ϭ 0 (k ϶ g ( i ), j ) (16)

genotype data to learn about population structure and

assign individuals (probabilistically) to populations.for each j ϶ g ( i ) with probability




TABLE 4

Testingwhether particular individuals are immigrants or have recent immigrant ancestors

G eographic Possible No immigrant Immigrant ImmigrantIndividual origin source an cestry Immigran t paren t gra nd pa ren t

1 Ngangao Chawia 0.05 0.869 0.008 0.052 0.0630.10 0.739 0.019 0.106 0.123

2 Ngangao Mbololo 0.05 0.673 0.029 0.126 0.168

0.10 0.472 0.046 0.203 0.2733 Mbololo Ngan gao 0.05 0.649 0.002 0.179 0.165

0.10 0.464 0.003 0.271 0.2534 Mbololo Chawia 0.05 0.891 0.000 0.007 0.082

0.10 0.791 0.000 0.014 0.157

The individuals are labeled as shown in Figure 4. “No immigrant ancestry” gives the probability that theancestry of each individual is exclusively in the geographic origin population; the following columns show theprobabilities that each individual has the given amount of ancestry in the possible source population. Therows do not add to 1 because th ere are small probabilities associated with individuals having ancestry in th ethird population.

Our method also provides a novel approach to testing preclassified individuals are used to estimate allele fre-

quencies (cf. Smouse et al. 1990).for the presence of population structure (K Ͼ 1).

Our examplesdemonstrate that the method can accu- Another type of application where the geographic

information might b e of value is in evolutionary studiesrately cluster ind ividuals into their app ropriate po pula-

tions, even using only a modest number of loci. In prac- of population relationships. Such analyses frequently

make use of summary statistics ba sed on populationtice, the accuracy of the assignments depends on a

number of factors, including the number of individuals allele frequencies [ e.g., F ST a n d (␦) 2]. In situations

where the population allele frequencies might be af-(which affects the accuracy of the estimate for P ) , t he

number of loci (which affects the accuracy of the esti- fected by recent immigration or where population classi-

fication s are u nclear , such summa ry statistics could bemate for Q ), the amount of admixture, and the extent

of allele-frequency differences among populations. calculated directly from the population allele frequen-

cies P estimated by the Gibbs sampler.We anticipate that our method will be useful for iden-

tifying populations and a ssigning individuals in situa- There are several ways in which the basic model that

we have described h ere might be m odified to producetions where there is little informa tion ab out po pulation

structure. It should a lso b e useful in problems where better performance in particular cases. For example, inmodel s an d met h ods and appl icat ions t o dat a wecryptic population structure is a concern, as a way of

identifying subpopulations. Even in situations where assumed relatively noninformative priors for q . H o w -

ever, in some situations, there might be quite a bit ofthere is nongenetic information that can be used to

define populations, it may be useful to use the approach information about likely values of q , and the estimation

procedure could be improved by using that informa-developed here to ensure that populations defined on

an extrinsic basis reflect the underlying genetic struc- tion. For example, in estimating admixture proportions

for African Americans, it would be possible to improveture.

As described in inco r por at ing popul at ion info r - the estimation procedure by making use of existing in-

formation about the extent of European admixturemat ion we have also developed a framework that makes

it possible to combine genetic information with prior (e.g., Par r a et al. 1998).

A second way in which the basic model can be modi-information about the geographic sampling location of

individuals. Besides being used to detect migrants, this fied involves changing the way in which the allele fre-

quencies P are estimated. Throughout this article, wecould also be used in situations where there is strongprior population information for some individuals, but have assumed that the allele frequencies in different

populations are uncorrelated with one another. This isnot for others. For example, in hybrid zones it may be

possible to identify some individuals who do not have a convenient approximation for populations that are

not extremely closely related and , a s we have seen, canmixed ancestry and then to estimate q for the rest ( M.

Beau mon t , D. Go t el l i, E. M. Bar et t , A. C. Kit ch - produce accurate clustering. However, looselyspeaking,

the model of uncorrelated allele frequencies says thatener , M. J. Dan iel s, J. K. Pr it ch ar d and M. W. Br u-

for d , unpublished results). The ad vantage of using a we do not normally expect to see populations with very

similar a llele frequencies. This property has the resultclustering approach in such cases is that it makes the

method more robust to the presence of misclassified that the clustering algorithm may tend to merge subpop-

ulations that share similar frequencies. An alternative,individuals and should be more accurate than if only




which we have implemented in our software package, enough to make the population act as a single unstruc-

tured population.is to permit allele frequencies to be correlated across

In summary, we find that the meth od described h erepopulations(appendix, M odel wit h correlated all elefrequen-

can produce highly accurate clustering and sensiblecies ). In a series of ad ditional simulations, we have foundchoices of K , both for simulated data and for real datathat this allows us to perform accurate assignments offrom humans and from the Taita thrush. In the latterindividuals in very closely related populations, th oughexamp le, we find it particular ly enco urag ing th at usingpossibly at the cost of making us likely to overestimate K.

a relatively small number of loci (seven) we can detectOur basic model might also b e mo dified to a llow for

a very strong signal of population structure and assignlinkage among marker loci. Normally, we would notindividuals appropriately.expect to see linkage d isequilibrium within subpopula-

The algorithms described in this article have beentions, except between markers that are extremely closeimplemented in a computer software package structure ,together. This means tha t in situations where there iswhich is available at http://www.stats.ox.ac.uk/ pritch/lit t le admixture, our assumption of independencehome.html.among loci will be quite accurate. However, we might

expect to see strong correlations amo ng linked loci We than k Peter G albusera an d Lynn Jorde for allowing us to use

their data, Augie Kong for a helpful discussion, Daniel Falush forwhen there is recent admixture. This occurs becausesuggesting com parison with neigh bor -joining trees, Steve Bro oks andan individual who is admixed will inherit large chro mo-Trevor Sweeting for helpful discussions on inferring K , and Eric An-somal segments from one p opulation o r an other. Thus,derson for h is extensive commen ts on an earlier version o f the m anu-

when the map order of marker loci is known, it should script. This work was supported by National In stitutes of H ealth g rantbe possible to impro ve the accuracy of the estimatio n for G M19634 and by a H itching s-Elion fellowship fro m B urrou ghs-Well-

come Fund to J.K.P., by a grant from the University of Oxford andsuch individuals by modeling the inheritance of these a Wellcome Trust Fellowship (057416) to M.S., and by grants GR/segments.M14197 and 43/MMI09788, from the Engineering and Physical Sci-

In this article we ha ve d evoted considerab le attentionences Research Council and Biotechnology and Biological Sciences

to the problem of inferring K. This is an important Research Council, respectively, to P.D. The work was initiated whilepractical problem from the stand point of mo del choice. the autho rs were resident at the Isaac Newton Institute for Mathemati-

cal Sciences, Cambridge, UK.We need to have some way of deciding which clustering

model is most appropriate for interpreting the data.

However, we stress that care should be taken in the

interpretation of th e inferred value of K. To begin with, LITERATU RE C ITEDdue to the very high dimensionality of the parameter

Bal ding , D. J., an d R. A. Nich ol s, 1994 DNA profile match proba-space, we fo und it d ifficult to obtain reliable estimates bility calculations: ho w to a llowfo r popu lation stratification, relat-o f P r (X | K ) and have chosen to use a fairly ad hoc edness, database selection and single bands. Forensic Sci. Int.

64: 125– 140.procedure that we have found gives sensible results inBal ding, D. J., an d R. A. Nich ol s, 1995 A method for quantifyingpractice. Second, it has been observed that in Bayesian differentiation b etween populations a t m ulti-allelic loci and its

mod el-based clustering, the posterior distribution of K implications for investigating identity and paternity. Genetica 96:3–12.tends to be quite dependent on the priorsan d modeling

Bowco ck, A. M., A. Ruiz-Linar es, J. Tomfoh r de, E. Minc h , J. Kiddassumptions, even th ough estimates of the o ther param - et al., 1994 High resolution of human evolutionary trees witheters (e.g., P and Q here) may be reasonably robust polymorphic microsatellites. Nature 368: 455– 457.

Caval l i-Sfo r za, L. L., P. Meno zzi an d A. Piazza , 1994 TheHistory (see Rich ar dson and Gr een 1997; St ephen s 2000a,and Geography of H uman Genes. Princeton Un iversity Press,

for example). Princeton, NJ.Chib, S., 1995 Marginal likelihood from the Gibbs output. J. Am.There are also biological reasons to be careful inter-

Stat. Assoc. 90: 1313–1321.preting K. The population model that we have adoptedChib, S., an d E. Gr eenber g, 1995 Und erstanding the Metropolis-

here is obviously an idealization. We anticipate that it Hastings algorithm. Am. Stat. 49: 327– 335.will be flexible enough to permit appropriate clustering Davies, N., F. X. Vil l abl anca an d G. K. Roder ick, 1999 Deter-

mining the source of individuals: multilocus genotyping infor a wide range of population structures. However, asnonequilibrium population genetics. TREE 14: 17–21.

we pointed out in our discussion of data set 3 ( Choice DiCiccio, T., R. Kass, A. Raf t er y an d L. Wasser man, 1997 Com-of K for simulated data ), clusters may not necessarilycorre- puting Bayes factors by posterior simulation and asymptotic ap-

proximations. J. Am. Stat. Assoc. 92: 903– 915.spond to “ real” population s. As another example, imag-Ewens , W. J., an d R. S. Spiel man, 1995 The tran smission/disequ ilib-

ine a species that lives on a continuous plane, but has rium test:h istory, subdivision, an d ad mixture. Am. J. Hum. Gen et.low dispersal rates, so th at a llele frequen ciesvary contin- 57: 455– 464.

Fel senst ein , J., 1993 PH YLIP (phylogeny inference package) ver-uously across the plan e. If we sample a t K distinct loca-sion 3.5c. Technical report, Department of Genetics, University

tions, we might infer th e presence of K clusters, but the of Washington, Seattle.inferred number K is not biologically interesting, as it For eman, L., A. Smit h an d I. Evet t , 1997 Bayesian analysis of DNA

profiling d ata in forensic identification applications. J. R. Stat.was determined purely by the sampling scheme. All thatSoc. A 160: 429– 469.

can usefully be said in such a situation is that the migra- Gal buser a, P., L. Lens, E. Waiya ki, T. Sch enc k an d E. Mat t ysen ,2000 Effective population size and gen e flow in the globa lly,tion rates between the sampling locations are not high




critically endan gered Taita thrush, Turdushelleri . Con serv. Gen et. stationary distribution (). This is often surprisingly(in press).

straightforward using stand ard meth ods devised for thisGil ks, W. R., S. Rich ar d son an d D. J. Spieg el h al t er , 1996a Intro-ducing Markov chain Monte Carlo, pp. 1–19 in M arkov Chain purpose, such as the Metropolis-Hastings algorithmM onte Carlo in Practice , edited by W. R. Gil ks, S. Rich ar dson (e.g., Chib and Gr eenber g 1995) an d G ibbs samplingan d D. J. Spiegel hal t er . Chapman & Hall, London.

(e.g., Gil ks et al. 1996a), which we describe in moreGil ks, W. R., S. Rich ar dson an d D. J. Spiegel ha l t er (Editors),1996b M arkov Chain M onte Carlo i n Practice. Chapman & Hall, detail below. Intuitively, if the Markov chain (0), (1),London. (2), . . . has stationary distribution (), then (m ) will

Gol dst ein, D. B., an d D. Pol l oc k, 1997 Launching microsatellites:be a pproximately distributed a s () provided m is suf-a review of mutation processes and methods of phylogenetic

inference. J. Hered. 88: 335– 342. ficiently large. This can b e form alized an d sho wn to b eGr een, P. J., 1995 Reversible jump Markov chain Monte Carlo com- true p rovided the Markov chain satisfies certain t echn i-

putation an d Bayesian mod el determination. Biometrika 82:711–cal conditions (ergodicity ) t ha t ho ld fo r t he Ma r ko v732.

Hud son, R. R., 1990 Gene genealogies and the coalescent process, cha ins considered in th is article. Furth ermo re, for suffi-pp. 1–44 in Oxford Surveys in Evolu ti onary Bi ology , Vol. 7, edited ciently large c , (m ), (m ϩc ), (m ϩ2c ), . . . will be reasonablyby D. Fut uyma an d J. Ant onovics. Oxford Un iversity P ress,

independent samples from (). The value of m usedOxford.Jor de, L. B., M. J. Bamsha d, W. S. Wat kin s, R. Zeng er , A. E. Fr al ey is often referred to as the burn-in period of the chain;

et al., 1995 Origins and affinitieso f modern humans: a compari- c is often referred to a s the thinning interval.son of mitochondrial and nuclear genetic data. Am. J. Hum.

In general it is very difficult to know ho w large m Genet. 57: 523– 538.Mount ain, J. L., an d L. L. Caval l i-Sfo r za, 1997 Multilocus geno- and c should be. The values required to o btain reliable

types, a tree of individuals, and human evolutionary history. Am. results depend heavily on the amount of correlationJ. Hum. Genet. 61: 705– 718.

between successive states of th e Markov chain . If succes-Paet ka u, D., W. Ca l ver t , I. St ir l ing an d C. St r obeck, 1995 Mi-crosatellite analysis of population structure in Canadian polar sive states are relatively unco rrelated ( tha t is, if the cha inbears. Mol. Ecol. 4: 347– 354. moves quickly between reasonably different values of

Par r a, E. J., A. Mar cini, J. Akey, J. Mar t inson, M. A. Bat zer et al.,), then the cha in is said to mix well, and relatively small1998 Estimating African American ad mixture proportions by

use of p opula tion-specific alleles. Am. J. H um. G enet. 63: 1839– values of m and c will suffice. Con versely, if the cha in1851. mixes ba dly (sometimes known as being sticky , a s t he

Pr it cha r d, J. K., an d N. A. Rosenber g, 1999 Use of unlinked ge-chain will tend to get stuck moving amon g very similarnetic m arkers to d etect pop ulation stratification in a ssociation

studies. Am. J. Hum. Genet. 65: 220– 228. values of ), then very large values of m and c will beRaft er y, A. E., 1996 Hypothesis testing and model selection, pp. required, po ssibly rendering the meth od impracticable.

163– 188 in M arkov Chain M onte Carlo in Practice , edited by W. R.On e strategy for investigating whether m and c ar e suffi-Gil ks, S. Rich ar dson an d D. J. Spiegel ha l t er . Chapman &

Hall, London. ciently large, an d the strategy we ad opt h ere, is to simu-Rannal a, B., an d J. L. Mount a in, 1997 Detecting immigration by

late several realizations of th e Markov chain, each start-using multilocus genotypes. P roc. Natl. Acad. Sci. U SA 94: 9197–

ing fro m a differ ent va lue o f (0). If m and c are9201.Richar dson, S., an d P. J. Gr een, 1997 On Bayesian ana lysis of sufficiently large, then the results obtained should be

mixtures with an unknown number of components. J. R. Stat.

independent of (0) and should therefore be similar forSoc. Ser. B 59: 731– 792.the different runs. Substantial differences among theRoeder , K., M. Escobar , J. B. Kadane an d I. Bal azs, 1998 Measur-

ing heterogeneity in forensic databases using hierarchical Bayes results obtained for the d ifferent runs indicate that m models. B iometrika 85: 269– 287.

and c are too small. It isth en necessary either to increaseSmouse, P. E., R. S. Wapl es an d J. A. Twor ek, 1990 A geneticm and c or (if this makes the method computationallymixture an alysis for use with incomplete source p opulation-data .

Can. J. Fish. Aquat. Sci. 47: 620– 634. infeasible) to construct a Markov chain with better mix-Spieg el ha l t er , D. J., N. G. Best an d B. P. Car l in, 1999 Bayesian

ing p roperties. In the exam ples presented in th is articledeviance, the effective number of parameters, and the compari-son of arbitrarily complex models. Available from http:// we ha ve chosen c ϭ 1.www.mrc-bsu.cam.ac.uk/publications/preslid.shtml. G ibbs sampling is a metho d o f constructing a Markov

St ephens, M., 2000a Bayesian analysis of mixtures with an unknownchain with stationary d istribution () , wh ic h h a snumber of com ponents—an alternative to reversible jump meth-

ods. Ann. Stat. (in press). proved particularly useful for clustering problems. Sup-St eph ens, M., 2000b Dealing with label-switching in mixture mod- pose that may be partitioned into ϭ (1 , . . . , r ),

els. J. R. Stat. Soc. Ser. B (in press).and that although it is not possible to simulate from

Communicating editor: M. K. Uyen oya ma () directly, it is possible to simulate a random value

of i directly from the full conditional distribution

(i | 1, 2, . . . , i Ϫ1, i ϩ 1, . . . , r ) for i ϭ 1, 2, . . . ,APPENDIX r. Then th e following a lgorithm may be used to simulate

a Markov chain with stationary distribution () :MCMC methodsand Gibbs sampling:Al gor it h m A1: Starting with initial values (0) ϭ

MCMC methods are extremely useful for obtaining ((0)1 , . . . , (0)

r ) , iterate the f oll owin g steps for m ϭ 1,(approximate) samples from a probability distribution, 2, . . . .(), say, which cannot be simulated from directly [in

Step 1. Sample (m )1 from (1|

(m Ϫ1)2 , (m Ϫ1)

3 , . . . , (m Ϫ1)r ) .our case ϭ (Z , P , Q ) a nd () ϭ Pr( Z , P , Q |X )]. The

idea is to construct a Markov chain (0), (1), (2), . . . with Step 2. Sample (m )2 from (2|

(m )1 , (m Ϫ1)

3 , . . . , (m Ϫ1)r ) .




Step r . Sample (m )r from (r |

(m )1 ) , (m )

2 , . . . , (m )r Ϫ1) . t h en u se t h is t o e st im a te t h e p ost er io r d ist rib ut io n o f

K from (11). An alternative interpretation of thisIt is easy to show that if (m Ϫ1) (), then (m )

method is that model selection is based on penalizing(), and so () is the stationary distribution of this

the mean of the Bayesian deviance by a quarter of itsMarkov cha in.

variance (cf. Spiegel ha l t er et al. 1999, who suggested

investigatin g mod el fit using a d ifferen t pena lization ofInference on K , the number of populations

the mean of the B ayesian d eviance).We now provide further d etails regarding our ap-

proach to choosing K (see Inf erence for the nu mber of Details of the MCMC algorithmspopulations ) .

The simplest way of estimating P r( X |K ) is the so-called Al gor it h m A2: Step 1 may be performed by simulat-harmonic mean estimator in g p kl· independently for each (k , l ), from

p kl· |X , Z D(1 ϩ n kl 1, . . . , J l ϩ n klJ l

) , ( A6)1

Pr( X |K )ϭ ΎPr( Z , P , Q |X , K )

Pr( X |Z , P , Q ,K )dZdPdQ

where

≈1

M ͚

M

m ϭ1

1

P r( X |Z (m ), P (m ), Q (m ), K ). ( A1) n klj ϭ #{(i ,a ) : x ( i , a )

l ϭ j and z ( i ) ϭ k } (A7)

is the number of copies of allele j at locus l observedThis estimator is notoriously unstable, often having in-in individuals assigned (by Z ) to population k.finite varian ce, an d is thu s of little use in practice. O ne

Step 2 may be perfo rmed by simulating z ( i ), indepen-theoretically a ttractive alternative involves estimating

dently for each i , fromPr( P , Q |X ) for some P, Q (Chib 1995; Raft er y 1996).However, our own implementation of versions of this

Pr( z ( i ) ϭ k |X , P ) ϭPr( x ( i )|P , z ( i ) ϭ k )

RK k Јϭ1Pr( x ( i )|P , z ( i ) ϭ k Ј)

, (A8)approach has turned out to be computationally infeasi-

ble, du e to the very high-dimensional para meter space

of our problem. While alternative approaches to esti- where Pr(x ( i )|P , z ( i ) ϭ k ) ϭ͟ Ll ϭ1 p klx ( i ,1)p klx ( i ,2).

mating Pr(X |K ), such as variable-dimension MCMC Note that Equation A8 makes an implicit assumptionmethods( Gr een 1995; St ephen s 2000a) or importance that an equal fraction of the sample is drawn from eachsampling (DiCiccio et al. 1997), may lead to compu- population. Alternatively, it might be natural to intro-tationally feasible a lgorithms, the high-dimensional pa - duce an additional parameter for the fraction of therameter space makes designing efficient versions of sample drawn from each population.these schemes rather challenging. For this reason we Al gor it h m A3: Step 1maybe performed byupdatingtake a more ad hoc approach , which begins by consider- P and Q independently. Upd ating P is achieved as be-ing the Bayesian devi ance

fore, using Equ ation A6 but where th e definition (A7)of n klj is mod ified in the obvious way toD (Z , P , Q ) ϭ Ϫ2 log Pr(X |Z , P , Q ). (A2)

n klj ϭ #{(i , a ) : x ( i ,a )l ϭ j and z ( i ,a )

l ϭ k }. ( A9)The conditional mean and variance of D given X are

easily estimated using Updating Q involves simulating from

E (D (Z , P , Q )|X )q ( i )|X , Z D(␣ ϩ m ( i )

1 , . . . , ␣ ϩ m ( i )K ) , (A10)

≈1

M ͚

M

m ϭ1

Ϫ2 log Pr(X |Z (m ), P (m ), Q (m )) ϭ ̂, sa y, ( A3) where m ( i )k is the number of allele copies in individual

i that originated (according to Z) in population k :

andm ( i )

k ϭ #{(l , a ) : z ( i ,a )l ϭ k }. (A11)

Var( D (Z , P , Q )|X )Step 2 may be performed by simulating z l

( i,a ), indepen-

dently for each i , a , l , from≈ 1M ͚

M

m ϭ1(Ϫ2 log Pr(X |Z (m ), P (m ), Q (m )) Ϫ ̂) 2 ϭ ̂2, say.

(A4) P r( z ( i ,a )l ϭ k |X , P ) ϭ

q ( i )k P r( x ( i ,a )

l |P , z ( i ,a )l ϭ k )

RK k Јϭ1q ( i )

k Ј P r( x ( i ,a )l |P , z ( i ,a )

l ϭ k Ј),

If we make the (admittedly dubious) assumption that (A12)the cond itional distribution of D given X isn ormal, then

where Pr(x ( i ,a )l |P , z ( i ,a )

l ϭ k ) ϭ p klx (i ,a )l

.it follows from (A1) that

Step 3 may be performed by simulating a proposalϪ2 log(Pr(X |K )) ≈ ̂ ϩ ̂2/4. (A5)

␣Ј, from a normal distribution with mean ␣, and some

variance 2␣. The proposal is automatically rejected if(Replacing th e assumption of normality with the as-

sumption of b eing ga mma-distributed m ay be mo re as- ␣Ј Յ 0, and otherwise it is accepted with the a ppropriate

Metropolis-Hastings probability.ymptotically justifiable an d gives similar results.) We




Model with correlated allele frequencies locus l , a n d f ( l ) Ͼ 0 d etermines the strength of the

correlations across populations at locus l. When f ( l ) isFor very closely related populations it is natural to

large, the allele frequencies in all populations tend toassume that a llele frequencies are correlated acrosspo p-

be similar to the m ean allele frequencies in the sample.ulations. For com pleteness, we d escribe a mod el that is

In our implementation of this model, we placed aimplemented in the program structure , allowing allele-

gamma prior on each f ( l ) an d used a Metropo lis-H astingsfrequen cy correlations.

upd ate step. The pro posal f ( l )Ј waschosen from a normalRecall that we mod el allele frequencies by p kl· D(1,

with mean f ( l ) and some variance 2f . It was automa tically

2

, . . . , J l

). For all the results presented in this article,rejected if f ( l )Ј Յ 0.we took 1 ϭ 2 ϭ ···ϭ Jl ϭ 1.0, which gives a uni-

There are several possible alternative models to con-form distribution on allele frequencies, where J l is the

sidering a factor f for each locus. One would be tonumber of alleles at lows l . To model closely related

consider a factor for each population, and anotherpopulations, we consider an alternative model, where

would be to give each type of locus ( e.g., SNPs and

dinucleotide and trinucleotide repeats) a shared valuep kl· D( f ( l )J l ( l )1 , f ( l )J l

( l )2 , . . . , f ( l )J l

( l )J l

). (A13)of f.

Here, ( l )i is the mean sample frequency of allele i at

2000, Pritchard, Inference of Population Structure, Multilocus, Data

Documents

Transcript of 2000, Pritchard, Inference of Population Structure, Multilocus, Data