Identification of local selective sweeps in human populations since the exodus from Africa

Identification of local selective sweeps in human populations since theexodus from Africa

ASA JOHANSSON and ULF GYLLENSTEN

Department of Genetics and Pathology, Rudbeck laboratory, Uppsala University, Uppsala, Sweden

Johansson, A. and Gyllensten, U. 2008. Identification of local selective sweeps in human populations since the exodus from

Africa. * Hereditas 145: 126�137. Lund, Sweden. eISSN 1601-5223. Received January 25, 2008. Accepted March 17, 2008

Selection on the human genome has been studied using comparative genomics and SNP architecture in the lineage leading to

modern humans. In connection with the African exodus and colonization of other continents, human populations have

adapted to a range of different environmental conditions. Using a new method that jointly analyses haplotype block length

and allele frequency variation (FST) within and between populations, we have identified chromosomal regions that are

candidates for having been affected by local selection. Based on 1.6 million SNPs typed in 71 individuals of African

American, European American and Han Chinese descent, we have identified a number of genes and non-coding regions that

are candidates for having been subjected to local positive selection during the last 100 000 years. Among these genes are

those involved in skin pigmentation (SLC24A5) and diet adaptation (LCT). The list of genes implicated in these local

selective sweeps overlap partly with those implicated in other studies of human populations using other methods, but show

little overlap with those postulated to have been under selection in the 5�7 myr since the divergence of the ancestors of human

and chimpanzee. Our analysis provides focal points in the genome for detailed studies of evolutionary events that have

shaped human populations as they explored different regions of the world.

Ulf Gyllensten, Dept of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, SE-571 85, Uppsala, Sweden.

E-mail: [email protected]

Comparisons of the human and chimpanzee genomes

have been used to identify genes subjected to selection

on the lineage leading to modern humans (CLARK

et al. 2003; BUSTAMANTE et al. 2005; NIELSEN et al.

2005). Such comparative genomic approaches address

genetic changes that have occurred during the 5�7 myr

since humans and chimpanzees shared a common

ancestor. However, modern humans emerged in Africa

less than 200 kyr ago (STRINGER and ANDREWS 1988)

and only began to colonize other continents 50�80 kyr

ago. To specifically target genes that have been

subjected to selection in association with more recent

evolutionary events, such as the African exodus and

the separation of Caucasian and Asian populations,

genomic analyses based on population comparisons

are needed. Some genetic adaptations to environmen-

tal conditions have been described in humans. For

instance, populations depending on agriculture often

have a high tolerance to lactose, associated with a

mutation in the lactase gene (HOLLOX et al. 2001;

ENATTAH et al. 2002). Also, variation at the Duffy

blood group locus has been associated with malaria

resistance (HAMBLIN and DI RIENZO 2000; HAMBLIN

et al. 2002). Of the three Duffy alleles (FY*A,

FY*B and FY*0), homozygotes for FY*0 have been

associated with resistance to malaria. This allele

has almost reached fixation in sub-Saharan popula-

tions but remains rare in Asian and European

populations. However, the number of loci identified

that are candidates for local selection is limited and

new methods are needed to scan the human genome

for evidence of selection.

Positive (directional) selection on an allele is ex-

pected to increase its frequency in the affected

population. Differences in allele frequency between

populations, estimated by Wrights FST (WRIGHT

1950), have therefore been used to indicate genes

under selection (BEAUMONT and BALDING 2004;

STORZ 2005). A high FST indicates positive selection

whereas a low FST indicates that the loci are subject to

purifying or balancing selection. Genome-wide

searches based on allele frequency differences have

resulted in a number of genes and gene categories

postulated to be under selection (AKEY et al. 2002).

However, FST estimates for individual sites have a high

variance, complicating its use as a sole indicator of

regions under selection (WEIR et al. 2005). Positive

selection is expected to increase both the frequency of

the affected site and of linked sites and selective

sweeps may therefore result in the presence of long

haplotype blocks. This is utilized in hitchhike mapping

for the identification of genes under selection (HARR

et al. 2002; SCHLOTTERER 2003). However, since both

recombination frequency and population structure

contribute to haplotype block variability, reduced

haplotype variability per se is not a strong indication

of selective events. Local selective sweeps have also

been studied using a number of methods based on the

Hereditas 145: 126�137 (2008)

DOI: 10.1111/j.2008.0018-0661.02054.x

haplotype or LD architecture (SABETI et al. 2002;

HANCHARD et al. 2006; VOIGHT et al. 2006;

WANG et al. 2006), and such methods have been

proposed to be more powerful for detecting selective

events than methods based on the nucleotide diversity

(HANCHARD et al. 2006), such as Tajima’s D-test, Fu

and Li’s D-test and Fay and Wu’s H-test (TAJIMA

1989; FU and LI 1993; FAY and WU 2000). While most

methods focus on haplotype patterns within a popula-

tion, a haplotype-based method was recently presented

for cross-population comparisons to detect alleles that

have reached near-fixation (SABETI et al. 2007).

Here, we apply a new cross-population method to

search for genomic regions that have been subjected to

local positive selection. Our method considers both

the haplotype block length and allele frequency

variation between different populations as well as

between genomic regions. Selection may have occurred

at a number of time points in the human history, such

as in Africa prior to the separation of the major

population groups (Fig. 1, branch 2), after the exodus

of non-African populations (branch 3), prior to the

separation of Asian and European populations

(branch 4) or after the separation of European

(branch 5) and Asian (branch 6) populations. Com-

parison between populations makes it possible to

study the genetic changes associated with some of

these events. We have used available SNP and haplo-

type data from three major human populations,

African (American), European (American) and Asian

(Han Chinese) (HINDS et al. 2005), to address selective

sweeps that occurred during the last 50�80 kyr of

human evolution. Our results are compared to those

of recent studies that have assessed selection in the

human genome using other methods.

MATERIAL AND METHODS

We used a publicly available dataset consisting of

1.6 million SNPs evenly distributed across the genome

and genotyped in 71 individuals by Perlegen Sciences,

representing 23 African Americans (AA), 24 Eur-

opean Americans (EA) and 24 Han Chinese (HC)

(HINDS et al. 2005). Haplotypes had been inferred

separately for each of the three sample sets using theHAP program (HALPERIN and ESKIN 2004) and

partitioned into blocks with limited diversity (HINDS

et al. 2005). These blocks were defined as sets of SNPs

for which at least 80% of the inferred haplotypes could

be grouped into common patterns with a population

frequency of at least 5%. Using this definition, 235 663

blocks have previously been identified in African

Americans, 109 913 blocks in European Americansand 89 994 blocks in Han Chinese (HINDS et al. 2005).

For identifying chromosomal regions subjected to

positive selection, we only considered the autosomal

chromosomes.

Block length and FST values

For each haplotype block, consisting of at least two

SNPs, we estimated the block length using Build 35

(ver. 35 of the human genome sequence annotations).

Blocks for which the SNP order did not agree between

Build 35 and Build 34 (ver. 34 of the human genome

sequence annotations which was used by Perlegen)

was removed from further analysis. In addition, blocks

shorter than 500 bases were removed. A block FST in apopulation was defined as the average pairwise FST

(WRIGHT 1950) for all individual SNPs in that block

between that population and another population,

resulting into two block FST values for each haplotype

block. For instance, for a block defined in African

Americans the first block FST is the average FST

between African Americans and European Americans

for all SNPs in the African American block. Thesecond FST is the average FST between African

Americans and Han Chinese for the same SNPs.

Genes included in a haplotype block were identified

through their reference sequence position according to

Build 35.

FR for haplotype blocks

Our aim was to search for regions that differ in block

allele frequency (high FST) between African and non-

African populations, between Asian and both other

populations and between European and both other

populations. To study whether each population differs

from both the other two populations, we calculated

the following FST-ratios (FRs) for each haplotype

block (i),

Fig. 1. Different stages when selection could have actedduring human evolution. The number on the branchesindicate time points when selection could have occurred,such as on the lineage leading to Chimpanzee (branch 1), inAfrican populations prior to the separation of the threepopulations (branch 2), after the exodus of non-Africanpopulations (branch 3), in non-African populations prior tothe separation of Asian and European populations (branch4), and after the separation of European (branch 5) andAsian (branch 6) populations.

Hereditas 145 (2008) Selective sweeps in human populations 127

FRAAi � log

�FAA�EA

STi

median (FAA�EAST )

�FAA�HC

STi

median (FAA�HCST )

�;

FREAi � log

�FEA�AA

STi

median (FEA�AAST )

�FEA�HC

STi

median (FEA�HCST )

�;

FRHCi � log

�FHC�AA

STi

median (FHC�AAST )

�FHC�EA

STi

median (FHC�EAST )

�;

where AA refers to African Americans, EA to

European Americans, and HC to Han Chinese,

respectively. The individual block FST values were

divided by the median to be compared to the genomein general and log transformed. An FST ratio around

zero indicates an average block FST and a low

(negative) FR indicates very low differentiation be-

tween the study population and both of the other

populations. A high FR reflects large differentiation

but does not indicate the specific population(s) in

which a potential selective sweep has occurred.

Population-specific extended haplotype blocks

To identify in which population a selective sweep has

occurred, we compared the length of haplotype blocks

between populations for every site. For each block

defined in a population, the corresponding blocks in

the two other populations were identified as the blocks

with the largest overlap. If no overlapping block was

found in the other populations, the block was removedfrom further analysis. We calculated a ratio between

the lengths of the haplotype block (i) in the different

populations as:

To make the ratios comparable across populations,

each ration was divided by the median and log

transformed. The log transformed LR is expected to

be zero for regions exhibiting an average proportion of

block length within populations. A low (negative) LR

indicates that the haplotype block, in the population

studied, is relatively short compared to the blocks

in the other populations. A high LR in one population

indicates larger haplotype block relative the other

populations and that the region has potentially under-

gone a selective sweep in that population.

Simulation of LR distribution

We used the software SelSim (SPENCER and COOP

2004) to simulate a dataset using different strength of

selection on an SNP, varying the effective population

size and using different fractions of derived to

ancestral alleles. The number of chromosomes used

in each simulation was similar to the number ofchromosomes in each of our three human populations

from which the empirical data was obtained (n�48).

To examine the effect of variation in selection

coefficient (s�0, s�0.01, s�0.02, s�0.04), we used

an effective population size of Ne�10 000 and the

proportion of derived to ancestral alleles of 44/4. To

study the effect on the effective population size, we

used Ne�1000, Ne�5000 and Ne�10 000, a propor-tion of derived to ancestral alleles of 44/4 and selection

coefficients of s�0 and s�0.04. In all simulations the

recombination rate was set to 1 cM/Mb, the number of

SNPs to 200 and the density to 1 SNP per kb. For each

set of simulations, we made 10 000 permutations. The

length of haplotype blocks surrounding the SNP

under selection was determined using the same criteria

as used for the population data; 80% of the sequencesshould be grouped into haplotypes with an allele

frequency of at least 5% in the dataset (HINDS et al.

2005). For each of the simulated replicates, the block

length was determined using a script in MATLAB in

the following way:

1) Starting at the SNP under selection.

2) Step 1 SNP in each direction from starting point.

3) Determine if the block including one or both of

the new SNPs follows the definition of a haplo-

type block.

4) If the SNPs are included in the same block as

the SNP under selection: repeat from step 2. If

one SNP does not belong to the same block as the

SNP under selection, no further SNPs in that

direction from the start point will be evaluated.

LRAAi � log

�block length(AA)2

i

block length(EA)i � block length(HC)i

=median

�block length(AA)2

block length(EA) � block length(HC)

��

LREAi � log

�block length(EA)2

i

block length(AA)i � block length(HC)i

=median

�block length(EA)2

block length(AA) � block length(HC)

��

LRHCi � log

�block length(HC)2

i

block length(AA)i � block length(EA)i

=median

�block length(HC)2

block length(AA) � block length(EA)

��

128 A. Johansson and U. Gyllensten Hereditas 145 (2008)

5) When none of the new SNPs are included in

the block, the block length is defined by the

distance between the last SNPs, in each direction

from the starting point.

LRs were calculated for the simulated haplotype

blocks, similar to the empirical population data. Each

simulated block was compared to two randomlychosen blocks from the dataset simulated with s�0:

LRi� logblock length2

i

block length(s�0) � block length(s�0)

Blocks used for calculating the LRs were resampled

100 000 times with replacement.

Combining methods for studying genes under positive

selection

We used the iHS method by Voight and colleagues

(VOIGHT et al. 2006) to scan our top 1% candidate

genes for being subjected to positive selection. For all

SNPs in each block, the ancestral state was identified

using the UCSC Human Genome database (Mar. 2006

assembly) where possible. For all SNPs with anidentified ancestral state, an unstandardized iHS was

calculated as the ratio between the integrated EHH

(extended haplotype homozygosity) score of the

ancestral and the derived allele (VOIGHT et al. 2006).

Since our genomic regions are selected as candidates

for positive selection, we cannot use our empirical

distribution to standardize the iHS. Instead, we used

the already described, almost linear, relationshipbetween the cutoff for the top 1% tail of the

unstandardized iHS distribution and the allele fre-

quency (VOIGHT et al. 2006, Fig. 4) to estimate what

fraction of SNPs for each of our genes are within the

top 1% genome-wide empirical distribution of un-

standardized iHSs. Large blocks containing more than

one gene were divided to assign an independent iHS to

each gene. The significance for each gene was calcu-lated as the significance of observing a certain fraction

of SNPs within the 1% tail of the genome-wide

distribution, compared to the 1% expected by random

using the x2 statistic.

RESULTS

Distribution of LR

We have used the length ratio (LR) to measure the

length of a haplotype block relative to the length of

the corresponding block in other populations and

relative to the genome average. The behavior of the LR

under different selection models was studied by

simulation using the SelSim software (SPENCER and

COOP 2004). In these simulations, the length of each

block was calculated relative to a randomly chosen

block from the dataset evolving under neutrality (s�0), and relative to the average block length in the

dataset. When keeping all other parameters constantin the simulations and varying the selection coefficient

between s�0, s�0.01, s�0.02 and s�0.04, selection

will result in a shift of the block length in the expected

direction (towards longer blocks) (Fig. 2A). For

example, using s�0.01 will result in a mean LR�1.7. The actual LR distributions for the SNP data

from African American (AA), European American

(EA) and Han Chinese (HC) are quite similar to thesimulated data under neutrality (s�0). This is con-

sistent with the prediction that most of the variation in

haplotype block length observed in the human gen-

ome is the result of random genetic drift rather than

selection. The influence of selection on a genomic

region does not only depend on the selection coeffi-

cient but also on the effective population size. When

considering sites that are evolving under neutrality(s�0), varying the effective population size between

Ne�1000, Ne�5000 and Ne�10 000 does not affect

the LR distribution (Fig. 2B), while the strongest

effect from adding a selective advantage for one allele

(s�0.04) is, not surprisingly, seen in populations

with the largest effective size. As shown by the

simulations, we would expect longer haplotype blocks

with reduced variability when an allele is favored bypositive selection (Fig. 2A). Under certain conditions,

positive selection may therefore result in extended

haplotype blocks, supporting the use of block length

as a variable for the identification of genomic regions

that have been subjected to local positive selection.

Distribution of FST ratios

As a second parameter, we calculated the FST ratio

(FR). The populations for which SNP data areavailable share a relatively recent common ancestry

(less than 50 000 years) and most of the SNP variation

is shared between populations. Similar to the LR, the

FR for each block in a population is the pairwise FST

between that population and each other population,

and relative to the genome average. If a gene has

been under recent positive selection in only one

population, we would expect a large FR, since theFST between that population and both others would

be larger than average. The distribution of FR values is

very similar between populations, with a somewhat

lower fraction of the blocks with FR�0 in the Han

Chinese (Fig. 2C).

Block LR and FR distributions

In order to identify genomic regions with deviating

patterns and are candidates for having been affected


by selection, we combined the LR and the FR values

for each population and plotted these against each

other. Blocks that are only affected by positive

selection in Africans are expected to have a highLRAA and a high FRAA (Fig. 3a), while blocks that

are only under selection in European Americans are

expected to have a high LREA and a high FREA

(Fig. 3b) and, finally, blocks under selection in Asians

are expected to have a high LRHC and high FRHC

(Fig. 3c). To determine the statistical significance of

deviating LR and FR values, the distributions of these

parameters have to be addressed. Both the LR and FRvalues are approximately normally distributed, with

LR possibly having a slight negative skew (Fig. 2A,

2C). Since we are only interested in outliers with a high

value, only positive values are considered in determin-

ing the significance levels. From the empirical dis-

tributions, each block was assigned a p-value

describing the likelihood of the observed value con-

sidering the distribution of values. For each block andeach population, p-values for both LRs and FRs were

assessed separately. Even though FR and LR are not

completely uncorrelated (r2�0.013), we computed a

combined p-value by multiplying the p-value for the

FR and LR for each block respectively.

FR and LR in genic and non-genic regions

We first examined the distribution of FR and LR

values relative to some general features in the genome.

Across all populations, 55% (155 558/284 953) of the

blocks included at least part of a gene, whereas 45%(129 395/284 953) consisted of only intergenic regions.

The median FR is lower for blocks containing genes,

both when each population is considered separately

and when all populations are analyzed together

(Table 1). The fraction of significant (pB0.01) blocks

containing genes is higher in all but the African

American population (Table 1). Since a low FR

indicates a small differentiation between populationsrelative the genome average, the higher average FR

may indicate more drift in intergenic regions. If

positive selection is acting specifically on genic rather

than intergenic regions, this could result in a higher

number of significant FR values for the blocks

containing genes, as is seen in our data. In contrast

to FR, a low LR reflects blocks that are relatively

shorter in a population, whereas blocks with a LRaround zero indicate similarity in block distribution

among populations. Therefore, the median value of

LR for blocks in genic and intergenic regions is not of

interest. The fraction of significant combined p-values

A

B

C

0

0.02

0.04

0.06

0.08

0.1

0.12

-5 -3 -1 1 3 5

FR

African American

Eurpean American

Han Chinese

0

0.02

0.04

0.06

0.08

0.1

0.12

-4 -2 0 2 4 6 8LR

s=0; Ne=1000s=0.04; Ne=1000s=0; Ne=5000s=0.04; Ne=5000s=0; Ne=10000s=0.04; Ne=10000

0

0.02

0.04

0.06

0.08

0.1

0.12

-6 -4 -2 0 2 4 6 8LR

Simulated LRsEmpirical LRs

s=0

s=0.01

s=0.02

s=0.04

skcol

bf

on

oitcarF

skcol

bf

on

oitcarF

skcol

bf

on

oitcarF

Fig. 2. A�C. LR and FR distributions. (A) LR distribution.LRs for haplotype blocks as a function of selectioncoefficient are calculated for simulated datasets with differ-ent selection coefficients (s�0, s�0.01, s�0.02 and s�0.04) for an effective population size of Ne�10 000. Thesolid grey lines represent the LR distributions for theempirical data from the African American, EuropeanAmerican and Han Chinese population. (B) LR distributionfor haplotype blocks as a function of effective popula-tion size. LRs are calculated for simulated datasets withNe�1000, Ne�5000 or Ne�10 000 and s�0 or s�0.04.The three solid lines (almost overlapping) represent thesimulated dataset without selection for three differenteffective populations sizes. The other lines represent thedifferent effective population sizes and s�0.04. (C) FRdistribution. Empirical FST ratios (FRs) for the blocks inAfrican Americans, European American and Han Chinese.


(pB0.05) is higher for blocks with genes as compared

to blocks without genes for Han Chinese and Eur-

opean Americans (Table 1).

Candidate genes for positive selection

All genes in a haplotype block were assessed for their

FR and LR and the combined (FR�LR) p-valuecalculated for that block. Among the total number of

284 953 blocks (100 481, 100 450 and 84 022 for AA,

EA and HC respectively), 55% included at least part of

a gene or predicted gene. Since both block length and

gene density varies across the genome, the number of

genes per block will vary and some genes will be part

of more than one block. In order to determine a gene-

specific LR and FR, we identified all blocks to which

each gene found in a population belonged and chose

the largest LR, FR, and combined p-value to repre-

sent the gene. These analyses include 22 737, 22 807

and 22 665 genes for AA, EA and HC, respectively.

After removing duplicates, some genes were still part

of the same block, and the final analysis resulted in

15 855, 15 715 and 15 577 blocks containing the genes

Fig. 3. a�c. Length ratio (LR) vs FST ratio (FR) for the three human populations studied. Haplotype blocks including a geneor part of a gene and exhibiting a positive LR and FR in (a) African American, (b) European American and (c) HanChinese. The red dots represent blocks with genes with a combination of LR and FR values that are significant at a genome-wide level of pB0.01.

Table 1. Analysis of the distribution of median FR, fraction of significant FR blocks and combined probabilities of

LR and FR in genic and non-genic regions1.

Comparison Population

AfricanAmerican (AA)

EuropeanAmerican (EA)

HanChinese (HC)

All

I2. FR medianGenic �0.054 �0.063 �0.072 �0.062Non-genic �0.083 �0.073 �0.096 �0.083P-value 4.5��10�4 3.8�10�2 1.5�10�2 3.85�10�6

II3. Fraction of significant FRGenic 0.0100 0.0110 0.0114 0.0108Non-genic 0.0100 0.0087 0.0083 0.0091P-value ns 1.43�10�4 2.72�10�6 1.86�10�6

III4. Combined P-valueGenic 0.0441 0.0575 0.0617 0.0540Non-genic 0.0485 0.0548 0.0551 0.0527P-value 0.0005 0.040 2.43�10�5 0.072

1) Abbreviations: ns � not significant; FR�FST ratio2) I. The median FRwas calculated for blocks containing at least part of a gene and for blocks not containing any part of a gene.p-values are calculated by comparing the median between the two groups of blocks using the Mann-Whitney rank sum test.3) II. The fraction of FRs with pB0.01 were calculated for blocks containing at least part of a gene and for blocks notcontaining any part of a gene. p-values were calculated by comparing the number of significant genes within each group usingx2 statistics.4) III. Combined LR and FR probabilities. The fraction of combined values of pB0.05 were calculated for blocks containingat least part of a gene and for blocks not containing any part of a gene. p-values were calculated by comparing the number ofsignificant genes within each group using x2 statistics.


for each population, respectively. Using a criteria of

p�0.01 for genome-wide significance (0.01/15 855�6.31�10�7, 0.01/15 715�6.36�10�7 and 0.01/15 577�6.42�10�7 for AA, EA and HC, respec-

tively) on the combined p-value for LR and FR

resulted in a list of 31 genes (Table 2) located in 23

blocks (9, 1, 13 for each population). The number of

significant blocks differs widely between populations.

This could be due to a bias in the dataset, but

considering the similarity between the distributions

of LR and FR between populations (Fig. 2, 3), it ismore likely to be due to stochastic variation. The

individual genes on these two lists will be addressed

further in the discussion.

Correlation with other studies of genes under positive

selection

A number of studies have searched for genes under

positive selection, either those that have been affected

by selection on the lineage leading to modern humans

or those affected more recently during human evolu-

tion (CLARK et al. 2003; BUSTAMANTE et al. 2005;

NIELSEN et al. 2005; VOIGHT et al. 2006; WANG et al.

2005). The loci implicated in studies of selection on the

lineage leading to modern humans shows little overlap

with the genes on our lists. However, when comparing

with studies of more recent events of selection, the

overlap is larger. There is a total of 34 genes (GNA14,

PQLC1, MYH9, MYEF2, LOC400369, SLC12A1,

AQP1, LOC90193, PRKG2, HERC1, LOC439940,

EDAR, SULT1C2, APOL4, PTF1A, C10orf115,

C10orf67, MKRN2, RAF1, FLJ11036, C9orf82,

C8orf7, EPHB1, SULT1C1, DAPK2, UBXD2, LCT,

CBARA1, CCDC2, LRRC19, GCC2, MGC10701,

LIMS1, R3HDM) located in 25 different regions

that overlap between our top 1% candidates and the

top 250 regions suggested to be under selection by

VOIGHT et al. (2006). This overlap is not surprising

given that the methods used have some similarities.

There is also an overlap of 12 genes (CLSPN, EIF2C4,

EIF2C1, EIF2C3, KCNH7, USP3, HERC1, EDAR,

SULT1C2, DAPK2, GCC2, LIMS1) between our top

1% candidates and the 181 genes proposed by Carlson

and colleagues (CARLSON et al. 2005). This is a higher

overlap than expected by chance (5 overlapping

regions out of 48 proposed by Carlson et al. as

compared to 1% expected by random, p�0.0139, x2

test), even though Carlson and colleagues used an

Table 2. Genes identified by our method, proposed to have been under positive selection in different human

populations1.

Pop Chromosome FR LR p-value2 Genes

AA 6 5.85 2.33 2.15�10�6 TCBA1AA 7 5.89 2.68 1.90�10�6 AIP1AA 8 6.44 2.88 6.85�10�8 CSMD1AA 8 4.61 3.53 7.24�10�7 PSD3AA 11 5.78 2.17 2.92�10�6 NELL1AA 14 5.10 3.94 1.84�10�6 NPAS3AA 15 9.15 2.16 7.02�10�14 CYP19A1AA 16 5.21 2.03 8.92�10�7 LOC441745AA 20 4.24 2.71 2.92�10�6 NFATC2EA 9 2.78 3.97 2.70�10�6 C9orf121HC 2 4.44 3.02 1.09�10�6 CPS1HC 2 3.87 3.28 2.47�10�6 EDAR3

HC 2 4.61 2.79 1.32�10�6 LOC375295 FUCA1PHC 5 3.55 3.49 2.65�10�6 GALNT10HC 5 4.17 3.25 1.17�10�6 GRIA1HC 8 4.37 2.88 2.13�10�6 LOC439940HC 9 2.65 4.22 8.97�10�7 LOC401539HC 10 3.64 3.64 1.19�10�6 LOC389997 LOC387703HC 13 4.29 3.64 1.97�10�6 ATP8A2HC 14 4.30 3.24 1.42�10�6 ACTN1HC 14 4.30 3.07 1.42�10�6 RPS29P1 WDR22 DDX18P1HC 15 4.63 3.69 2.83�10�7 HERC13

HC 19 4.16 3.17 1.60�10�6 FUT2 FLJ36070 RASIP1 MGC34799 FUT1

1) Abbreviations: Pop � population; FR � FST ratio; LR- length ratio; AA � African American; EA � European American; HC� Han Chinese.2) All p-values are genome wide significant pB0.01.3) Overlaps with genes suggested to have been under selection in earlier genome-scans based on other methods (CARLSON et al.2005; VOIGHT et al. 2006).


approach based on Tajima’s D that is conceptually

different from ours.

Combining methods for studying genes under positive

selection

The somewhat low overlap of genes under positive

selection for different identification methods is not

surprising. Even though many of the methods are

somewhat similar, different populations and SNP data

have been used, and both the size of the window used

to scan the genome for particular features and the

allele frequency spectra vary. In an attempt to study

the overlap between the genes identified by ourmethod and those determined by Voight and collea-

gues (VOIGHT et al. 2006), we analyzed all our top 1%

blocks using the iHS method (VOIGHT et al. 2006).

Out of the top 1% regions for AA, EA and HC

populations (159, 157 and 156, respectively), 19, 46

and 42 regions are identified as candidates for selec-

tion (pB0.05) using the iHS method, clearly higher

than the number expected by chance. However, if weapply a genome-wide significance level of p�0.01,

only 4, 29 and 17 regions, for AA, EA and HC

respectively, remain as candidates for positive selection

(Table 3).

DISCUSSION

Many studies of selection in the human genome have

been based on comparisons of the human andchimpanzee genomes and have therefore addressed

selection over 5�7 myr. We have focused on events that

occurred associated with the evolution of major

human population groups over the last 100 000 years.

The genetic differentiation of human populations is

the result of both selective and stochastic forces. Many

adaptations, such as those resulting from spatial and

temporal variation in climate, exposure to pathogensand diet, may have been restricted to particular

populations and are therefore likely to remain un-

detected by comparative genomic studies.

Positive selection acting on a locus is expected to

result in a more rapid fixation of alleles and conse-

quently less variation around the site under selection.

A means to identify chromosomal regions subjected to

positive selection is therefore to examine the pattern ofDNA polymorphism combined with the haplotype

structure. The FST for individual nucleotide sites often

has a high variance (AKEY et al. 2002; WEIR et al.

2005) but since variation at neighboring sites is often

correlated, we instead estimated the FST for haplotype

blocks. Positive selection for an allele in a large

population is expected to result in regions with

extended haplotypes due to lower levels of genetic

variation than expected by random genetic drift.

Therefore, one characteristic of genomic regions under

recent positive or negative selection is large haplotype

blocks of reduced diversity and a high correlation

Table 3. The genes among our top 1% list of candidates

that are proposed to have been under positive selection

when evaluated using the iHS method.

Pop1 P-value Genes

AA 2.80E-03 AIP1AA 1.91E-02 CNTN5AA 2.37E-04 CSMD1AA 1.97E-03 RORAEA 2.25E-06 A2BP1EA 1.04E-10 ACTBP4EA 1.08E-06 AIP1EA 3.54E-02 APBA2EA 2.32E-04 ARHGAP26EA 1.20E-02 BNC2EA 3.76E-05 DCCEA 2.84E-04 DKFZP566N034EA 4.73E-02 DOCK4EA 2.62E-03 FLJ10159EA 2.01E-03 KIAA0861EA 2.83E-02 KIAA1889EA 7.15E-02 KIAA2026EA 2.09E-08 LCTEA 6.56E-03 LOC284788EA 1.65E-05 LOC440867EA 7.41E-04 LOC90193EA 9.72E-02 MYEF2, SLC24A5, LOC400369EA 1.56E-06 PEPP2EA 1.27E-03 PPP2R2BEA 1.30E-03 PRDM10EA 5.60E-06 PRKG2EA 1.94E-04 PSD3EA 3.25E-14 R3HDMEA 7.52E-04 SGCZEA 8.93E-04 SLC12A1EA 1.89E-02 SMYD3EA 5.05E-03 TCBA1EA 6.42E-06 UBXD2HC 3.45E-03 ATP1B3P1, PAPOLG, LOC130865HC 4.01E-03 C2orf23HC 1.10E-02 C6orf176HC 1.80E-02 C8orf21HC 4.86E-06 DAB1HC 1.07E-02 EPHB1HC 9.60E-03 FHITHC 1.89E-03 FLJ11036, MKRN2, RAF1HC 4.13E-03 LOC131368HC 4.49E-05 LOC442008HC 2.47E-06 LRPPRCHC 2.97E-02 MGC42105HC 1.06E-02 NAV2HC 2.42E-04 SEMA3EHC 5.48E-12 SMC6L1, FLJ40869, LOC343930HC 5.34E-02 TCBA1HC 2.45E-06 TRPC6

1) Abbreviations: Pop � population; AA � African Amer-ican; EA � European American; HC � Han Chinese.


between the FST of proximate SNPs. The genome is

known to contain recombination hotspots located

between regions with higher LD (ALTSHULER et al.

2005). By focusing on haplotype blocks with reduceddiversity (regardless of LD) rather than studying

extended haplotypes, the results should be less sensi-

tive to variation in the recombination rate between

populations.

False and true positive rates for genes under selection

Approaches to identify genes under positive selectionhave to consider the relative contributions of genetic

drift and natural selection on the genetic variability

pattern. Most studies focus on the top 1% of

candidates, but are unable to distinguish between the

alternative explanation of positive selection or neutral

evolution. In our simulations of the length of blocks

with reduced haplotype diversity (LR), we observe

that a selective sweep with a selection coefficient s�0.01 will result in an average LR�1.7 (Fig. 4).

However, when no selection is acting (s�0), about

10% of the simulated data exhibited an LR�1.7.

Therefore, it is not possible to distinguish with

certainty between blocks that are under selection and

those evolving under neutrality. The number of genes

affected by positive selection in the human genome

is unknown, but it has been suggested that as many as3% of genes have been subjected to recent positive

selection (EBERLE et al. 2006). If we assume that 3% is

the true fraction, it follows that roughly 660 of the

about 22 000 genes in the human genome have been

subjected to positive selection. Assuming that those

660 genes have a selection coefficient of s�0.01 and

using LR�3 as the threshold for identification of

genes under positive selection then1% of the neutrally

evolving genes (1%�97%�22 000�213) and 22% of

the genes truly under selection (22%�660�145) will

be suggested to be candidates for positive selection(Fig. 4). Using more stringent criteria for the LR

cutoff will result in a larger fraction of true to false

positives, and also a larger number of false negatives.

For example, using LR�4.5 results in 0.1% (21 genes)

of the neutral evolving genes and 5.9% (45 genes) of

the positively selected genes. Assuming a selection

coefficient of s�0.01 on 3% of the genes in the human

genome may also be an overestimation, resultingin an even higher ratio of false positive to true positive

selected genes. Similar to these estimations, the

frequency of false positives among top candidates

for positive selection is probably high for most

methods developed for scanning the genome for

positive selection. At the same time, most of the

genes that have undergone positive selection have

not been detected. This is the most likely explanationfor the low extent of overlap of candidate genes

between different studies and methods (BISWAS and

AKEY 2006; SABETI et al. 2006, 2007), even though the

power to detect almost complete selective sweeps has

been shown to be high (SABETI et al. 2007). Many

selective events are likely to be weak and their

signatures can easily be eradicated by genetic drift.

Using a combined approach based on several genomiccharacteristics in searching for genes subjected to

recent positive selection is one approach for reducing

the number of false positives as well as false negative

genes.

Genes within our top 1% candidates likely to have been

under selection

Our analysis of the haplotype LR and FR in three

major populations resulted in the identification of a

number of genes that are candidates for having beenaffected by selection, even though their p-values does

not reach genome-wide significance. The lactase gene

(LCT) appears on the top 1% list of candidates for

positive selection in European Americans. The LCT

gene is likely to have been under positive selection due

to the increased nutrition when consuming dairy

products, which were introduced to humans during

cattle domestication in the Near East, about 9 000years ago. Another interesting candidate gene is

MCPH1, involved in regulation of brain size. Our

analysis indicates that this gene has been under selec-

tion in European Americans, and it has earlier been

suggested to be both under negative selection on the

human lineage (BUSTAMANTE et al. 2005) and under

positive selection in Caucasians (EVANS et al. 2005).

The polymorphism in MCPH1 proposed to be under

Fig. 4. Distribution of LR for neutrally evolving sitescompared to sites with a selection coefficient of s�0.01.Distribution of LR for neutrally evolving sites compared tosites with a selection coefficient of s�0.01. For the neutrallyevolving genes, only 1% will exhibit an LR above 3compared to 22% of the genes with a selection coefficientof 0.01.


selection in Caucasians was estimated to have arisen

approximately 37 000 years ago and simulations in-

dicate that it has been increasing in frequency too

rapidly to be compatible with neutral drift (EVANS

et al. 2005). One further interesting gene for which

positive selection is indicated in European Americans

is AIM1 (MATP). Polymorphism at AIM1 is asso-

ciated with normal variation in human pigmentation

(dark hair, skin and eye color in Caucasians) (GRAF

et al. 2005). A recent study also suggested that the

AIM1 gene has been subjected to positive selection

(SOEJIMA et al. 2006). Skin pigmentation is a well-

known example of genetic adaptation in humans

(CAVALLI-SFORZA et al. 1996). Both AIM1 and a

gene called OCA2, which is also among our top

candidates for being under selection in Europeans,

are implicated in different types of oculocutaneous

albinism. A mutation in OCA2 causes the most

prevalent type of oculocutaneous albinism throughout

the world and occurs at much lower frequency in

Europeans than in Africans (LEE et al. 1994).

SLC24A5, which is among our top candidates in

European Americans, has also been shown to be

involved in skin pigmentation and the allele associated

with light pigmentation is almost fixed in European

populations (LAMASON et al. 2005). A number of coat

color genes have been described in mice and we

identified 51 genes within our of 1% top candidates

as orthologues to genes associated with variation in

coat color in mice (http://albinismdb.med.umn.edu/

genes.htm). Among these 51 genes, 5 (RAB27A, DCT,

EGFR, ATRN, MATP) are among our top candidates.

This is a significantly higher number than expected by

chance for 51 randomly chosen genes (p� 0.029, x2

test), indicating that many of the genes involved in

human skin pigmentation have been under selection in

different populations. Recently a number of studies

have focused on genes involved in skin pigmentation in

humans (MCEVOY et al. 2006, LAO et al. 2007, MYLES

et al. 2007), also indicating that many of those genes

have been subjected to positive selection.In addition to the genes investigated in our study, a

number of regions lacking recognized genes were also

as likely to have been under selection as the genes

discussed (data not shown). The results for some of

these regions may reflect stochastic variation, but in

general, this supports the notion that some non-

coding DNA and intergenic sequences are under

selection for functional reasons (ANDOLFATTO 2005).

One example of such a case is the LCT gene. The most

likely mutation to have been under selection in

association with the LCT gene is situated 14 kb

upstream the LCT gene (ENATTAH et al. 2002).

In the case of LCT, the haplotype block is large

enough to include both the gene and the upstream

region, which might not be the case for other similar

situations.

Genome-wide significance of positive selection

As seen in our and other studies, most methods face

the problem that using stringent statistical thresholds

to avoid false positives results in high numbers of false

negatives. When applying thresholds for genome-wide

significance we identify 31 candidates for having been

under selection (Table 2). As an alternative means of

enriching for candidates, we applied the method based

on iHS (VOIGHT et al. 2006) to our list of top 1%candidates, which resulted in 50 regions with a

significant p-value (Table 3). The overlap between

the regions identified by our method using a genome-

wide significance threshold (Table 2) and the applica-

tion of the iHS method to our 1% top candidates, is

restricted to AIP1 and CSMD1 in the African

American population. In addition, TCBA1, which is

on the top 1% list for all three populations, is foundamong the genome-wide significant genes in African

Americans using our method, but among the signifi-

cant genes for both European Americans and Han

Chinese when applying the iHS method. Surprisingly,

HERC1 and EDAR (Table 2), which show genome-

wide significance using our method, were not signifi-

cant when we applied the iHS method, even though

they were listed as significant by other groups usingthis test on other datasets (VOIGHT et al. 2006; SABETI

et al. 2007). This discrepancy probably reflects the

sensitivity of the results to the dataset used for

identifying the candidate regions. The limited sample

sizes available for these kinds of studies not only

decrease the power to identify genes under selection,

but also make the findings hard to replicate in another

sample.

Overlap with other studies

The rather small overlap between the loci indicated to

be under selection in our study and those pinpointed

previously in between-species comparisons is not

surprising. First, there is a large difference in the

time perspective of the selective events. Local adaptive

selection that affects one or a few human populations

may be quite distinct from the selective pressure thatshaped modern humans from archaic forms of Homo.

Second, since environmental factors vary between

locations it is not expected that selection will affect

all populations equally. Therefore, some of the appar-

ent differences between the results of comparative

genomic and population genetic approaches may be

due to differences in the samples included. For

example, LPP has been suggested to be under negative


selection on the human lineage (BUSTAMANTE et al.

2005) but is among our top candidates for having been

under positive selection in Han Chinese. However,

Bustamante and colleagues used only European and

African samples to represent humans in their analysis

and if local selection has only occurred on LPP in the

Han Chinese this would not been detected in their

study. We observe a larger overlap with studies

focusing on individual human populations. CARLSON

et al. (2005) based their study on the Tajima’s D

statistic and used a sliding window technique to

identify candidate regions. Tajima’s D is conceptually

unrelated to our method and it is therefore interesting

that we have a larger overlap than expected by chance

between our list of genes and that of CARLSON et al.

(2005). Since we are considering both the allele

frequency differences between human populations

and the deviation in length of haplotype blocks, we

will preferably identify genes that have been under

selection after the separation of the three major

human groups.

In summary, we have developed and applied a new

method for identifying candidate loci for being under

selection at different time points during the evolution

of human populations. To identify the specific loci and

polymorphisms under selection requires further stu-

dies of the sequence variability and natural history of

different populations. None of the methods available

for identifying genes under selection is resistant to

errors, but the use of both haplotype architecture and

population differentiation increases our ability to

identify genomic regions that have been affected by

non-random forces. These methodologies provide

focal points in the genome for future studies of the

evolutionary events that have shaped modern human

populations as they explored different parts of the

world. One continuation of these studies is to evaluate

the top candidates by resequencing the genes in a

number of populations. Resequencing is still both

time-consuming and expensive but with the rapid

progress in high-resolution genotyping of individuals

from different populations (ALTSHULER et al. 2005;

HINDS et al. 2005) and the availability of new

techniques for high throughput genomic resequencing

(BENNETT et al. 2005; MARGULIES et al. 2005), the

amount of genome data is growing exponentially and

thereby also the potential for further evaluation of

genes that have been under selection in humans.

Acknowledgements � This study was supported by grantsfrom the Swedish Natural Sciences Research Council. AJ isaffiliated to The Linnaeus Centre for Bioinformatics, Up-psala University, Sweden.

REFERENCES

Akey, J. M., Zhang, G., Zhang, K. et al. 2002. Interrogatinga high-density SNP map for signatures of naturalselection. � Genome Res. 12: 1805�1814.

Altshuler, D., Brooks, L. D., Chakravarti, A. et al. 2005. Ahaplotype map of the human genome. � Nature 437:1299�1320.

Andolfatto, P. 2005. Adaptive evolution of non-coding DNAin Drosophila. � Nature 437: 1149�1152.

Beaumont, M. A. and Balding, D. J. 2004. Identifyingadaptive genetic divergence among populations fromgenome scans. � Mol. Ecol. 13: 969�980.

Bennett, S. T., Barnes, C., Cox, A. et al. 2005. Toward the1,000 dollars human genome. � Pharmacogenomics 6:373�382.

Biswas, S. and Akey, J. M. 2006. Genomic insights intopositive selection. � Trends Genet. 22: 437�446.

Bustamante, C. D., Fledel-Alon, A., Williamson, S. et al.2005. Natural selection on protein-coding genes in thehuman genome. � Nature 437: 1153�1157.

Carlson, C. S., Thomas, D. J., Eberle, M. A. et al. 2005.Genomic regions exhibiting positive selection identifiedfrom dense genotype data. � Genome Res. 15: 1553�1565.

Cavalli-Sforza, L. L., Menozzi, P. and Piazza, A. 1996. Thehistory and geography of human genes. � Princeton Univ.Press.

Clark, A. G., Glanowski, S., Nielsen, R. et al. 2003. Inferringnonneutral evolution from human�chimp�mouse ortho-logous gene trios. � Science 302: 1960�1963.

Eberle, M. A., Rieder, M. J., Kruglyak, L. et al. 2006. Allelefrequency matching between SNPs reveals an excess oflinkage disequilibrium in genic regions of the humangenome. � PLoS Genet. 2: e142.

Enattah, N. S., Sahi, T., Savilahti, E. et al. 2002. Identifica-tion of a variant associated with adult-type hypolactasia.� Nat. Genet. 30: 233�237.

Evans, P. D., Gilbert, S. L., Mekel-Bobrov, N. et al. 2005.Microcephalin, a gene regulating brain size, continues toevolve adaptively in humans. � Science 309: 1717�1720.

Fay, J. C. and Wu, C. I. 2000. Hitchhiking under positiveDarwinian selection. � Genetics 155: 1405�1413.

Fu, Y. X. and Li, W. H. 1993. Statistical tests of neutrality ofmutations. � Genetics 133: 693�709.

Graf, J., Hodgson, R. and Van Daal, A. 2005. Singlenucleotide polymorphisms in the MATP gene are asso-ciated with normal human pigmentation variation.� Hum. Mutat. 25: 278�284.

Halperin, E. and Eskin, E. 2004. Haplotype reconstructionfrom genotype data using imperfect phylogeny. � Bioin-formatics 20: 1842�1849.

Hamblin, M. T. and Di Rienzo, A. 2000. Detection of thesignature of natural selection in humans: evidence fromthe Duffy blood group locus. � Am. J. Hum. Genet. 66:1669�1679.

Hamblin, M. T., Thompson, E. E. and Di Rienzo, A. 2002.Complex signatures of natural selection at the Duffyblood group locus. � Am. J. Hum. Genet. 70: 369�383.

Hanchard, N. A., Rockett, K. A., Spencer, C. et al. 2006.Screening for recently selected alleles by analysis ofhuman haplotype similarity. � Am. J. Hum. Genet. 78:153�159.

Harr, B., Kauer, M. and Schlotterer, C. 2002. Hitchhikingmapping: a population-based fine-mapping strategy for


adaptive mutations in Drosophila melanogaster. � Proc.Natl Acad. Sci. USA 99: 12949�12954.

Hinds, D. A., Stuve, L. L., Nilsen, G. B. et al. 2005. Whole-genome patterns of common DNA variation in threehuman populations. � Science 307: 1072�1079.

Hollox, E. J., Poulter, M., Zvarik, M. et al. 2001. Lactasehaplotype diversity in the Old World. � Am. J. Hum.Genet. 68: 160�172.

Lamason, R. L., Mohideen, M. A., Mest, J. R. et al. 2005.SLC24A5, a putative cation exchanger, affects pigmenta-tion in zebrafish and humans. � Science 310: 1782�1786.

Lao, O., De Gruijter, J. M., Van Duijn, K. et al. 2007.Signatures of positive selection in genes associated withhuman skin pigmentation as revealed from analyses ofsingle nucleotide polymorphisms. � Ann. Hum. Genet.71: 354�369.

Lee, S. T., Nicholls, R. D., Schnur, R. E. et al. 1994. Diversemutations of the P gene among African-Americans withtype II (tyrosinase-positive) oculocutaneous albinism(OCA2). � Hum. Mol. Genet. 3: 2047�2051.

Margulies, M., Egholm, M., Altman, W. E. et al. 2005.Genome sequencing in microfabricated high-densitypicolitre reactors. � Nature 437: 376�380.

McEvoy, B., Beleza, S. and Shriver, M. D. 2006. The geneticarchitecture of normal variation in human pigmentation:an evolutionary perspective and model. � Hum. Mol.Genet. 15 Spec. No. 2: R176�181.

Myles, S., Somel, M., Tang, K. et al. 2007. Identifying genesunderlying skin pigmentation differences among humanpopulations. � Hum. Genet. 120: 613�621.

Nielsen, R., Bustamante, C., Clark, A. G. et al. 2005. A scanfor positively selected genes in the genomes of humansand chimpanzees. � PLoS Biol. 3: e170.

Sabeti, P. C., Reich, D. E., Higgins, J. M. et al. 2002.Detecting recent positive selection in the human genomefrom haplotype structure. � Nature 419: 832�837.

Sabeti, P. C., Schaffner, S. F., Fry, B. et al. 2006. Positivenatural selection in the human lineage. � Science 312:1614�1620.

Sabeti, P. C., Varilly, P., Fry, B. et al. 2007. Genome-widedetection and characterization of positive selection inhuman populations. � Nature 449: 913�918.

Schlotterer, C. 2003. Hitchhiking mapping � functionalgenomics from the population genetics perspective.� Trends Genet. 19: 32�38.

Soejima, M., Tachida, H., Ishida, T. et al. 2006. Evidence for recent positive selection at the human AIM1 locus in aEuropean population. � Mol. Biol. Evol 23:179�188.

Spencer, C. C. and Coop, G. 2004. SelSim: a program tosimulate population genetic data with natural selectionand recombination. � Bioinformatics 20: 3673�3675.

Storz, J. F. 2005. Using genome scans of DNA polymorph-ism to infer adaptive population divergence. � Mol. Ecol.14: 671�688.

Stringer, C. B. and Andrews, P. 1988. Genetic and fossilevidence for the origin of modern humans. � Science 239:1263�1268.

Tajima, F. 1989. Statistical method for testing the neutralmutation hypothesis by DNA polymorphism. � Genetics123: 585�595.

Voight, B. F., Kudaravalli, S., Wen, X. et al. 2006. A map ofrecent positive selection in the human genome. � PLoSBiol. 4: e72.

Wang, E. T., Kodama, G., Baldi, P. et al. 2006. Globallandscape of recent inferred Darwinian selection forHomo sapiens. � Proc. Natl Acad. Sci.USA 103:135�140.

Weir, B. S., Cardon, L. R., Anderson, A. D. et al. 2005.Measures of human population structure show hetero-geneity among genomic regions. � Genome Res. 15: 1468�1476.

Wright, S. 1950. Genetic structure of populations. � Br. Med.J. 4669: 36.


Identification of local selective sweeps in human populations since the exodus from Africa

Documents

Transcript of Identification of local selective sweeps in human populations since the exodus from Africa