Identification of local selective sweeps in human populations since the exodus from Africa
-
Upload
asa-johansson -
Category
Documents
-
view
212 -
download
0
Transcript of Identification of local selective sweeps in human populations since the exodus from Africa
Identification of local selective sweeps in human populations since theexodus from Africa
ASA JOHANSSON and ULF GYLLENSTEN
Department of Genetics and Pathology, Rudbeck laboratory, Uppsala University, Uppsala, Sweden
Johansson, A. and Gyllensten, U. 2008. Identification of local selective sweeps in human populations since the exodus from
Africa. * Hereditas 145: 126�137. Lund, Sweden. eISSN 1601-5223. Received January 25, 2008. Accepted March 17, 2008
Selection on the human genome has been studied using comparative genomics and SNP architecture in the lineage leading to
modern humans. In connection with the African exodus and colonization of other continents, human populations have
adapted to a range of different environmental conditions. Using a new method that jointly analyses haplotype block length
and allele frequency variation (FST) within and between populations, we have identified chromosomal regions that are
candidates for having been affected by local selection. Based on 1.6 million SNPs typed in 71 individuals of African
American, European American and Han Chinese descent, we have identified a number of genes and non-coding regions that
are candidates for having been subjected to local positive selection during the last 100 000 years. Among these genes are
those involved in skin pigmentation (SLC24A5) and diet adaptation (LCT). The list of genes implicated in these local
selective sweeps overlap partly with those implicated in other studies of human populations using other methods, but show
little overlap with those postulated to have been under selection in the 5�7 myr since the divergence of the ancestors of human
and chimpanzee. Our analysis provides focal points in the genome for detailed studies of evolutionary events that have
shaped human populations as they explored different regions of the world.
Ulf Gyllensten, Dept of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, SE-571 85, Uppsala, Sweden.
E-mail: [email protected]
Comparisons of the human and chimpanzee genomes
have been used to identify genes subjected to selection
on the lineage leading to modern humans (CLARK
et al. 2003; BUSTAMANTE et al. 2005; NIELSEN et al.
2005). Such comparative genomic approaches address
genetic changes that have occurred during the 5�7 myr
since humans and chimpanzees shared a common
ancestor. However, modern humans emerged in Africa
less than 200 kyr ago (STRINGER and ANDREWS 1988)
and only began to colonize other continents 50�80 kyr
ago. To specifically target genes that have been
subjected to selection in association with more recent
evolutionary events, such as the African exodus and
the separation of Caucasian and Asian populations,
genomic analyses based on population comparisons
are needed. Some genetic adaptations to environmen-
tal conditions have been described in humans. For
instance, populations depending on agriculture often
have a high tolerance to lactose, associated with a
mutation in the lactase gene (HOLLOX et al. 2001;
ENATTAH et al. 2002). Also, variation at the Duffy
blood group locus has been associated with malaria
resistance (HAMBLIN and DI RIENZO 2000; HAMBLIN
et al. 2002). Of the three Duffy alleles (FY*A,
FY*B and FY*0), homozygotes for FY*0 have been
associated with resistance to malaria. This allele
has almost reached fixation in sub-Saharan popula-
tions but remains rare in Asian and European
populations. However, the number of loci identified
that are candidates for local selection is limited and
new methods are needed to scan the human genome
for evidence of selection.
Positive (directional) selection on an allele is ex-
pected to increase its frequency in the affected
population. Differences in allele frequency between
populations, estimated by Wrights FST (WRIGHT
1950), have therefore been used to indicate genes
under selection (BEAUMONT and BALDING 2004;
STORZ 2005). A high FST indicates positive selection
whereas a low FST indicates that the loci are subject to
purifying or balancing selection. Genome-wide
searches based on allele frequency differences have
resulted in a number of genes and gene categories
postulated to be under selection (AKEY et al. 2002).
However, FST estimates for individual sites have a high
variance, complicating its use as a sole indicator of
regions under selection (WEIR et al. 2005). Positive
selection is expected to increase both the frequency of
the affected site and of linked sites and selective
sweeps may therefore result in the presence of long
haplotype blocks. This is utilized in hitchhike mapping
for the identification of genes under selection (HARR
et al. 2002; SCHLOTTERER 2003). However, since both
recombination frequency and population structure
contribute to haplotype block variability, reduced
haplotype variability per se is not a strong indication
of selective events. Local selective sweeps have also
been studied using a number of methods based on the
Hereditas 145: 126�137 (2008)
DOI: 10.1111/j.2008.0018-0661.02054.x
haplotype or LD architecture (SABETI et al. 2002;
HANCHARD et al. 2006; VOIGHT et al. 2006;
WANG et al. 2006), and such methods have been
proposed to be more powerful for detecting selective
events than methods based on the nucleotide diversity
(HANCHARD et al. 2006), such as Tajima’s D-test, Fu
and Li’s D-test and Fay and Wu’s H-test (TAJIMA
1989; FU and LI 1993; FAY and WU 2000). While most
methods focus on haplotype patterns within a popula-
tion, a haplotype-based method was recently presented
for cross-population comparisons to detect alleles that
have reached near-fixation (SABETI et al. 2007).
Here, we apply a new cross-population method to
search for genomic regions that have been subjected to
local positive selection. Our method considers both
the haplotype block length and allele frequency
variation between different populations as well as
between genomic regions. Selection may have occurred
at a number of time points in the human history, such
as in Africa prior to the separation of the major
population groups (Fig. 1, branch 2), after the exodus
of non-African populations (branch 3), prior to the
separation of Asian and European populations
(branch 4) or after the separation of European
(branch 5) and Asian (branch 6) populations. Com-
parison between populations makes it possible to
study the genetic changes associated with some of
these events. We have used available SNP and haplo-
type data from three major human populations,
African (American), European (American) and Asian
(Han Chinese) (HINDS et al. 2005), to address selective
sweeps that occurred during the last 50�80 kyr of
human evolution. Our results are compared to those
of recent studies that have assessed selection in the
human genome using other methods.
MATERIAL AND METHODS
We used a publicly available dataset consisting of
1.6 million SNPs evenly distributed across the genome
and genotyped in 71 individuals by Perlegen Sciences,
representing 23 African Americans (AA), 24 Eur-
opean Americans (EA) and 24 Han Chinese (HC)
(HINDS et al. 2005). Haplotypes had been inferred
separately for each of the three sample sets using theHAP program (HALPERIN and ESKIN 2004) and
partitioned into blocks with limited diversity (HINDS
et al. 2005). These blocks were defined as sets of SNPs
for which at least 80% of the inferred haplotypes could
be grouped into common patterns with a population
frequency of at least 5%. Using this definition, 235 663
blocks have previously been identified in African
Americans, 109 913 blocks in European Americansand 89 994 blocks in Han Chinese (HINDS et al. 2005).
For identifying chromosomal regions subjected to
positive selection, we only considered the autosomal
chromosomes.
Block length and FST values
For each haplotype block, consisting of at least two
SNPs, we estimated the block length using Build 35
(ver. 35 of the human genome sequence annotations).
Blocks for which the SNP order did not agree between
Build 35 and Build 34 (ver. 34 of the human genome
sequence annotations which was used by Perlegen)
was removed from further analysis. In addition, blocks
shorter than 500 bases were removed. A block FST in apopulation was defined as the average pairwise FST
(WRIGHT 1950) for all individual SNPs in that block
between that population and another population,
resulting into two block FST values for each haplotype
block. For instance, for a block defined in African
Americans the first block FST is the average FST
between African Americans and European Americans
for all SNPs in the African American block. Thesecond FST is the average FST between African
Americans and Han Chinese for the same SNPs.
Genes included in a haplotype block were identified
through their reference sequence position according to
Build 35.
FR for haplotype blocks
Our aim was to search for regions that differ in block
allele frequency (high FST) between African and non-
African populations, between Asian and both other
populations and between European and both other
populations. To study whether each population differs
from both the other two populations, we calculated
the following FST-ratios (FRs) for each haplotype
block (i),
Fig. 1. Different stages when selection could have actedduring human evolution. The number on the branchesindicate time points when selection could have occurred,such as on the lineage leading to Chimpanzee (branch 1), inAfrican populations prior to the separation of the threepopulations (branch 2), after the exodus of non-Africanpopulations (branch 3), in non-African populations prior tothe separation of Asian and European populations (branch4), and after the separation of European (branch 5) andAsian (branch 6) populations.
Hereditas 145 (2008) Selective sweeps in human populations 127
FRAAi � log
�FAA�EA
STi
median (FAA�EAST )
�FAA�HC
STi
median (FAA�HCST )
�;
FREAi � log
�FEA�AA
STi
median (FEA�AAST )
�FEA�HC
STi
median (FEA�HCST )
�;
FRHCi � log
�FHC�AA
STi
median (FHC�AAST )
�FHC�EA
STi
median (FHC�EAST )
�;
where AA refers to African Americans, EA to
European Americans, and HC to Han Chinese,
respectively. The individual block FST values were
divided by the median to be compared to the genomein general and log transformed. An FST ratio around
zero indicates an average block FST and a low
(negative) FR indicates very low differentiation be-
tween the study population and both of the other
populations. A high FR reflects large differentiation
but does not indicate the specific population(s) in
which a potential selective sweep has occurred.
Population-specific extended haplotype blocks
To identify in which population a selective sweep has
occurred, we compared the length of haplotype blocks
between populations for every site. For each block
defined in a population, the corresponding blocks in
the two other populations were identified as the blocks
with the largest overlap. If no overlapping block was
found in the other populations, the block was removedfrom further analysis. We calculated a ratio between
the lengths of the haplotype block (i) in the different
populations as:
To make the ratios comparable across populations,
each ration was divided by the median and log
transformed. The log transformed LR is expected to
be zero for regions exhibiting an average proportion of
block length within populations. A low (negative) LR
indicates that the haplotype block, in the population
studied, is relatively short compared to the blocks
in the other populations. A high LR in one population
indicates larger haplotype block relative the other
populations and that the region has potentially under-
gone a selective sweep in that population.
Simulation of LR distribution
We used the software SelSim (SPENCER and COOP
2004) to simulate a dataset using different strength of
selection on an SNP, varying the effective population
size and using different fractions of derived to
ancestral alleles. The number of chromosomes used
in each simulation was similar to the number ofchromosomes in each of our three human populations
from which the empirical data was obtained (n�48).
To examine the effect of variation in selection
coefficient (s�0, s�0.01, s�0.02, s�0.04), we used
an effective population size of Ne�10 000 and the
proportion of derived to ancestral alleles of 44/4. To
study the effect on the effective population size, we
used Ne�1000, Ne�5000 and Ne�10 000, a propor-tion of derived to ancestral alleles of 44/4 and selection
coefficients of s�0 and s�0.04. In all simulations the
recombination rate was set to 1 cM/Mb, the number of
SNPs to 200 and the density to 1 SNP per kb. For each
set of simulations, we made 10 000 permutations. The
length of haplotype blocks surrounding the SNP
under selection was determined using the same criteria
as used for the population data; 80% of the sequencesshould be grouped into haplotypes with an allele
frequency of at least 5% in the dataset (HINDS et al.
2005). For each of the simulated replicates, the block
length was determined using a script in MATLAB in
the following way:
1) Starting at the SNP under selection.
2) Step 1 SNP in each direction from starting point.
3) Determine if the block including one or both of
the new SNPs follows the definition of a haplo-
type block.
4) If the SNPs are included in the same block as
the SNP under selection: repeat from step 2. If
one SNP does not belong to the same block as the
SNP under selection, no further SNPs in that
direction from the start point will be evaluated.
LRAAi � log
�block length(AA)2
i
block length(EA)i � block length(HC)i
=median
�block length(AA)2
block length(EA) � block length(HC)
��
LREAi � log
�block length(EA)2
i
block length(AA)i � block length(HC)i
=median
�block length(EA)2
block length(AA) � block length(HC)
��
LRHCi � log
�block length(HC)2
i
block length(AA)i � block length(EA)i
=median
�block length(HC)2
block length(AA) � block length(EA)
��
128 A. Johansson and U. Gyllensten Hereditas 145 (2008)
5) When none of the new SNPs are included in
the block, the block length is defined by the
distance between the last SNPs, in each direction
from the starting point.
LRs were calculated for the simulated haplotype
blocks, similar to the empirical population data. Each
simulated block was compared to two randomlychosen blocks from the dataset simulated with s�0:
LRi� logblock length2
i
block length(s�0) � block length(s�0)
Blocks used for calculating the LRs were resampled
100 000 times with replacement.
Combining methods for studying genes under positive
selection
We used the iHS method by Voight and colleagues
(VOIGHT et al. 2006) to scan our top 1% candidate
genes for being subjected to positive selection. For all
SNPs in each block, the ancestral state was identified
using the UCSC Human Genome database (Mar. 2006
assembly) where possible. For all SNPs with anidentified ancestral state, an unstandardized iHS was
calculated as the ratio between the integrated EHH
(extended haplotype homozygosity) score of the
ancestral and the derived allele (VOIGHT et al. 2006).
Since our genomic regions are selected as candidates
for positive selection, we cannot use our empirical
distribution to standardize the iHS. Instead, we used
the already described, almost linear, relationshipbetween the cutoff for the top 1% tail of the
unstandardized iHS distribution and the allele fre-
quency (VOIGHT et al. 2006, Fig. 4) to estimate what
fraction of SNPs for each of our genes are within the
top 1% genome-wide empirical distribution of un-
standardized iHSs. Large blocks containing more than
one gene were divided to assign an independent iHS to
each gene. The significance for each gene was calcu-lated as the significance of observing a certain fraction
of SNPs within the 1% tail of the genome-wide
distribution, compared to the 1% expected by random
using the x2 statistic.
RESULTS
Distribution of LR
We have used the length ratio (LR) to measure the
length of a haplotype block relative to the length of
the corresponding block in other populations and
relative to the genome average. The behavior of the LR
under different selection models was studied by
simulation using the SelSim software (SPENCER and
COOP 2004). In these simulations, the length of each
block was calculated relative to a randomly chosen
block from the dataset evolving under neutrality (s�0), and relative to the average block length in the
dataset. When keeping all other parameters constantin the simulations and varying the selection coefficient
between s�0, s�0.01, s�0.02 and s�0.04, selection
will result in a shift of the block length in the expected
direction (towards longer blocks) (Fig. 2A). For
example, using s�0.01 will result in a mean LR�1.7. The actual LR distributions for the SNP data
from African American (AA), European American
(EA) and Han Chinese (HC) are quite similar to thesimulated data under neutrality (s�0). This is con-
sistent with the prediction that most of the variation in
haplotype block length observed in the human gen-
ome is the result of random genetic drift rather than
selection. The influence of selection on a genomic
region does not only depend on the selection coeffi-
cient but also on the effective population size. When
considering sites that are evolving under neutrality(s�0), varying the effective population size between
Ne�1000, Ne�5000 and Ne�10 000 does not affect
the LR distribution (Fig. 2B), while the strongest
effect from adding a selective advantage for one allele
(s�0.04) is, not surprisingly, seen in populations
with the largest effective size. As shown by the
simulations, we would expect longer haplotype blocks
with reduced variability when an allele is favored bypositive selection (Fig. 2A). Under certain conditions,
positive selection may therefore result in extended
haplotype blocks, supporting the use of block length
as a variable for the identification of genomic regions
that have been subjected to local positive selection.
Distribution of FST ratios
As a second parameter, we calculated the FST ratio
(FR). The populations for which SNP data areavailable share a relatively recent common ancestry
(less than 50 000 years) and most of the SNP variation
is shared between populations. Similar to the LR, the
FR for each block in a population is the pairwise FST
between that population and each other population,
and relative to the genome average. If a gene has
been under recent positive selection in only one
population, we would expect a large FR, since theFST between that population and both others would
be larger than average. The distribution of FR values is
very similar between populations, with a somewhat
lower fraction of the blocks with FR�0 in the Han
Chinese (Fig. 2C).
Block LR and FR distributions
In order to identify genomic regions with deviating
patterns and are candidates for having been affected
Hereditas 145 (2008) Selective sweeps in human populations 129
by selection, we combined the LR and the FR values
for each population and plotted these against each
other. Blocks that are only affected by positive
selection in Africans are expected to have a highLRAA and a high FRAA (Fig. 3a), while blocks that
are only under selection in European Americans are
expected to have a high LREA and a high FREA
(Fig. 3b) and, finally, blocks under selection in Asians
are expected to have a high LRHC and high FRHC
(Fig. 3c). To determine the statistical significance of
deviating LR and FR values, the distributions of these
parameters have to be addressed. Both the LR and FRvalues are approximately normally distributed, with
LR possibly having a slight negative skew (Fig. 2A,
2C). Since we are only interested in outliers with a high
value, only positive values are considered in determin-
ing the significance levels. From the empirical dis-
tributions, each block was assigned a p-value
describing the likelihood of the observed value con-
sidering the distribution of values. For each block andeach population, p-values for both LRs and FRs were
assessed separately. Even though FR and LR are not
completely uncorrelated (r2�0.013), we computed a
combined p-value by multiplying the p-value for the
FR and LR for each block respectively.
FR and LR in genic and non-genic regions
We first examined the distribution of FR and LR
values relative to some general features in the genome.
Across all populations, 55% (155 558/284 953) of the
blocks included at least part of a gene, whereas 45%(129 395/284 953) consisted of only intergenic regions.
The median FR is lower for blocks containing genes,
both when each population is considered separately
and when all populations are analyzed together
(Table 1). The fraction of significant (pB0.01) blocks
containing genes is higher in all but the African
American population (Table 1). Since a low FR
indicates a small differentiation between populationsrelative the genome average, the higher average FR
may indicate more drift in intergenic regions. If
positive selection is acting specifically on genic rather
than intergenic regions, this could result in a higher
number of significant FR values for the blocks
containing genes, as is seen in our data. In contrast
to FR, a low LR reflects blocks that are relatively
shorter in a population, whereas blocks with a LRaround zero indicate similarity in block distribution
among populations. Therefore, the median value of
LR for blocks in genic and intergenic regions is not of
interest. The fraction of significant combined p-values
A
B
C
0
0.02
0.04
0.06
0.08
0.1
0.12
-5 -3 -1 1 3 5
FR
African American
Eurpean American
Han Chinese
0
0.02
0.04
0.06
0.08
0.1
0.12
-4 -2 0 2 4 6 8LR
s=0; Ne=1000s=0.04; Ne=1000s=0; Ne=5000s=0.04; Ne=5000s=0; Ne=10000s=0.04; Ne=10000
0
0.02
0.04
0.06
0.08
0.1
0.12
-6 -4 -2 0 2 4 6 8LR
Simulated LRsEmpirical LRs
s=0
s=0.01
s=0.02
s=0.04
skcol
bf
on
oitcarF
skcol
bf
on
oitcarF
skcol
bf
on
oitcarF
Fig. 2. A�C. LR and FR distributions. (A) LR distribution.LRs for haplotype blocks as a function of selectioncoefficient are calculated for simulated datasets with differ-ent selection coefficients (s�0, s�0.01, s�0.02 and s�0.04) for an effective population size of Ne�10 000. Thesolid grey lines represent the LR distributions for theempirical data from the African American, EuropeanAmerican and Han Chinese population. (B) LR distributionfor haplotype blocks as a function of effective popula-tion size. LRs are calculated for simulated datasets withNe�1000, Ne�5000 or Ne�10 000 and s�0 or s�0.04.The three solid lines (almost overlapping) represent thesimulated dataset without selection for three differenteffective populations sizes. The other lines represent thedifferent effective population sizes and s�0.04. (C) FRdistribution. Empirical FST ratios (FRs) for the blocks inAfrican Americans, European American and Han Chinese.
130 A. Johansson and U. Gyllensten Hereditas 145 (2008)
(pB0.05) is higher for blocks with genes as compared
to blocks without genes for Han Chinese and Eur-
opean Americans (Table 1).
Candidate genes for positive selection
All genes in a haplotype block were assessed for their
FR and LR and the combined (FR�LR) p-valuecalculated for that block. Among the total number of
284 953 blocks (100 481, 100 450 and 84 022 for AA,
EA and HC respectively), 55% included at least part of
a gene or predicted gene. Since both block length and
gene density varies across the genome, the number of
genes per block will vary and some genes will be part
of more than one block. In order to determine a gene-
specific LR and FR, we identified all blocks to which
each gene found in a population belonged and chose
the largest LR, FR, and combined p-value to repre-
sent the gene. These analyses include 22 737, 22 807
and 22 665 genes for AA, EA and HC, respectively.
After removing duplicates, some genes were still part
of the same block, and the final analysis resulted in
15 855, 15 715 and 15 577 blocks containing the genes
Fig. 3. a�c. Length ratio (LR) vs FST ratio (FR) for the three human populations studied. Haplotype blocks including a geneor part of a gene and exhibiting a positive LR and FR in (a) African American, (b) European American and (c) HanChinese. The red dots represent blocks with genes with a combination of LR and FR values that are significant at a genome-wide level of pB0.01.
Table 1. Analysis of the distribution of median FR, fraction of significant FR blocks and combined probabilities of
LR and FR in genic and non-genic regions1.
Comparison Population
AfricanAmerican (AA)
EuropeanAmerican (EA)
HanChinese (HC)
All
I2. FR medianGenic �0.054 �0.063 �0.072 �0.062Non-genic �0.083 �0.073 �0.096 �0.083P-value 4.5��10�4 3.8�10�2 1.5�10�2 3.85�10�6
II3. Fraction of significant FRGenic 0.0100 0.0110 0.0114 0.0108Non-genic 0.0100 0.0087 0.0083 0.0091P-value ns 1.43�10�4 2.72�10�6 1.86�10�6
III4. Combined P-valueGenic 0.0441 0.0575 0.0617 0.0540Non-genic 0.0485 0.0548 0.0551 0.0527P-value 0.0005 0.040 2.43�10�5 0.072
1) Abbreviations: ns � not significant; FR�FST ratio2) I. The median FRwas calculated for blocks containing at least part of a gene and for blocks not containing any part of a gene.p-values are calculated by comparing the median between the two groups of blocks using the Mann-Whitney rank sum test.3) II. The fraction of FRs with pB0.01 were calculated for blocks containing at least part of a gene and for blocks notcontaining any part of a gene. p-values were calculated by comparing the number of significant genes within each group usingx2 statistics.4) III. Combined LR and FR probabilities. The fraction of combined values of pB0.05 were calculated for blocks containingat least part of a gene and for blocks not containing any part of a gene. p-values were calculated by comparing the number ofsignificant genes within each group using x2 statistics.
Hereditas 145 (2008) Selective sweeps in human populations 131
for each population, respectively. Using a criteria of
p�0.01 for genome-wide significance (0.01/15 855�6.31�10�7, 0.01/15 715�6.36�10�7 and 0.01/15 577�6.42�10�7 for AA, EA and HC, respec-
tively) on the combined p-value for LR and FR
resulted in a list of 31 genes (Table 2) located in 23
blocks (9, 1, 13 for each population). The number of
significant blocks differs widely between populations.
This could be due to a bias in the dataset, but
considering the similarity between the distributions
of LR and FR between populations (Fig. 2, 3), it ismore likely to be due to stochastic variation. The
individual genes on these two lists will be addressed
further in the discussion.
Correlation with other studies of genes under positive
selection
A number of studies have searched for genes under
positive selection, either those that have been affected
by selection on the lineage leading to modern humans
or those affected more recently during human evolu-
tion (CLARK et al. 2003; BUSTAMANTE et al. 2005;
NIELSEN et al. 2005; VOIGHT et al. 2006; WANG et al.
2005). The loci implicated in studies of selection on the
lineage leading to modern humans shows little overlap
with the genes on our lists. However, when comparing
with studies of more recent events of selection, the
overlap is larger. There is a total of 34 genes (GNA14,
PQLC1, MYH9, MYEF2, LOC400369, SLC12A1,
AQP1, LOC90193, PRKG2, HERC1, LOC439940,
EDAR, SULT1C2, APOL4, PTF1A, C10orf115,
C10orf67, MKRN2, RAF1, FLJ11036, C9orf82,
C8orf7, EPHB1, SULT1C1, DAPK2, UBXD2, LCT,
CBARA1, CCDC2, LRRC19, GCC2, MGC10701,
LIMS1, R3HDM) located in 25 different regions
that overlap between our top 1% candidates and the
top 250 regions suggested to be under selection by
VOIGHT et al. (2006). This overlap is not surprising
given that the methods used have some similarities.
There is also an overlap of 12 genes (CLSPN, EIF2C4,
EIF2C1, EIF2C3, KCNH7, USP3, HERC1, EDAR,
SULT1C2, DAPK2, GCC2, LIMS1) between our top
1% candidates and the 181 genes proposed by Carlson
and colleagues (CARLSON et al. 2005). This is a higher
overlap than expected by chance (5 overlapping
regions out of 48 proposed by Carlson et al. as
compared to 1% expected by random, p�0.0139, x2
test), even though Carlson and colleagues used an
Table 2. Genes identified by our method, proposed to have been under positive selection in different human
populations1.
Pop Chromosome FR LR p-value2 Genes
AA 6 5.85 2.33 2.15�10�6 TCBA1AA 7 5.89 2.68 1.90�10�6 AIP1AA 8 6.44 2.88 6.85�10�8 CSMD1AA 8 4.61 3.53 7.24�10�7 PSD3AA 11 5.78 2.17 2.92�10�6 NELL1AA 14 5.10 3.94 1.84�10�6 NPAS3AA 15 9.15 2.16 7.02�10�14 CYP19A1AA 16 5.21 2.03 8.92�10�7 LOC441745AA 20 4.24 2.71 2.92�10�6 NFATC2EA 9 2.78 3.97 2.70�10�6 C9orf121HC 2 4.44 3.02 1.09�10�6 CPS1HC 2 3.87 3.28 2.47�10�6 EDAR3
HC 2 4.61 2.79 1.32�10�6 LOC375295 FUCA1PHC 5 3.55 3.49 2.65�10�6 GALNT10HC 5 4.17 3.25 1.17�10�6 GRIA1HC 8 4.37 2.88 2.13�10�6 LOC439940HC 9 2.65 4.22 8.97�10�7 LOC401539HC 10 3.64 3.64 1.19�10�6 LOC389997 LOC387703HC 13 4.29 3.64 1.97�10�6 ATP8A2HC 14 4.30 3.24 1.42�10�6 ACTN1HC 14 4.30 3.07 1.42�10�6 RPS29P1 WDR22 DDX18P1HC 15 4.63 3.69 2.83�10�7 HERC13
HC 19 4.16 3.17 1.60�10�6 FUT2 FLJ36070 RASIP1 MGC34799 FUT1
1) Abbreviations: Pop � population; FR � FST ratio; LR- length ratio; AA � African American; EA � European American; HC� Han Chinese.2) All p-values are genome wide significant pB0.01.3) Overlaps with genes suggested to have been under selection in earlier genome-scans based on other methods (CARLSON et al.2005; VOIGHT et al. 2006).
132 A. Johansson and U. Gyllensten Hereditas 145 (2008)
approach based on Tajima’s D that is conceptually
different from ours.
Combining methods for studying genes under positive
selection
The somewhat low overlap of genes under positive
selection for different identification methods is not
surprising. Even though many of the methods are
somewhat similar, different populations and SNP data
have been used, and both the size of the window used
to scan the genome for particular features and the
allele frequency spectra vary. In an attempt to study
the overlap between the genes identified by ourmethod and those determined by Voight and collea-
gues (VOIGHT et al. 2006), we analyzed all our top 1%
blocks using the iHS method (VOIGHT et al. 2006).
Out of the top 1% regions for AA, EA and HC
populations (159, 157 and 156, respectively), 19, 46
and 42 regions are identified as candidates for selec-
tion (pB0.05) using the iHS method, clearly higher
than the number expected by chance. However, if weapply a genome-wide significance level of p�0.01,
only 4, 29 and 17 regions, for AA, EA and HC
respectively, remain as candidates for positive selection
(Table 3).
DISCUSSION
Many studies of selection in the human genome have
been based on comparisons of the human andchimpanzee genomes and have therefore addressed
selection over 5�7 myr. We have focused on events that
occurred associated with the evolution of major
human population groups over the last 100 000 years.
The genetic differentiation of human populations is
the result of both selective and stochastic forces. Many
adaptations, such as those resulting from spatial and
temporal variation in climate, exposure to pathogensand diet, may have been restricted to particular
populations and are therefore likely to remain un-
detected by comparative genomic studies.
Positive selection acting on a locus is expected to
result in a more rapid fixation of alleles and conse-
quently less variation around the site under selection.
A means to identify chromosomal regions subjected to
positive selection is therefore to examine the pattern ofDNA polymorphism combined with the haplotype
structure. The FST for individual nucleotide sites often
has a high variance (AKEY et al. 2002; WEIR et al.
2005) but since variation at neighboring sites is often
correlated, we instead estimated the FST for haplotype
blocks. Positive selection for an allele in a large
population is expected to result in regions with
extended haplotypes due to lower levels of genetic
variation than expected by random genetic drift.
Therefore, one characteristic of genomic regions under
recent positive or negative selection is large haplotype
blocks of reduced diversity and a high correlation
Table 3. The genes among our top 1% list of candidates
that are proposed to have been under positive selection
when evaluated using the iHS method.
Pop1 P-value Genes
AA 2.80E-03 AIP1AA 1.91E-02 CNTN5AA 2.37E-04 CSMD1AA 1.97E-03 RORAEA 2.25E-06 A2BP1EA 1.04E-10 ACTBP4EA 1.08E-06 AIP1EA 3.54E-02 APBA2EA 2.32E-04 ARHGAP26EA 1.20E-02 BNC2EA 3.76E-05 DCCEA 2.84E-04 DKFZP566N034EA 4.73E-02 DOCK4EA 2.62E-03 FLJ10159EA 2.01E-03 KIAA0861EA 2.83E-02 KIAA1889EA 7.15E-02 KIAA2026EA 2.09E-08 LCTEA 6.56E-03 LOC284788EA 1.65E-05 LOC440867EA 7.41E-04 LOC90193EA 9.72E-02 MYEF2, SLC24A5, LOC400369EA 1.56E-06 PEPP2EA 1.27E-03 PPP2R2BEA 1.30E-03 PRDM10EA 5.60E-06 PRKG2EA 1.94E-04 PSD3EA 3.25E-14 R3HDMEA 7.52E-04 SGCZEA 8.93E-04 SLC12A1EA 1.89E-02 SMYD3EA 5.05E-03 TCBA1EA 6.42E-06 UBXD2HC 3.45E-03 ATP1B3P1, PAPOLG, LOC130865HC 4.01E-03 C2orf23HC 1.10E-02 C6orf176HC 1.80E-02 C8orf21HC 4.86E-06 DAB1HC 1.07E-02 EPHB1HC 9.60E-03 FHITHC 1.89E-03 FLJ11036, MKRN2, RAF1HC 4.13E-03 LOC131368HC 4.49E-05 LOC442008HC 2.47E-06 LRPPRCHC 2.97E-02 MGC42105HC 1.06E-02 NAV2HC 2.42E-04 SEMA3EHC 5.48E-12 SMC6L1, FLJ40869, LOC343930HC 5.34E-02 TCBA1HC 2.45E-06 TRPC6
1) Abbreviations: Pop � population; AA � African Amer-ican; EA � European American; HC � Han Chinese.
Hereditas 145 (2008) Selective sweeps in human populations 133
between the FST of proximate SNPs. The genome is
known to contain recombination hotspots located
between regions with higher LD (ALTSHULER et al.
2005). By focusing on haplotype blocks with reduceddiversity (regardless of LD) rather than studying
extended haplotypes, the results should be less sensi-
tive to variation in the recombination rate between
populations.
False and true positive rates for genes under selection
Approaches to identify genes under positive selectionhave to consider the relative contributions of genetic
drift and natural selection on the genetic variability
pattern. Most studies focus on the top 1% of
candidates, but are unable to distinguish between the
alternative explanation of positive selection or neutral
evolution. In our simulations of the length of blocks
with reduced haplotype diversity (LR), we observe
that a selective sweep with a selection coefficient s�0.01 will result in an average LR�1.7 (Fig. 4).
However, when no selection is acting (s�0), about
10% of the simulated data exhibited an LR�1.7.
Therefore, it is not possible to distinguish with
certainty between blocks that are under selection and
those evolving under neutrality. The number of genes
affected by positive selection in the human genome
is unknown, but it has been suggested that as many as3% of genes have been subjected to recent positive
selection (EBERLE et al. 2006). If we assume that 3% is
the true fraction, it follows that roughly 660 of the
about 22 000 genes in the human genome have been
subjected to positive selection. Assuming that those
660 genes have a selection coefficient of s�0.01 and
using LR�3 as the threshold for identification of
genes under positive selection then1% of the neutrally
evolving genes (1%�97%�22 000�213) and 22% of
the genes truly under selection (22%�660�145) will
be suggested to be candidates for positive selection(Fig. 4). Using more stringent criteria for the LR
cutoff will result in a larger fraction of true to false
positives, and also a larger number of false negatives.
For example, using LR�4.5 results in 0.1% (21 genes)
of the neutral evolving genes and 5.9% (45 genes) of
the positively selected genes. Assuming a selection
coefficient of s�0.01 on 3% of the genes in the human
genome may also be an overestimation, resultingin an even higher ratio of false positive to true positive
selected genes. Similar to these estimations, the
frequency of false positives among top candidates
for positive selection is probably high for most
methods developed for scanning the genome for
positive selection. At the same time, most of the
genes that have undergone positive selection have
not been detected. This is the most likely explanationfor the low extent of overlap of candidate genes
between different studies and methods (BISWAS and
AKEY 2006; SABETI et al. 2006, 2007), even though the
power to detect almost complete selective sweeps has
been shown to be high (SABETI et al. 2007). Many
selective events are likely to be weak and their
signatures can easily be eradicated by genetic drift.
Using a combined approach based on several genomiccharacteristics in searching for genes subjected to
recent positive selection is one approach for reducing
the number of false positives as well as false negative
genes.
Genes within our top 1% candidates likely to have been
under selection
Our analysis of the haplotype LR and FR in three
major populations resulted in the identification of a
number of genes that are candidates for having beenaffected by selection, even though their p-values does
not reach genome-wide significance. The lactase gene
(LCT) appears on the top 1% list of candidates for
positive selection in European Americans. The LCT
gene is likely to have been under positive selection due
to the increased nutrition when consuming dairy
products, which were introduced to humans during
cattle domestication in the Near East, about 9 000years ago. Another interesting candidate gene is
MCPH1, involved in regulation of brain size. Our
analysis indicates that this gene has been under selec-
tion in European Americans, and it has earlier been
suggested to be both under negative selection on the
human lineage (BUSTAMANTE et al. 2005) and under
positive selection in Caucasians (EVANS et al. 2005).
The polymorphism in MCPH1 proposed to be under
Fig. 4. Distribution of LR for neutrally evolving sitescompared to sites with a selection coefficient of s�0.01.Distribution of LR for neutrally evolving sites compared tosites with a selection coefficient of s�0.01. For the neutrallyevolving genes, only 1% will exhibit an LR above 3compared to 22% of the genes with a selection coefficientof 0.01.
134 A. Johansson and U. Gyllensten Hereditas 145 (2008)
selection in Caucasians was estimated to have arisen
approximately 37 000 years ago and simulations in-
dicate that it has been increasing in frequency too
rapidly to be compatible with neutral drift (EVANS
et al. 2005). One further interesting gene for which
positive selection is indicated in European Americans
is AIM1 (MATP). Polymorphism at AIM1 is asso-
ciated with normal variation in human pigmentation
(dark hair, skin and eye color in Caucasians) (GRAF
et al. 2005). A recent study also suggested that the
AIM1 gene has been subjected to positive selection
(SOEJIMA et al. 2006). Skin pigmentation is a well-
known example of genetic adaptation in humans
(CAVALLI-SFORZA et al. 1996). Both AIM1 and a
gene called OCA2, which is also among our top
candidates for being under selection in Europeans,
are implicated in different types of oculocutaneous
albinism. A mutation in OCA2 causes the most
prevalent type of oculocutaneous albinism throughout
the world and occurs at much lower frequency in
Europeans than in Africans (LEE et al. 1994).
SLC24A5, which is among our top candidates in
European Americans, has also been shown to be
involved in skin pigmentation and the allele associated
with light pigmentation is almost fixed in European
populations (LAMASON et al. 2005). A number of coat
color genes have been described in mice and we
identified 51 genes within our of 1% top candidates
as orthologues to genes associated with variation in
coat color in mice (http://albinismdb.med.umn.edu/
genes.htm). Among these 51 genes, 5 (RAB27A, DCT,
EGFR, ATRN, MATP) are among our top candidates.
This is a significantly higher number than expected by
chance for 51 randomly chosen genes (p� 0.029, x2
test), indicating that many of the genes involved in
human skin pigmentation have been under selection in
different populations. Recently a number of studies
have focused on genes involved in skin pigmentation in
humans (MCEVOY et al. 2006, LAO et al. 2007, MYLES
et al. 2007), also indicating that many of those genes
have been subjected to positive selection.In addition to the genes investigated in our study, a
number of regions lacking recognized genes were also
as likely to have been under selection as the genes
discussed (data not shown). The results for some of
these regions may reflect stochastic variation, but in
general, this supports the notion that some non-
coding DNA and intergenic sequences are under
selection for functional reasons (ANDOLFATTO 2005).
One example of such a case is the LCT gene. The most
likely mutation to have been under selection in
association with the LCT gene is situated 14 kb
upstream the LCT gene (ENATTAH et al. 2002).
In the case of LCT, the haplotype block is large
enough to include both the gene and the upstream
region, which might not be the case for other similar
situations.
Genome-wide significance of positive selection
As seen in our and other studies, most methods face
the problem that using stringent statistical thresholds
to avoid false positives results in high numbers of false
negatives. When applying thresholds for genome-wide
significance we identify 31 candidates for having been
under selection (Table 2). As an alternative means of
enriching for candidates, we applied the method based
on iHS (VOIGHT et al. 2006) to our list of top 1%candidates, which resulted in 50 regions with a
significant p-value (Table 3). The overlap between
the regions identified by our method using a genome-
wide significance threshold (Table 2) and the applica-
tion of the iHS method to our 1% top candidates, is
restricted to AIP1 and CSMD1 in the African
American population. In addition, TCBA1, which is
on the top 1% list for all three populations, is foundamong the genome-wide significant genes in African
Americans using our method, but among the signifi-
cant genes for both European Americans and Han
Chinese when applying the iHS method. Surprisingly,
HERC1 and EDAR (Table 2), which show genome-
wide significance using our method, were not signifi-
cant when we applied the iHS method, even though
they were listed as significant by other groups usingthis test on other datasets (VOIGHT et al. 2006; SABETI
et al. 2007). This discrepancy probably reflects the
sensitivity of the results to the dataset used for
identifying the candidate regions. The limited sample
sizes available for these kinds of studies not only
decrease the power to identify genes under selection,
but also make the findings hard to replicate in another
sample.
Overlap with other studies
The rather small overlap between the loci indicated to
be under selection in our study and those pinpointed
previously in between-species comparisons is not
surprising. First, there is a large difference in the
time perspective of the selective events. Local adaptive
selection that affects one or a few human populations
may be quite distinct from the selective pressure thatshaped modern humans from archaic forms of Homo.
Second, since environmental factors vary between
locations it is not expected that selection will affect
all populations equally. Therefore, some of the appar-
ent differences between the results of comparative
genomic and population genetic approaches may be
due to differences in the samples included. For
example, LPP has been suggested to be under negative
Hereditas 145 (2008) Selective sweeps in human populations 135
selection on the human lineage (BUSTAMANTE et al.
2005) but is among our top candidates for having been
under positive selection in Han Chinese. However,
Bustamante and colleagues used only European and
African samples to represent humans in their analysis
and if local selection has only occurred on LPP in the
Han Chinese this would not been detected in their
study. We observe a larger overlap with studies
focusing on individual human populations. CARLSON
et al. (2005) based their study on the Tajima’s D
statistic and used a sliding window technique to
identify candidate regions. Tajima’s D is conceptually
unrelated to our method and it is therefore interesting
that we have a larger overlap than expected by chance
between our list of genes and that of CARLSON et al.
(2005). Since we are considering both the allele
frequency differences between human populations
and the deviation in length of haplotype blocks, we
will preferably identify genes that have been under
selection after the separation of the three major
human groups.
In summary, we have developed and applied a new
method for identifying candidate loci for being under
selection at different time points during the evolution
of human populations. To identify the specific loci and
polymorphisms under selection requires further stu-
dies of the sequence variability and natural history of
different populations. None of the methods available
for identifying genes under selection is resistant to
errors, but the use of both haplotype architecture and
population differentiation increases our ability to
identify genomic regions that have been affected by
non-random forces. These methodologies provide
focal points in the genome for future studies of the
evolutionary events that have shaped modern human
populations as they explored different parts of the
world. One continuation of these studies is to evaluate
the top candidates by resequencing the genes in a
number of populations. Resequencing is still both
time-consuming and expensive but with the rapid
progress in high-resolution genotyping of individuals
from different populations (ALTSHULER et al. 2005;
HINDS et al. 2005) and the availability of new
techniques for high throughput genomic resequencing
(BENNETT et al. 2005; MARGULIES et al. 2005), the
amount of genome data is growing exponentially and
thereby also the potential for further evaluation of
genes that have been under selection in humans.
Acknowledgements � This study was supported by grantsfrom the Swedish Natural Sciences Research Council. AJ isaffiliated to The Linnaeus Centre for Bioinformatics, Up-psala University, Sweden.
REFERENCES
Akey, J. M., Zhang, G., Zhang, K. et al. 2002. Interrogatinga high-density SNP map for signatures of naturalselection. � Genome Res. 12: 1805�1814.
Altshuler, D., Brooks, L. D., Chakravarti, A. et al. 2005. Ahaplotype map of the human genome. � Nature 437:1299�1320.
Andolfatto, P. 2005. Adaptive evolution of non-coding DNAin Drosophila. � Nature 437: 1149�1152.
Beaumont, M. A. and Balding, D. J. 2004. Identifyingadaptive genetic divergence among populations fromgenome scans. � Mol. Ecol. 13: 969�980.
Bennett, S. T., Barnes, C., Cox, A. et al. 2005. Toward the1,000 dollars human genome. � Pharmacogenomics 6:373�382.
Biswas, S. and Akey, J. M. 2006. Genomic insights intopositive selection. � Trends Genet. 22: 437�446.
Bustamante, C. D., Fledel-Alon, A., Williamson, S. et al.2005. Natural selection on protein-coding genes in thehuman genome. � Nature 437: 1153�1157.
Carlson, C. S., Thomas, D. J., Eberle, M. A. et al. 2005.Genomic regions exhibiting positive selection identifiedfrom dense genotype data. � Genome Res. 15: 1553�1565.
Cavalli-Sforza, L. L., Menozzi, P. and Piazza, A. 1996. Thehistory and geography of human genes. � Princeton Univ.Press.
Clark, A. G., Glanowski, S., Nielsen, R. et al. 2003. Inferringnonneutral evolution from human�chimp�mouse ortho-logous gene trios. � Science 302: 1960�1963.
Eberle, M. A., Rieder, M. J., Kruglyak, L. et al. 2006. Allelefrequency matching between SNPs reveals an excess oflinkage disequilibrium in genic regions of the humangenome. � PLoS Genet. 2: e142.
Enattah, N. S., Sahi, T., Savilahti, E. et al. 2002. Identifica-tion of a variant associated with adult-type hypolactasia.� Nat. Genet. 30: 233�237.
Evans, P. D., Gilbert, S. L., Mekel-Bobrov, N. et al. 2005.Microcephalin, a gene regulating brain size, continues toevolve adaptively in humans. � Science 309: 1717�1720.
Fay, J. C. and Wu, C. I. 2000. Hitchhiking under positiveDarwinian selection. � Genetics 155: 1405�1413.
Fu, Y. X. and Li, W. H. 1993. Statistical tests of neutrality ofmutations. � Genetics 133: 693�709.
Graf, J., Hodgson, R. and Van Daal, A. 2005. Singlenucleotide polymorphisms in the MATP gene are asso-ciated with normal human pigmentation variation.� Hum. Mutat. 25: 278�284.
Halperin, E. and Eskin, E. 2004. Haplotype reconstructionfrom genotype data using imperfect phylogeny. � Bioin-formatics 20: 1842�1849.
Hamblin, M. T. and Di Rienzo, A. 2000. Detection of thesignature of natural selection in humans: evidence fromthe Duffy blood group locus. � Am. J. Hum. Genet. 66:1669�1679.
Hamblin, M. T., Thompson, E. E. and Di Rienzo, A. 2002.Complex signatures of natural selection at the Duffyblood group locus. � Am. J. Hum. Genet. 70: 369�383.
Hanchard, N. A., Rockett, K. A., Spencer, C. et al. 2006.Screening for recently selected alleles by analysis ofhuman haplotype similarity. � Am. J. Hum. Genet. 78:153�159.
Harr, B., Kauer, M. and Schlotterer, C. 2002. Hitchhikingmapping: a population-based fine-mapping strategy for
136 A. Johansson and U. Gyllensten Hereditas 145 (2008)
adaptive mutations in Drosophila melanogaster. � Proc.Natl Acad. Sci. USA 99: 12949�12954.
Hinds, D. A., Stuve, L. L., Nilsen, G. B. et al. 2005. Whole-genome patterns of common DNA variation in threehuman populations. � Science 307: 1072�1079.
Hollox, E. J., Poulter, M., Zvarik, M. et al. 2001. Lactasehaplotype diversity in the Old World. � Am. J. Hum.Genet. 68: 160�172.
Lamason, R. L., Mohideen, M. A., Mest, J. R. et al. 2005.SLC24A5, a putative cation exchanger, affects pigmenta-tion in zebrafish and humans. � Science 310: 1782�1786.
Lao, O., De Gruijter, J. M., Van Duijn, K. et al. 2007.Signatures of positive selection in genes associated withhuman skin pigmentation as revealed from analyses ofsingle nucleotide polymorphisms. � Ann. Hum. Genet.71: 354�369.
Lee, S. T., Nicholls, R. D., Schnur, R. E. et al. 1994. Diversemutations of the P gene among African-Americans withtype II (tyrosinase-positive) oculocutaneous albinism(OCA2). � Hum. Mol. Genet. 3: 2047�2051.
Margulies, M., Egholm, M., Altman, W. E. et al. 2005.Genome sequencing in microfabricated high-densitypicolitre reactors. � Nature 437: 376�380.
McEvoy, B., Beleza, S. and Shriver, M. D. 2006. The geneticarchitecture of normal variation in human pigmentation:an evolutionary perspective and model. � Hum. Mol.Genet. 15 Spec. No. 2: R176�181.
Myles, S., Somel, M., Tang, K. et al. 2007. Identifying genesunderlying skin pigmentation differences among humanpopulations. � Hum. Genet. 120: 613�621.
Nielsen, R., Bustamante, C., Clark, A. G. et al. 2005. A scanfor positively selected genes in the genomes of humansand chimpanzees. � PLoS Biol. 3: e170.
Sabeti, P. C., Reich, D. E., Higgins, J. M. et al. 2002.Detecting recent positive selection in the human genomefrom haplotype structure. � Nature 419: 832�837.
Sabeti, P. C., Schaffner, S. F., Fry, B. et al. 2006. Positivenatural selection in the human lineage. � Science 312:1614�1620.
Sabeti, P. C., Varilly, P., Fry, B. et al. 2007. Genome-widedetection and characterization of positive selection inhuman populations. � Nature 449: 913�918.
Schlotterer, C. 2003. Hitchhiking mapping � functionalgenomics from the population genetics perspective.� Trends Genet. 19: 32�38.
Soejima, M., Tachida, H., Ishida, T. et al. 2006. Evidence for recent positive selection at the human AIM1 locus in aEuropean population. � Mol. Biol. Evol 23:179�188.
Spencer, C. C. and Coop, G. 2004. SelSim: a program tosimulate population genetic data with natural selectionand recombination. � Bioinformatics 20: 3673�3675.
Storz, J. F. 2005. Using genome scans of DNA polymorph-ism to infer adaptive population divergence. � Mol. Ecol.14: 671�688.
Stringer, C. B. and Andrews, P. 1988. Genetic and fossilevidence for the origin of modern humans. � Science 239:1263�1268.
Tajima, F. 1989. Statistical method for testing the neutralmutation hypothesis by DNA polymorphism. � Genetics123: 585�595.
Voight, B. F., Kudaravalli, S., Wen, X. et al. 2006. A map ofrecent positive selection in the human genome. � PLoSBiol. 4: e72.
Wang, E. T., Kodama, G., Baldi, P. et al. 2006. Globallandscape of recent inferred Darwinian selection forHomo sapiens. � Proc. Natl Acad. Sci.USA 103:135�140.
Weir, B. S., Cardon, L. R., Anderson, A. D. et al. 2005.Measures of human population structure show hetero-geneity among genomic regions. � Genome Res. 15: 1468�1476.
Wright, S. 1950. Genetic structure of populations. � Br. Med.J. 4669: 36.
Hereditas 145 (2008) Selective sweeps in human populations 137