Breeding for pepper fruit quality: A genetical metabolomics approach
The Genetical Bandwidth Mapping: a spatial and graphical...
Transcript of The Genetical Bandwidth Mapping: a spatial and graphical...
The Genetical Bandwidth Mapping: a spatial and graphical representation of
population genetic structure based on the Wombling method
A. Cercueil1, O. François1 and S. Manel2
1TIMC-TIMB, Faculté de Médecine, F38706 La Tronche cedex, France
2Laboratoire d'Ecologie Alpine, CNRS UMR 5553, Université Joseph Fourier, BP 53, F-
38041 Grenoble cedex 9, France
Corresponding author:
S. Manel
Tel: 33 (0) 4 76 51 42 78
Fax: 33 (0) 4 76 51 42 79
email: [email protected]
Running head: genetical bandwidth mapping
1
SUMMARY
Characterizing the spatial variation of allele frequencies in a population has a wide range of
application in population genetics. This article introduces a new nonparametric method which
provides a two-dimensional representation of a structural parameter called the genetical
bandwidth. The genetical bandwidth is based on the computation of Womble’s systemic
function, and describes the genetic structure around arbitrary spatial locations in a study area.
It corresponds to the shortest distance to areas of significant variation of allele frequencies.
Simulations demonstrate that the method is able to locate genetic boundaries or clines. Two
applications are presented using datasets taken from the literature: the D. melanogaster Adh
F/S locus and the HGDP-CEPH modern human dataset.
Key-words: spatial genetic structure, Wombling, systemic function
2
1- INTRODUCTION
Wombling methods aim at detecting regions of abrupt changes in maps of biological
variables. They have been introduced by Womble (1951), and refined afterwards by Barbujani
et al. (1989) and Bocquet-Appel & Bacro (1994). In the original approach, Womble assumed
the knowledge of surfaces derived from the variables of interest (e.g. allele frequency
surfaces), and computed the gradient of these surfaces. The norms of the gradients were then
averaged to form a new surface called the systemic map (or systemic function). Zones of
rapid changes could therefore be identified as regions given high values by the systemic
function. These regions have been called boundaries.
However the surfaces considered in the original approach are rarely available in real data
analysis. Instead of surfaces the variables of interest are actually measured at scarce
geographic locations. Implementing Wombling methods therefore requires the preliminary
inference of such surfaces from partial informations. Regular experimental designs are usually
addressed with a technique called lattice Wombling. In this approach a lattice tessellates the
space into rectangular regions termed pixels, and the variables of interest are assigned to the
center of the pixel. The rates of change of the systemic function can then be computed either
as first derivatives among adjacent pixels or second partial derivatives (Laplacian) (Fagan et
al. 2003; Fortin & Dale 2006; Jacquez et al. 2000). The computation uses kernel methods that
operate in windows of m pixels where m is a predefined parameter. Irregular experimental
designs are processed with another technique called triangulation Wombling based on
triangular kernels (Fortin 1994). Once the systemic function and the rates of changes are
estimated, the next step is to discriminate between true boundaries and those due to sampling
errors. To do so, Barbujani et al. (1989) assess the significance of the boundaries by using
randomization procedures. These procedures detect whether the highest rates of change are
higher than expected under the absence of structure. Their method has been applied to the
analysis of allele frequencies, and has been able to detect zones of abrupt change within
Human and Drosophila.
In summary, the need for statistical reconstruction from scarce observations generates a
difficult issue in Wombling computer programming. As lattice Wombling assigns the
observations to a regular lattice, it may be subject to bias. On the other hand, triangulation
Wombling estimates the systemic function from three data only, and is then prone to errors
due to statistical variability. The reconstruction issue is closely related to the choice of the
spatial scales at which the estimation procedures (kernels) are implemented. For instance, the
3
rates of change are strongly dependent on the pixel size (Fagan et al. 2003; Jacquez et al.
2000). This has been recently addressed by a method called hierarchical Wombling (Csillag &
Kabos 2002) that computes maps at several scales by varying the pixel size. But hierarchical
Wombling results in as many maps as different pixel sizes are used, and leaves us with the
dilemma to decide which pixel size is the most appropriate to interpret the data.
This article addresses the above decision problem from a different perspective. The
approach introduced here, named the Genetical Bandwidth Mapping (GBM), is a
nonparametric technique that deals with allele frequency. The GBM estimates the systemic
function at several scales in order to provide a local characterization of the genetic structure
around any arbitrary spatial location in a study area. The quantities displayed by the GBM are
not systemic values themselves, but the new quantities called genetical bandwidths. The GBM
avoids the issue of choosing among of multiple maps by computing the local parameters
which in turn can be interpreted as the shortest distances to areas of significant variation in
allele frequencies. This article also evaluates the capabilities of the GBM to detecting and
locating genetic structures such as boundaries or clines using multilocus genotypes.
2- THEORY
We consider a biological population living in a two dimensional habitat, and assume that n
individuals have been sampled from this population according to a random or regular
experimental design. In what follows, each location (xi, yi) is associated with a genetic
observation gi called the multilocus genotype that indicates the presence of specific alleles at
multiple DNA loci.
Typically the coordinates (xi, yi) represent either the location of an individual labelled i at
its instant of observation, or the location of its habitat. The GBM computes a critical
parameter (the genetical bandwidth) at each site of a grid that covers the study area. In the
sequel the grid sites are denoted (x,y), and may differ from the sampled locations (xi yi). This
section presents a formal description of the bandwidths, and describes the statistical principles
underlying the GBM. Mathematical details are deferred to the electronic supplementary
material (ESM1).
4
(a) Definition of bandwidths
In the GBM, gradients of allele frequencies are computed at each grid site. In this grid the
neighbourhood of a site is not defined precisely. Instead each observation receives a
weighting that depends on the distance to the current grid location (x y). More specifically the
observation i at (xi, yi) is given the following Gaussian weight wi(h) = exp(- di2/2h2 ),
where di2 represents the squared Euclidean distance between (xi, yi) and (x, y). This approach
is standard in density estimation (Silvermann 1986). The parameter h is crucial to our method.
It is called the bandwidth (or window size), and controls the exponential decay of the
Gaussian weights.
(b) The Systemic function
In order to test for the absence of local genetic structures, the GBM computes an estimate
of Womble's Systemic function at each grid site. The Systemic function S(x,y) is defined by
the following formula
∑ ∇=j
j yxfyxS ),(),(
where is the gradient of the allele frequency f),( yxf j∇ j with respect to x and y, and the sum
runs over all possible alleles j at all loci. The Systemic function is representative of the total
slope of allele frequency surfaces, and it integrates the dependencies between the genetical
and the spatial data. The allele frequencies and their derivatives at (x y) are unknown
parameters. In the GBM the derivatives are estimated from the presence/absence of alleles at
the sampled locations. The estimation technique makes use of local polynomials based on the
Gaussian weights wi(h) (see Fan & Gijbels 1996 and ESM1). This approach differs from
standard techniques (e.g. Barbujani et al. 1989) where the derivatives are usually estimated
from difference equations. Depending on the bandwidth h, the GBM builds an estimate
S(x,y,h) of S(x,y) for each (x,y). In nonparametric statistics the choice of h is usually based on
the minimization of a statistical error. Optimal choices nevertheless generate difficult
mathematical and practical problems. GBM deals with this particular problem in an original
way where the choice of h is related to critical regions of tests for the absence of structure.
(c) Testing for spatial genetic structure
Testing for genetic structure is at the heart of the GBM. The tests for absence of structure
are repeated at each site (x y) of the grid. Their objective is determining whether the estimated
systemic value S(x,y,h) reflects significant local genetic structure or not. In order to perform
the tests, the values S(x,y,h) are compared to the probability distribution of the Systemic
5
function S(x,y) obtained under the null hypothesis of absence of structure. Such a probability
distribution can be computed from a permutation procedure. The algorithm actually proceeds
with resampling from all possible genotypes, and assigns the sampled genotypes to the
individual locations. For each (x,y), replicates of systemic values are computed as described in
paragraph (b). A p-value is then computed by forming the ratio of the number of replicates
greater than the estimated value S(x, y, h) to their total number. For example figure 1
represents a sample from a population that displays a geographic structure at a two-alleles
haploid locus. The figure illustrates the basic idea underlying the principle of this test, and
suggests that it should be significant for large bandwidths (h = h1) but nonsignificant for
lower values (e.g. h = h2).
(d) Genetical bandwidths
To compute the genetical bandwidths, the type I error of the permutation test must be set
to the value α (typically α = 0.01 or α = 0.05). The tests are repeated at each grid site for
several values of h. The genetical bandwidths are then defined as the largest values of h for
which the tests are nonsignificant, i.e. the p-values are greater than α. To compute these
quantities, large bandwidths corresponding to the diameter of the study area are first tested.
The bandwidths are then decreased by one unit unless the test becomes nonsignificant.
(e) Interpretation of the GBM output
The GBM results in a two dimensional matrix that can be interpreted as a map. Such a
matrix contains the genetical bandwidths computed at each point of the grid covering the
study area. These parameters correspond to the shortest distances to the zones of significant
variation in allele frequencies. Graphical representations of the GBM therefore provide bases
for interpretation of the genetic structure of populations. In this paragraph, we give short
guidelines to the interpretation of such outputs by analysing two typical responses. For sake
of clarity we discuss one-dimensional organisation, i.e. populations that consist of
continuously distributed individuals along a line. The one dimensional hypothesis is actually
more amenable to analysis, and enables reasonable guesses of typical shapes of response.
Understanding one-dimensional shape may be also useful to better interpret cross-sections in
two-dimensional maps. In one dimension two types of response can be expected. We call
these responses the W-shaped and the V-shaped curves (figure 2). The W-shaped response
may be the signal of a cline in allele frequency while the V-shaped response is expected when
differentiated populations are isolated by a nonhabitat or unsampled area. The W-shaped
response is illustrated in figure 2a where the distribution of an haploid diallelic locus (a/A) is
6
displayed. Within areas of low frequency of allele A, the genetical bandwidths decrease as the
distance to the zone of sharp variation. The same phenomenon occurs within areas of high
frequency. Within the transition zone, the genetical bandwidths may undergo erratical
variations due to variability in geographic sampling, and the fact that the allele frequencies
are closer to 50 percent. Figure 2b gives an illustration of the V-shaped response. Two
populations are separated by an empty area, e.g. a physical barrier to gene flows. In this case
the genetical bandwidths vary linearly with the distance to the furthest population.
(f) Software
A documented computer software called GENBMAP implements the GBM in C++, and
provides a graphical interface for visualizing its outputs. The computer program has been
designed for running under the win32 operating system, and it is available from the authors’
web pages (http://www-timc.imag.fr/Olivier.Francois/ or http://www2.ujf-
grenoble.fr/leca/membres/manel.html).
3- SIMULATION STUDY
This section illustrates the behaviour of the GBM when confronted to typical spatial
structures obtained from numerical simulations. Two cases have been simulated: 1) Cline
along a specific direction (longitude), 2) Barrier to gene flows in the 2 islands model.
(a) Experimental design and simulation tools
Random generation of spatial locations and genetic data have been performed using the
statistical software R 2.2.0 (R Development Core Team, 2005). Simulations have been
replicated more than ten times in each case. Sample sizes have been increased from 200 to
1,000 and 5,000 individuals. Maps have been computed using 300 by 300 regular grids
(90,000 sites). The type I errors in permutation tests have been set to α = 0.05 and α = 0.1.
Refining the mesh grid has a direct impact on the running time which increases linearly with
the number of sites in the grid. The number of permutations has been fixed to 200, and then
increased to 500 and 1,000 without major influence on the output of the tests. With 200
permutations, the computing time for one single map approximates half an hour on a 2GHz
processor laptop computer. Additional genetic data have been generated from the 2 and 3
islands models using EASYPOP (Balloux 2001) with similar results (not reported). The
graphical results presented in this section have been chosen as representative of the majority
of the simulated cases.
7
(b) Simulation of clines
Simulated data include 12 unlinked diallelic loci (alleles are coded 0 and 1). At each
locus, the frequencies of 1’s have been varied along the horizontal direction (x coordinate),
and their dependence on x has been considered to be logistic f(x) = 1 / ( 1 + exp( - (x-a)/b ) )
with a and b specific constants set to a = 2500, 3000, 3500 and b = 200, 500 (each of the 6
combinations has been produced twice). Spatial coordinates have been obtained as mixtures
of two independent isotropic Gaussian distributions centered at x1 = 1000 and x2 = 3000 (The
mean y coordinate has been set to y = 1800, and the standard deviation has been taken equal
sd = 1000). These simulations are relevant to a clinal variation of allele frequencies as a
function of longitude or altitude.
Figure 3 displays a central horizontal cross section (y = 900) of the GBM. The cross
section exhibits a W-shaped response locating the beginning of the sharp variation around
1,500 and its end around 4,000 (figure 3). The central area of the GBM corresponds to a wide
region where the allele frequencies vary significantly. In these experiments, increasing the
type I error α had the consequence of producing wider minimal areas. Nevertheless, changing
this parameter did not influence the overall topography of the map.
(c) Simulation of two islands models and sensitivity analysis
The sensitivity of the GBM to simultaneous variation in spatial and genetic differentiation
has been assessed from simulations of the island-model. Individuals have been sampled from
two subpopulations of equal effective sizes Ne. Alleles have been simulated according to the
infinite allele model with constant mutation rate θ = 4µNe = 1 at each locus. Migration rates
M = 4mNe have been varied from 1 to 10. F-statistics (FST) have been computed using Weir
and Cockerham estimates (Weir & Cockerham 1984). In order to assess effects due to the
number of loci and the sample size, we have used L = 5, 20 and 50 loci and n = 20, 50 and
100 individuals. Spatial coordinates have been simulated as the geographic mixture of two
independent Gaussian distributions. Each subpopulation (or island) has its own spatial range,
and the two islands may intersect. In order to measure the amount of intersection between the
two islands, we have introduced the parameter r whose definition comes from discriminant
analysis. It corresponds to the ratio of the within-island variance to the between-islands
variance, and it can be interpreted as a measure of spatial differentiation. For equal within-
variances, bringing islands closer has the effect of increasing r while moving islands away
8
from each other has the reverse effect (figure 4a). For r between 0 and 4, the islands are
clearly separated. For r between 4 and 8, geographic mixing may exist and may be strong.
FST values are considered critical to the detection of population structure when the minima
of the GBM remain undetectable (flat responses). This is said to happen when the ratio of the
maximum of the GBM to its minimum value minus one is less than 5 percent. In figure 4,
these ratios have been obtained from ten replicates of the simulation scenario for each value
of r. Figure 4b represents the variation of the critical FST values as a function of r. For r in the
range (0, 4), the structure in two islands can be detected when FST’s are not lower than a value
around 0.04. In this range the method is relatively insensitive to the number of loci. As the
level of spatial mixture increases (r around 6), the performance rises with the number of loci.
With 20 loci the structure in two islands can be detected for FST values less than 0.05 - 0.06.
But the GBM is clearly sensitive to the sample size (figure 4c), and its use may be precluded
with small data samples even if two spatial clusters can be observed. For n = 20 the two
islands can actually be identified for large FST values (> 0.15) only. However, for n = 100, the
GBM is remarkably efficient at detecting structure even if the two subpopulations are
particularly merged.
4-APPLICATION TO REAL DATA
In this section, two real examples are analyzed. The first has been taken from the literature
as a case of evidence for clinal selection at the Adh F/S locus in D. melanogaster (Berry &
Kreitman 1993). The second dataset is related to the spatial genetic structure of the modern
human population (Rosenberg et al. 2002)
(a) Adh locus
Clinal selection at the Adh F/S locus in D. melanogaster was studied by Berry & Kreitman
(1993) using 113 haplotypes from 44 polymorphic markers in 1533 individual from 25
population sites. Because the original data contains latitudinal information for population
samples, individual locations have been randomly simulated from the 25 site coordinates by
adding a small amount of variability to the site latitude (s.d. = 0.5°E). Since the sites are
located on the East Cost of North America, they can be considered as almost aligned, and
longitudes have been simulated within an artificial range of (-1, +1)°N. A subsample of 1303
individuals has been used in our analysis so that the 14 most represented haplotypes have
been selected (85% of the full dataset at the end of Berry & Kreitman's article). The above
described sampling procedure does not produce bias toward the appearance of clines.
9
Regarding the Adh locus, the two dimensional GBM displays a band pattern which also
reflects the absence of sensitivity to the random sampling of longitudes. Most latitudinal
cross-sections display the same band pattern (not shown), and the curve represented in figure
5 actually corresponds to a single section of the map. This curve exhibits a W-shaped
response locating the beginning of the zone of sharp variation around the latitude 32.4°N and
the end of this zone around the latitude 41.1°E. This result gives evidence that the cline is
correctly retrieved, and that it separates two homogeneous zones in the South and the North of
the study area.
(b) CEPH Human genome dataset
The genetic structure of modern human populations has recently been investigated
without the use of predefined “groups” by Rosenberg et al. (2002). The study was based on
the HGDP-CEPH human genome diversity cell line panel (Cann et al. 2002) using genotypes
at 377 autosomal microsatellite loci in 1056 individuals distributed around the world.
Inference of genetic ancestry was performed by applying a model-based clustering algorithm
implemented in the computer program STRUCTURE (Pritchard et al. 2000) that computes
individual cluster membership coefficients. Rosenberg et al. (2002) identified six main
genetic clusters, five of which correspond to major geographic regions, and subclusters that
often correspond to individual populations.
The GBM has been applied to the Eurasia/East-Asia subset of original data. The studied
dataset contains 451 individuals which originate from 25 populations. The study area ranges
from southwestern Pakistan (latitude 24°N, longitude 66°E) to north-eastern Russia (latitude
64°N, longitude 130°E). Geographic origins and coordinates of sampled populations are
available from the CEPH website (www.cephb.fr/HGDP-CEPH-Panel/). Because the
coordinates of individuals are not known exactly, they have been simulated from the ranges
given for each sampled population. This has been done by adding small amounts of standard
deviation to the geographic coordinates. The GBM has been computed from a regular grid of
resolution 400 by 400, 200 permutations, and type I error set to α = 0.05.
The GBM output is found to be in accordance with the genetic structure of the Pakistan
and East Asia human populations (figure 6). 1) The Pakistan area is associated with high
genetic heterogeneity (and low values of the GBM). The (L) and (U) signals may be
interpreted as genetic barriers to gene flows (a double barrier). The lower separation (L) may
10
be associated with the Kalash sample which was clearly identified as a genetic isolate by
Rosenberg et al. (2002). The upper (U) may be interpreted as the separation between Xibo
and Uygur of northwestern China and the populations from southern Pakistan which speak
indo-european languages. 2) Evidence for the separation Eurasia/East Asia is provided by the
transversal separation (T), and by (U) and (V) which may be the continuation of (U). 3)
Another heterogeneous genetic area is observed in northeastern China (Y). This may be
explained by the separation between Yakut and Japanese which are populations with altaic
languages and the rest of the East-Asian samples.
5- DISCUSSION
The Genetical Bandwidth Mapping provides informations about the spatial genetic
organisation of populations. It displays these informations through the two-dimensional
graphical representation of a local structural parameter. Genetical bandwidths may be
interpreted as the shortest distances to areas of significant variation in allele frequencies. They
may also be viewed as the size of the largest neighbourhoods in which the spatial genetic
structure can be thought of as being homogeneous. The definition of homogeneity used here
entails that allele frequencies undergo very slow spatial variation within the area. Note that
this allows genetic heterogeneity to take place as well. For example, genetic admixture can be
considered as a kind of homogeneous structure provided that allele frequencies are constant
across the population. The previous definition fits well into the framework of Wombling
methods because spatial homogeneity can be measured from the values of the systemic map
S(x,y). However GBM and Wombling differ fundamentally. A major difference is that
Wombling requires the systemic function being estimated using a local parameter (e.g., a
window size). As a consequence, there is usually no unique estimation of S, but as many as
experimented window sizes. The GBM overcomes this issue by estimating critical window
sizes based on several estimates of S(x,y), and hence produces a unique map.
Simulations and sensitivity analyses have shown that GBM is efficient at detecting spatial
structures such as genetic barriers to gene flows or clines. Applications to real data have
provided additional evidence that GBM may be able to identify clinal variation in allele
frequencies (Drosophila Adh locus). Maps for the human population samples are in
accordance with the current knowledge about the genetic structure of Eurasian and Asian
populations. Genetic boundaries associated with population subdivision may induce V-shaped
responses. However the reciprocal assertion may not be true. Actually these typical responses
may not reveal the presence of genetic boundaries unambiguously, as clines may also produce
11
similar signals. In addition, some boundaries may remain undetectable. This happens when
the effect of boundaries operates at scales which are not compatible with the individual
resolution. For instance, a large unsampled area might be associated with absence of signals
although this area may separate two strongly differentiated clusters of aggregated individuals.
What does the GBM bring compared to the already existing methods? The answer requires
drawing up the state-of-the-art of the methods used in spatial genetics. Spatial population
genetics often relies on theoretical models and statistical methods for inference from genetic
data in subdivided populations. Mainly, these methods are concerned with the estimation of
migration or dispersal rates based on neutral models of evolution (Malécot 1968; Rousset
2004; Wright 1943) that have originally demonstrated how F-statistics depends on spatial
dispersal under simplified assumptions. These approaches require predefined populations, and
they do not use spatial information explicitly. Relaxing the prior definition of populations,
some other approaches also look at genetic structure without any spatial information (e.g.
PCA, genealogical tree reconstruction). Perhaps the most widely used is the Bayesian
clustering method introduced by Pritchard et al. (2000) that allows grouping individuals into
population units. Guillot et al. (2005) have introduced a Bayesian model that considers
georeferenced multilocus genotypes. Like Pritchard et al.'s, the method also requires strong
model assumptions such as Hardy-Weinberg and linkage equilibria.
Another stream of theoretical works about spatial genetic structure has traditionally been
built on spatial statistics. These works fall into three categories (see Manel et al. 2003): 1)
matrix methods and the Mantel test (Mantel 1967), 2) spatial autocorrelation statistics (Moran
1950; Sokal & Oden 1978), 3) methods related to identification of boundaries (Monmonier
1973). A criticism addressed to Mantel and autocorrelation methods is that they may indeed
reveal the presence of a specific structure, but they fail to identify its shape or its location
precisely (Barbujani 2000). In constrast, the Monmonier’s algorithm (Monmonier 1973) and
the Wombling method include the analysis of local features. The Monmonier’s algorithm has
however the drawback of setting a number of hard parameters, such as the estimated number
of boundaries. Like autocorrelation, the GBM deals with genetic data and geographic
coordinates simultaneously. But the main difference is that GBM enables locating the spatial
genetic structures. Compared to methods that use the Monmonier's algorithm, GBM avoids
using predefined number of populations, and does not even assume any particular measure of
genetic distances.
12
Wombling has generated a great amount of applied and theoretical works since its
introduction by Womble in 1951 (Womble 1951 ; Barbujani et al. 1989; Bocquet-Appel &
Bacro 1994; Fagan et al. 2003; Fortin 2000; Jacquez et al. 2000). So far, these works have
focussed on estimating systemic maps from various statistical procedures. This article has a
different spirit. Based on Wombling estimation, it has introduced a new structural parameter,
and has given this parameter a biological interpretation. The idea behind all studies of
geographic diversity is that one can proceed from the observed pattern to the underlying
evolutionary process (Barbujani 2000). The first step is to assess the observed pattern of
genetic variation. GBM has proved to be a powerful tool to address this question, and to
provide graphical insights on population structure when no prior information is available.
Acknowledgment
We are grateful O.J. Hardy, G. Guillot and J. Goudet for their helpful comments and
discussions. We wish also thanks J. Rebreyend for his help with the GENBMAP program
interface. This work was conducted while O.F. and S.M. were supported by the IMAG project
AlpB and the ACI project ImpBio.
REFERENCES
Balloux, F. 2001 EASYPOP (version1.7): a computer program for the simulation of
population genetics. J. Hered. 92, 301-302.
Barbujani, G. 2000 Geographic patterns: how to identify them and why? Hum. Biol. 72, 133-
153.
Barbujani, G., Oden, N. L. & Sokal, R. 1989 Detecting regions of abrupt change in maps of
biological variables. Syst. Zool. 38, 376-389.
Berry, A. & Kreitman, M. 1993 Molecular analysis of an allozyme cline: alcohol
deshydrogenase in Drosophila melanogaster on the East Coast of North America.
Genetics 134, 869-893.
Bocquet-Appel, J. P. & Bacro, J. N. 1994 Generalized Wombling. Syst. Biol. 43, 442-448.
Cann, H. M., de Tomas, C., Cazes, L. et al. 2002 A human genome diversity cell line panel.
Science 296: 261-262.
Csillag, F. & Kabos, S. 2002 Wavelets, boundaries and the spatial analysis of landscape
patterns. Ecoscience 9, 177-190.
Fagan, W. F., Fortin, M.-J. & Soykan, C. 2003 Integrating edge detection and dynamic
modelling in quantitative analysis of ecological boundaries. Bioscience 53, 730-783.
13
Fan, J. & Gijbels, I. 1996 Local polynomial modelling and its applications, edn. London :
Chapman & Hall.
Fortin, M.-J. 1994 Edge detection algorithms for two-dimensional ecological data. Ecology
75, 956-965.
Fortin, M.-J., Olson, R. J., Ferson, S. , Iverson, L., Hunsaker, C., Edwards, G., Levine, D.,
Butera, K.; & Klemas, V. 2000. Issues related to the detection of boundaries.
Landscape Ecol. 15, 453-466.
Fortin, M.-J. & Dale, M. 2005 Spatial analysis. A guide for ecologist, eds. Cambridge
University Press: Cambridge.
Guillot, G., Estoup, A., Mortier, F. & Cosson, J. 2005 A spatial statistical model for landscape
genetics. Genetics. 170, 1261-1280.
Jacquez, G., Maruca, S. & Fortin, M.-J. 2000 From fields to objects: a review of geographic
boundary analysis. J. Geograph Syst 2, 221-241.
Malécot, G. 1968 The mathematics of heredity, eds. Freeman & Company: New york (USA).
Manel, S., Bellemain, E., Swenson, J. & François, O. 2004 Assumed and inferred spatial
structure of populations: the scandinavian brown bears revisited. Mol. Ecol. 13, 1327-
1331.
Manel, S., Schwartz, M., Luikart, G. & Taberlet, P. 2003 Landscape genetics: combining
landscape ecology and population genetics. Trends Ecol. Evol. 18, 157-206.
Mantel, N. 1967 The detection of disease clustering and a generalised regression approach.
Cancer Res. 27, 173-220.
Monmonier, M. 1973 Maximum-difference barriers: an alternative numerical regionalization
method. Geogr. Analysis 3, 245-261.
Moran, P. A. 1950 Notes on continuous stochastic phenomena. Biometrika 37: 17-23.
Pritchard, J. K., Stephens, M. & Donnelly, P. 2000 Inference of population structure using
multilocus genotype data. Genetics 155, 945-959.
R Development Core Team. R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria, 2004.
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A.
& Feldman, M. W. 2002 Genetic structure of human populations. Science 298: 2381-
2385.
Rousset, F. 2004 Genetic structure and selection in subdivided populations. Princeton
University Press: Princeton (USA).
Silvermann B. W. 1986 Density estimation for statistics and data analysis, eds. Chapman &
Hall: New York.
14
Sokal, R. & Oden, N. 1978 Spatial autocorrelation in biology. I-Methodology. Biological
Journal of Linneans Society 10: 199-228.
Weir, B. S. & Cockerham, C. C. 1984. Estimating F-statistics for the analysis of population
structure. Evolution 38, 1358-1370.
Womble, W. 1951 Differential systematics. Science 28, 315-322.
Wright, S. 1943 Isolation by distance. Genetics 28, 114-138.
15
Electronic supplementary material (ESM1) : Mathematical details of the GBM
The estimates of derivatives needed for computing systemic functions have been obtained
according to the nonparametric method called local polynomials (Fan & Gijbels 1996). Let us
explain the method briefly. For sake of simplicity, we assume the observation of a single
locus, and we let (zi) denote the Bernoulli variables that indicate the presence or absence of a
specific allele. Here the subscript i refers to the geographic location, and zi =1 means that an
organism located at the site i carries the studied allele.
We denote by (xi,yi) the spatial coordinates of the observation i and by (x,y) the coordinates
of an arbitrary grid site. In the local polynomials method, a function w weights observations
according to their distance to grid sites. The weight function usually consists of a nonnegative
decreasing function of the Euclidean distance di between the observation (xi,yi) and the grid
site (x,y). In nonparametric statistics Gaussian weights are a standard choice
g(d) = exp(-d2/2).
The bandwidth (denoted h) is a scale parameter that enables the control of weight decay.
This parameter influences the quality of estimation, and is subject to the bias/variance
dilemna. Actually the observation i receives the weight
wi(h) = g(di/h),
When the bandwidth is small, the decay may be fast compared to the scale of spatial data,
and only the observations close to the grid site will be taken into account. In contrast large
bandwidths may lead to an underestimation of local structures.
The local polynomials method attempts to fit a polynomial function P(x,y) to the (unknown)
frequency of the studied allele at every grid site (x,y). The fitted polynomial is a function of
the following form: 22
0 2/12/1),( yxyxyxyxP yyxyxxyx γγγββα +++++=
that solves the minimum square problem 2
1),()()( ∑
=
−−−=n
iiiii zyyxxPhwPE , (1)
where n is the sample size. Solving this problem amounts to find a six dimensional parameter
θ that corresponds to the set of coefficients of the polynomial function P. The solution of the
minimization problem (1) is obtained as the solution of the linear regression. The solution is
given by the following matrix product
)()( 1 XWZXWX tt −=θ (2)
16
where ),...( 1 nt zzZ= ),...( 1 n
t zzZ = , W is a n dimensional diagonal matrix such that wii = w(di/h) and
22
2111
2111
)())(()(1..................
)())(()(1
yyyyxxxxyyxx
yyyyxxxxyyxx
X
nnnnnn
Note that computing the estimated parameter θ requires a number of operations of order O(n).
The derivatives can be estimated as
yxyxxPxxββ≡∂∂
),,(),( hyxyxxP
xx ββ =≡∂∂
and
),,(),( hyxyxyP
yy ββ =≡∂∂
An estimate of the gradient norm in the systemic function can then be given by
yxfββ=∇ 222yxf ββ +=∇
Note that equation (2) can be rewritten as
θ = B Z
where XXB= )()( 1 XWXWXB tt −=
The estimates are thus obtained by making the product of the two matrices B and Z. Note
that the matrix B is dependent on the spatial data only, whereas the genetic data are contained
in Z. This remark is crucial for the implementation of the permutation tests. During the
permutation tests the alleles are resampled without replacement, and the B matrix needs not
be recalculated. The derivatives are therefore reestimated from the application of a single
matrix product. The reduced algorithmic cost of this method gives support for the choice of a
linear regression method instead of a logistic regression method.
17
Figure legends
Figure 1. Illustration of the permutation. The Figure represents individuals (dots) genotyped
at a two-allele locus (A = black dots, a = white dots), and includes geographic structure. The
thick black line indicates the presence of a physical barrier (e.g. a mountain range), and the
cross corresponds to a site (x,y) in the grid. When the permutation test is applied to
individuals at distance h = h1 black and white dots may be mixed up with high probability.
The null hypothesis of absence of structure may then be rejected. In contrast the test may be
non significant when the permutation test is applied to individuals at distance h = h2.
Figure 2. (a) W-shaped response corresponding to a cline in allele frequency at a locus with
two alleles. (b) V-shaped response corresponding to physical barrier to gene flows.
Figure 3. One dimensional cross-section of the GBM corresponding to the response of a cline
in simulated data. Genetical bandwidths range from 200 (min) to 1200 (max).
Figure 4. Simulations of the two-islands model with limited gene flows. (a) Relative locations
of the two islands along the r-axis. (b) Critical FST values below which population subdivision
cannot be detected (as a function of r). The sample size is set to n = 100 individuals. The
numbers of loci are set to L = 5 (triangles), L = 20 (squares), and L = 50 ( circles). (c) Critical
FST values below which population subdivision cannot be detected (as a function of r). The
number of loci is set to L = 20. The sample sizes are n = 20 (circles), n = 50 (triangles), and n
= 100 (squares).
Figure 5. GBM response to the latitudinal cline in frequencies of the Adh locus in the North
American D. melanogaster populations (latitudinal section).
Figure 6. Map produced by the GBM method for the Eurasian and Asian modern human
populations (CEPH dataset). U, T, V are signals of the genetic differentiation between
Eurasian and Asian populations. L may be explained by the genetic isolate Kalash, and Y is
consistent with the separation of populations with languages of altaic origin.
18
Landscape barrier (e.g. montain range)
h2
h1
Figure 1
19
(a) W-shaped response
(b) V-shaped response
Figure 2
20
Figure 3
21
0 4 8 Spatial ratio
Respective lof the two islands
ocations
)
0
0.1
0.2
0
FST
0
0.2
0.4
0
FST
F
(b
4 8
spatial ratio
i
(c)
(a)
4 8
spatial ratio
gure 4
22
0
3
6
25 30 35 40 45 50
latitude (°N)
gene
tical
ban
dwid
th vv
Figure 5
23
Figure 6
24