The Genetical Bandwidth Mapping: a spatial and graphical...

The Genetical Bandwidth Mapping: a spatial and graphical representation of

population genetic structure based on the Wombling method

A. Cercueil1, O. François1 and S. Manel2

1TIMC-TIMB, Faculté de Médecine, F38706 La Tronche cedex, France

2Laboratoire d'Ecologie Alpine, CNRS UMR 5553, Université Joseph Fourier, BP 53, F-

38041 Grenoble cedex 9, France

Corresponding author:

S. Manel

Tel: 33 (0) 4 76 51 42 78

Fax: 33 (0) 4 76 51 42 79

email: [email protected]

Running head: genetical bandwidth mapping

1

mailto:[email protected]

SUMMARY

Characterizing the spatial variation of allele frequencies in a population has a wide range of

application in population genetics. This article introduces a new nonparametric method which

provides a two-dimensional representation of a structural parameter called the genetical

bandwidth. The genetical bandwidth is based on the computation of Womble’s systemic

function, and describes the genetic structure around arbitrary spatial locations in a study area.

It corresponds to the shortest distance to areas of significant variation of allele frequencies.

Simulations demonstrate that the method is able to locate genetic boundaries or clines. Two

applications are presented using datasets taken from the literature: the D. melanogaster Adh

F/S locus and the HGDP-CEPH modern human dataset.

Key-words: spatial genetic structure, Wombling, systemic function

2

1- INTRODUCTION

Wombling methods aim at detecting regions of abrupt changes in maps of biological

variables. They have been introduced by Womble (1951), and refined afterwards by Barbujani

et al. (1989) and Bocquet-Appel & Bacro (1994). In the original approach, Womble assumed

the knowledge of surfaces derived from the variables of interest (e.g. allele frequency

surfaces), and computed the gradient of these surfaces. The norms of the gradients were then

averaged to form a new surface called the systemic map (or systemic function). Zones of

rapid changes could therefore be identified as regions given high values by the systemic

function. These regions have been called boundaries.

However the surfaces considered in the original approach are rarely available in real data

analysis. Instead of surfaces the variables of interest are actually measured at scarce

geographic locations. Implementing Wombling methods therefore requires the preliminary

inference of such surfaces from partial informations. Regular experimental designs are usually

addressed with a technique called lattice Wombling. In this approach a lattice tessellates the

space into rectangular regions termed pixels, and the variables of interest are assigned to the

center of the pixel. The rates of change of the systemic function can then be computed either

as first derivatives among adjacent pixels or second partial derivatives (Laplacian) (Fagan et

al. 2003; Fortin & Dale 2006; Jacquez et al. 2000). The computation uses kernel methods that

operate in windows of m pixels where m is a predefined parameter. Irregular experimental

designs are processed with another technique called triangulation Wombling based on

triangular kernels (Fortin 1994). Once the systemic function and the rates of changes are

estimated, the next step is to discriminate between true boundaries and those due to sampling

errors. To do so, Barbujani et al. (1989) assess the significance of the boundaries by using

randomization procedures. These procedures detect whether the highest rates of change are

higher than expected under the absence of structure. Their method has been applied to the

analysis of allele frequencies, and has been able to detect zones of abrupt change within

Human and Drosophila.

In summary, the need for statistical reconstruction from scarce observations generates a

difficult issue in Wombling computer programming. As lattice Wombling assigns the

observations to a regular lattice, it may be subject to bias. On the other hand, triangulation

Wombling estimates the systemic function from three data only, and is then prone to errors

due to statistical variability. The reconstruction issue is closely related to the choice of the

spatial scales at which the estimation procedures (kernels) are implemented. For instance, the

3

rates of change are strongly dependent on the pixel size (Fagan et al. 2003; Jacquez et al.

2000). This has been recently addressed by a method called hierarchical Wombling (Csillag &

Kabos 2002) that computes maps at several scales by varying the pixel size. But hierarchical

Wombling results in as many maps as different pixel sizes are used, and leaves us with the

dilemma to decide which pixel size is the most appropriate to interpret the data.

This article addresses the above decision problem from a different perspective. The

approach introduced here, named the Genetical Bandwidth Mapping (GBM), is a

nonparametric technique that deals with allele frequency. The GBM estimates the systemic

function at several scales in order to provide a local characterization of the genetic structure

around any arbitrary spatial location in a study area. The quantities displayed by the GBM are

not systemic values themselves, but the new quantities called genetical bandwidths. The GBM

avoids the issue of choosing among of multiple maps by computing the local parameters

which in turn can be interpreted as the shortest distances to areas of significant variation in

allele frequencies. This article also evaluates the capabilities of the GBM to detecting and

locating genetic structures such as boundaries or clines using multilocus genotypes.

2- THEORY

We consider a biological population living in a two dimensional habitat, and assume that n

individuals have been sampled from this population according to a random or regular

experimental design. In what follows, each location (xi, yi) is associated with a genetic

observation gi called the multilocus genotype that indicates the presence of specific alleles at

multiple DNA loci.

Typically the coordinates (xi, yi) represent either the location of an individual labelled i at

its instant of observation, or the location of its habitat. The GBM computes a critical

parameter (the genetical bandwidth) at each site of a grid that covers the study area. In the

sequel the grid sites are denoted (x,y), and may differ from the sampled locations (xi yi). This

section presents a formal description of the bandwidths, and describes the statistical principles

underlying the GBM. Mathematical details are deferred to the electronic supplementary

material (ESM1).

4

(a) Definition of bandwidths

In the GBM, gradients of allele frequencies are computed at each grid site. In this grid the

neighbourhood of a site is not defined precisely. Instead each observation receives a

weighting that depends on the distance to the current grid location (x y). More specifically the

observation i at (xi, yi) is given the following Gaussian weight wi(h) = exp(- di2/2h2 ),

where di2 represents the squared Euclidean distance between (xi, yi) and (x, y). This approach

is standard in density estimation (Silvermann 1986). The parameter h is crucial to our method.

It is called the bandwidth (or window size), and controls the exponential decay of the

Gaussian weights.

(b) The Systemic function

In order to test for the absence of local genetic structures, the GBM computes an estimate

of Womble's Systemic function at each grid site. The Systemic function S(x,y) is defined by

the following formula

∑ ∇=j

j yxfyxS ),(),(

where is the gradient of the allele frequency f),( yxf j∇ j with respect to x and y, and the sum

runs over all possible alleles j at all loci. The Systemic function is representative of the total

slope of allele frequency surfaces, and it integrates the dependencies between the genetical

and the spatial data. The allele frequencies and their derivatives at (x y) are unknown

parameters. In the GBM the derivatives are estimated from the presence/absence of alleles at

the sampled locations. The estimation technique makes use of local polynomials based on the

Gaussian weights wi(h) (see Fan & Gijbels 1996 and ESM1). This approach differs from

standard techniques (e.g. Barbujani et al. 1989) where the derivatives are usually estimated

from difference equations. Depending on the bandwidth h, the GBM builds an estimate

S(x,y,h) of S(x,y) for each (x,y). In nonparametric statistics the choice of h is usually based on

the minimization of a statistical error. Optimal choices nevertheless generate difficult

mathematical and practical problems. GBM deals with this particular problem in an original

way where the choice of h is related to critical regions of tests for the absence of structure.

(c) Testing for spatial genetic structure

Testing for genetic structure is at the heart of the GBM. The tests for absence of structure

are repeated at each site (x y) of the grid. Their objective is determining whether the estimated

systemic value S(x,y,h) reflects significant local genetic structure or not. In order to perform

the tests, the values S(x,y,h) are compared to the probability distribution of the Systemic

5

function S(x,y) obtained under the null hypothesis of absence of structure. Such a probability

distribution can be computed from a permutation procedure. The algorithm actually proceeds

with resampling from all possible genotypes, and assigns the sampled genotypes to the

individual locations. For each (x,y), replicates of systemic values are computed as described in

paragraph (b). A p-value is then computed by forming the ratio of the number of replicates

greater than the estimated value S(x, y, h) to their total number. For example figure 1

represents a sample from a population that displays a geographic structure at a two-alleles

haploid locus. The figure illustrates the basic idea underlying the principle of this test, and

suggests that it should be significant for large bandwidths (h = h1) but nonsignificant for

lower values (e.g. h = h2).

(d) Genetical bandwidths

To compute the genetical bandwidths, the type I error of the permutation test must be set

to the value α (typically α = 0.01 or α = 0.05). The tests are repeated at each grid site for

several values of h. The genetical bandwidths are then defined as the largest values of h for

which the tests are nonsignificant, i.e. the p-values are greater than α. To compute these

quantities, large bandwidths corresponding to the diameter of the study area are first tested.

The bandwidths are then decreased by one unit unless the test becomes nonsignificant.

(e) Interpretation of the GBM output

The GBM results in a two dimensional matrix that can be interpreted as a map. Such a

matrix contains the genetical bandwidths computed at each point of the grid covering the

study area. These parameters correspond to the shortest distances to the zones of significant

variation in allele frequencies. Graphical representations of the GBM therefore provide bases

for interpretation of the genetic structure of populations. In this paragraph, we give short

guidelines to the interpretation of such outputs by analysing two typical responses. For sake

of clarity we discuss one-dimensional organisation, i.e. populations that consist of

continuously distributed individuals along a line. The one dimensional hypothesis is actually

more amenable to analysis, and enables reasonable guesses of typical shapes of response.

Understanding one-dimensional shape may be also useful to better interpret cross-sections in

two-dimensional maps. In one dimension two types of response can be expected. We call

these responses the W-shaped and the V-shaped curves (figure 2). The W-shaped response

may be the signal of a cline in allele frequency while the V-shaped response is expected when

differentiated populations are isolated by a nonhabitat or unsampled area. The W-shaped

response is illustrated in figure 2a where the distribution of an haploid diallelic locus (a/A) is

6

displayed. Within areas of low frequency of allele A, the genetical bandwidths decrease as the

distance to the zone of sharp variation. The same phenomenon occurs within areas of high

frequency. Within the transition zone, the genetical bandwidths may undergo erratical

variations due to variability in geographic sampling, and the fact that the allele frequencies

are closer to 50 percent. Figure 2b gives an illustration of the V-shaped response. Two

populations are separated by an empty area, e.g. a physical barrier to gene flows. In this case

the genetical bandwidths vary linearly with the distance to the furthest population.

(f) Software

A documented computer software called GENBMAP implements the GBM in C++, and

provides a graphical interface for visualizing its outputs. The computer program has been

designed for running under the win32 operating system, and it is available from the authors’

web pages (http://www-timc.imag.fr/Olivier.Francois/ or http://www2.ujf-

grenoble.fr/leca/membres/manel.html).

3- SIMULATION STUDY

This section illustrates the behaviour of the GBM when confronted to typical spatial

structures obtained from numerical simulations. Two cases have been simulated: 1) Cline

along a specific direction (longitude), 2) Barrier to gene flows in the 2 islands model.

(a) Experimental design and simulation tools

Random generation of spatial locations and genetic data have been performed using the

statistical software R 2.2.0 (R Development Core Team, 2005). Simulations have been

replicated more than ten times in each case. Sample sizes have been increased from 200 to

1,000 and 5,000 individuals. Maps have been computed using 300 by 300 regular grids

(90,000 sites). The type I errors in permutation tests have been set to α = 0.05 and α = 0.1.

Refining the mesh grid has a direct impact on the running time which increases linearly with

the number of sites in the grid. The number of permutations has been fixed to 200, and then

increased to 500 and 1,000 without major influence on the output of the tests. With 200

permutations, the computing time for one single map approximates half an hour on a 2GHz

processor laptop computer. Additional genetic data have been generated from the 2 and 3

islands models using EASYPOP (Balloux 2001) with similar results (not reported). The

graphical results presented in this section have been chosen as representative of the majority

of the simulated cases.

7

(b) Simulation of clines

Simulated data include 12 unlinked diallelic loci (alleles are coded 0 and 1). At each

locus, the frequencies of 1’s have been varied along the horizontal direction (x coordinate),

and their dependence on x has been considered to be logistic f(x) = 1 / ( 1 + exp( - (x-a)/b ) )

with a and b specific constants set to a = 2500, 3000, 3500 and b = 200, 500 (each of the 6

combinations has been produced twice). Spatial coordinates have been obtained as mixtures

of two independent isotropic Gaussian distributions centered at x1 = 1000 and x2 = 3000 (The

mean y coordinate has been set to y = 1800, and the standard deviation has been taken equal

sd = 1000). These simulations are relevant to a clinal variation of allele frequencies as a

function of longitude or altitude.

Figure 3 displays a central horizontal cross section (y = 900) of the GBM. The cross

section exhibits a W-shaped response locating the beginning of the sharp variation around

1,500 and its end around 4,000 (figure 3). The central area of the GBM corresponds to a wide

region where the allele frequencies vary significantly. In these experiments, increasing the

type I error α had the consequence of producing wider minimal areas. Nevertheless, changing

this parameter did not influence the overall topography of the map.

(c) Simulation of two islands models and sensitivity analysis

The sensitivity of the GBM to simultaneous variation in spatial and genetic differentiation

has been assessed from simulations of the island-model. Individuals have been sampled from

two subpopulations of equal effective sizes Ne. Alleles have been simulated according to the

infinite allele model with constant mutation rate θ = 4µNe = 1 at each locus. Migration rates

M = 4mNe have been varied from 1 to 10. F-statistics (FST) have been computed using Weir

and Cockerham estimates (Weir & Cockerham 1984). In order to assess effects due to the

number of loci and the sample size, we have used L = 5, 20 and 50 loci and n = 20, 50 and

100 individuals. Spatial coordinates have been simulated as the geographic mixture of two

independent Gaussian distributions. Each subpopulation (or island) has its own spatial range,

and the two islands may intersect. In order to measure the amount of intersection between the

two islands, we have introduced the parameter r whose definition comes from discriminant

analysis. It corresponds to the ratio of the within-island variance to the between-islands

variance, and it can be interpreted as a measure of spatial differentiation. For equal within-

variances, bringing islands closer has the effect of increasing r while moving islands away

8

from each other has the reverse effect (figure 4a). For r between 0 and 4, the islands are

clearly separated. For r between 4 and 8, geographic mixing may exist and may be strong.

FST values are considered critical to the detection of population structure when the minima

of the GBM remain undetectable (flat responses). This is said to happen when the ratio of the

maximum of the GBM to its minimum value minus one is less than 5 percent. In figure 4,

these ratios have been obtained from ten replicates of the simulation scenario for each value

of r. Figure 4b represents the variation of the critical FST values as a function of r. For r in the

range (0, 4), the structure in two islands can be detected when FST’s are not lower than a value

around 0.04. In this range the method is relatively insensitive to the number of loci. As the

level of spatial mixture increases (r around 6), the performance rises with the number of loci.

With 20 loci the structure in two islands can be detected for FST values less than 0.05 - 0.06.

But the GBM is clearly sensitive to the sample size (figure 4c), and its use may be precluded

with small data samples even if two spatial clusters can be observed. For n = 20 the two

islands can actually be identified for large FST values (> 0.15) only. However, for n = 100, the

GBM is remarkably efficient at detecting structure even if the two subpopulations are

particularly merged.

4-APPLICATION TO REAL DATA

In this section, two real examples are analyzed. The first has been taken from the literature

as a case of evidence for clinal selection at the Adh F/S locus in D. melanogaster (Berry &

Kreitman 1993). The second dataset is related to the spatial genetic structure of the modern

human population (Rosenberg et al. 2002)

(a) Adh locus

Clinal selection at the Adh F/S locus in D. melanogaster was studied by Berry & Kreitman

(1993) using 113 haplotypes from 44 polymorphic markers in 1533 individual from 25

population sites. Because the original data contains latitudinal information for population

samples, individual locations have been randomly simulated from the 25 site coordinates by

adding a small amount of variability to the site latitude (s.d. = 0.5°E). Since the sites are

located on the East Cost of North America, they can be considered as almost aligned, and

longitudes have been simulated within an artificial range of (-1, +1)°N. A subsample of 1303

individuals has been used in our analysis so that the 14 most represented haplotypes have

been selected (85% of the full dataset at the end of Berry & Kreitman's article). The above

described sampling procedure does not produce bias toward the appearance of clines.

9

Regarding the Adh locus, the two dimensional GBM displays a band pattern which also

reflects the absence of sensitivity to the random sampling of longitudes. Most latitudinal

cross-sections display the same band pattern (not shown), and the curve represented in figure

5 actually corresponds to a single section of the map. This curve exhibits a W-shaped

response locating the beginning of the zone of sharp variation around the latitude 32.4°N and

the end of this zone around the latitude 41.1°E. This result gives evidence that the cline is

correctly retrieved, and that it separates two homogeneous zones in the South and the North of

the study area.

(b) CEPH Human genome dataset

The genetic structure of modern human populations has recently been investigated

without the use of predefined “groups” by Rosenberg et al. (2002). The study was based on

the HGDP-CEPH human genome diversity cell line panel (Cann et al. 2002) using genotypes

at 377 autosomal microsatellite loci in 1056 individuals distributed around the world.

Inference of genetic ancestry was performed by applying a model-based clustering algorithm

implemented in the computer program STRUCTURE (Pritchard et al. 2000) that computes

individual cluster membership coefficients. Rosenberg et al. (2002) identified six main

genetic clusters, five of which correspond to major geographic regions, and subclusters that

often correspond to individual populations.

The GBM has been applied to the Eurasia/East-Asia subset of original data. The studied

dataset contains 451 individuals which originate from 25 populations. The study area ranges

from southwestern Pakistan (latitude 24°N, longitude 66°E) to north-eastern Russia (latitude

64°N, longitude 130°E). Geographic origins and coordinates of sampled populations are

available from the CEPH website (www.cephb.fr/HGDP-CEPH-Panel/). Because the

coordinates of individuals are not known exactly, they have been simulated from the ranges

given for each sampled population. This has been done by adding small amounts of standard

deviation to the geographic coordinates. The GBM has been computed from a regular grid of

resolution 400 by 400, 200 permutations, and type I error set to α = 0.05.

The GBM output is found to be in accordance with the genetic structure of the Pakistan

and East Asia human populations (figure 6). 1) The Pakistan area is associated with high

genetic heterogeneity (and low values of the GBM). The (L) and (U) signals may be

interpreted as genetic barriers to gene flows (a double barrier). The lower separation (L) may

10

http://www.cephb.fr/HGDP-CEPH-Panel/

be associated with the Kalash sample which was clearly identified as a genetic isolate by

Rosenberg et al. (2002). The upper (U) may be interpreted as the separation between Xibo

and Uygur of northwestern China and the populations from southern Pakistan which speak

indo-european languages. 2) Evidence for the separation Eurasia/East Asia is provided by the

transversal separation (T), and by (U) and (V) which may be the continuation of (U). 3)

Another heterogeneous genetic area is observed in northeastern China (Y). This may be

explained by the separation between Yakut and Japanese which are populations with altaic

languages and the rest of the East-Asian samples.

5- DISCUSSION

The Genetical Bandwidth Mapping provides informations about the spatial genetic

organisation of populations. It displays these informations through the two-dimensional

graphical representation of a local structural parameter. Genetical bandwidths may be

interpreted as the shortest distances to areas of significant variation in allele frequencies. They

may also be viewed as the size of the largest neighbourhoods in which the spatial genetic

structure can be thought of as being homogeneous. The definition of homogeneity used here

entails that allele frequencies undergo very slow spatial variation within the area. Note that

this allows genetic heterogeneity to take place as well. For example, genetic admixture can be

considered as a kind of homogeneous structure provided that allele frequencies are constant

across the population. The previous definition fits well into the framework of Wombling

methods because spatial homogeneity can be measured from the values of the systemic map

S(x,y). However GBM and Wombling differ fundamentally. A major difference is that

Wombling requires the systemic function being estimated using a local parameter (e.g., a

window size). As a consequence, there is usually no unique estimation of S, but as many as

experimented window sizes. The GBM overcomes this issue by estimating critical window

sizes based on several estimates of S(x,y), and hence produces a unique map.

Simulations and sensitivity analyses have shown that GBM is efficient at detecting spatial

structures such as genetic barriers to gene flows or clines. Applications to real data have

provided additional evidence that GBM may be able to identify clinal variation in allele

frequencies (Drosophila Adh locus). Maps for the human population samples are in

accordance with the current knowledge about the genetic structure of Eurasian and Asian

populations. Genetic boundaries associated with population subdivision may induce V-shaped

responses. However the reciprocal assertion may not be true. Actually these typical responses

may not reveal the presence of genetic boundaries unambiguously, as clines may also produce

11

similar signals. In addition, some boundaries may remain undetectable. This happens when

the effect of boundaries operates at scales which are not compatible with the individual

resolution. For instance, a large unsampled area might be associated with absence of signals

although this area may separate two strongly differentiated clusters of aggregated individuals.

What does the GBM bring compared to the already existing methods? The answer requires

drawing up the state-of-the-art of the methods used in spatial genetics. Spatial population

genetics often relies on theoretical models and statistical methods for inference from genetic

data in subdivided populations. Mainly, these methods are concerned with the estimation of

migration or dispersal rates based on neutral models of evolution (Malécot 1968; Rousset

2004; Wright 1943) that have originally demonstrated how F-statistics depends on spatial

dispersal under simplified assumptions. These approaches require predefined populations, and

they do not use spatial information explicitly. Relaxing the prior definition of populations,

some other approaches also look at genetic structure without any spatial information (e.g.

PCA, genealogical tree reconstruction). Perhaps the most widely used is the Bayesian

clustering method introduced by Pritchard et al. (2000) that allows grouping individuals into

population units. Guillot et al. (2005) have introduced a Bayesian model that considers

georeferenced multilocus genotypes. Like Pritchard et al.'s, the method also requires strong

model assumptions such as Hardy-Weinberg and linkage equilibria.

Another stream of theoretical works about spatial genetic structure has traditionally been

built on spatial statistics. These works fall into three categories (see Manel et al. 2003): 1)

matrix methods and the Mantel test (Mantel 1967), 2) spatial autocorrelation statistics (Moran

1950; Sokal & Oden 1978), 3) methods related to identification of boundaries (Monmonier

1973). A criticism addressed to Mantel and autocorrelation methods is that they may indeed

reveal the presence of a specific structure, but they fail to identify its shape or its location

precisely (Barbujani 2000). In constrast, the Monmonier’s algorithm (Monmonier 1973) and

the Wombling method include the analysis of local features. The Monmonier’s algorithm has

however the drawback of setting a number of hard parameters, such as the estimated number

of boundaries. Like autocorrelation, the GBM deals with genetic data and geographic

coordinates simultaneously. But the main difference is that GBM enables locating the spatial

genetic structures. Compared to methods that use the Monmonier's algorithm, GBM avoids

using predefined number of populations, and does not even assume any particular measure of

genetic distances.

12

Wombling has generated a great amount of applied and theoretical works since its

introduction by Womble in 1951 (Womble 1951 ; Barbujani et al. 1989; Bocquet-Appel &

Bacro 1994; Fagan et al. 2003; Fortin 2000; Jacquez et al. 2000). So far, these works have

focussed on estimating systemic maps from various statistical procedures. This article has a

different spirit. Based on Wombling estimation, it has introduced a new structural parameter,

and has given this parameter a biological interpretation. The idea behind all studies of

geographic diversity is that one can proceed from the observed pattern to the underlying

evolutionary process (Barbujani 2000). The first step is to assess the observed pattern of

genetic variation. GBM has proved to be a powerful tool to address this question, and to

provide graphical insights on population structure when no prior information is available.

Acknowledgment

We are grateful O.J. Hardy, G. Guillot and J. Goudet for their helpful comments and

discussions. We wish also thanks J. Rebreyend for his help with the GENBMAP program

interface. This work was conducted while O.F. and S.M. were supported by the IMAG project

AlpB and the ACI project ImpBio.

REFERENCES

Balloux, F. 2001 EASYPOP (version1.7): a computer program for the simulation of

population genetics. J. Hered. 92, 301-302.

Barbujani, G. 2000 Geographic patterns: how to identify them and why? Hum. Biol. 72, 133-

153.

Barbujani, G., Oden, N. L. & Sokal, R. 1989 Detecting regions of abrupt change in maps of

biological variables. Syst. Zool. 38, 376-389.

Berry, A. & Kreitman, M. 1993 Molecular analysis of an allozyme cline: alcohol

deshydrogenase in Drosophila melanogaster on the East Coast of North America.

Genetics 134, 869-893.

Bocquet-Appel, J. P. & Bacro, J. N. 1994 Generalized Wombling. Syst. Biol. 43, 442-448.

Cann, H. M., de Tomas, C., Cazes, L. et al. 2002 A human genome diversity cell line panel.

Science 296: 261-262.

Csillag, F. & Kabos, S. 2002 Wavelets, boundaries and the spatial analysis of landscape

patterns. Ecoscience 9, 177-190.

Fagan, W. F., Fortin, M.-J. & Soykan, C. 2003 Integrating edge detection and dynamic

modelling in quantitative analysis of ecological boundaries. Bioscience 53, 730-783.

13

Fan, J. & Gijbels, I. 1996 Local polynomial modelling and its applications, edn. London :

Chapman & Hall.

Fortin, M.-J. 1994 Edge detection algorithms for two-dimensional ecological data. Ecology

75, 956-965.

Fortin, M.-J., Olson, R. J., Ferson, S. , Iverson, L., Hunsaker, C., Edwards, G., Levine, D.,

Butera, K.; & Klemas, V. 2000. Issues related to the detection of boundaries.

Landscape Ecol. 15, 453-466.

Fortin, M.-J. & Dale, M. 2005 Spatial analysis. A guide for ecologist, eds. Cambridge

University Press: Cambridge.

Guillot, G., Estoup, A., Mortier, F. & Cosson, J. 2005 A spatial statistical model for landscape

genetics. Genetics. 170, 1261-1280.

Jacquez, G., Maruca, S. & Fortin, M.-J. 2000 From fields to objects: a review of geographic

boundary analysis. J. Geograph Syst 2, 221-241.

Malécot, G. 1968 The mathematics of heredity, eds. Freeman & Company: New york (USA).

Manel, S., Bellemain, E., Swenson, J. & François, O. 2004 Assumed and inferred spatial

structure of populations: the scandinavian brown bears revisited. Mol. Ecol. 13, 1327-

1331.

Manel, S., Schwartz, M., Luikart, G. & Taberlet, P. 2003 Landscape genetics: combining

landscape ecology and population genetics. Trends Ecol. Evol. 18, 157-206.

Mantel, N. 1967 The detection of disease clustering and a generalised regression approach.

Cancer Res. 27, 173-220.

Monmonier, M. 1973 Maximum-difference barriers: an alternative numerical regionalization

method. Geogr. Analysis 3, 245-261.

Moran, P. A. 1950 Notes on continuous stochastic phenomena. Biometrika 37: 17-23.

Pritchard, J. K., Stephens, M. & Donnelly, P. 2000 Inference of population structure using

multilocus genotype data. Genetics 155, 945-959.

R Development Core Team. R: A language and environment for statistical computing. R

Foundation for Statistical Computing, Vienna, Austria, 2004.

Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A.

& Feldman, M. W. 2002 Genetic structure of human populations. Science 298: 2381-

2385.

Rousset, F. 2004 Genetic structure and selection in subdivided populations. Princeton

University Press: Princeton (USA).

Silvermann B. W. 1986 Density estimation for statistics and data analysis, eds. Chapman &

Hall: New York.

14

Sokal, R. & Oden, N. 1978 Spatial autocorrelation in biology. I-Methodology. Biological

Journal of Linneans Society 10: 199-228.

Weir, B. S. & Cockerham, C. C. 1984. Estimating F-statistics for the analysis of population

structure. Evolution 38, 1358-1370.

Womble, W. 1951 Differential systematics. Science 28, 315-322.

Wright, S. 1943 Isolation by distance. Genetics 28, 114-138.

15

Electronic supplementary material (ESM1) : Mathematical details of the GBM

The estimates of derivatives needed for computing systemic functions have been obtained

according to the nonparametric method called local polynomials (Fan & Gijbels 1996). Let us

explain the method briefly. For sake of simplicity, we assume the observation of a single

locus, and we let (zi) denote the Bernoulli variables that indicate the presence or absence of a

specific allele. Here the subscript i refers to the geographic location, and zi =1 means that an

organism located at the site i carries the studied allele.

We denote by (xi,yi) the spatial coordinates of the observation i and by (x,y) the coordinates

of an arbitrary grid site. In the local polynomials method, a function w weights observations

according to their distance to grid sites. The weight function usually consists of a nonnegative

decreasing function of the Euclidean distance di between the observation (xi,yi) and the grid

site (x,y). In nonparametric statistics Gaussian weights are a standard choice

g(d) = exp(-d2/2).

The bandwidth (denoted h) is a scale parameter that enables the control of weight decay.

This parameter influences the quality of estimation, and is subject to the bias/variance

dilemna. Actually the observation i receives the weight

wi(h) = g(di/h),

When the bandwidth is small, the decay may be fast compared to the scale of spatial data,

and only the observations close to the grid site will be taken into account. In contrast large

bandwidths may lead to an underestimation of local structures.

The local polynomials method attempts to fit a polynomial function P(x,y) to the (unknown)

frequency of the studied allele at every grid site (x,y). The fitted polynomial is a function of

the following form: 22

0 2/12/1),( yxyxyxyxP yyxyxxyx γγγββα +++++=

that solves the minimum square problem 2

1),()()( ∑

=

−−−=n

iiiii zyyxxPhwPE , (1)

where n is the sample size. Solving this problem amounts to find a six dimensional parameter

θ that corresponds to the set of coefficients of the polynomial function P. The solution of the

minimization problem (1) is obtained as the solution of the linear regression. The solution is

given by the following matrix product

)()( 1 XWZXWX tt −=θ (2)

16

where ),...( 1 nt zzZ= ),...( 1 n

t zzZ = , W is a n dimensional diagonal matrix such that wii = w(di/h) and

22

2111

2111

)())(()(1..................

)())(()(1

yyyyxxxxyyxx

yyyyxxxxyyxx

X

nnnnnn

Note that computing the estimated parameter θ requires a number of operations of order O(n).

The derivatives can be estimated as

yxyxxPxxββ≡∂∂

),,(),( hyxyxxP

xx ββ =≡∂∂

and

),,(),( hyxyxyP

yy ββ =≡∂∂

An estimate of the gradient norm in the systemic function can then be given by

yxfββ=∇ 222yxf ββ +=∇

Note that equation (2) can be rewritten as

θ = B Z

where XXB= )()( 1 XWXWXB tt −=

The estimates are thus obtained by making the product of the two matrices B and Z. Note

that the matrix B is dependent on the spatial data only, whereas the genetic data are contained

in Z. This remark is crucial for the implementation of the permutation tests. During the

permutation tests the alleles are resampled without replacement, and the B matrix needs not

be recalculated. The derivatives are therefore reestimated from the application of a single

matrix product. The reduced algorithmic cost of this method gives support for the choice of a

linear regression method instead of a logistic regression method.

17

Figure legends

Figure 1. Illustration of the permutation. The Figure represents individuals (dots) genotyped

at a two-allele locus (A = black dots, a = white dots), and includes geographic structure. The

thick black line indicates the presence of a physical barrier (e.g. a mountain range), and the

cross corresponds to a site (x,y) in the grid. When the permutation test is applied to

individuals at distance h = h1 black and white dots may be mixed up with high probability.

The null hypothesis of absence of structure may then be rejected. In contrast the test may be

non significant when the permutation test is applied to individuals at distance h = h2.

Figure 2. (a) W-shaped response corresponding to a cline in allele frequency at a locus with

two alleles. (b) V-shaped response corresponding to physical barrier to gene flows.

Figure 3. One dimensional cross-section of the GBM corresponding to the response of a cline

in simulated data. Genetical bandwidths range from 200 (min) to 1200 (max).

Figure 4. Simulations of the two-islands model with limited gene flows. (a) Relative locations

of the two islands along the r-axis. (b) Critical FST values below which population subdivision

cannot be detected (as a function of r). The sample size is set to n = 100 individuals. The

numbers of loci are set to L = 5 (triangles), L = 20 (squares), and L = 50 ( circles). (c) Critical

FST values below which population subdivision cannot be detected (as a function of r). The

number of loci is set to L = 20. The sample sizes are n = 20 (circles), n = 50 (triangles), and n

= 100 (squares).

Figure 5. GBM response to the latitudinal cline in frequencies of the Adh locus in the North

American D. melanogaster populations (latitudinal section).

Figure 6. Map produced by the GBM method for the Eurasian and Asian modern human

populations (CEPH dataset). U, T, V are signals of the genetic differentiation between

Eurasian and Asian populations. L may be explained by the genetic isolate Kalash, and Y is

consistent with the separation of populations with languages of altaic origin.

18

Landscape barrier (e.g. montain range)

h2

h1

Figure 1

19

(a) W-shaped response

(b) V-shaped response

Figure 2

20

Figure 3

21

0 4 8 Spatial ratio

Respective lof the two islands

ocations

)

0

0.1

0.2

0

FST

0

0.2

0.4

0

FST

F

(b

4 8

spatial ratio

i

(c)

(a)

4 8

spatial ratio

gure 4

22

0

3

6

25 30 35 40 45 50

latitude (°N)

gene

tical

ban

dwid

th vv

Figure 5

23

Figure 6

24

The Genetical Bandwidth Mapping: a spatial and graphical...

Documents

Transcript of The Genetical Bandwidth Mapping: a spatial and graphical...