The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto...

15
The Evolutionary History of Nebraska Deer Mice: Local Adaptation in the Face of Strong Gene Flow Susanne P. Pfeifer, †,1,2 Stefan Laurent, †,1 Vitor C. Sousa, †,3,4 Catherine R. Linnen, †,5 Matthieu Foll, 1 Laurent Excoffier,* ,3 Hopi E. Hoekstra,* ,6 and Jeffrey D. Jensen* ,1,2 1 School of Life Sciences, Ecole Polytechnique F ed erale de Lausanne, Lausanne, Switzerland 2 School of Life Sciences, Center for Evolution & Medicine, Arizona State University, Tempe, AZ 3 Institute of Ecology & Evolution, University of Berne, Berne, Switzerland 4 Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ci ^ encias, Universidade de Lisboa, Lisboa, Portugal 5 Department of Biology, University of Kentucky, Lexington, KY 6 Department of Organismic & Evolutionary Biology and Molecular & Cellular Biology, Museum of Comparative Zoology, Howard Hughes Medical Institute, Harvard University, Cambridge, MA These authors contributed equally to this work. *Corresponding authors: E-mails: laurent.excoffi[email protected]; [email protected]; [email protected]. Associate editor: Nadia Singh Abstract The interplay of gene flow, genetic drift, and local selective pressure is a dynamic process that has been well studied from a theoretical perspective over the last century. Wright and Haldane laid the foundation for expectations under an island- continent model, demonstrating that an island-specific beneficial allele may be maintained locally if the selection coefficient is larger than the rate of migration of the ancestral allele from the continent. Subsequent extensions of this model have provided considerably more insight. Yet, connecting theoretical results with empirical data has proven challenging, owing to a lack of information on the relationship between genotype, phenotype, and fitness. Here, we examine the demographic and selective history of deer mice in and around the Nebraska Sand Hills, a system in which variation at the Agouti locus affects cryptic coloration that in turn affects the survival of mice in their local habitat. We first genotyped 250 individuals from 11 sites along a transect spanning the Sand Hills at 660,000 single nucleotide polymorphisms across the genome. Using these genomic data, we found that deer mice first colonized the Sand Hills following the last glacial period. Subsequent high rates of gene flow have served to homogenize the majority of the genome between populations on and off the Sand Hills, with the exception of the Agouti pigmentation locus. Furthermore, mutations at this locus are strongly associated with the pigment traits that are strongly correlated with local soil coloration and thus responsible for cryptic coloration. Key words: population genetics, cryptic coloration, adaptation. Introduction Characterizing the conditions under which a population may adapt to a new environment, despite ongoing gene flow with its ancestral population, remains a question of central importance in population and ecological genetics. Hence, many studies have attempted to characterize selection–migration dynamics. Early studies focused on the distributions of phenotypes using ecological data to estimate migration rates (e.g., Clarke and Murray 1962; Endler 1977; Cook and Mani 1980); whereas later studies, focused primarily on Mendelian traits, documented pat- terns of genetic variation at a causal locus and contrasted that with putatively neutral loci to estimate selection and migration strengths (e.g., Ross and Keller 1995; Hoekstra et al. 2004; Stinchcombe et al. 2004, and see review of Linnen and Hoekstra 2009). Only recently have such studies expanded to examine whole-genome responses (see Jones et al. 2012; Feulner et al. 2015). These selection–migration dynamics are well grounded in theory. Haldane (1930) first noted that migration may disrupt the adaptive process if selection is not sufficiently strong to maintain a locally beneficial allele. Bulmer (1972) went on to describe a two-allele two-deme model in which such a loss is expected to occur if m/s > a/(1 a), where m is the rate of migration, s is the selection coefficient in population 1, and a is the ratio of selection coefficients between populations. More recent related work suggests that in order for a beneficial allele to reach fixation, s must be much larger than m (e.g., Lenormand 2002; Yeaman and Otto 2011, and see review of Tigano and Friesen 2016). In addition to fixation probabilities, the migration load induced by the influx of locally deleterious mutations entering Article Fast Track ß The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Open Access 792 Mol. Biol. Evol. 35(4):792–806 doi:10.1093/molbev/msy004 Advance Access publication January 15, 2018 Downloaded from https://academic.oup.com/mbe/article-abstract/35/4/792/4810448 by Harvard University user on 07 May 2018

Transcript of The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto...

Page 1: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

The Evolutionary History of Nebraska Deer Mice LocalAdaptation in the Face of Strong Gene Flow

Susanne P Pfeiferdagger12 Stefan Laurentdagger1 Vitor C Sousadagger34 Catherine R Linnendagger5 Matthieu Foll1

Laurent Excoffier3 Hopi E Hoekstra6 and Jeffrey D Jensen12

1School of Life Sciences Ecole Polytechnique Federale de Lausanne Lausanne Switzerland2School of Life Sciences Center for Evolution amp Medicine Arizona State University Tempe AZ3Institute of Ecology amp Evolution University of Berne Berne Switzerland4Centre for Ecology Evolution and Environmental Changes Faculdade de Ciencias Universidade de Lisboa Lisboa Portugal5Department of Biology University of Kentucky Lexington KY6Department of Organismic amp Evolutionary Biology and Molecular amp Cellular Biology Museum of Comparative Zoology HowardHughes Medical Institute Harvard University Cambridge MAdaggerThese authors contributed equally to this work

Corresponding authors E-mails laurentexcoffierieeunibech hoekstraoebharvardedu jeffreydjensenasuedu

Associate editor Nadia Singh

Abstract

The interplay of gene flow genetic drift and local selective pressure is a dynamic process that has been well studied froma theoretical perspective over the last century Wright and Haldane laid the foundation for expectations under an island-continent model demonstrating that an island-specific beneficial allele may be maintained locally if the selectioncoefficient is larger than the rate of migration of the ancestral allele from the continent Subsequent extensions ofthis model have provided considerably more insight Yet connecting theoretical results with empirical data has provenchallenging owing to a lack of information on the relationship between genotype phenotype and fitness Here weexamine the demographic and selective history of deer mice in and around the Nebraska Sand Hills a system in whichvariation at the Agouti locus affects cryptic coloration that in turn affects the survival of mice in their local habitat Wefirst genotyped 250 individuals from 11 sites along a transect spanning the Sand Hills at 660000 single nucleotidepolymorphisms across the genome Using these genomic data we found that deer mice first colonized the Sand Hillsfollowing the last glacial period Subsequent high rates of gene flow have served to homogenize the majority of thegenome between populations on and off the Sand Hills with the exception of the Agouti pigmentation locusFurthermore mutations at this locus are strongly associated with the pigment traits that are strongly correlated withlocal soil coloration and thus responsible for cryptic coloration

Key words population genetics cryptic coloration adaptation

Introduction

Characterizing the conditions under which a populationmay adapt to a new environment despite ongoing geneflow with its ancestral population remains a question ofcentral importance in population and ecological geneticsHence many studies have attempted to characterizeselectionndashmigration dynamics Early studies focused onthe distributions of phenotypes using ecological data toestimate migration rates (eg Clarke and Murray 1962Endler 1977 Cook and Mani 1980) whereas later studiesfocused primarily on Mendelian traits documented pat-terns of genetic variation at a causal locus and contrastedthat with putatively neutral loci to estimate selection andmigration strengths (eg Ross and Keller 1995 Hoekstraet al 2004 Stinchcombe et al 2004 and see review ofLinnen and Hoekstra 2009) Only recently have such

studies expanded to examine whole-genome responses(see Jones et al 2012 Feulner et al 2015)

These selectionndashmigration dynamics are well groundedin theory Haldane (1930) first noted that migration maydisrupt the adaptive process if selection is not sufficientlystrong to maintain a locally beneficial allele Bulmer (1972)went on to describe a two-allele two-deme model in whichsuch a loss is expected to occur if msgt a(1 a) wherem is the rate of migration s is the selection coefficient inpopulation 1 and a is the ratio of selection coefficientsbetween populations More recent related work suggeststhat in order for a beneficial allele to reach fixation s mustbe much larger than m (eg Lenormand 2002Yeaman and Otto 2011 and see review of Tigano andFriesen 2016)

In addition to fixation probabilities the migration loadinduced by the influx of locally deleterious mutations entering

Article

FastT

rack

The Author(s) 2018 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(httpcreativecommonsorglicensesby-nc40) which permits non-commercial re-use distribution and reproduction in anymedium provided the original work is properly cited For commercial re-use please contact journalspermissionsoupcom Open Access792 Mol Biol Evol 35(4)792ndash806 doi101093molbevmsy004 Advance Access publication January 15 2018Downloaded from httpsacademicoupcommbearticle-abstract3547924810448

by Harvard University useron 07 May 2018

the population has also been well studied Haldane (1957)found that the number of selective deaths necessary to main-tain differences between populations is proportional to thenumber of locally maladapted alleles migrating into the pop-ulation Thus the fewer loci necessary for the diverging locallyadaptive phenotype the lower the resulting migration loadMore recently Yeaman and Whitlock (2011) argued that withgene flow the genetic architecture underlying an adaptivephenotype is expected to have fewer and larger-effect allelescompared with neutral expectations under models withoutmigration (ie an exponential distribution of effect sizes [andsee Rafajlovic et al 2017]) As recombination between theselocally beneficial alleles may result in maladapted intermedi-ate phenotypes this model further predicts a genomic clus-tering of the underlying mutations (Maynard Smith 1977Lenormand and Otto 2000) The relative advantage of thislinkage will increase as the population approaches the migra-tion threshold at which local adaptation becomes impossible(Kirkpatrick and Barton 2006) Taken together these studiesmake a number of testable predictions relating migration ratewith the expected number of sites underlying the locally ben-eficial phenotype at the locus in question the average effectsize of the alleles the clustering of beneficial mutations andthe conditions under which a locally beneficial allele may belost maintained or fixed in a population

These theoretical expectations however have proven dif-ficult to evaluate in natural populations owing to a dearth ofsystems with concrete links between genotype phenotypeand fitness In one well-studied system deer mice(Peromyscus maniculatus) inhabiting the light-colored SandHills of Nebraska (formed within the last 8000 years followingthe end of the Wisconsin glacial period [Loope and Swinehart2000]) have evolved lighter coloration than conspecifics onthe surrounding dark soil (Dice 1941 1947) Using a combi-nation of laboratory crosses and hitchhiking-mapping theAgouti gene which encodes a signaling protein known toplay a key role in mammalian pigment-type switching andcolor patterning (Jackson 1994 Mills and Patterson 1994

Barsh 1996 and see review of Manceau et al 2010) hasbeen implicated in adaptive color variation (Linnen et al2009) Moreover using association mapping specific candi-date mutations have been linked to variation in specific pig-ment traits which in turn contribute to differential survivalfrom visually hunting avian predators (Linnen et al 2013)

While previous research in this system has established linksbetween genotype phenotype and fitness this work has fo-cused on a single ecotonal population located near the north-ern edge of the Sand Hills Thus the dynamic interplay ofmigration selection and genetic drift as well as the extentof genotypic and phenotypic structuring among populationsremains unknown To address these questions we have sam-pled hundreds of individuals across a transect spanning theSand Hills and neighboring populations to the north andsouth Combining extensive per-locale soil color measure-ments per-individual phenotyping and 660000 single nucle-otide polymorphisms (SNPs) genome-wide this data setprovides a unique opportunity to evaluate theoretical predic-tions of local adaptation with gene flow on a genomic scale

Results and Discussion

Environmental and Phenotypic Variation in andaround the Nebraska Sand HillsWe collected P maniculatus luteus individuals (Nfrac14 266) aswell as soil samples (Nfrac14 271) from 11 locations spanning a330-km transect across the Sand Hills of Nebraska (fig 1supplementary table 1 Supplementary Material online) Asexpected we found that soil color differed significantly be-tween ldquoonrdquo Sand Hills and ldquooffrdquo Sand Hills locations(F1frac14 3074 Plt 1 1015 fig 2) Similarly five of the sixmouse color traits also differed significantly between thetwo habitats (fig 2) In contrast mice trapped on versus offof the Sand Hills did not differ in other nonpigmentationtraits including total body length tail length hind foot lengthor ear length (see Supplementary Material online) Togetherthese results are consistent with strong divergent selection

FIG 1 Sampling locations on (light brown) and off (dark brown and out of state) of the Nebraska Sand Hills Sampling spanned a 330-kmtransect beginning 120 km south of the Sand Hills and ending 120 km north of the Sand Hills

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

793Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

between habitat types on multiple color traits but not othermorphological traits

We also examined habitat heterogeneity and phenotypicdifferences among sampling sites within habitat types Both

soil brightness and mouse color differed significantly amongsites within habitats (soil F9frac14 80 Pfrac14 19 1010 dorsalbrightness F9frac14 105 Pfrac14 81 1014 dorsal hue F9 frac1434Pfrac14 67 104 ventral brightness F9frac14 103 Pfrac14 14 1013 dorsalndashventral boundary F9frac14 66 Pfrac14 15 108tail stripe F9 frac14 69 Pfrac14 75 109) Additionally site-specific means for three of the color traits were significantlycorrelated with local soil brightness including dorsal bright-ness (tfrac14 44 Pfrac14 00017) dorsalndashventral boundary (tfrac14 39Pfrac14 00034) and tail stripe (tfrac14 31 Pfrac14 0013) In contrastsite-specific means for dorsal hue and ventral brightness didnot correlate with local soil color (dorsal hue tfrac14 2385Pfrac14 0097 ventral brightness tfrac14 032 Pfrac14 077) These resultssuggest that the agents of selection shaping correlations be-tween local soil color and mouse color vary among the traits

The Genetic Architecture of Light Color in Sand HillsMiceGenetic architecture parameter estimates derived from ourassociation mapping analyses suggest that variants in theAgouti locus explain a considerable amount of the observedcolor variation in on versus off Sand Hills populations (table 1)Indeed Agouti SNPs explain 69 of the total phenotypicvariance in dorsal brightness and tail stripe This analysisalso provides estimates of the potential number of causalSNPs as well as the proportion of genetic variance that isattributable to major-effect SNPs These estimates suggestthat dozens of Agouti variants may be contributing to varia-tion in color traits but the credible intervals for SNP numberare wide (table 1) One explanation for such wide intervals isthat it is difficult to disentangle the effects of closely linkedSNPs Nevertheless because the lower bound of the credibleintervals for all but one color trait exceeds 1 our data indicatethat there are multiple Agouti mutations with nonnegligibleeffects on color phenotypes Estimates of PGE (percent ofgenetic variance attributable to major effect mutations) indi-cate that whatever their number these major effect muta-tions explain a considerable percentage of total geneticvariance (eg 83 for dorsal brightness and 87 for dorsalndashventral boundary)

Although our genetic architecture parameter estimatesindicate that large-effect Agouti variants contribute to varia-tion in each of the five color traits the number and location ofthese SNPs vary considerably (fig 3) To interpret theseresults some background on the structure and function ofAgouti is needed Work in Mus musculus has identified twodifferent Agouti isoforms that are under the control of differ-ent promoters The ventral isoform containing noncodingexons 1 A1Arsquo is expressed in the ventral dermis during em-bryogenesis and is required for determining the boundarybetween the dark dorsum and light ventrum (Bultmanet al 1994 Vrieling et al 1994) The hair-cycle isoform con-taining noncoding exons 1B1C is expressed in both the dor-sal and ventral dermis during hair growth and is required forforming light pheomelanin bands on individual hairs(Bultman et al 1994 Vrieling et al 1994) In Peromyscus thesesame isoforms are present as well as additional novel isoforms(Mallarino et al 2017) Isoform-specific changes in Agouti

FIG 2 Soil color and mouse color traits are lighter on the NebraskaSand Hills Box plots for each sampling site were produced usingscaled (range 0ndash1) normal-quantile transformed values for soilbrightness (B2) dorsal brightness (PC1) dorsal hue (PC2) ventralbrightness (PC3) dorsalndashventral (D-V) boundary and tail stripeSoil is shown on the separated top panel and the five pigment traitsare given below In all plots higher values correspond to lighterbrighter color (y-axis) Gray shading indicates off the Sand Hills sitesSoil and five mouse color traits were significantly lighter on the SandHills (dorsal brightness F1 frac14 2186 Plt 1 1015 dorsal hueF1frac14 184 Pfrac14 24 105 ventral brightness F1 frac14 610 Pfrac14 15 1013 dorsalndashventral boundary F1frac14 1046 Plt 1 1015 tail stripeF1 frac14 1702 Plt 1 1015 with the corresponding P-values given tothe right of each panel A sixth color trait (ventral hue [PC4] [F1frac14 11Pfrac14 030]) and four noncolor traits (total body length [F1frac14 011Pfrac14 074] tail length [F1frac14 018 Pfrac14 067] hind foot length [F1frac14 30Pfrac14 0084] and ear length [F1frac14 048 Pfrac14 049]) did not differ signif-icantly between habitats (see supplementary fig 10 and table 11Supplementary Material online)

Pfeifer et al doi101093molbevmsy004 MBE

794Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

expression are associated with changes in the dorsalndashventralboundary (ventral isoform Manceau et al 2011) and thewidth of light bands on individual hairs (hair-cycle isoformLinnen et al 2009) Both isoforms share the same three codingexons (exons 2ndash4) thus protein-coding changes could simul-taneously alter pigmentation patterning and hair banding(Linnen et al 2013)

Across the Agouti locus we detected 160 SNPs that werestrongly associated (posterior inclusion probability [PIP] inthe polygenic BSLMMgt 01) with at least one color trait(see Materials and Methods) By trait the number of SNPsexceeding our PIP threshold was 53 (dorsal brightness) 2(dorsal hue) 16 (ventral brightness) 34 (dorsalndashventralboundary) and 81 (tail stripe) Additionally patterns ofgenotypendashphenotype association were distinct for eachtrait but consistent with Agouti isoform function (fig 3)For example we observed the strongest associations fordorsal brightness around the three coding exons (exons2 3 and 4 located between positions 2506 and 2511 Mbon the scaffold containing the Agouti locus) and upstreamof the transcription start site for the Agouti hair-cycle iso-form (fig 3) In contrast for dorsalndashventral boundary theSNP with the highest PIP was located very near the tran-scription start site for the Agouti ventral isoform (fig 3)Notably a previously identified serine deletion in exon 2(Linnen et al 2009 2013) exceeds our threshold for three ofthe color traits (dorsal brightness dorsalndashventral boundaryand tail stripe) and was elevated above background asso-ciation levels for the remaining two traits (dorsal hue andventral brightness) and multiple candidates from thoseanalyses are also recovered here (supplementary table 2Supplementary Material online) Overall our associationmapping results thus reveal that there are many sites inAgouti that are associated with pigment variation

We also estimated genetic architecture parameters andidentified candidate pigmentation SNPs in the full data set(Agouti SNPs plus all SNPs outside of Agouti that had nomissing data) For all five traits including SNPs outside ofAgouti led to a modest increase in PVE PGE and SNP-number estimates but credible intervals overlapped withthose of the Agouti-only data set for all parameter estimates(table 1 vs supplementary table 3 Supplementary Materialonline) Additionally although the highest-PIP SNPs werefound in Agouti we identified several non-Agouti candidate

SNPs as well (supplementary table 4 Supplementary Materialonline) Together these results suggest that while Agoutiexplains a considerable amount of color variation in miceliving on and around the Nebraska Sand Hills a complete

Table 1 Hyperparameters Describing the Genetic Architecture of Five Color Traits

Trait h PVE rho PGE n_gamma

Dorsal brightness 050 (029 07) 069 (057 081) 062 (031 092) 083 (064 097) 1172 (7 286)Dorsal hue 028 (007 058) 026 (008 045) 032 (002 087) 030 (0 086) 554 (0 264)Ventral brightness 033 (016 051) 039 (023 056) 043 (013 082) 054 (022 088) 806 (3 271)D-V boundary 044 (025 064) 061 (046 073) 075 (045 098) 087 (07 099) 705 (9 256)Tail stripe 051 (032 069) 069 (056 08) 064 (038 089) 079 (061 095) 1239 (16 284)

NOTEmdashAll parameters were estimated using a BSLMM model in GEMMA 2148 Agouti SNPs and a relatedness matrix estimated from genome-wide SNPs Parameter meansand 95 credible intervals (lower bound upper bound) were calculated from ten independent runs per trait Interpretation of hyperparameters is as follows PVE the totalproportion of phenotypic variance that is explained by genotype PGE the proportion of genetic variance explained by sparse (ie major) effects h an approximation used forprior specification of the expected value of PVE rho an approximation used for prior specification of the expected value of PGE n_gamma the expected number of sparse(major) effect SNPs

FIG 3 Association mapping results for Agouti SNPs and five colortraits Each point represents one of 2148 Agouti SNPs included in theassociation mapping analysis with position (on scaffold 16 seeMaterials and Methods) indicated on the x-axis and PIP estimatedusing a BSLMM in GEMMA on the y-axis PIP values which approx-imate the strength of the association between genotype and pheno-type were averaged across ten independent GEMMA runs Black dotsindicate SNPs with PIPgt 01 dashed lines give the location of sixcandidate SNPs identified in a previous association mapping analysisof a single population (associated with tail stripe D-V boundary aswell as dorsal brightness and hue Linnen et al 2013) gray lines givethe location of Agouti exons and arrows indicate two alternativetranscription start sites

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

795Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

accounting of non-Agouti regions associated with color willrequire a whole-genome resequencing approach

The Demographic History of the Sand Hills PopulationInferring the demographic history of this region is of interestin and of itself but also is a requisite step for downstreamselection inference Based on their genetic diversity individ-uals from different sampling locations grouped according totheir location and habitat with a clear separation betweenindividuals from on and off of the Sand Hills (fig 4) Theobserved pattern of isolation by distance (IBD) was furthersupported by a significant correlation between pairwise ge-netic differentiation (FST[1 FST]) and pairwise geographicdistance among sampling sites (Pfrac14 99 104) The pairwiseFST values between sampling locations were low ranging from0008 to 0065 indicating limited genetic differentiationConsistent with both patterns of IBD and reduced differen-tiation among populations individual ancestry proportions(as inferred by sNMF and TESS) were best explained by threepopulation clusters Additionally the ancestry proportionsobtained with three clusters (fig 4) returned low cross-entropy values (supplementary fig 1 SupplementaryMaterial online) particularly when accounting for geographicsampling location again suggesting three distinct groups 1)North of the Sand Hills 2) on the Sand Hills and 3) south of

the Sand Hills The population tree inferred with TreeMix alsoindicated low levels of genetic drift among populations aswell as primary differentiation between north and south pop-ulations with the Sand Hills population occupying an inter-mediate position This pattern is consistent with increaseddifferentiation moving along the transect (fig 4) which islikely owing both to simple IBD as well as the more complexsettlement history inferred below

Given the observed population structure we next investi-gated models of colonization history based on three popula-tions Those inhabiting the Sand Hills and those inhabiting theneighboring regions to the north and to the south This resultedin two different three-population demographic models corre-sponding to different population tree topologies which explic-itly allow for bottlenecks associated with potential founderevents and for gene flow among populations (supplementaryfig 2 Supplementary Material online) Both models appearedequally likely (supplementary table 5 Supplementary Materialonline) and parameter estimates were similar and consistentacross models pointing to a recent divergence time amongpopulations evidence of a bottleneck associated with the col-onization of the Sand Hills and high rates of migration amongall populations (supplementary table 6 SupplementaryMaterial online) Note that the tested models did not specifi-cally impose a bottleneck for the colonization of the Sand Hills

FIG 4 Genetic structure of populations on and off the Sand Hills (A) PCA shows that individuals cluster according to their geographic samplinglocation with a clear separation of individuals sampled on the Sand Hills (SH green) from off the Sand Hills (blue) (B) TreeMix results for the treethat best fit the covariance in allele frequencies across samples (C) An isolation-by-distance model is supported by a significant correlationbetween pairwise genetic differentiation (FST[1 FST]) and pairwise geographic distance among sampling sites Significance of the Mantel testwas assessed through 10000 permutations (D) Spatial interpolation of the ancestry proportion inferred accounting for geographic samplinglocation using TESS3 for Kfrac14 3 groups Individuals are represented as dots and the ancestry of the three groups is represented by a gradient of threecolors (E) Individual admixture proportions (inferred using TESS3) are consistent with IBD and reduced differentiation among populations Thenumber of clusters (K) best explaining the data was assessed as the K value reaching the lowest minimal cross-entropy at Kfrac14 3 All analyses werebased on a subset of 161 individuals showing a mean depth of coverage larger than 8 with all genotypes possessing a minimum coverage of 4sampling one SNP for each 15-kb block resulting in a data set with 5412 SNPs (supplementary table 8 Supplementary Material online)

Pfeifer et al doi101093molbevmsy004 MBE

796Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

as population sizes could have remained high at the onset of thecolonization of the Sand Hills

To better distinguish between the models we comparedthe likelihoods of the two models for bootstrap data setscontaining a single SNP per 15-kb block counting the num-ber of bootstrap replicates with a relative likelihood (based onAkaikersquos information criterion values) larger than 095 for eachmodel Using this approach we identified a model with atopology in which the population on the Sand Hills sharesa more recent common ancestor with the population off theSand Hills to the south (supported in 81100 bootstrap rep-licates supplementary fig 3 Supplementary Material online)This topology and the parameter estimates (supplementarytable 6 Supplementary Material online) are consistent with arecent colonization of the Sand Hills from the south namely1) A recent split within the last4000 years (95 CI 3400ndash7900) 2) a stronger bottleneck associated with the coloniza-tion of the Sand Hills compared with the older split betweennorthern and southern populations and 3) higher levels ofgene flow from south to north including higher migrationrates from the southern population into the Sand Hills (2Nm95 CI 125ndash240) Furthermore this model fits well the mar-ginal site frequency spectrum (SFS) for each sample as well asthe 2D marginal SFS for each pair of populations (supplemen-tary fig 4 Supplementary Material online) Thus this neutraldemographic model nicely explains observed patterns of var-iation outside of the Agouti region and suggests that at leastthe most recent colonization represented by the currentlysampled individuals occurred considerably after the SandHills began to initially form (ie roughly 8000 years ago)

The Selective History of the Sand Hills PopulationDespite the high levels of gene flow inferred in our modelwhich results in low levels of differentiation among popula-tions genome-wide (supplementary fig 5a SupplementaryMaterial online) we observed high levels of differentiationamong populations at the Agouti locusmdashvariation that isfurther correlated with several phenotypic traits (fig 5 sup-plementary figs 5 and 6 Supplementary Material online) Weobserved the highest level of differentiation between micesampled on either side of the northern limit of the SandHills with a 100-kb region within Agouti exhibiting a contin-uous run of elevated genetic differentiation (supplementaryfig 6 Supplementary Material online) This pattern is consis-tent with the expectations of positive selection under a localadaptation regime Within this 100-kb region a subregion of30 kb located in intron 1 (ie between exon 1A and exon 1)displayed maximal differentiation (supplementary fig 6Supplementary Material online) making it difficult to pre-cisely identify the target of selection In contrast to thewide-range differentiation observed at Agouti on the northernside of the cline a single narrow FST-peak of roughly 5 kblocated in intron 1 was observed on the southern limit ofthe Sand Hills (fig 5 supplementary fig 6a SupplementaryMaterial online) Genome-wide comparison confirmed theoverall genetic similarity between light-colored Sand Hillsmice and the dark-colored population to the south withthe exception of a small number of variants putatively driving

adaptive phenotypic divergence at Agouti (fig 5 supplemen-tary fig 6a Supplementary Material online)

To test whether this pattern of differentiation could be theresult of nonselective forces (see Crisci et al 2012 Jensen et al2016) we calculated the HapFLK statistic providing a singlemeasure of genetic differentiation for the three geographiclocalities while controlling for neutral differentiation Beforecalculating genetic differentiation at the target locus (ieAgouti) the HapFLK method computes a neutral distancematrix from the background data (ie the backgroundregions randomly distributed across the genome) that sum-marizes the genetic similarity between populations with re-gard to the extent of genetic drift since divergence from theircommon ancestral population (supplementary fig 7Supplementary Material online) Consistent with the inferreddemographic history of the populations genome-wide back-ground levels of variation suggest that individuals capturedsouth of the Sand Hills are more similar to those inhabitingthe Sand Hills when compared with populations to thenorth The HapFLK profile confirmed the significant geneticdifferentiation at the Agouti locus compared with neutralexpectations (fig 6) Interestingly southern populations sharea significant amount of the haplotypic structure observed inpopulations on the Sand Hills with the exception of thecandidate region itself This is in keeping with high rates ofon-going gene flow occurring between on and off Sand Hillspopulations whereas selection maintains the ancestral Agoutialleles in the populations off the Sand Hills and the derivedalleles conferring crypsis on the Sand Hills

Owing to low levels of linkage disequilibrium characteriz-ing the Agouti region adaptive variants could be mapped on afine genetic scale Specifically there are three regions of in-creased differentiation (fig 6) 1) A narrow 3-kb peak in theHapFLK profile (located in intron 1 supplementary table 7Supplementary Material online) colocalizes with a region ofhigh linkage disequilibrium (supplementary fig 8Supplementary Material online) suggesting a recent selectiveevent 2) a second highest peak of differentiation locatedbetween exon 1A and the duplicated reversed copyexon1Arsquo is the only significant region detected by the CLRtest (supplementary fig 9 Supplementary Material online)and contains strong haplotype structure (fig 6) again sug-gesting the recent action of positive selection and 3) a thirdhighest peak of differentiation surrounds the putatively ben-eficial serine deletion in exon 2 previously described by Linnenet al (2009) showing strong differentiation between on andoff Sand Hills populations in our data set Although severallines of evidence support the role of the serine deletion inadaptation to the Sand Hills the linkage disequilibriumaround this variant is low suggesting that the age of thecorresponding selective event is likely the oldest among thethree candidate regions with subsequent mutation and re-combination events reducing the selective sweep patternHence as one may anticipate this greatly expanded clinaldata set served to identify younger and more geographicallylocalized candidate regions from across the Sand Hills com-pared with previous studies while still generally supportingthe model proposed by Linnen et al (2013) of multiple

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

797Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

independently selected mutations underlying differentaspects of the cryptic phenotype

Predicted Migration Thresholds and the Conditions ofAllele MaintenanceAlthough a number of strong assumptions must be made itis possible to estimate the migration threshold below which alocally beneficial mutation may be maintained in a

population Given our high rates of inferred gene flow it isimportant to consider whether observations are consistentwith theoretical expectations necessary for the maintenanceof alleles Following Yeaman and Whitlock (2011) andYeaman and Otto (2011) this migration threshold is definedin terms of both rates of gene flow as well as fitness effects oflocally beneficial alleles in the matching and in the alternateenvironment Given the parameter values inferred here along

FIG 5 Genetic differentiation across the Agouti locus Genetic differentiation observed between populations on the Sand Hills and populationsnorth of the Sand Hills (top panel) populations on the Sand Hills and populations south of the Sand Hills (middle panel) and populations northand south of the Sand Hills Dotted lines indicate the genome-wide average FST between these comparisons As shown differentiation is generallygreatly elevated across the region with a number of highly differentiated SNPs particularly when comparing on versus off Sand Hills populationswhich correspond to SNPs associated with different aspects of the cryptic phenotype (PIP scores are indicated by the size of the dots as shown inthe legend and significant SNPs are labeled with respect to the associated phenotype) Populations to the north and south of the Sand Hills are alsohighly differentiated in this region likely owing to differing levels of gene flow with the Sand Hills populations yet the SNPs underlying the crypticphenotype are not differentiated between these dark populations

Pfeifer et al doi101093molbevmsy004 MBE

798Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

with estimated population sizes the threshold above whichthe most strongly beneficial mutations may be maintained inthe population is estimated at mfrac14 08mdasha large value indeedgiven the inferred strength of selection (see Materials andMethods for details) For the more weakly beneficial muta-tions identified this value is estimated at mfrac14 007 again wellabove empirically estimated rates Thus our empirical obser-vations are fully consistent with expectation in that gene flowis strong enough to prevent locally beneficial alleles fromfixing on the Sand Hills but not so strong so as to swampout the derived allele despite the high input of ancestral var-iation As such parameter estimates fall in a range consistentwith the long-term maintenance of polymorphic alleles

ConclusionsThe cryptically colored mice of the Nebraska Sand Hills rep-resent one of the few mammalian systems in which aspects ofgenotype phenotype and fitness have been measured andconnected Yet the population genetics of the Sand Hillsregion has remained poorly understood By sequencing hun-dreds of individuals across a cline spanning over 300 km fun-damental aspects of the evolutionary history of thispopulation could be addressed for the first time Utilizinggenome-wide putatively neutral regions the inferred demo-graphic history suggests a relatively recent colonization of theSand Hills from the south well after the last glacial period

Further high rates of gene flow are inferred between light anddark populationsmdashresulting in genome-wide low levels ofgenetic differentiation as well as low levels of phenotypic dif-ferentiation of four traits unrelated to coloration across thecline However the Agouti region differs markedly in this re-gard with high levels of differentiation observed between onand off Sand Hills populations strong haplotype structure andhigh levels of linkage disequilibrium spanning putatively ben-eficial mutations among cryptically colored individuals In ad-dition we found these putatively beneficial mutations to bestrongly associated with the phenotypic traits underlying cryp-sis and these phenotypic traits were found to be in strongassociation with variance in soil color across the cline

Together these results suggest a model in which selectionis acting to maintain the alleles underlying the locally adaptivephenotype on lightdark soil despite substantial gene flowwhich not only prevents the populations from strongly dif-ferentiating from one another but also prevents the crypticgenotypes from reaching local fixation Furthermore return-ing to the theoretical predictions outlined in theIntroduction the mutations underlying the cryptic pheno-type are found to be few in number of large effect size and inclose genomic proximity As described by Yeaman andWhitlock (2011) in models of local adaptation with high mi-gration the establishment of a large effect beneficial allelemay indeed facilitate the accumulation of other locally ben-eficial alleles in the same genomic region owing to the local

FIG 6 Comparison of tests of neutrality and selection across the Agouti locus From the top The exon structure pairwise differences Tajimarsquos DCLR test HapFLK and FLK The dashed lines indicate the significance threshold using parametric bootstrapping of the best-fitting demographicmodel

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

799Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 2: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

the population has also been well studied Haldane (1957)found that the number of selective deaths necessary to main-tain differences between populations is proportional to thenumber of locally maladapted alleles migrating into the pop-ulation Thus the fewer loci necessary for the diverging locallyadaptive phenotype the lower the resulting migration loadMore recently Yeaman and Whitlock (2011) argued that withgene flow the genetic architecture underlying an adaptivephenotype is expected to have fewer and larger-effect allelescompared with neutral expectations under models withoutmigration (ie an exponential distribution of effect sizes [andsee Rafajlovic et al 2017]) As recombination between theselocally beneficial alleles may result in maladapted intermedi-ate phenotypes this model further predicts a genomic clus-tering of the underlying mutations (Maynard Smith 1977Lenormand and Otto 2000) The relative advantage of thislinkage will increase as the population approaches the migra-tion threshold at which local adaptation becomes impossible(Kirkpatrick and Barton 2006) Taken together these studiesmake a number of testable predictions relating migration ratewith the expected number of sites underlying the locally ben-eficial phenotype at the locus in question the average effectsize of the alleles the clustering of beneficial mutations andthe conditions under which a locally beneficial allele may belost maintained or fixed in a population

These theoretical expectations however have proven dif-ficult to evaluate in natural populations owing to a dearth ofsystems with concrete links between genotype phenotypeand fitness In one well-studied system deer mice(Peromyscus maniculatus) inhabiting the light-colored SandHills of Nebraska (formed within the last 8000 years followingthe end of the Wisconsin glacial period [Loope and Swinehart2000]) have evolved lighter coloration than conspecifics onthe surrounding dark soil (Dice 1941 1947) Using a combi-nation of laboratory crosses and hitchhiking-mapping theAgouti gene which encodes a signaling protein known toplay a key role in mammalian pigment-type switching andcolor patterning (Jackson 1994 Mills and Patterson 1994

Barsh 1996 and see review of Manceau et al 2010) hasbeen implicated in adaptive color variation (Linnen et al2009) Moreover using association mapping specific candi-date mutations have been linked to variation in specific pig-ment traits which in turn contribute to differential survivalfrom visually hunting avian predators (Linnen et al 2013)

While previous research in this system has established linksbetween genotype phenotype and fitness this work has fo-cused on a single ecotonal population located near the north-ern edge of the Sand Hills Thus the dynamic interplay ofmigration selection and genetic drift as well as the extentof genotypic and phenotypic structuring among populationsremains unknown To address these questions we have sam-pled hundreds of individuals across a transect spanning theSand Hills and neighboring populations to the north andsouth Combining extensive per-locale soil color measure-ments per-individual phenotyping and 660000 single nucle-otide polymorphisms (SNPs) genome-wide this data setprovides a unique opportunity to evaluate theoretical predic-tions of local adaptation with gene flow on a genomic scale

Results and Discussion

Environmental and Phenotypic Variation in andaround the Nebraska Sand HillsWe collected P maniculatus luteus individuals (Nfrac14 266) aswell as soil samples (Nfrac14 271) from 11 locations spanning a330-km transect across the Sand Hills of Nebraska (fig 1supplementary table 1 Supplementary Material online) Asexpected we found that soil color differed significantly be-tween ldquoonrdquo Sand Hills and ldquooffrdquo Sand Hills locations(F1frac14 3074 Plt 1 1015 fig 2) Similarly five of the sixmouse color traits also differed significantly between thetwo habitats (fig 2) In contrast mice trapped on versus offof the Sand Hills did not differ in other nonpigmentationtraits including total body length tail length hind foot lengthor ear length (see Supplementary Material online) Togetherthese results are consistent with strong divergent selection

FIG 1 Sampling locations on (light brown) and off (dark brown and out of state) of the Nebraska Sand Hills Sampling spanned a 330-kmtransect beginning 120 km south of the Sand Hills and ending 120 km north of the Sand Hills

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

793Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

between habitat types on multiple color traits but not othermorphological traits

We also examined habitat heterogeneity and phenotypicdifferences among sampling sites within habitat types Both

soil brightness and mouse color differed significantly amongsites within habitats (soil F9frac14 80 Pfrac14 19 1010 dorsalbrightness F9frac14 105 Pfrac14 81 1014 dorsal hue F9 frac1434Pfrac14 67 104 ventral brightness F9frac14 103 Pfrac14 14 1013 dorsalndashventral boundary F9frac14 66 Pfrac14 15 108tail stripe F9 frac14 69 Pfrac14 75 109) Additionally site-specific means for three of the color traits were significantlycorrelated with local soil brightness including dorsal bright-ness (tfrac14 44 Pfrac14 00017) dorsalndashventral boundary (tfrac14 39Pfrac14 00034) and tail stripe (tfrac14 31 Pfrac14 0013) In contrastsite-specific means for dorsal hue and ventral brightness didnot correlate with local soil color (dorsal hue tfrac14 2385Pfrac14 0097 ventral brightness tfrac14 032 Pfrac14 077) These resultssuggest that the agents of selection shaping correlations be-tween local soil color and mouse color vary among the traits

The Genetic Architecture of Light Color in Sand HillsMiceGenetic architecture parameter estimates derived from ourassociation mapping analyses suggest that variants in theAgouti locus explain a considerable amount of the observedcolor variation in on versus off Sand Hills populations (table 1)Indeed Agouti SNPs explain 69 of the total phenotypicvariance in dorsal brightness and tail stripe This analysisalso provides estimates of the potential number of causalSNPs as well as the proportion of genetic variance that isattributable to major-effect SNPs These estimates suggestthat dozens of Agouti variants may be contributing to varia-tion in color traits but the credible intervals for SNP numberare wide (table 1) One explanation for such wide intervals isthat it is difficult to disentangle the effects of closely linkedSNPs Nevertheless because the lower bound of the credibleintervals for all but one color trait exceeds 1 our data indicatethat there are multiple Agouti mutations with nonnegligibleeffects on color phenotypes Estimates of PGE (percent ofgenetic variance attributable to major effect mutations) indi-cate that whatever their number these major effect muta-tions explain a considerable percentage of total geneticvariance (eg 83 for dorsal brightness and 87 for dorsalndashventral boundary)

Although our genetic architecture parameter estimatesindicate that large-effect Agouti variants contribute to varia-tion in each of the five color traits the number and location ofthese SNPs vary considerably (fig 3) To interpret theseresults some background on the structure and function ofAgouti is needed Work in Mus musculus has identified twodifferent Agouti isoforms that are under the control of differ-ent promoters The ventral isoform containing noncodingexons 1 A1Arsquo is expressed in the ventral dermis during em-bryogenesis and is required for determining the boundarybetween the dark dorsum and light ventrum (Bultmanet al 1994 Vrieling et al 1994) The hair-cycle isoform con-taining noncoding exons 1B1C is expressed in both the dor-sal and ventral dermis during hair growth and is required forforming light pheomelanin bands on individual hairs(Bultman et al 1994 Vrieling et al 1994) In Peromyscus thesesame isoforms are present as well as additional novel isoforms(Mallarino et al 2017) Isoform-specific changes in Agouti

FIG 2 Soil color and mouse color traits are lighter on the NebraskaSand Hills Box plots for each sampling site were produced usingscaled (range 0ndash1) normal-quantile transformed values for soilbrightness (B2) dorsal brightness (PC1) dorsal hue (PC2) ventralbrightness (PC3) dorsalndashventral (D-V) boundary and tail stripeSoil is shown on the separated top panel and the five pigment traitsare given below In all plots higher values correspond to lighterbrighter color (y-axis) Gray shading indicates off the Sand Hills sitesSoil and five mouse color traits were significantly lighter on the SandHills (dorsal brightness F1 frac14 2186 Plt 1 1015 dorsal hueF1frac14 184 Pfrac14 24 105 ventral brightness F1 frac14 610 Pfrac14 15 1013 dorsalndashventral boundary F1frac14 1046 Plt 1 1015 tail stripeF1 frac14 1702 Plt 1 1015 with the corresponding P-values given tothe right of each panel A sixth color trait (ventral hue [PC4] [F1frac14 11Pfrac14 030]) and four noncolor traits (total body length [F1frac14 011Pfrac14 074] tail length [F1frac14 018 Pfrac14 067] hind foot length [F1frac14 30Pfrac14 0084] and ear length [F1frac14 048 Pfrac14 049]) did not differ signif-icantly between habitats (see supplementary fig 10 and table 11Supplementary Material online)

Pfeifer et al doi101093molbevmsy004 MBE

794Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

expression are associated with changes in the dorsalndashventralboundary (ventral isoform Manceau et al 2011) and thewidth of light bands on individual hairs (hair-cycle isoformLinnen et al 2009) Both isoforms share the same three codingexons (exons 2ndash4) thus protein-coding changes could simul-taneously alter pigmentation patterning and hair banding(Linnen et al 2013)

Across the Agouti locus we detected 160 SNPs that werestrongly associated (posterior inclusion probability [PIP] inthe polygenic BSLMMgt 01) with at least one color trait(see Materials and Methods) By trait the number of SNPsexceeding our PIP threshold was 53 (dorsal brightness) 2(dorsal hue) 16 (ventral brightness) 34 (dorsalndashventralboundary) and 81 (tail stripe) Additionally patterns ofgenotypendashphenotype association were distinct for eachtrait but consistent with Agouti isoform function (fig 3)For example we observed the strongest associations fordorsal brightness around the three coding exons (exons2 3 and 4 located between positions 2506 and 2511 Mbon the scaffold containing the Agouti locus) and upstreamof the transcription start site for the Agouti hair-cycle iso-form (fig 3) In contrast for dorsalndashventral boundary theSNP with the highest PIP was located very near the tran-scription start site for the Agouti ventral isoform (fig 3)Notably a previously identified serine deletion in exon 2(Linnen et al 2009 2013) exceeds our threshold for three ofthe color traits (dorsal brightness dorsalndashventral boundaryand tail stripe) and was elevated above background asso-ciation levels for the remaining two traits (dorsal hue andventral brightness) and multiple candidates from thoseanalyses are also recovered here (supplementary table 2Supplementary Material online) Overall our associationmapping results thus reveal that there are many sites inAgouti that are associated with pigment variation

We also estimated genetic architecture parameters andidentified candidate pigmentation SNPs in the full data set(Agouti SNPs plus all SNPs outside of Agouti that had nomissing data) For all five traits including SNPs outside ofAgouti led to a modest increase in PVE PGE and SNP-number estimates but credible intervals overlapped withthose of the Agouti-only data set for all parameter estimates(table 1 vs supplementary table 3 Supplementary Materialonline) Additionally although the highest-PIP SNPs werefound in Agouti we identified several non-Agouti candidate

SNPs as well (supplementary table 4 Supplementary Materialonline) Together these results suggest that while Agoutiexplains a considerable amount of color variation in miceliving on and around the Nebraska Sand Hills a complete

Table 1 Hyperparameters Describing the Genetic Architecture of Five Color Traits

Trait h PVE rho PGE n_gamma

Dorsal brightness 050 (029 07) 069 (057 081) 062 (031 092) 083 (064 097) 1172 (7 286)Dorsal hue 028 (007 058) 026 (008 045) 032 (002 087) 030 (0 086) 554 (0 264)Ventral brightness 033 (016 051) 039 (023 056) 043 (013 082) 054 (022 088) 806 (3 271)D-V boundary 044 (025 064) 061 (046 073) 075 (045 098) 087 (07 099) 705 (9 256)Tail stripe 051 (032 069) 069 (056 08) 064 (038 089) 079 (061 095) 1239 (16 284)

NOTEmdashAll parameters were estimated using a BSLMM model in GEMMA 2148 Agouti SNPs and a relatedness matrix estimated from genome-wide SNPs Parameter meansand 95 credible intervals (lower bound upper bound) were calculated from ten independent runs per trait Interpretation of hyperparameters is as follows PVE the totalproportion of phenotypic variance that is explained by genotype PGE the proportion of genetic variance explained by sparse (ie major) effects h an approximation used forprior specification of the expected value of PVE rho an approximation used for prior specification of the expected value of PGE n_gamma the expected number of sparse(major) effect SNPs

FIG 3 Association mapping results for Agouti SNPs and five colortraits Each point represents one of 2148 Agouti SNPs included in theassociation mapping analysis with position (on scaffold 16 seeMaterials and Methods) indicated on the x-axis and PIP estimatedusing a BSLMM in GEMMA on the y-axis PIP values which approx-imate the strength of the association between genotype and pheno-type were averaged across ten independent GEMMA runs Black dotsindicate SNPs with PIPgt 01 dashed lines give the location of sixcandidate SNPs identified in a previous association mapping analysisof a single population (associated with tail stripe D-V boundary aswell as dorsal brightness and hue Linnen et al 2013) gray lines givethe location of Agouti exons and arrows indicate two alternativetranscription start sites

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

795Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

accounting of non-Agouti regions associated with color willrequire a whole-genome resequencing approach

The Demographic History of the Sand Hills PopulationInferring the demographic history of this region is of interestin and of itself but also is a requisite step for downstreamselection inference Based on their genetic diversity individ-uals from different sampling locations grouped according totheir location and habitat with a clear separation betweenindividuals from on and off of the Sand Hills (fig 4) Theobserved pattern of isolation by distance (IBD) was furthersupported by a significant correlation between pairwise ge-netic differentiation (FST[1 FST]) and pairwise geographicdistance among sampling sites (Pfrac14 99 104) The pairwiseFST values between sampling locations were low ranging from0008 to 0065 indicating limited genetic differentiationConsistent with both patterns of IBD and reduced differen-tiation among populations individual ancestry proportions(as inferred by sNMF and TESS) were best explained by threepopulation clusters Additionally the ancestry proportionsobtained with three clusters (fig 4) returned low cross-entropy values (supplementary fig 1 SupplementaryMaterial online) particularly when accounting for geographicsampling location again suggesting three distinct groups 1)North of the Sand Hills 2) on the Sand Hills and 3) south of

the Sand Hills The population tree inferred with TreeMix alsoindicated low levels of genetic drift among populations aswell as primary differentiation between north and south pop-ulations with the Sand Hills population occupying an inter-mediate position This pattern is consistent with increaseddifferentiation moving along the transect (fig 4) which islikely owing both to simple IBD as well as the more complexsettlement history inferred below

Given the observed population structure we next investi-gated models of colonization history based on three popula-tions Those inhabiting the Sand Hills and those inhabiting theneighboring regions to the north and to the south This resultedin two different three-population demographic models corre-sponding to different population tree topologies which explic-itly allow for bottlenecks associated with potential founderevents and for gene flow among populations (supplementaryfig 2 Supplementary Material online) Both models appearedequally likely (supplementary table 5 Supplementary Materialonline) and parameter estimates were similar and consistentacross models pointing to a recent divergence time amongpopulations evidence of a bottleneck associated with the col-onization of the Sand Hills and high rates of migration amongall populations (supplementary table 6 SupplementaryMaterial online) Note that the tested models did not specifi-cally impose a bottleneck for the colonization of the Sand Hills

FIG 4 Genetic structure of populations on and off the Sand Hills (A) PCA shows that individuals cluster according to their geographic samplinglocation with a clear separation of individuals sampled on the Sand Hills (SH green) from off the Sand Hills (blue) (B) TreeMix results for the treethat best fit the covariance in allele frequencies across samples (C) An isolation-by-distance model is supported by a significant correlationbetween pairwise genetic differentiation (FST[1 FST]) and pairwise geographic distance among sampling sites Significance of the Mantel testwas assessed through 10000 permutations (D) Spatial interpolation of the ancestry proportion inferred accounting for geographic samplinglocation using TESS3 for Kfrac14 3 groups Individuals are represented as dots and the ancestry of the three groups is represented by a gradient of threecolors (E) Individual admixture proportions (inferred using TESS3) are consistent with IBD and reduced differentiation among populations Thenumber of clusters (K) best explaining the data was assessed as the K value reaching the lowest minimal cross-entropy at Kfrac14 3 All analyses werebased on a subset of 161 individuals showing a mean depth of coverage larger than 8 with all genotypes possessing a minimum coverage of 4sampling one SNP for each 15-kb block resulting in a data set with 5412 SNPs (supplementary table 8 Supplementary Material online)

Pfeifer et al doi101093molbevmsy004 MBE

796Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

as population sizes could have remained high at the onset of thecolonization of the Sand Hills

To better distinguish between the models we comparedthe likelihoods of the two models for bootstrap data setscontaining a single SNP per 15-kb block counting the num-ber of bootstrap replicates with a relative likelihood (based onAkaikersquos information criterion values) larger than 095 for eachmodel Using this approach we identified a model with atopology in which the population on the Sand Hills sharesa more recent common ancestor with the population off theSand Hills to the south (supported in 81100 bootstrap rep-licates supplementary fig 3 Supplementary Material online)This topology and the parameter estimates (supplementarytable 6 Supplementary Material online) are consistent with arecent colonization of the Sand Hills from the south namely1) A recent split within the last4000 years (95 CI 3400ndash7900) 2) a stronger bottleneck associated with the coloniza-tion of the Sand Hills compared with the older split betweennorthern and southern populations and 3) higher levels ofgene flow from south to north including higher migrationrates from the southern population into the Sand Hills (2Nm95 CI 125ndash240) Furthermore this model fits well the mar-ginal site frequency spectrum (SFS) for each sample as well asthe 2D marginal SFS for each pair of populations (supplemen-tary fig 4 Supplementary Material online) Thus this neutraldemographic model nicely explains observed patterns of var-iation outside of the Agouti region and suggests that at leastthe most recent colonization represented by the currentlysampled individuals occurred considerably after the SandHills began to initially form (ie roughly 8000 years ago)

The Selective History of the Sand Hills PopulationDespite the high levels of gene flow inferred in our modelwhich results in low levels of differentiation among popula-tions genome-wide (supplementary fig 5a SupplementaryMaterial online) we observed high levels of differentiationamong populations at the Agouti locusmdashvariation that isfurther correlated with several phenotypic traits (fig 5 sup-plementary figs 5 and 6 Supplementary Material online) Weobserved the highest level of differentiation between micesampled on either side of the northern limit of the SandHills with a 100-kb region within Agouti exhibiting a contin-uous run of elevated genetic differentiation (supplementaryfig 6 Supplementary Material online) This pattern is consis-tent with the expectations of positive selection under a localadaptation regime Within this 100-kb region a subregion of30 kb located in intron 1 (ie between exon 1A and exon 1)displayed maximal differentiation (supplementary fig 6Supplementary Material online) making it difficult to pre-cisely identify the target of selection In contrast to thewide-range differentiation observed at Agouti on the northernside of the cline a single narrow FST-peak of roughly 5 kblocated in intron 1 was observed on the southern limit ofthe Sand Hills (fig 5 supplementary fig 6a SupplementaryMaterial online) Genome-wide comparison confirmed theoverall genetic similarity between light-colored Sand Hillsmice and the dark-colored population to the south withthe exception of a small number of variants putatively driving

adaptive phenotypic divergence at Agouti (fig 5 supplemen-tary fig 6a Supplementary Material online)

To test whether this pattern of differentiation could be theresult of nonselective forces (see Crisci et al 2012 Jensen et al2016) we calculated the HapFLK statistic providing a singlemeasure of genetic differentiation for the three geographiclocalities while controlling for neutral differentiation Beforecalculating genetic differentiation at the target locus (ieAgouti) the HapFLK method computes a neutral distancematrix from the background data (ie the backgroundregions randomly distributed across the genome) that sum-marizes the genetic similarity between populations with re-gard to the extent of genetic drift since divergence from theircommon ancestral population (supplementary fig 7Supplementary Material online) Consistent with the inferreddemographic history of the populations genome-wide back-ground levels of variation suggest that individuals capturedsouth of the Sand Hills are more similar to those inhabitingthe Sand Hills when compared with populations to thenorth The HapFLK profile confirmed the significant geneticdifferentiation at the Agouti locus compared with neutralexpectations (fig 6) Interestingly southern populations sharea significant amount of the haplotypic structure observed inpopulations on the Sand Hills with the exception of thecandidate region itself This is in keeping with high rates ofon-going gene flow occurring between on and off Sand Hillspopulations whereas selection maintains the ancestral Agoutialleles in the populations off the Sand Hills and the derivedalleles conferring crypsis on the Sand Hills

Owing to low levels of linkage disequilibrium characteriz-ing the Agouti region adaptive variants could be mapped on afine genetic scale Specifically there are three regions of in-creased differentiation (fig 6) 1) A narrow 3-kb peak in theHapFLK profile (located in intron 1 supplementary table 7Supplementary Material online) colocalizes with a region ofhigh linkage disequilibrium (supplementary fig 8Supplementary Material online) suggesting a recent selectiveevent 2) a second highest peak of differentiation locatedbetween exon 1A and the duplicated reversed copyexon1Arsquo is the only significant region detected by the CLRtest (supplementary fig 9 Supplementary Material online)and contains strong haplotype structure (fig 6) again sug-gesting the recent action of positive selection and 3) a thirdhighest peak of differentiation surrounds the putatively ben-eficial serine deletion in exon 2 previously described by Linnenet al (2009) showing strong differentiation between on andoff Sand Hills populations in our data set Although severallines of evidence support the role of the serine deletion inadaptation to the Sand Hills the linkage disequilibriumaround this variant is low suggesting that the age of thecorresponding selective event is likely the oldest among thethree candidate regions with subsequent mutation and re-combination events reducing the selective sweep patternHence as one may anticipate this greatly expanded clinaldata set served to identify younger and more geographicallylocalized candidate regions from across the Sand Hills com-pared with previous studies while still generally supportingthe model proposed by Linnen et al (2013) of multiple

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

797Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

independently selected mutations underlying differentaspects of the cryptic phenotype

Predicted Migration Thresholds and the Conditions ofAllele MaintenanceAlthough a number of strong assumptions must be made itis possible to estimate the migration threshold below which alocally beneficial mutation may be maintained in a

population Given our high rates of inferred gene flow it isimportant to consider whether observations are consistentwith theoretical expectations necessary for the maintenanceof alleles Following Yeaman and Whitlock (2011) andYeaman and Otto (2011) this migration threshold is definedin terms of both rates of gene flow as well as fitness effects oflocally beneficial alleles in the matching and in the alternateenvironment Given the parameter values inferred here along

FIG 5 Genetic differentiation across the Agouti locus Genetic differentiation observed between populations on the Sand Hills and populationsnorth of the Sand Hills (top panel) populations on the Sand Hills and populations south of the Sand Hills (middle panel) and populations northand south of the Sand Hills Dotted lines indicate the genome-wide average FST between these comparisons As shown differentiation is generallygreatly elevated across the region with a number of highly differentiated SNPs particularly when comparing on versus off Sand Hills populationswhich correspond to SNPs associated with different aspects of the cryptic phenotype (PIP scores are indicated by the size of the dots as shown inthe legend and significant SNPs are labeled with respect to the associated phenotype) Populations to the north and south of the Sand Hills are alsohighly differentiated in this region likely owing to differing levels of gene flow with the Sand Hills populations yet the SNPs underlying the crypticphenotype are not differentiated between these dark populations

Pfeifer et al doi101093molbevmsy004 MBE

798Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

with estimated population sizes the threshold above whichthe most strongly beneficial mutations may be maintained inthe population is estimated at mfrac14 08mdasha large value indeedgiven the inferred strength of selection (see Materials andMethods for details) For the more weakly beneficial muta-tions identified this value is estimated at mfrac14 007 again wellabove empirically estimated rates Thus our empirical obser-vations are fully consistent with expectation in that gene flowis strong enough to prevent locally beneficial alleles fromfixing on the Sand Hills but not so strong so as to swampout the derived allele despite the high input of ancestral var-iation As such parameter estimates fall in a range consistentwith the long-term maintenance of polymorphic alleles

ConclusionsThe cryptically colored mice of the Nebraska Sand Hills rep-resent one of the few mammalian systems in which aspects ofgenotype phenotype and fitness have been measured andconnected Yet the population genetics of the Sand Hillsregion has remained poorly understood By sequencing hun-dreds of individuals across a cline spanning over 300 km fun-damental aspects of the evolutionary history of thispopulation could be addressed for the first time Utilizinggenome-wide putatively neutral regions the inferred demo-graphic history suggests a relatively recent colonization of theSand Hills from the south well after the last glacial period

Further high rates of gene flow are inferred between light anddark populationsmdashresulting in genome-wide low levels ofgenetic differentiation as well as low levels of phenotypic dif-ferentiation of four traits unrelated to coloration across thecline However the Agouti region differs markedly in this re-gard with high levels of differentiation observed between onand off Sand Hills populations strong haplotype structure andhigh levels of linkage disequilibrium spanning putatively ben-eficial mutations among cryptically colored individuals In ad-dition we found these putatively beneficial mutations to bestrongly associated with the phenotypic traits underlying cryp-sis and these phenotypic traits were found to be in strongassociation with variance in soil color across the cline

Together these results suggest a model in which selectionis acting to maintain the alleles underlying the locally adaptivephenotype on lightdark soil despite substantial gene flowwhich not only prevents the populations from strongly dif-ferentiating from one another but also prevents the crypticgenotypes from reaching local fixation Furthermore return-ing to the theoretical predictions outlined in theIntroduction the mutations underlying the cryptic pheno-type are found to be few in number of large effect size and inclose genomic proximity As described by Yeaman andWhitlock (2011) in models of local adaptation with high mi-gration the establishment of a large effect beneficial allelemay indeed facilitate the accumulation of other locally ben-eficial alleles in the same genomic region owing to the local

FIG 6 Comparison of tests of neutrality and selection across the Agouti locus From the top The exon structure pairwise differences Tajimarsquos DCLR test HapFLK and FLK The dashed lines indicate the significance threshold using parametric bootstrapping of the best-fitting demographicmodel

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

799Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 3: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

between habitat types on multiple color traits but not othermorphological traits

We also examined habitat heterogeneity and phenotypicdifferences among sampling sites within habitat types Both

soil brightness and mouse color differed significantly amongsites within habitats (soil F9frac14 80 Pfrac14 19 1010 dorsalbrightness F9frac14 105 Pfrac14 81 1014 dorsal hue F9 frac1434Pfrac14 67 104 ventral brightness F9frac14 103 Pfrac14 14 1013 dorsalndashventral boundary F9frac14 66 Pfrac14 15 108tail stripe F9 frac14 69 Pfrac14 75 109) Additionally site-specific means for three of the color traits were significantlycorrelated with local soil brightness including dorsal bright-ness (tfrac14 44 Pfrac14 00017) dorsalndashventral boundary (tfrac14 39Pfrac14 00034) and tail stripe (tfrac14 31 Pfrac14 0013) In contrastsite-specific means for dorsal hue and ventral brightness didnot correlate with local soil color (dorsal hue tfrac14 2385Pfrac14 0097 ventral brightness tfrac14 032 Pfrac14 077) These resultssuggest that the agents of selection shaping correlations be-tween local soil color and mouse color vary among the traits

The Genetic Architecture of Light Color in Sand HillsMiceGenetic architecture parameter estimates derived from ourassociation mapping analyses suggest that variants in theAgouti locus explain a considerable amount of the observedcolor variation in on versus off Sand Hills populations (table 1)Indeed Agouti SNPs explain 69 of the total phenotypicvariance in dorsal brightness and tail stripe This analysisalso provides estimates of the potential number of causalSNPs as well as the proportion of genetic variance that isattributable to major-effect SNPs These estimates suggestthat dozens of Agouti variants may be contributing to varia-tion in color traits but the credible intervals for SNP numberare wide (table 1) One explanation for such wide intervals isthat it is difficult to disentangle the effects of closely linkedSNPs Nevertheless because the lower bound of the credibleintervals for all but one color trait exceeds 1 our data indicatethat there are multiple Agouti mutations with nonnegligibleeffects on color phenotypes Estimates of PGE (percent ofgenetic variance attributable to major effect mutations) indi-cate that whatever their number these major effect muta-tions explain a considerable percentage of total geneticvariance (eg 83 for dorsal brightness and 87 for dorsalndashventral boundary)

Although our genetic architecture parameter estimatesindicate that large-effect Agouti variants contribute to varia-tion in each of the five color traits the number and location ofthese SNPs vary considerably (fig 3) To interpret theseresults some background on the structure and function ofAgouti is needed Work in Mus musculus has identified twodifferent Agouti isoforms that are under the control of differ-ent promoters The ventral isoform containing noncodingexons 1 A1Arsquo is expressed in the ventral dermis during em-bryogenesis and is required for determining the boundarybetween the dark dorsum and light ventrum (Bultmanet al 1994 Vrieling et al 1994) The hair-cycle isoform con-taining noncoding exons 1B1C is expressed in both the dor-sal and ventral dermis during hair growth and is required forforming light pheomelanin bands on individual hairs(Bultman et al 1994 Vrieling et al 1994) In Peromyscus thesesame isoforms are present as well as additional novel isoforms(Mallarino et al 2017) Isoform-specific changes in Agouti

FIG 2 Soil color and mouse color traits are lighter on the NebraskaSand Hills Box plots for each sampling site were produced usingscaled (range 0ndash1) normal-quantile transformed values for soilbrightness (B2) dorsal brightness (PC1) dorsal hue (PC2) ventralbrightness (PC3) dorsalndashventral (D-V) boundary and tail stripeSoil is shown on the separated top panel and the five pigment traitsare given below In all plots higher values correspond to lighterbrighter color (y-axis) Gray shading indicates off the Sand Hills sitesSoil and five mouse color traits were significantly lighter on the SandHills (dorsal brightness F1 frac14 2186 Plt 1 1015 dorsal hueF1frac14 184 Pfrac14 24 105 ventral brightness F1 frac14 610 Pfrac14 15 1013 dorsalndashventral boundary F1frac14 1046 Plt 1 1015 tail stripeF1 frac14 1702 Plt 1 1015 with the corresponding P-values given tothe right of each panel A sixth color trait (ventral hue [PC4] [F1frac14 11Pfrac14 030]) and four noncolor traits (total body length [F1frac14 011Pfrac14 074] tail length [F1frac14 018 Pfrac14 067] hind foot length [F1frac14 30Pfrac14 0084] and ear length [F1frac14 048 Pfrac14 049]) did not differ signif-icantly between habitats (see supplementary fig 10 and table 11Supplementary Material online)

Pfeifer et al doi101093molbevmsy004 MBE

794Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

expression are associated with changes in the dorsalndashventralboundary (ventral isoform Manceau et al 2011) and thewidth of light bands on individual hairs (hair-cycle isoformLinnen et al 2009) Both isoforms share the same three codingexons (exons 2ndash4) thus protein-coding changes could simul-taneously alter pigmentation patterning and hair banding(Linnen et al 2013)

Across the Agouti locus we detected 160 SNPs that werestrongly associated (posterior inclusion probability [PIP] inthe polygenic BSLMMgt 01) with at least one color trait(see Materials and Methods) By trait the number of SNPsexceeding our PIP threshold was 53 (dorsal brightness) 2(dorsal hue) 16 (ventral brightness) 34 (dorsalndashventralboundary) and 81 (tail stripe) Additionally patterns ofgenotypendashphenotype association were distinct for eachtrait but consistent with Agouti isoform function (fig 3)For example we observed the strongest associations fordorsal brightness around the three coding exons (exons2 3 and 4 located between positions 2506 and 2511 Mbon the scaffold containing the Agouti locus) and upstreamof the transcription start site for the Agouti hair-cycle iso-form (fig 3) In contrast for dorsalndashventral boundary theSNP with the highest PIP was located very near the tran-scription start site for the Agouti ventral isoform (fig 3)Notably a previously identified serine deletion in exon 2(Linnen et al 2009 2013) exceeds our threshold for three ofthe color traits (dorsal brightness dorsalndashventral boundaryand tail stripe) and was elevated above background asso-ciation levels for the remaining two traits (dorsal hue andventral brightness) and multiple candidates from thoseanalyses are also recovered here (supplementary table 2Supplementary Material online) Overall our associationmapping results thus reveal that there are many sites inAgouti that are associated with pigment variation

We also estimated genetic architecture parameters andidentified candidate pigmentation SNPs in the full data set(Agouti SNPs plus all SNPs outside of Agouti that had nomissing data) For all five traits including SNPs outside ofAgouti led to a modest increase in PVE PGE and SNP-number estimates but credible intervals overlapped withthose of the Agouti-only data set for all parameter estimates(table 1 vs supplementary table 3 Supplementary Materialonline) Additionally although the highest-PIP SNPs werefound in Agouti we identified several non-Agouti candidate

SNPs as well (supplementary table 4 Supplementary Materialonline) Together these results suggest that while Agoutiexplains a considerable amount of color variation in miceliving on and around the Nebraska Sand Hills a complete

Table 1 Hyperparameters Describing the Genetic Architecture of Five Color Traits

Trait h PVE rho PGE n_gamma

Dorsal brightness 050 (029 07) 069 (057 081) 062 (031 092) 083 (064 097) 1172 (7 286)Dorsal hue 028 (007 058) 026 (008 045) 032 (002 087) 030 (0 086) 554 (0 264)Ventral brightness 033 (016 051) 039 (023 056) 043 (013 082) 054 (022 088) 806 (3 271)D-V boundary 044 (025 064) 061 (046 073) 075 (045 098) 087 (07 099) 705 (9 256)Tail stripe 051 (032 069) 069 (056 08) 064 (038 089) 079 (061 095) 1239 (16 284)

NOTEmdashAll parameters were estimated using a BSLMM model in GEMMA 2148 Agouti SNPs and a relatedness matrix estimated from genome-wide SNPs Parameter meansand 95 credible intervals (lower bound upper bound) were calculated from ten independent runs per trait Interpretation of hyperparameters is as follows PVE the totalproportion of phenotypic variance that is explained by genotype PGE the proportion of genetic variance explained by sparse (ie major) effects h an approximation used forprior specification of the expected value of PVE rho an approximation used for prior specification of the expected value of PGE n_gamma the expected number of sparse(major) effect SNPs

FIG 3 Association mapping results for Agouti SNPs and five colortraits Each point represents one of 2148 Agouti SNPs included in theassociation mapping analysis with position (on scaffold 16 seeMaterials and Methods) indicated on the x-axis and PIP estimatedusing a BSLMM in GEMMA on the y-axis PIP values which approx-imate the strength of the association between genotype and pheno-type were averaged across ten independent GEMMA runs Black dotsindicate SNPs with PIPgt 01 dashed lines give the location of sixcandidate SNPs identified in a previous association mapping analysisof a single population (associated with tail stripe D-V boundary aswell as dorsal brightness and hue Linnen et al 2013) gray lines givethe location of Agouti exons and arrows indicate two alternativetranscription start sites

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

795Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

accounting of non-Agouti regions associated with color willrequire a whole-genome resequencing approach

The Demographic History of the Sand Hills PopulationInferring the demographic history of this region is of interestin and of itself but also is a requisite step for downstreamselection inference Based on their genetic diversity individ-uals from different sampling locations grouped according totheir location and habitat with a clear separation betweenindividuals from on and off of the Sand Hills (fig 4) Theobserved pattern of isolation by distance (IBD) was furthersupported by a significant correlation between pairwise ge-netic differentiation (FST[1 FST]) and pairwise geographicdistance among sampling sites (Pfrac14 99 104) The pairwiseFST values between sampling locations were low ranging from0008 to 0065 indicating limited genetic differentiationConsistent with both patterns of IBD and reduced differen-tiation among populations individual ancestry proportions(as inferred by sNMF and TESS) were best explained by threepopulation clusters Additionally the ancestry proportionsobtained with three clusters (fig 4) returned low cross-entropy values (supplementary fig 1 SupplementaryMaterial online) particularly when accounting for geographicsampling location again suggesting three distinct groups 1)North of the Sand Hills 2) on the Sand Hills and 3) south of

the Sand Hills The population tree inferred with TreeMix alsoindicated low levels of genetic drift among populations aswell as primary differentiation between north and south pop-ulations with the Sand Hills population occupying an inter-mediate position This pattern is consistent with increaseddifferentiation moving along the transect (fig 4) which islikely owing both to simple IBD as well as the more complexsettlement history inferred below

Given the observed population structure we next investi-gated models of colonization history based on three popula-tions Those inhabiting the Sand Hills and those inhabiting theneighboring regions to the north and to the south This resultedin two different three-population demographic models corre-sponding to different population tree topologies which explic-itly allow for bottlenecks associated with potential founderevents and for gene flow among populations (supplementaryfig 2 Supplementary Material online) Both models appearedequally likely (supplementary table 5 Supplementary Materialonline) and parameter estimates were similar and consistentacross models pointing to a recent divergence time amongpopulations evidence of a bottleneck associated with the col-onization of the Sand Hills and high rates of migration amongall populations (supplementary table 6 SupplementaryMaterial online) Note that the tested models did not specifi-cally impose a bottleneck for the colonization of the Sand Hills

FIG 4 Genetic structure of populations on and off the Sand Hills (A) PCA shows that individuals cluster according to their geographic samplinglocation with a clear separation of individuals sampled on the Sand Hills (SH green) from off the Sand Hills (blue) (B) TreeMix results for the treethat best fit the covariance in allele frequencies across samples (C) An isolation-by-distance model is supported by a significant correlationbetween pairwise genetic differentiation (FST[1 FST]) and pairwise geographic distance among sampling sites Significance of the Mantel testwas assessed through 10000 permutations (D) Spatial interpolation of the ancestry proportion inferred accounting for geographic samplinglocation using TESS3 for Kfrac14 3 groups Individuals are represented as dots and the ancestry of the three groups is represented by a gradient of threecolors (E) Individual admixture proportions (inferred using TESS3) are consistent with IBD and reduced differentiation among populations Thenumber of clusters (K) best explaining the data was assessed as the K value reaching the lowest minimal cross-entropy at Kfrac14 3 All analyses werebased on a subset of 161 individuals showing a mean depth of coverage larger than 8 with all genotypes possessing a minimum coverage of 4sampling one SNP for each 15-kb block resulting in a data set with 5412 SNPs (supplementary table 8 Supplementary Material online)

Pfeifer et al doi101093molbevmsy004 MBE

796Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

as population sizes could have remained high at the onset of thecolonization of the Sand Hills

To better distinguish between the models we comparedthe likelihoods of the two models for bootstrap data setscontaining a single SNP per 15-kb block counting the num-ber of bootstrap replicates with a relative likelihood (based onAkaikersquos information criterion values) larger than 095 for eachmodel Using this approach we identified a model with atopology in which the population on the Sand Hills sharesa more recent common ancestor with the population off theSand Hills to the south (supported in 81100 bootstrap rep-licates supplementary fig 3 Supplementary Material online)This topology and the parameter estimates (supplementarytable 6 Supplementary Material online) are consistent with arecent colonization of the Sand Hills from the south namely1) A recent split within the last4000 years (95 CI 3400ndash7900) 2) a stronger bottleneck associated with the coloniza-tion of the Sand Hills compared with the older split betweennorthern and southern populations and 3) higher levels ofgene flow from south to north including higher migrationrates from the southern population into the Sand Hills (2Nm95 CI 125ndash240) Furthermore this model fits well the mar-ginal site frequency spectrum (SFS) for each sample as well asthe 2D marginal SFS for each pair of populations (supplemen-tary fig 4 Supplementary Material online) Thus this neutraldemographic model nicely explains observed patterns of var-iation outside of the Agouti region and suggests that at leastthe most recent colonization represented by the currentlysampled individuals occurred considerably after the SandHills began to initially form (ie roughly 8000 years ago)

The Selective History of the Sand Hills PopulationDespite the high levels of gene flow inferred in our modelwhich results in low levels of differentiation among popula-tions genome-wide (supplementary fig 5a SupplementaryMaterial online) we observed high levels of differentiationamong populations at the Agouti locusmdashvariation that isfurther correlated with several phenotypic traits (fig 5 sup-plementary figs 5 and 6 Supplementary Material online) Weobserved the highest level of differentiation between micesampled on either side of the northern limit of the SandHills with a 100-kb region within Agouti exhibiting a contin-uous run of elevated genetic differentiation (supplementaryfig 6 Supplementary Material online) This pattern is consis-tent with the expectations of positive selection under a localadaptation regime Within this 100-kb region a subregion of30 kb located in intron 1 (ie between exon 1A and exon 1)displayed maximal differentiation (supplementary fig 6Supplementary Material online) making it difficult to pre-cisely identify the target of selection In contrast to thewide-range differentiation observed at Agouti on the northernside of the cline a single narrow FST-peak of roughly 5 kblocated in intron 1 was observed on the southern limit ofthe Sand Hills (fig 5 supplementary fig 6a SupplementaryMaterial online) Genome-wide comparison confirmed theoverall genetic similarity between light-colored Sand Hillsmice and the dark-colored population to the south withthe exception of a small number of variants putatively driving

adaptive phenotypic divergence at Agouti (fig 5 supplemen-tary fig 6a Supplementary Material online)

To test whether this pattern of differentiation could be theresult of nonselective forces (see Crisci et al 2012 Jensen et al2016) we calculated the HapFLK statistic providing a singlemeasure of genetic differentiation for the three geographiclocalities while controlling for neutral differentiation Beforecalculating genetic differentiation at the target locus (ieAgouti) the HapFLK method computes a neutral distancematrix from the background data (ie the backgroundregions randomly distributed across the genome) that sum-marizes the genetic similarity between populations with re-gard to the extent of genetic drift since divergence from theircommon ancestral population (supplementary fig 7Supplementary Material online) Consistent with the inferreddemographic history of the populations genome-wide back-ground levels of variation suggest that individuals capturedsouth of the Sand Hills are more similar to those inhabitingthe Sand Hills when compared with populations to thenorth The HapFLK profile confirmed the significant geneticdifferentiation at the Agouti locus compared with neutralexpectations (fig 6) Interestingly southern populations sharea significant amount of the haplotypic structure observed inpopulations on the Sand Hills with the exception of thecandidate region itself This is in keeping with high rates ofon-going gene flow occurring between on and off Sand Hillspopulations whereas selection maintains the ancestral Agoutialleles in the populations off the Sand Hills and the derivedalleles conferring crypsis on the Sand Hills

Owing to low levels of linkage disequilibrium characteriz-ing the Agouti region adaptive variants could be mapped on afine genetic scale Specifically there are three regions of in-creased differentiation (fig 6) 1) A narrow 3-kb peak in theHapFLK profile (located in intron 1 supplementary table 7Supplementary Material online) colocalizes with a region ofhigh linkage disequilibrium (supplementary fig 8Supplementary Material online) suggesting a recent selectiveevent 2) a second highest peak of differentiation locatedbetween exon 1A and the duplicated reversed copyexon1Arsquo is the only significant region detected by the CLRtest (supplementary fig 9 Supplementary Material online)and contains strong haplotype structure (fig 6) again sug-gesting the recent action of positive selection and 3) a thirdhighest peak of differentiation surrounds the putatively ben-eficial serine deletion in exon 2 previously described by Linnenet al (2009) showing strong differentiation between on andoff Sand Hills populations in our data set Although severallines of evidence support the role of the serine deletion inadaptation to the Sand Hills the linkage disequilibriumaround this variant is low suggesting that the age of thecorresponding selective event is likely the oldest among thethree candidate regions with subsequent mutation and re-combination events reducing the selective sweep patternHence as one may anticipate this greatly expanded clinaldata set served to identify younger and more geographicallylocalized candidate regions from across the Sand Hills com-pared with previous studies while still generally supportingthe model proposed by Linnen et al (2013) of multiple

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

797Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

independently selected mutations underlying differentaspects of the cryptic phenotype

Predicted Migration Thresholds and the Conditions ofAllele MaintenanceAlthough a number of strong assumptions must be made itis possible to estimate the migration threshold below which alocally beneficial mutation may be maintained in a

population Given our high rates of inferred gene flow it isimportant to consider whether observations are consistentwith theoretical expectations necessary for the maintenanceof alleles Following Yeaman and Whitlock (2011) andYeaman and Otto (2011) this migration threshold is definedin terms of both rates of gene flow as well as fitness effects oflocally beneficial alleles in the matching and in the alternateenvironment Given the parameter values inferred here along

FIG 5 Genetic differentiation across the Agouti locus Genetic differentiation observed between populations on the Sand Hills and populationsnorth of the Sand Hills (top panel) populations on the Sand Hills and populations south of the Sand Hills (middle panel) and populations northand south of the Sand Hills Dotted lines indicate the genome-wide average FST between these comparisons As shown differentiation is generallygreatly elevated across the region with a number of highly differentiated SNPs particularly when comparing on versus off Sand Hills populationswhich correspond to SNPs associated with different aspects of the cryptic phenotype (PIP scores are indicated by the size of the dots as shown inthe legend and significant SNPs are labeled with respect to the associated phenotype) Populations to the north and south of the Sand Hills are alsohighly differentiated in this region likely owing to differing levels of gene flow with the Sand Hills populations yet the SNPs underlying the crypticphenotype are not differentiated between these dark populations

Pfeifer et al doi101093molbevmsy004 MBE

798Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

with estimated population sizes the threshold above whichthe most strongly beneficial mutations may be maintained inthe population is estimated at mfrac14 08mdasha large value indeedgiven the inferred strength of selection (see Materials andMethods for details) For the more weakly beneficial muta-tions identified this value is estimated at mfrac14 007 again wellabove empirically estimated rates Thus our empirical obser-vations are fully consistent with expectation in that gene flowis strong enough to prevent locally beneficial alleles fromfixing on the Sand Hills but not so strong so as to swampout the derived allele despite the high input of ancestral var-iation As such parameter estimates fall in a range consistentwith the long-term maintenance of polymorphic alleles

ConclusionsThe cryptically colored mice of the Nebraska Sand Hills rep-resent one of the few mammalian systems in which aspects ofgenotype phenotype and fitness have been measured andconnected Yet the population genetics of the Sand Hillsregion has remained poorly understood By sequencing hun-dreds of individuals across a cline spanning over 300 km fun-damental aspects of the evolutionary history of thispopulation could be addressed for the first time Utilizinggenome-wide putatively neutral regions the inferred demo-graphic history suggests a relatively recent colonization of theSand Hills from the south well after the last glacial period

Further high rates of gene flow are inferred between light anddark populationsmdashresulting in genome-wide low levels ofgenetic differentiation as well as low levels of phenotypic dif-ferentiation of four traits unrelated to coloration across thecline However the Agouti region differs markedly in this re-gard with high levels of differentiation observed between onand off Sand Hills populations strong haplotype structure andhigh levels of linkage disequilibrium spanning putatively ben-eficial mutations among cryptically colored individuals In ad-dition we found these putatively beneficial mutations to bestrongly associated with the phenotypic traits underlying cryp-sis and these phenotypic traits were found to be in strongassociation with variance in soil color across the cline

Together these results suggest a model in which selectionis acting to maintain the alleles underlying the locally adaptivephenotype on lightdark soil despite substantial gene flowwhich not only prevents the populations from strongly dif-ferentiating from one another but also prevents the crypticgenotypes from reaching local fixation Furthermore return-ing to the theoretical predictions outlined in theIntroduction the mutations underlying the cryptic pheno-type are found to be few in number of large effect size and inclose genomic proximity As described by Yeaman andWhitlock (2011) in models of local adaptation with high mi-gration the establishment of a large effect beneficial allelemay indeed facilitate the accumulation of other locally ben-eficial alleles in the same genomic region owing to the local

FIG 6 Comparison of tests of neutrality and selection across the Agouti locus From the top The exon structure pairwise differences Tajimarsquos DCLR test HapFLK and FLK The dashed lines indicate the significance threshold using parametric bootstrapping of the best-fitting demographicmodel

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

799Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 4: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

expression are associated with changes in the dorsalndashventralboundary (ventral isoform Manceau et al 2011) and thewidth of light bands on individual hairs (hair-cycle isoformLinnen et al 2009) Both isoforms share the same three codingexons (exons 2ndash4) thus protein-coding changes could simul-taneously alter pigmentation patterning and hair banding(Linnen et al 2013)

Across the Agouti locus we detected 160 SNPs that werestrongly associated (posterior inclusion probability [PIP] inthe polygenic BSLMMgt 01) with at least one color trait(see Materials and Methods) By trait the number of SNPsexceeding our PIP threshold was 53 (dorsal brightness) 2(dorsal hue) 16 (ventral brightness) 34 (dorsalndashventralboundary) and 81 (tail stripe) Additionally patterns ofgenotypendashphenotype association were distinct for eachtrait but consistent with Agouti isoform function (fig 3)For example we observed the strongest associations fordorsal brightness around the three coding exons (exons2 3 and 4 located between positions 2506 and 2511 Mbon the scaffold containing the Agouti locus) and upstreamof the transcription start site for the Agouti hair-cycle iso-form (fig 3) In contrast for dorsalndashventral boundary theSNP with the highest PIP was located very near the tran-scription start site for the Agouti ventral isoform (fig 3)Notably a previously identified serine deletion in exon 2(Linnen et al 2009 2013) exceeds our threshold for three ofthe color traits (dorsal brightness dorsalndashventral boundaryand tail stripe) and was elevated above background asso-ciation levels for the remaining two traits (dorsal hue andventral brightness) and multiple candidates from thoseanalyses are also recovered here (supplementary table 2Supplementary Material online) Overall our associationmapping results thus reveal that there are many sites inAgouti that are associated with pigment variation

We also estimated genetic architecture parameters andidentified candidate pigmentation SNPs in the full data set(Agouti SNPs plus all SNPs outside of Agouti that had nomissing data) For all five traits including SNPs outside ofAgouti led to a modest increase in PVE PGE and SNP-number estimates but credible intervals overlapped withthose of the Agouti-only data set for all parameter estimates(table 1 vs supplementary table 3 Supplementary Materialonline) Additionally although the highest-PIP SNPs werefound in Agouti we identified several non-Agouti candidate

SNPs as well (supplementary table 4 Supplementary Materialonline) Together these results suggest that while Agoutiexplains a considerable amount of color variation in miceliving on and around the Nebraska Sand Hills a complete

Table 1 Hyperparameters Describing the Genetic Architecture of Five Color Traits

Trait h PVE rho PGE n_gamma

Dorsal brightness 050 (029 07) 069 (057 081) 062 (031 092) 083 (064 097) 1172 (7 286)Dorsal hue 028 (007 058) 026 (008 045) 032 (002 087) 030 (0 086) 554 (0 264)Ventral brightness 033 (016 051) 039 (023 056) 043 (013 082) 054 (022 088) 806 (3 271)D-V boundary 044 (025 064) 061 (046 073) 075 (045 098) 087 (07 099) 705 (9 256)Tail stripe 051 (032 069) 069 (056 08) 064 (038 089) 079 (061 095) 1239 (16 284)

NOTEmdashAll parameters were estimated using a BSLMM model in GEMMA 2148 Agouti SNPs and a relatedness matrix estimated from genome-wide SNPs Parameter meansand 95 credible intervals (lower bound upper bound) were calculated from ten independent runs per trait Interpretation of hyperparameters is as follows PVE the totalproportion of phenotypic variance that is explained by genotype PGE the proportion of genetic variance explained by sparse (ie major) effects h an approximation used forprior specification of the expected value of PVE rho an approximation used for prior specification of the expected value of PGE n_gamma the expected number of sparse(major) effect SNPs

FIG 3 Association mapping results for Agouti SNPs and five colortraits Each point represents one of 2148 Agouti SNPs included in theassociation mapping analysis with position (on scaffold 16 seeMaterials and Methods) indicated on the x-axis and PIP estimatedusing a BSLMM in GEMMA on the y-axis PIP values which approx-imate the strength of the association between genotype and pheno-type were averaged across ten independent GEMMA runs Black dotsindicate SNPs with PIPgt 01 dashed lines give the location of sixcandidate SNPs identified in a previous association mapping analysisof a single population (associated with tail stripe D-V boundary aswell as dorsal brightness and hue Linnen et al 2013) gray lines givethe location of Agouti exons and arrows indicate two alternativetranscription start sites

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

795Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

accounting of non-Agouti regions associated with color willrequire a whole-genome resequencing approach

The Demographic History of the Sand Hills PopulationInferring the demographic history of this region is of interestin and of itself but also is a requisite step for downstreamselection inference Based on their genetic diversity individ-uals from different sampling locations grouped according totheir location and habitat with a clear separation betweenindividuals from on and off of the Sand Hills (fig 4) Theobserved pattern of isolation by distance (IBD) was furthersupported by a significant correlation between pairwise ge-netic differentiation (FST[1 FST]) and pairwise geographicdistance among sampling sites (Pfrac14 99 104) The pairwiseFST values between sampling locations were low ranging from0008 to 0065 indicating limited genetic differentiationConsistent with both patterns of IBD and reduced differen-tiation among populations individual ancestry proportions(as inferred by sNMF and TESS) were best explained by threepopulation clusters Additionally the ancestry proportionsobtained with three clusters (fig 4) returned low cross-entropy values (supplementary fig 1 SupplementaryMaterial online) particularly when accounting for geographicsampling location again suggesting three distinct groups 1)North of the Sand Hills 2) on the Sand Hills and 3) south of

the Sand Hills The population tree inferred with TreeMix alsoindicated low levels of genetic drift among populations aswell as primary differentiation between north and south pop-ulations with the Sand Hills population occupying an inter-mediate position This pattern is consistent with increaseddifferentiation moving along the transect (fig 4) which islikely owing both to simple IBD as well as the more complexsettlement history inferred below

Given the observed population structure we next investi-gated models of colonization history based on three popula-tions Those inhabiting the Sand Hills and those inhabiting theneighboring regions to the north and to the south This resultedin two different three-population demographic models corre-sponding to different population tree topologies which explic-itly allow for bottlenecks associated with potential founderevents and for gene flow among populations (supplementaryfig 2 Supplementary Material online) Both models appearedequally likely (supplementary table 5 Supplementary Materialonline) and parameter estimates were similar and consistentacross models pointing to a recent divergence time amongpopulations evidence of a bottleneck associated with the col-onization of the Sand Hills and high rates of migration amongall populations (supplementary table 6 SupplementaryMaterial online) Note that the tested models did not specifi-cally impose a bottleneck for the colonization of the Sand Hills

FIG 4 Genetic structure of populations on and off the Sand Hills (A) PCA shows that individuals cluster according to their geographic samplinglocation with a clear separation of individuals sampled on the Sand Hills (SH green) from off the Sand Hills (blue) (B) TreeMix results for the treethat best fit the covariance in allele frequencies across samples (C) An isolation-by-distance model is supported by a significant correlationbetween pairwise genetic differentiation (FST[1 FST]) and pairwise geographic distance among sampling sites Significance of the Mantel testwas assessed through 10000 permutations (D) Spatial interpolation of the ancestry proportion inferred accounting for geographic samplinglocation using TESS3 for Kfrac14 3 groups Individuals are represented as dots and the ancestry of the three groups is represented by a gradient of threecolors (E) Individual admixture proportions (inferred using TESS3) are consistent with IBD and reduced differentiation among populations Thenumber of clusters (K) best explaining the data was assessed as the K value reaching the lowest minimal cross-entropy at Kfrac14 3 All analyses werebased on a subset of 161 individuals showing a mean depth of coverage larger than 8 with all genotypes possessing a minimum coverage of 4sampling one SNP for each 15-kb block resulting in a data set with 5412 SNPs (supplementary table 8 Supplementary Material online)

Pfeifer et al doi101093molbevmsy004 MBE

796Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

as population sizes could have remained high at the onset of thecolonization of the Sand Hills

To better distinguish between the models we comparedthe likelihoods of the two models for bootstrap data setscontaining a single SNP per 15-kb block counting the num-ber of bootstrap replicates with a relative likelihood (based onAkaikersquos information criterion values) larger than 095 for eachmodel Using this approach we identified a model with atopology in which the population on the Sand Hills sharesa more recent common ancestor with the population off theSand Hills to the south (supported in 81100 bootstrap rep-licates supplementary fig 3 Supplementary Material online)This topology and the parameter estimates (supplementarytable 6 Supplementary Material online) are consistent with arecent colonization of the Sand Hills from the south namely1) A recent split within the last4000 years (95 CI 3400ndash7900) 2) a stronger bottleneck associated with the coloniza-tion of the Sand Hills compared with the older split betweennorthern and southern populations and 3) higher levels ofgene flow from south to north including higher migrationrates from the southern population into the Sand Hills (2Nm95 CI 125ndash240) Furthermore this model fits well the mar-ginal site frequency spectrum (SFS) for each sample as well asthe 2D marginal SFS for each pair of populations (supplemen-tary fig 4 Supplementary Material online) Thus this neutraldemographic model nicely explains observed patterns of var-iation outside of the Agouti region and suggests that at leastthe most recent colonization represented by the currentlysampled individuals occurred considerably after the SandHills began to initially form (ie roughly 8000 years ago)

The Selective History of the Sand Hills PopulationDespite the high levels of gene flow inferred in our modelwhich results in low levels of differentiation among popula-tions genome-wide (supplementary fig 5a SupplementaryMaterial online) we observed high levels of differentiationamong populations at the Agouti locusmdashvariation that isfurther correlated with several phenotypic traits (fig 5 sup-plementary figs 5 and 6 Supplementary Material online) Weobserved the highest level of differentiation between micesampled on either side of the northern limit of the SandHills with a 100-kb region within Agouti exhibiting a contin-uous run of elevated genetic differentiation (supplementaryfig 6 Supplementary Material online) This pattern is consis-tent with the expectations of positive selection under a localadaptation regime Within this 100-kb region a subregion of30 kb located in intron 1 (ie between exon 1A and exon 1)displayed maximal differentiation (supplementary fig 6Supplementary Material online) making it difficult to pre-cisely identify the target of selection In contrast to thewide-range differentiation observed at Agouti on the northernside of the cline a single narrow FST-peak of roughly 5 kblocated in intron 1 was observed on the southern limit ofthe Sand Hills (fig 5 supplementary fig 6a SupplementaryMaterial online) Genome-wide comparison confirmed theoverall genetic similarity between light-colored Sand Hillsmice and the dark-colored population to the south withthe exception of a small number of variants putatively driving

adaptive phenotypic divergence at Agouti (fig 5 supplemen-tary fig 6a Supplementary Material online)

To test whether this pattern of differentiation could be theresult of nonselective forces (see Crisci et al 2012 Jensen et al2016) we calculated the HapFLK statistic providing a singlemeasure of genetic differentiation for the three geographiclocalities while controlling for neutral differentiation Beforecalculating genetic differentiation at the target locus (ieAgouti) the HapFLK method computes a neutral distancematrix from the background data (ie the backgroundregions randomly distributed across the genome) that sum-marizes the genetic similarity between populations with re-gard to the extent of genetic drift since divergence from theircommon ancestral population (supplementary fig 7Supplementary Material online) Consistent with the inferreddemographic history of the populations genome-wide back-ground levels of variation suggest that individuals capturedsouth of the Sand Hills are more similar to those inhabitingthe Sand Hills when compared with populations to thenorth The HapFLK profile confirmed the significant geneticdifferentiation at the Agouti locus compared with neutralexpectations (fig 6) Interestingly southern populations sharea significant amount of the haplotypic structure observed inpopulations on the Sand Hills with the exception of thecandidate region itself This is in keeping with high rates ofon-going gene flow occurring between on and off Sand Hillspopulations whereas selection maintains the ancestral Agoutialleles in the populations off the Sand Hills and the derivedalleles conferring crypsis on the Sand Hills

Owing to low levels of linkage disequilibrium characteriz-ing the Agouti region adaptive variants could be mapped on afine genetic scale Specifically there are three regions of in-creased differentiation (fig 6) 1) A narrow 3-kb peak in theHapFLK profile (located in intron 1 supplementary table 7Supplementary Material online) colocalizes with a region ofhigh linkage disequilibrium (supplementary fig 8Supplementary Material online) suggesting a recent selectiveevent 2) a second highest peak of differentiation locatedbetween exon 1A and the duplicated reversed copyexon1Arsquo is the only significant region detected by the CLRtest (supplementary fig 9 Supplementary Material online)and contains strong haplotype structure (fig 6) again sug-gesting the recent action of positive selection and 3) a thirdhighest peak of differentiation surrounds the putatively ben-eficial serine deletion in exon 2 previously described by Linnenet al (2009) showing strong differentiation between on andoff Sand Hills populations in our data set Although severallines of evidence support the role of the serine deletion inadaptation to the Sand Hills the linkage disequilibriumaround this variant is low suggesting that the age of thecorresponding selective event is likely the oldest among thethree candidate regions with subsequent mutation and re-combination events reducing the selective sweep patternHence as one may anticipate this greatly expanded clinaldata set served to identify younger and more geographicallylocalized candidate regions from across the Sand Hills com-pared with previous studies while still generally supportingthe model proposed by Linnen et al (2013) of multiple

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

797Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

independently selected mutations underlying differentaspects of the cryptic phenotype

Predicted Migration Thresholds and the Conditions ofAllele MaintenanceAlthough a number of strong assumptions must be made itis possible to estimate the migration threshold below which alocally beneficial mutation may be maintained in a

population Given our high rates of inferred gene flow it isimportant to consider whether observations are consistentwith theoretical expectations necessary for the maintenanceof alleles Following Yeaman and Whitlock (2011) andYeaman and Otto (2011) this migration threshold is definedin terms of both rates of gene flow as well as fitness effects oflocally beneficial alleles in the matching and in the alternateenvironment Given the parameter values inferred here along

FIG 5 Genetic differentiation across the Agouti locus Genetic differentiation observed between populations on the Sand Hills and populationsnorth of the Sand Hills (top panel) populations on the Sand Hills and populations south of the Sand Hills (middle panel) and populations northand south of the Sand Hills Dotted lines indicate the genome-wide average FST between these comparisons As shown differentiation is generallygreatly elevated across the region with a number of highly differentiated SNPs particularly when comparing on versus off Sand Hills populationswhich correspond to SNPs associated with different aspects of the cryptic phenotype (PIP scores are indicated by the size of the dots as shown inthe legend and significant SNPs are labeled with respect to the associated phenotype) Populations to the north and south of the Sand Hills are alsohighly differentiated in this region likely owing to differing levels of gene flow with the Sand Hills populations yet the SNPs underlying the crypticphenotype are not differentiated between these dark populations

Pfeifer et al doi101093molbevmsy004 MBE

798Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

with estimated population sizes the threshold above whichthe most strongly beneficial mutations may be maintained inthe population is estimated at mfrac14 08mdasha large value indeedgiven the inferred strength of selection (see Materials andMethods for details) For the more weakly beneficial muta-tions identified this value is estimated at mfrac14 007 again wellabove empirically estimated rates Thus our empirical obser-vations are fully consistent with expectation in that gene flowis strong enough to prevent locally beneficial alleles fromfixing on the Sand Hills but not so strong so as to swampout the derived allele despite the high input of ancestral var-iation As such parameter estimates fall in a range consistentwith the long-term maintenance of polymorphic alleles

ConclusionsThe cryptically colored mice of the Nebraska Sand Hills rep-resent one of the few mammalian systems in which aspects ofgenotype phenotype and fitness have been measured andconnected Yet the population genetics of the Sand Hillsregion has remained poorly understood By sequencing hun-dreds of individuals across a cline spanning over 300 km fun-damental aspects of the evolutionary history of thispopulation could be addressed for the first time Utilizinggenome-wide putatively neutral regions the inferred demo-graphic history suggests a relatively recent colonization of theSand Hills from the south well after the last glacial period

Further high rates of gene flow are inferred between light anddark populationsmdashresulting in genome-wide low levels ofgenetic differentiation as well as low levels of phenotypic dif-ferentiation of four traits unrelated to coloration across thecline However the Agouti region differs markedly in this re-gard with high levels of differentiation observed between onand off Sand Hills populations strong haplotype structure andhigh levels of linkage disequilibrium spanning putatively ben-eficial mutations among cryptically colored individuals In ad-dition we found these putatively beneficial mutations to bestrongly associated with the phenotypic traits underlying cryp-sis and these phenotypic traits were found to be in strongassociation with variance in soil color across the cline

Together these results suggest a model in which selectionis acting to maintain the alleles underlying the locally adaptivephenotype on lightdark soil despite substantial gene flowwhich not only prevents the populations from strongly dif-ferentiating from one another but also prevents the crypticgenotypes from reaching local fixation Furthermore return-ing to the theoretical predictions outlined in theIntroduction the mutations underlying the cryptic pheno-type are found to be few in number of large effect size and inclose genomic proximity As described by Yeaman andWhitlock (2011) in models of local adaptation with high mi-gration the establishment of a large effect beneficial allelemay indeed facilitate the accumulation of other locally ben-eficial alleles in the same genomic region owing to the local

FIG 6 Comparison of tests of neutrality and selection across the Agouti locus From the top The exon structure pairwise differences Tajimarsquos DCLR test HapFLK and FLK The dashed lines indicate the significance threshold using parametric bootstrapping of the best-fitting demographicmodel

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

799Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 5: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

accounting of non-Agouti regions associated with color willrequire a whole-genome resequencing approach

The Demographic History of the Sand Hills PopulationInferring the demographic history of this region is of interestin and of itself but also is a requisite step for downstreamselection inference Based on their genetic diversity individ-uals from different sampling locations grouped according totheir location and habitat with a clear separation betweenindividuals from on and off of the Sand Hills (fig 4) Theobserved pattern of isolation by distance (IBD) was furthersupported by a significant correlation between pairwise ge-netic differentiation (FST[1 FST]) and pairwise geographicdistance among sampling sites (Pfrac14 99 104) The pairwiseFST values between sampling locations were low ranging from0008 to 0065 indicating limited genetic differentiationConsistent with both patterns of IBD and reduced differen-tiation among populations individual ancestry proportions(as inferred by sNMF and TESS) were best explained by threepopulation clusters Additionally the ancestry proportionsobtained with three clusters (fig 4) returned low cross-entropy values (supplementary fig 1 SupplementaryMaterial online) particularly when accounting for geographicsampling location again suggesting three distinct groups 1)North of the Sand Hills 2) on the Sand Hills and 3) south of

the Sand Hills The population tree inferred with TreeMix alsoindicated low levels of genetic drift among populations aswell as primary differentiation between north and south pop-ulations with the Sand Hills population occupying an inter-mediate position This pattern is consistent with increaseddifferentiation moving along the transect (fig 4) which islikely owing both to simple IBD as well as the more complexsettlement history inferred below

Given the observed population structure we next investi-gated models of colonization history based on three popula-tions Those inhabiting the Sand Hills and those inhabiting theneighboring regions to the north and to the south This resultedin two different three-population demographic models corre-sponding to different population tree topologies which explic-itly allow for bottlenecks associated with potential founderevents and for gene flow among populations (supplementaryfig 2 Supplementary Material online) Both models appearedequally likely (supplementary table 5 Supplementary Materialonline) and parameter estimates were similar and consistentacross models pointing to a recent divergence time amongpopulations evidence of a bottleneck associated with the col-onization of the Sand Hills and high rates of migration amongall populations (supplementary table 6 SupplementaryMaterial online) Note that the tested models did not specifi-cally impose a bottleneck for the colonization of the Sand Hills

FIG 4 Genetic structure of populations on and off the Sand Hills (A) PCA shows that individuals cluster according to their geographic samplinglocation with a clear separation of individuals sampled on the Sand Hills (SH green) from off the Sand Hills (blue) (B) TreeMix results for the treethat best fit the covariance in allele frequencies across samples (C) An isolation-by-distance model is supported by a significant correlationbetween pairwise genetic differentiation (FST[1 FST]) and pairwise geographic distance among sampling sites Significance of the Mantel testwas assessed through 10000 permutations (D) Spatial interpolation of the ancestry proportion inferred accounting for geographic samplinglocation using TESS3 for Kfrac14 3 groups Individuals are represented as dots and the ancestry of the three groups is represented by a gradient of threecolors (E) Individual admixture proportions (inferred using TESS3) are consistent with IBD and reduced differentiation among populations Thenumber of clusters (K) best explaining the data was assessed as the K value reaching the lowest minimal cross-entropy at Kfrac14 3 All analyses werebased on a subset of 161 individuals showing a mean depth of coverage larger than 8 with all genotypes possessing a minimum coverage of 4sampling one SNP for each 15-kb block resulting in a data set with 5412 SNPs (supplementary table 8 Supplementary Material online)

Pfeifer et al doi101093molbevmsy004 MBE

796Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

as population sizes could have remained high at the onset of thecolonization of the Sand Hills

To better distinguish between the models we comparedthe likelihoods of the two models for bootstrap data setscontaining a single SNP per 15-kb block counting the num-ber of bootstrap replicates with a relative likelihood (based onAkaikersquos information criterion values) larger than 095 for eachmodel Using this approach we identified a model with atopology in which the population on the Sand Hills sharesa more recent common ancestor with the population off theSand Hills to the south (supported in 81100 bootstrap rep-licates supplementary fig 3 Supplementary Material online)This topology and the parameter estimates (supplementarytable 6 Supplementary Material online) are consistent with arecent colonization of the Sand Hills from the south namely1) A recent split within the last4000 years (95 CI 3400ndash7900) 2) a stronger bottleneck associated with the coloniza-tion of the Sand Hills compared with the older split betweennorthern and southern populations and 3) higher levels ofgene flow from south to north including higher migrationrates from the southern population into the Sand Hills (2Nm95 CI 125ndash240) Furthermore this model fits well the mar-ginal site frequency spectrum (SFS) for each sample as well asthe 2D marginal SFS for each pair of populations (supplemen-tary fig 4 Supplementary Material online) Thus this neutraldemographic model nicely explains observed patterns of var-iation outside of the Agouti region and suggests that at leastthe most recent colonization represented by the currentlysampled individuals occurred considerably after the SandHills began to initially form (ie roughly 8000 years ago)

The Selective History of the Sand Hills PopulationDespite the high levels of gene flow inferred in our modelwhich results in low levels of differentiation among popula-tions genome-wide (supplementary fig 5a SupplementaryMaterial online) we observed high levels of differentiationamong populations at the Agouti locusmdashvariation that isfurther correlated with several phenotypic traits (fig 5 sup-plementary figs 5 and 6 Supplementary Material online) Weobserved the highest level of differentiation between micesampled on either side of the northern limit of the SandHills with a 100-kb region within Agouti exhibiting a contin-uous run of elevated genetic differentiation (supplementaryfig 6 Supplementary Material online) This pattern is consis-tent with the expectations of positive selection under a localadaptation regime Within this 100-kb region a subregion of30 kb located in intron 1 (ie between exon 1A and exon 1)displayed maximal differentiation (supplementary fig 6Supplementary Material online) making it difficult to pre-cisely identify the target of selection In contrast to thewide-range differentiation observed at Agouti on the northernside of the cline a single narrow FST-peak of roughly 5 kblocated in intron 1 was observed on the southern limit ofthe Sand Hills (fig 5 supplementary fig 6a SupplementaryMaterial online) Genome-wide comparison confirmed theoverall genetic similarity between light-colored Sand Hillsmice and the dark-colored population to the south withthe exception of a small number of variants putatively driving

adaptive phenotypic divergence at Agouti (fig 5 supplemen-tary fig 6a Supplementary Material online)

To test whether this pattern of differentiation could be theresult of nonselective forces (see Crisci et al 2012 Jensen et al2016) we calculated the HapFLK statistic providing a singlemeasure of genetic differentiation for the three geographiclocalities while controlling for neutral differentiation Beforecalculating genetic differentiation at the target locus (ieAgouti) the HapFLK method computes a neutral distancematrix from the background data (ie the backgroundregions randomly distributed across the genome) that sum-marizes the genetic similarity between populations with re-gard to the extent of genetic drift since divergence from theircommon ancestral population (supplementary fig 7Supplementary Material online) Consistent with the inferreddemographic history of the populations genome-wide back-ground levels of variation suggest that individuals capturedsouth of the Sand Hills are more similar to those inhabitingthe Sand Hills when compared with populations to thenorth The HapFLK profile confirmed the significant geneticdifferentiation at the Agouti locus compared with neutralexpectations (fig 6) Interestingly southern populations sharea significant amount of the haplotypic structure observed inpopulations on the Sand Hills with the exception of thecandidate region itself This is in keeping with high rates ofon-going gene flow occurring between on and off Sand Hillspopulations whereas selection maintains the ancestral Agoutialleles in the populations off the Sand Hills and the derivedalleles conferring crypsis on the Sand Hills

Owing to low levels of linkage disequilibrium characteriz-ing the Agouti region adaptive variants could be mapped on afine genetic scale Specifically there are three regions of in-creased differentiation (fig 6) 1) A narrow 3-kb peak in theHapFLK profile (located in intron 1 supplementary table 7Supplementary Material online) colocalizes with a region ofhigh linkage disequilibrium (supplementary fig 8Supplementary Material online) suggesting a recent selectiveevent 2) a second highest peak of differentiation locatedbetween exon 1A and the duplicated reversed copyexon1Arsquo is the only significant region detected by the CLRtest (supplementary fig 9 Supplementary Material online)and contains strong haplotype structure (fig 6) again sug-gesting the recent action of positive selection and 3) a thirdhighest peak of differentiation surrounds the putatively ben-eficial serine deletion in exon 2 previously described by Linnenet al (2009) showing strong differentiation between on andoff Sand Hills populations in our data set Although severallines of evidence support the role of the serine deletion inadaptation to the Sand Hills the linkage disequilibriumaround this variant is low suggesting that the age of thecorresponding selective event is likely the oldest among thethree candidate regions with subsequent mutation and re-combination events reducing the selective sweep patternHence as one may anticipate this greatly expanded clinaldata set served to identify younger and more geographicallylocalized candidate regions from across the Sand Hills com-pared with previous studies while still generally supportingthe model proposed by Linnen et al (2013) of multiple

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

797Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

independently selected mutations underlying differentaspects of the cryptic phenotype

Predicted Migration Thresholds and the Conditions ofAllele MaintenanceAlthough a number of strong assumptions must be made itis possible to estimate the migration threshold below which alocally beneficial mutation may be maintained in a

population Given our high rates of inferred gene flow it isimportant to consider whether observations are consistentwith theoretical expectations necessary for the maintenanceof alleles Following Yeaman and Whitlock (2011) andYeaman and Otto (2011) this migration threshold is definedin terms of both rates of gene flow as well as fitness effects oflocally beneficial alleles in the matching and in the alternateenvironment Given the parameter values inferred here along

FIG 5 Genetic differentiation across the Agouti locus Genetic differentiation observed between populations on the Sand Hills and populationsnorth of the Sand Hills (top panel) populations on the Sand Hills and populations south of the Sand Hills (middle panel) and populations northand south of the Sand Hills Dotted lines indicate the genome-wide average FST between these comparisons As shown differentiation is generallygreatly elevated across the region with a number of highly differentiated SNPs particularly when comparing on versus off Sand Hills populationswhich correspond to SNPs associated with different aspects of the cryptic phenotype (PIP scores are indicated by the size of the dots as shown inthe legend and significant SNPs are labeled with respect to the associated phenotype) Populations to the north and south of the Sand Hills are alsohighly differentiated in this region likely owing to differing levels of gene flow with the Sand Hills populations yet the SNPs underlying the crypticphenotype are not differentiated between these dark populations

Pfeifer et al doi101093molbevmsy004 MBE

798Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

with estimated population sizes the threshold above whichthe most strongly beneficial mutations may be maintained inthe population is estimated at mfrac14 08mdasha large value indeedgiven the inferred strength of selection (see Materials andMethods for details) For the more weakly beneficial muta-tions identified this value is estimated at mfrac14 007 again wellabove empirically estimated rates Thus our empirical obser-vations are fully consistent with expectation in that gene flowis strong enough to prevent locally beneficial alleles fromfixing on the Sand Hills but not so strong so as to swampout the derived allele despite the high input of ancestral var-iation As such parameter estimates fall in a range consistentwith the long-term maintenance of polymorphic alleles

ConclusionsThe cryptically colored mice of the Nebraska Sand Hills rep-resent one of the few mammalian systems in which aspects ofgenotype phenotype and fitness have been measured andconnected Yet the population genetics of the Sand Hillsregion has remained poorly understood By sequencing hun-dreds of individuals across a cline spanning over 300 km fun-damental aspects of the evolutionary history of thispopulation could be addressed for the first time Utilizinggenome-wide putatively neutral regions the inferred demo-graphic history suggests a relatively recent colonization of theSand Hills from the south well after the last glacial period

Further high rates of gene flow are inferred between light anddark populationsmdashresulting in genome-wide low levels ofgenetic differentiation as well as low levels of phenotypic dif-ferentiation of four traits unrelated to coloration across thecline However the Agouti region differs markedly in this re-gard with high levels of differentiation observed between onand off Sand Hills populations strong haplotype structure andhigh levels of linkage disequilibrium spanning putatively ben-eficial mutations among cryptically colored individuals In ad-dition we found these putatively beneficial mutations to bestrongly associated with the phenotypic traits underlying cryp-sis and these phenotypic traits were found to be in strongassociation with variance in soil color across the cline

Together these results suggest a model in which selectionis acting to maintain the alleles underlying the locally adaptivephenotype on lightdark soil despite substantial gene flowwhich not only prevents the populations from strongly dif-ferentiating from one another but also prevents the crypticgenotypes from reaching local fixation Furthermore return-ing to the theoretical predictions outlined in theIntroduction the mutations underlying the cryptic pheno-type are found to be few in number of large effect size and inclose genomic proximity As described by Yeaman andWhitlock (2011) in models of local adaptation with high mi-gration the establishment of a large effect beneficial allelemay indeed facilitate the accumulation of other locally ben-eficial alleles in the same genomic region owing to the local

FIG 6 Comparison of tests of neutrality and selection across the Agouti locus From the top The exon structure pairwise differences Tajimarsquos DCLR test HapFLK and FLK The dashed lines indicate the significance threshold using parametric bootstrapping of the best-fitting demographicmodel

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

799Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 6: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

as population sizes could have remained high at the onset of thecolonization of the Sand Hills

To better distinguish between the models we comparedthe likelihoods of the two models for bootstrap data setscontaining a single SNP per 15-kb block counting the num-ber of bootstrap replicates with a relative likelihood (based onAkaikersquos information criterion values) larger than 095 for eachmodel Using this approach we identified a model with atopology in which the population on the Sand Hills sharesa more recent common ancestor with the population off theSand Hills to the south (supported in 81100 bootstrap rep-licates supplementary fig 3 Supplementary Material online)This topology and the parameter estimates (supplementarytable 6 Supplementary Material online) are consistent with arecent colonization of the Sand Hills from the south namely1) A recent split within the last4000 years (95 CI 3400ndash7900) 2) a stronger bottleneck associated with the coloniza-tion of the Sand Hills compared with the older split betweennorthern and southern populations and 3) higher levels ofgene flow from south to north including higher migrationrates from the southern population into the Sand Hills (2Nm95 CI 125ndash240) Furthermore this model fits well the mar-ginal site frequency spectrum (SFS) for each sample as well asthe 2D marginal SFS for each pair of populations (supplemen-tary fig 4 Supplementary Material online) Thus this neutraldemographic model nicely explains observed patterns of var-iation outside of the Agouti region and suggests that at leastthe most recent colonization represented by the currentlysampled individuals occurred considerably after the SandHills began to initially form (ie roughly 8000 years ago)

The Selective History of the Sand Hills PopulationDespite the high levels of gene flow inferred in our modelwhich results in low levels of differentiation among popula-tions genome-wide (supplementary fig 5a SupplementaryMaterial online) we observed high levels of differentiationamong populations at the Agouti locusmdashvariation that isfurther correlated with several phenotypic traits (fig 5 sup-plementary figs 5 and 6 Supplementary Material online) Weobserved the highest level of differentiation between micesampled on either side of the northern limit of the SandHills with a 100-kb region within Agouti exhibiting a contin-uous run of elevated genetic differentiation (supplementaryfig 6 Supplementary Material online) This pattern is consis-tent with the expectations of positive selection under a localadaptation regime Within this 100-kb region a subregion of30 kb located in intron 1 (ie between exon 1A and exon 1)displayed maximal differentiation (supplementary fig 6Supplementary Material online) making it difficult to pre-cisely identify the target of selection In contrast to thewide-range differentiation observed at Agouti on the northernside of the cline a single narrow FST-peak of roughly 5 kblocated in intron 1 was observed on the southern limit ofthe Sand Hills (fig 5 supplementary fig 6a SupplementaryMaterial online) Genome-wide comparison confirmed theoverall genetic similarity between light-colored Sand Hillsmice and the dark-colored population to the south withthe exception of a small number of variants putatively driving

adaptive phenotypic divergence at Agouti (fig 5 supplemen-tary fig 6a Supplementary Material online)

To test whether this pattern of differentiation could be theresult of nonselective forces (see Crisci et al 2012 Jensen et al2016) we calculated the HapFLK statistic providing a singlemeasure of genetic differentiation for the three geographiclocalities while controlling for neutral differentiation Beforecalculating genetic differentiation at the target locus (ieAgouti) the HapFLK method computes a neutral distancematrix from the background data (ie the backgroundregions randomly distributed across the genome) that sum-marizes the genetic similarity between populations with re-gard to the extent of genetic drift since divergence from theircommon ancestral population (supplementary fig 7Supplementary Material online) Consistent with the inferreddemographic history of the populations genome-wide back-ground levels of variation suggest that individuals capturedsouth of the Sand Hills are more similar to those inhabitingthe Sand Hills when compared with populations to thenorth The HapFLK profile confirmed the significant geneticdifferentiation at the Agouti locus compared with neutralexpectations (fig 6) Interestingly southern populations sharea significant amount of the haplotypic structure observed inpopulations on the Sand Hills with the exception of thecandidate region itself This is in keeping with high rates ofon-going gene flow occurring between on and off Sand Hillspopulations whereas selection maintains the ancestral Agoutialleles in the populations off the Sand Hills and the derivedalleles conferring crypsis on the Sand Hills

Owing to low levels of linkage disequilibrium characteriz-ing the Agouti region adaptive variants could be mapped on afine genetic scale Specifically there are three regions of in-creased differentiation (fig 6) 1) A narrow 3-kb peak in theHapFLK profile (located in intron 1 supplementary table 7Supplementary Material online) colocalizes with a region ofhigh linkage disequilibrium (supplementary fig 8Supplementary Material online) suggesting a recent selectiveevent 2) a second highest peak of differentiation locatedbetween exon 1A and the duplicated reversed copyexon1Arsquo is the only significant region detected by the CLRtest (supplementary fig 9 Supplementary Material online)and contains strong haplotype structure (fig 6) again sug-gesting the recent action of positive selection and 3) a thirdhighest peak of differentiation surrounds the putatively ben-eficial serine deletion in exon 2 previously described by Linnenet al (2009) showing strong differentiation between on andoff Sand Hills populations in our data set Although severallines of evidence support the role of the serine deletion inadaptation to the Sand Hills the linkage disequilibriumaround this variant is low suggesting that the age of thecorresponding selective event is likely the oldest among thethree candidate regions with subsequent mutation and re-combination events reducing the selective sweep patternHence as one may anticipate this greatly expanded clinaldata set served to identify younger and more geographicallylocalized candidate regions from across the Sand Hills com-pared with previous studies while still generally supportingthe model proposed by Linnen et al (2013) of multiple

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

797Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

independently selected mutations underlying differentaspects of the cryptic phenotype

Predicted Migration Thresholds and the Conditions ofAllele MaintenanceAlthough a number of strong assumptions must be made itis possible to estimate the migration threshold below which alocally beneficial mutation may be maintained in a

population Given our high rates of inferred gene flow it isimportant to consider whether observations are consistentwith theoretical expectations necessary for the maintenanceof alleles Following Yeaman and Whitlock (2011) andYeaman and Otto (2011) this migration threshold is definedin terms of both rates of gene flow as well as fitness effects oflocally beneficial alleles in the matching and in the alternateenvironment Given the parameter values inferred here along

FIG 5 Genetic differentiation across the Agouti locus Genetic differentiation observed between populations on the Sand Hills and populationsnorth of the Sand Hills (top panel) populations on the Sand Hills and populations south of the Sand Hills (middle panel) and populations northand south of the Sand Hills Dotted lines indicate the genome-wide average FST between these comparisons As shown differentiation is generallygreatly elevated across the region with a number of highly differentiated SNPs particularly when comparing on versus off Sand Hills populationswhich correspond to SNPs associated with different aspects of the cryptic phenotype (PIP scores are indicated by the size of the dots as shown inthe legend and significant SNPs are labeled with respect to the associated phenotype) Populations to the north and south of the Sand Hills are alsohighly differentiated in this region likely owing to differing levels of gene flow with the Sand Hills populations yet the SNPs underlying the crypticphenotype are not differentiated between these dark populations

Pfeifer et al doi101093molbevmsy004 MBE

798Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

with estimated population sizes the threshold above whichthe most strongly beneficial mutations may be maintained inthe population is estimated at mfrac14 08mdasha large value indeedgiven the inferred strength of selection (see Materials andMethods for details) For the more weakly beneficial muta-tions identified this value is estimated at mfrac14 007 again wellabove empirically estimated rates Thus our empirical obser-vations are fully consistent with expectation in that gene flowis strong enough to prevent locally beneficial alleles fromfixing on the Sand Hills but not so strong so as to swampout the derived allele despite the high input of ancestral var-iation As such parameter estimates fall in a range consistentwith the long-term maintenance of polymorphic alleles

ConclusionsThe cryptically colored mice of the Nebraska Sand Hills rep-resent one of the few mammalian systems in which aspects ofgenotype phenotype and fitness have been measured andconnected Yet the population genetics of the Sand Hillsregion has remained poorly understood By sequencing hun-dreds of individuals across a cline spanning over 300 km fun-damental aspects of the evolutionary history of thispopulation could be addressed for the first time Utilizinggenome-wide putatively neutral regions the inferred demo-graphic history suggests a relatively recent colonization of theSand Hills from the south well after the last glacial period

Further high rates of gene flow are inferred between light anddark populationsmdashresulting in genome-wide low levels ofgenetic differentiation as well as low levels of phenotypic dif-ferentiation of four traits unrelated to coloration across thecline However the Agouti region differs markedly in this re-gard with high levels of differentiation observed between onand off Sand Hills populations strong haplotype structure andhigh levels of linkage disequilibrium spanning putatively ben-eficial mutations among cryptically colored individuals In ad-dition we found these putatively beneficial mutations to bestrongly associated with the phenotypic traits underlying cryp-sis and these phenotypic traits were found to be in strongassociation with variance in soil color across the cline

Together these results suggest a model in which selectionis acting to maintain the alleles underlying the locally adaptivephenotype on lightdark soil despite substantial gene flowwhich not only prevents the populations from strongly dif-ferentiating from one another but also prevents the crypticgenotypes from reaching local fixation Furthermore return-ing to the theoretical predictions outlined in theIntroduction the mutations underlying the cryptic pheno-type are found to be few in number of large effect size and inclose genomic proximity As described by Yeaman andWhitlock (2011) in models of local adaptation with high mi-gration the establishment of a large effect beneficial allelemay indeed facilitate the accumulation of other locally ben-eficial alleles in the same genomic region owing to the local

FIG 6 Comparison of tests of neutrality and selection across the Agouti locus From the top The exon structure pairwise differences Tajimarsquos DCLR test HapFLK and FLK The dashed lines indicate the significance threshold using parametric bootstrapping of the best-fitting demographicmodel

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

799Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 7: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

independently selected mutations underlying differentaspects of the cryptic phenotype

Predicted Migration Thresholds and the Conditions ofAllele MaintenanceAlthough a number of strong assumptions must be made itis possible to estimate the migration threshold below which alocally beneficial mutation may be maintained in a

population Given our high rates of inferred gene flow it isimportant to consider whether observations are consistentwith theoretical expectations necessary for the maintenanceof alleles Following Yeaman and Whitlock (2011) andYeaman and Otto (2011) this migration threshold is definedin terms of both rates of gene flow as well as fitness effects oflocally beneficial alleles in the matching and in the alternateenvironment Given the parameter values inferred here along

FIG 5 Genetic differentiation across the Agouti locus Genetic differentiation observed between populations on the Sand Hills and populationsnorth of the Sand Hills (top panel) populations on the Sand Hills and populations south of the Sand Hills (middle panel) and populations northand south of the Sand Hills Dotted lines indicate the genome-wide average FST between these comparisons As shown differentiation is generallygreatly elevated across the region with a number of highly differentiated SNPs particularly when comparing on versus off Sand Hills populationswhich correspond to SNPs associated with different aspects of the cryptic phenotype (PIP scores are indicated by the size of the dots as shown inthe legend and significant SNPs are labeled with respect to the associated phenotype) Populations to the north and south of the Sand Hills are alsohighly differentiated in this region likely owing to differing levels of gene flow with the Sand Hills populations yet the SNPs underlying the crypticphenotype are not differentiated between these dark populations

Pfeifer et al doi101093molbevmsy004 MBE

798Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

with estimated population sizes the threshold above whichthe most strongly beneficial mutations may be maintained inthe population is estimated at mfrac14 08mdasha large value indeedgiven the inferred strength of selection (see Materials andMethods for details) For the more weakly beneficial muta-tions identified this value is estimated at mfrac14 007 again wellabove empirically estimated rates Thus our empirical obser-vations are fully consistent with expectation in that gene flowis strong enough to prevent locally beneficial alleles fromfixing on the Sand Hills but not so strong so as to swampout the derived allele despite the high input of ancestral var-iation As such parameter estimates fall in a range consistentwith the long-term maintenance of polymorphic alleles

ConclusionsThe cryptically colored mice of the Nebraska Sand Hills rep-resent one of the few mammalian systems in which aspects ofgenotype phenotype and fitness have been measured andconnected Yet the population genetics of the Sand Hillsregion has remained poorly understood By sequencing hun-dreds of individuals across a cline spanning over 300 km fun-damental aspects of the evolutionary history of thispopulation could be addressed for the first time Utilizinggenome-wide putatively neutral regions the inferred demo-graphic history suggests a relatively recent colonization of theSand Hills from the south well after the last glacial period

Further high rates of gene flow are inferred between light anddark populationsmdashresulting in genome-wide low levels ofgenetic differentiation as well as low levels of phenotypic dif-ferentiation of four traits unrelated to coloration across thecline However the Agouti region differs markedly in this re-gard with high levels of differentiation observed between onand off Sand Hills populations strong haplotype structure andhigh levels of linkage disequilibrium spanning putatively ben-eficial mutations among cryptically colored individuals In ad-dition we found these putatively beneficial mutations to bestrongly associated with the phenotypic traits underlying cryp-sis and these phenotypic traits were found to be in strongassociation with variance in soil color across the cline

Together these results suggest a model in which selectionis acting to maintain the alleles underlying the locally adaptivephenotype on lightdark soil despite substantial gene flowwhich not only prevents the populations from strongly dif-ferentiating from one another but also prevents the crypticgenotypes from reaching local fixation Furthermore return-ing to the theoretical predictions outlined in theIntroduction the mutations underlying the cryptic pheno-type are found to be few in number of large effect size and inclose genomic proximity As described by Yeaman andWhitlock (2011) in models of local adaptation with high mi-gration the establishment of a large effect beneficial allelemay indeed facilitate the accumulation of other locally ben-eficial alleles in the same genomic region owing to the local

FIG 6 Comparison of tests of neutrality and selection across the Agouti locus From the top The exon structure pairwise differences Tajimarsquos DCLR test HapFLK and FLK The dashed lines indicate the significance threshold using parametric bootstrapping of the best-fitting demographicmodel

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

799Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 8: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

with estimated population sizes the threshold above whichthe most strongly beneficial mutations may be maintained inthe population is estimated at mfrac14 08mdasha large value indeedgiven the inferred strength of selection (see Materials andMethods for details) For the more weakly beneficial muta-tions identified this value is estimated at mfrac14 007 again wellabove empirically estimated rates Thus our empirical obser-vations are fully consistent with expectation in that gene flowis strong enough to prevent locally beneficial alleles fromfixing on the Sand Hills but not so strong so as to swampout the derived allele despite the high input of ancestral var-iation As such parameter estimates fall in a range consistentwith the long-term maintenance of polymorphic alleles

ConclusionsThe cryptically colored mice of the Nebraska Sand Hills rep-resent one of the few mammalian systems in which aspects ofgenotype phenotype and fitness have been measured andconnected Yet the population genetics of the Sand Hillsregion has remained poorly understood By sequencing hun-dreds of individuals across a cline spanning over 300 km fun-damental aspects of the evolutionary history of thispopulation could be addressed for the first time Utilizinggenome-wide putatively neutral regions the inferred demo-graphic history suggests a relatively recent colonization of theSand Hills from the south well after the last glacial period

Further high rates of gene flow are inferred between light anddark populationsmdashresulting in genome-wide low levels ofgenetic differentiation as well as low levels of phenotypic dif-ferentiation of four traits unrelated to coloration across thecline However the Agouti region differs markedly in this re-gard with high levels of differentiation observed between onand off Sand Hills populations strong haplotype structure andhigh levels of linkage disequilibrium spanning putatively ben-eficial mutations among cryptically colored individuals In ad-dition we found these putatively beneficial mutations to bestrongly associated with the phenotypic traits underlying cryp-sis and these phenotypic traits were found to be in strongassociation with variance in soil color across the cline

Together these results suggest a model in which selectionis acting to maintain the alleles underlying the locally adaptivephenotype on lightdark soil despite substantial gene flowwhich not only prevents the populations from strongly dif-ferentiating from one another but also prevents the crypticgenotypes from reaching local fixation Furthermore return-ing to the theoretical predictions outlined in theIntroduction the mutations underlying the cryptic pheno-type are found to be few in number of large effect size and inclose genomic proximity As described by Yeaman andWhitlock (2011) in models of local adaptation with high mi-gration the establishment of a large effect beneficial allelemay indeed facilitate the accumulation of other locally ben-eficial alleles in the same genomic region owing to the local

FIG 6 Comparison of tests of neutrality and selection across the Agouti locus From the top The exon structure pairwise differences Tajimarsquos DCLR test HapFLK and FLK The dashed lines indicate the significance threshold using parametric bootstrapping of the best-fitting demographicmodel

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

799Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 9: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

reduction in effective migration rate (and see Aeschbacherand Burger 2014) Though speculative such a model mayindeed explain the accumulating observations of selectionfor crypsis generally targeting either the Agouti locus or theMc1r locus in a given population rather than both (ie asingle region is targeted by selection rather than two unlinkedregions)

Our results are in keeping with this model with the exon 2serine deletion explaining a large proportion of variance inmultiple traits underlying the cryptic phenotype and withpopulation-genetic patterns suggesting comparatively oldselection acting on this variant In addition owing to thelarge-scale geographic sampling multiple newly identifiedgenomically clustered and smaller effect alleles modulatingindividual traits are also identified which are characterized bystrong patterns of selective sweeps indicative of more recentselection This system thus provides an in-depth picture ofthe dynamic interplay of these population-level processesunderlying this adaptive phenotype and highlights a historycharacterized by remarkably strong local selective pressures aswell as continuous high rates of gene flow with the ancestralfounding populations

Materials and Methods

Population SamplingCollectionWe collected P maniculatus luteus individuals (Nfrac14 266) andcorresponding soil samples (Nfrac14 271) from 11 sites spanninga 330-km transect starting 120 km north of the Sand Hillsand ending 120 km south of the Sand Hills (fig 1 supple-mentary table 1 Supplementary Material online) In total fivesampling locations were on the Sand Hills and six were lo-cated off of the Sand Hills (three in the north and three in thesouth) We collected mice using Sherman live traps and pre-pared flat skins according to standard museum protocolsthese specimens are accessioned in the Harvard UniversityMuseum of Comparative Zoologyrsquos Mammal DepartmentFrom each sample we preserved liver tissue in 100 EtOHfor subsequent DNA extraction Collections were made underthe South Dakota Department of Game Fish and Parks sci-entific collectorrsquos permit 54 (2008) and the Nebraska Gameand Parks Commission Scientific and Educational Permit(2008) 579 subpermit 697-700 This work was approved byHarvard Universityrsquos Institutional Animal Care and UseCommittee

Mouse color measurements For each mouse we measuredstandard morphological traits including total length taillength hind foot length and ear length We also characterizedcolor and color pattern using methods modified from Linnenet al (2013) In brief to quantify color we used a USB2000spectrophotometer (Ocean Optics) to take reflectance meas-urements at three sites across the body (three replicatedmeasurements in the dorsal stripe flank and ventrum) Weprocessed these raw reflectance data using the CLR ColourAnalysis Programs v105 (Montgomerie 2008) Specifically wetrimmed and binned the data to 300ndash700 nm and 1 nm bins

and computed five color summary statistics B2 (mean bright-ness) B3 (intensity) S3 (chroma) S6 (contrast amplitude)and H3 (hue) We then averaged the three measurements foreach body region producing a total of 15 color values (ie fivesummary statistics from each of the three body regions) Toensure values were normally distributed we performed anormal-quantile transformation on each of the 15 color var-iables Using these transformed values we performed a prin-cipal component analysis (PCA) in STATA v130 (StataCorpCollege Station TX) On the basis of the eigenvalues andexamination of the scree plot we selected the first four prin-cipal components which together accounted for 84 of thevariation in color To increase interpretability of the loadingswe performed a VARIMAX rotation on the first four PCs(Tabachnick and Fidell 2001 Montgomerie 2006) After rota-tion PC1 corresponded to the brightnesscontrast (B2 B3S6) of the dorsum PC2 to the chromahue (S3 H3) of thedorsum PC3 to the brightnesscontrast (B2 B3 S6) of theventrum and PC4 to the chromahue (S3 H3) of the ven-trum Therefore we refer to PC1 as ldquodorsal brightnessrdquo PC2 asldquodorsal huerdquo PC3 as ldquoventral brightnessrdquo and PC4 as ldquoventralhuerdquo throughout

To quantify color pattern we took digital images of eachmouse flat skin with a Canon EOS Rebel XTI with a 50-mmCanon lens (Canon USA Lake Success NY) We used thequick selection tool in Adobe Photoshop CS5 (AdobeSystems San Jose CA) to select light and dark areas on eachmouse Specifically we outlined the dorsum (brown portion ofthe dorsum with legs and tails excluded) body (outlined entiremouse with brown and white areas legs and tails excluded)tail stripe (outlined dark stripe on tail only) and tail (outlinedentire tail) We calculated ldquodorsalndashventral boundaryrdquo as(bodymdashdorsum)body and ldquotail striperdquo as (tailmdashtail stripe)tail Thus each measure represents the proportion of a partic-ular region (tail or body) that appeared unpigmented highervalues therefore represent lighter phenotypes

To determine whether color phenotypes (dorsal bright-ness dorsal hue ventral brightness ventral hue dorsalndashventral boundary and tail stripe) differ between on SandHills and off Sand Hills populations we performed a nestedanalysis of variance (ANOVA) on each trait with samplingsite nested within habitat For comparison we also performednested ANOVAs on each of the four noncolor traits (totallength tail length hind foot length and ear length) All anal-yses were performed on normal-quantile-transformed dataUnless otherwise noted we performed all transformationsand statistical tests in R v332

Soil color measurements To characterize soil color we col-lected soil samples in the immediate vicinity of each success-ful Peromyscus capture To measure brightness we pouredeach soil sample into a small petri dish and recorded tenmeasurements using the USB2000 spectrophotometer Asabove we used CLR to trim and bin the data to 300ndash700 nm and 1 nm bins as well as to compute mean brightness(B2) We then averaged these ten values to produce a singlemean brightness value for each soil sample and transformedthe soil-brightness values using a normal-quantile

Pfeifer et al doi101093molbevmsy004 MBE

800Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 10: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

transformation prior to analysis To evaluate whether soilcolor differs consistently between on and off of the SandHills sites we performed an ANOVA with sampling sitenested within habitat Finally to test for a correlation betweencolor traits (see above) and local soil color we regressed themean brightness value for each color trait on the mean soilbrightness for each sampling sitemdashultimately for comparisonwith downstream population genetic inference (see Joostet al 2013)

Library Preparation and SequencingLibrary PreparationTo prepare sequencing libraries we used DNeasy kits (QiagenGermantown MD) to extract DNA from liver samples Wethen prepared libraries following the SureSelect TargetEnrichment Protocol v10 with some modifications In brief10ndash20 lg of each DNA sample was sheared to an average sizeof 200 bp using a Covaris ultrasonicator (Covaris IncWoburn MA) Sheared DNA samples were purified withAgencourt AMPure XP beads (Beckman CoulterIndianapolis IN) and quantified using a Quant-iT dsDNAAssay Kit (ThermoFisher Scientific Waltham MA) We thenperformed end repair and adenylation using 50 ll ligationswith Quick Ligase (New England Biolabs Ipswich MA) andpaired-end adapter oligonucleotides manufactured byIntegrated DNA Technologies (Coralville IA) Each samplewas assigned one of 48 five-base pair barcodes We pooledsamples into equimolar sets of 9ndash12 and performed size se-lection of a 280-bp band (650 bp) on a Pippin Prep with a 2agarose gel cassette We performed 12 cycles of PCR withPhusion High-Fidelity DNA Polymerase (NEB) according tomanufacturer guidelines To permit additional multiplexingbeyond that permitted by the 48 barcodes this PCR step alsoadded one of 12 six-base pair indices to each pool of 12barcoded samples Following amplification and AMPure pu-rification we assessed the quality and quantity of each pool(23 total) with an Agilent 2200 TapeStation (AgilentTechnologies Santa Clara CA) and a Qubit-BR dsDNAAssay Kit (Thermo Fisher Scientific)

Enrichment and SequencingTo enrich sample libraries for both the Agouti locus as wellas more than 1000 randomly distributed genomic regionsand following Linnen et al (2013) we used a MYcroarrayMYbaits capture array (MYcroarray Ann Arbor MI) Thisprobe set includes the 185-kb region containing all knownAgouti coding exons and regulatory elements (based on aP maniculatus rufinus Agouti BAC clone from Kingsley et al2009) andgt1000 nonrepetitive regions averaging 15 kb inlength at random locations across the P maniculatusgenome

After generating 23 indexed pools of barcoded librariesfrom 249 individual Sand Hills mice and one lab-derivednonAgouti control we enriched for regions of interest follow-ing the standard MYbaits v2 protocol for hybridization cap-ture and recovery We then performed 14 cycles of PCR withPhusion High-Fidelity Polymerase and a final AMPure

purification Final quantity and quality was then assessed us-ing a Qubit-HS dsDNA Assay Kit and Agilent 2200TapeStation After enriching our libraries for regions of inter-est we combined them into four pools and sequenced acrosseight lanes of 125-bp paired-end reads using an IlluminaHiSeq2500 platform (Illumina San Diego CA) Of the readpairs 94 could be confidently assigned to individual barc-odes (ie we excluded read data where the ID tags were notidentical between the two reads of a pair [4] as well as readswhere ID tags had low base qualities [2]) Read pair countsper individual ranged from 367759 to 42096507 (median4861819)

Reference AssemblyWe used the P maniculatus bairdii Pman_10 reference as-sembly publicly available from NCBI (RefSeq assembly acces-sion GCF_0005003451) which consists of 30921 scaffolds(scaffold N50 3760915) containing 2630541020 bp rangingfrom 201 to 18898765 bp in length (median scaffold length2048 bp mean scaffold length 85073 bp) to identify variationin the background regions (ie outside of Agouti)Unfortunately the Agouti locus is split over two overlappingscaffolds (ie exon 1 AArsquo is located on scaffoldNW_0065028941 and exons 2ndash4 are located on scaffoldNW_0065013961) in this assembly causing issues in theread mapping at this locus Therefore reads mapped on ei-ther of these two scaffolds were realigned to a novel in-housePeromyscus reference assembly in order to more reliably iden-tify variation at the Agouti locus The in-house reference as-sembly consists of 9080 scaffolds (scaffold N50 13859838)containing 2512380343 bp ranging from 1000 bp to60475073 bp in length (median scaffold length 3275 bpmean scaffold length 276694 bp) In the two reference as-semblies we annotated and masked seven different classes ofrepeats (ie SINE LINE LTR elements DNA elements satel-lites simple repeats and low complexity regions) usingRepeatMasker vOpen-405 (Smit et al 2013)

Sequence AlignmentWe removed contamination from the raw read pairs andtrimmed low quality read ends using cutadapt v18 (Martin2011) and TrimGalore v037 (httpwwwbioinformaticsbabrahamacukprojectstrim_galore) We aligned the pre-processed paired-end reads to the reference assembly usingStampy v1022 (Lunter and Goodson 2011) PCR duplicatesas well as reads that were not properly paired were thenremoved using SAMtools v0119 (Li et al 2009) After clean-ing read pair counts per individual ranged from 220960 to20178669 (median 3062998) We then conducted a multi-ple sequence alignment using the Genome Analysis Toolkit(GATK) IndelRealigner v33 (McKenna et al 2010 DePristoet al 2011 Van der Auwera et al 2013) to improve variantcalls in low-complexity genomic regions adjusting BaseAlignment Qualities at the same time in order to downweight base qualities in regions with high ambiguity (Li2011) Next we merged sample-specific reads across differentlanes thereby removing optical duplicates using SAMtoolsv0119 Following GATKrsquos Best Practice we performed a

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

801Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 11: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

second multiple sequence alignment to produce consistentmappings across all lanes of a sample The resulting data setcontained aligned read data for 249 individuals from 11 dif-ferent sampling locations

Variant Calling and FilteringWe performed an initial variant call using GATKrsquosHaplotypeCaller v33 (McKenna et al 2010 DePristo et al2011 Van der Auwera et al 2013) and then we genotyped allsamples jointly using GATKrsquos GenotypeGVCFs v33 toolPostgenotyping we filtered initial variants using GATKrsquosVariantFiltration v33 removing SNPs using the followingset of criteria (with acronyms as defined by the GATK pack-age) 1) The variant exhibited a low read mapping quality(MQlt 40) 2) the variant confidence was low (QDlt 20)3) the mapping qualities of the reads supporting the referenceallele and those supporting the alternate allele were qualita-tively different (MQRankSumlt125) 4) there was evidenceof a bias in the position of alleles within the reads that supportthem between the reference and alternate alleles(ReadPosRankSumlt80) or 5) there was evidence of astrand bias as estimated by Fisherrsquos exact test (FSgt 600) orthe Symmetric Odds Ratio test (SORgt 40) We removedindels using the following set of criteria 1) The variant con-fidence was low (QDlt 20) 2) there was evidence of a strandbias as estimated by Fisherrsquos exact test (FSgt 2000) or 3)there was evidence of bias in the position of alleles withinthe reads that support them between the reference and al-ternate alleles (ReadPosRankSumlt200) For an in-depthdiscussion of these issues see the recent review of Pfeifer(2017)

To minimize genotyping errors we excluded all variantswith a mean genotype quality of less than 20 (correspondingto P[error]frac14 001) We limited SNPs to biallelic sites usingVCFtools v0112 b (Danecek et al 2011) with the exceptionof one previously identified putatively beneficial triallelic var-iant (Linnen et al 2013) We excluded variants within repet-itive regions of the reference assembly from further analysesas potential misalignment of reads to these regions might leadto an increased frequency of heterozygous genotype callsAdditionally we filtered variants on the basis of the HardyndashWeinberg equilibrium by computing a P-value for HardyndashWeinberg equilibrium for each variant using VCFtoolsv0112 b and excluding variants with an excess of heterozy-gotes (Plt 001) unless otherwise noted Finally we removedsites for which all individuals were fixed for the nonreferenceallele as well as sites with more than 25 missing genotypeinformation

The resulting call set contained 8265 variants on scaffold16 (containing Agouti) and 649300 variants within the ran-dom background regions We polarized variants within Agoutiusing the P maniculatus rufinus Agouti sequence (Kingsleyet al 2009) as the outgroup Genotypes were phased usingBEAGLE v4 (Browning and Browning 2007) The Agouti dataare available at httpsfigsharecoms7ece137411ae5cf8ba56and the random genome-wide (ie non-Agouti) data areavailable at httpsfigsharecomsb312029e47f34268a8ce

Accessible GenomeTo minimize the number of false positives in our data set wesubjected variants to several stringent filter criteria The ap-plication of these filters led to an exclusion of a substantialfraction of genomic regions and thus we generated mask files(using the GATK pipeline described above but excludingvariant-specific filter criteria ie QD FS SOR MQRankSumand ReadPosRankSum) to identify all nucleotides accessibleto variant discovery After filtering only a small fraction of thegenome (ie 63083 bp within Agouti and 6224355 bp of therandom background regions) remained accessible Thesemask files enabled us to obtain an exact number of the mono-morphic sites in the reference assembly which we then usedto both estimate the demographic history of focal popula-tions as well as a control to avoid biases when calculatingsummary statistics

Association MappingTo identify SNPs within Agouti that contribute to variation inmouse coat color we used the Bayesian sparse linear mixedmodel (BSLMM) implemented in the software packageGEMMA v 0941 (Zhou et al 2013) In contrast to single-SNP association mapping approaches (eg Purcell et al 2007)the BSLMM is a polygenic model that simultaneously assessesthe contribution of multiple SNPs to phenotypic variationAdditionally compared with other polygenic models (eglinear mixed models and Bayesian variable selection regres-sion) the BSLMM performs well for a wide range of traitgenetic architectures from a highly polygenic architecturewith normally distributed effect sizes to a ldquosparserdquo model inwhich there are a small number of large-effect mutations(Zhou et al 2013) Indeed one benefit of the BSLMM (andother polygenic models) is that hyperparameters describingtrait genetic architecture (eg number of SNPs relative con-tribution of large- and small-effect variants) can be estimatedfrom the data Importantly the BSLMM also accounts forpopulation structure via inclusion of a kinship matrix as acovariate in the mixed model

For each of the five color traits that differed significantlybetween on Sand Hills and off Sand Hills habitats (ie dorsalbrightness dorsal hue ventral brightness dorsalndashventralboundary and tail stripe see ldquoResults and Discussionrdquo) weperformed ten independent GEMMA runs each consisting of25 million generations with the first five million generationsdiscarded as burn-in Because we were specifically interestedin the contribution of Agouti variants to color variation werestricted our GEMMA analysis to 2148 Agouti SNPs that hadno missing data and a minor allele frequency (MAF)gt 005However to construct a relatedness matrix we used thegenome-wide SNP data set We aimed to maximize the num-ber of individuals kept in the analyses and hence we used lessstringent filtering criteria than for the demographic analysesPrior to construction of this matrix we removed all individ-uals with a mean depth (DP) of coverage lower than 2Following the removal of low-coverage individuals we treatedgenotypes with less than 2 coverage or twice the individualmean DP as missing data and removed SNPs with more than10 missing data across individuals or with a MAFlt 001

Pfeifer et al doi101093molbevmsy004 MBE

802Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 12: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

To remove tightly linked SNPs we sampled one SNP per 1-kbblock choosing whichever SNP had the lowest amount ofmissing data resulting in a data set with 12920 SNPs (sup-plementary table 8 Supplementary Material online) We thencomputed the relatedness matrix using the RBioconductor(release 34) package SNPrelate v180 (Zheng et al 2012) Forall traits we used normal quantile transformed values to en-sure normality and option ldquo-bslmm 1rdquo to fit a standard linearBSLMM to our data

After runs were complete we assessed convergence on thedesired posterior distribution via examination of trace plotsfor each parameter and comparison of results across inde-pendent runs We then summarized PIPs for each SNP foreach trait by averaging across the ten independent runsFollowing (Chaves et al 2016) we used a strict cut-off ofPIPgt 01 to identify candidate SNPs for each color trait (cfGompert et al 2013 Comeault et al 2014) To summarizegenetic architecture parameter estimates for each trait wecalculated the mean and upper and lower bounds of the 95credible interval for each parameter from the combined pos-terior distributions derived from the ten runs

To identify additional candidate regions contributing tocolor variation and to compare genetic architecture param-eter estimates from Agouti to those obtained using the fulldata set (Agouti SNPs and non-Agouti SNPs) we repeated theGEMMA analyses as described above but with a data setconsisting of 8616 SNPs that had no missing data andMAFgt 005

Population StructureWe investigated the structure of populations along theNebraska Sand Hills transect with several complementaryapproaches First we used methods to cluster individualsbased on their genetic similarity including PCA and inferenceof individual ancestry proportions Second on the basis ofSNP allele frequencies we computed pairwise FST betweensampled sites to infer their relationships In addition we in-ferred the population tree best fitting the covariance of allelefrequencies across sites using TreeMix v113 (Pickrell andPritchard 2012) Finally we tested for IBD by comparing thematrix of geographic distances between sites to that of pair-wise genetic differentiation (FST[1 FST] Rousset 1997) us-ing Mantel tests (Sokal 1979 Smouse et al 1986) as well asmethods to infer ancestry proportions accounting for geo-graphic location

For demographic analyses we based all analyses on thebackground regions randomly distributed across the genome(ie excluding Agouti) and applied additional filters to max-imize the quality of the data First we discarded individualswith a mean depth of coverage (DP) across sites lower than8 Second given that the DP was heterogeneous acrossindividuals we treated all genotypes with DPlt 4 or withmore than twice the individual mean DP as missing data toavoid genotype call errors due to low coverage or mappingerrors Third we partitioned each scaffold into contiguousblocks of 15 kb size recording the number of SNPs and ac-cessible sites for each block To minimize regions with spu-rious SNPs (eg due to mis-alignment or repetitive regions)

we only kept blocks with more than 150 bp of accessible sitesand with a median distance among consecutive SNPs largerthan 3 bp This resulted in a data set consisting of 190 indi-viduals with 28 Mb distributed across 11770 blocks with284287 SNPs and a total of 2814532 callable sites (corre-sponding to 2530245 monomorphic sites supplementarytable 9 Supplementary Material online) Fourth to minimizemissing data we only kept SNPs with at least 90 of calledgenotypes and individuals with at least 75 of data acrosssites in the data set Finally since many of the applied meth-ods rely on the assumption of independence among SNPs wegenerated a data set by sampling one SNP per block selectingthe SNP with the lowest amount of missing data

The PCA and the pairwise FST (estimated following Weirand Cockerham 1984) analyses were performed in R using themethod implemented in the Bioconductor (release 34) pack-age SNPrelate v180 (Zheng et al 2012) We inferred the an-cestry proportions of all individuals based on K potentialcomponents using sNMF (Frichot et al 2014) implementedin the RBioconductor release 34 package LEA v160 (Frichotet al 2015) with default settings We examined K values from1 to 12 and selected the K that minimized the cross entropyas suggested by Frichot et al (2014) We performed the PCAFST and sNMF analyses by applying an extra MAF filtergt (12n) where n is the number of individuals such that singletonswere discarded To test for IBD we compared the estimatedFST values to the pairwise geographic distances (measured as astraight line from the most northern sample ie Colome)between sample sites using a Mantel test with significanceassessed by 10000 permutations of the genetic distance ma-trix In addition fine scale population structure wasaccounted for using the TESS3 R package (Caye et al 2016)Given the similar cross-entropy values in sNMF for Kfrac14 1 to 3100 independent runs for each K value were performed with atolerance of 106 and 200 iterations per run

Demographic AnalysesTo uncover the colonization history of the Nebraska SandHills we inferred the demographic history of populationsbased on the joint SFS obtained from the random anony-mous genomic regions using the composite likelihoodmethod implemented in fastsimcoal2 v305 (Excoffier et al2013) In particular we were interested in testing whether theSand Hills populations were founded widely from across theancestral range or whether there was a single colonizationevent We also aimed to quantify the current and past levelsof gene flow among populations We considered models withthree populations corresponding to samples on the Sand Hills(ie Ballard Gudmundsen SHGC Arapaho and Lemoyne) aswell as off of the Sand Hills in the north (ie Colome Dogearand Sharps) and in the south (ie Ogallala Grant andImperial) Specifically we considered two alternative three-population demographic models as those are best supportedby the data to test whether the colonization of the Sand Hillsmost likely occurred from the north or from the south in asingle event or serial events thereby simultaneously quanti-fying the levels of gene flow between populations on and offof the Sand Hills (supplementary fig 2 Supplementary

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

803Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 13: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

Material online) In these models we assumed that coloniza-tion dynamics were associated with founder events (ie bot-tlenecks) and the number and place of which along thepopulation trees were allowed to vary among modelsHowever parameter values corresponding to a no size changemodel (ie no bottleneck) were included for evaluationMutation rates and generation times were taken followingLinnen et al (2009 2013)

We constructed a 3D SFS by pooling individuals from thethree sampling regions (ie on the Sand Hills as well as off ofthe Sand Hills in the north and south) For each of the 11770blocks of 15 kb size 30 individuals (ie ten individuals fromeach of the three geographic regions) were selected such thatall SNPs within a given block exhibited complete genotypeinformation Specifically we selected the 30 individuals withthe least missing data for each block only including SNPs withcomplete genotype information across the chosen individu-als a procedure that maximized the number of SNPs withoutmissing data while keeping the local pattern of linkage dis-equilibrium within each block Note that we sampled geno-types of the same individuals within each block but that atdifferent blocks the sampled individuals might differ For eachblock we computed the number of accessible sites and dis-carded blocks without any SNPs

The resulting 3D-SFS contained a total of 140358 SNPs and2674174 accessible sites (supplementary table 8Supplementary Material online) The number of monomorphicsites was based on the number of accessible sites assuming thatthe proportion of SNPs lost with the extra DP filtering steps notincluded in the mask file was identical for polymorphic andmonomorphic sites Given that there was no closely relatedoutgroup sequence available to determine the ancestral allelefor SNPs within the random background regions we analyzedthe multidimensional folded SFS generated using Arlequinv3522 (Excoffier and Lischer 2010) For each model we esti-mated the parameters that maximized its likelihood by perform-ing 50 optimization cycles (-L 50) approximating the expectedSFS based on 350000 coalescent simulations (-N 350000) andusing as a search range for each parameter the values reported inthe supplementary table 10 Supplementary Material online

The model best fitting the data was selected using Akaikersquosinformation criterion (Akaike 1974) Because our data setlikely contains linked sites the confidence for a given modelis likely to be inflated Thus to compare models we generatedbootstrap replicates with one SNP sampled from each 15-kbblock which we assumed to be independent We estimatedthe likelihood of each bootstrap replicate based on theexpected SFS obtained for each model with the full set ofSNPs (following Bagley et al 2017) Furthermore we obtainedconfidence intervals for the parameter estimates by a non-parametric block-bootstrap approach To account for linkagedisequilibrium we generated 100 bootstrap data sets by sam-pling the 11770 blocks for each bootstrap data set with re-placement in order to match the original data set size Foreach data set we performed two independent runs using theparameters that maximized the likelihood as starting valuesWe then computed the 95 percentile confidence intervals ofthe parameters using the R-package boot v13-18

Selection AnalysesWe used VCFtools v0112 b (Danecek et al 2011) to calculateWeir and Cockerhamrsquos FST (Weir and Cockerham 1984) in theAgouti region We pooled the 11 sampling locations into threepopulations (ie one population on the Sand Hills and twopopulations off of the Sand Hills One in the north and one inthe south see Demographic Analyses) and calculated FST in asliding window (window size 1 kb step size 100 bp) as well ason a per-SNP basis within Agouti to identify highly differen-tiated candidate regions for local adaptation

To control for potential hierarchical population structureas well as past fluctuations in population size we also mea-sured genetic differentiation using the HapFLK method(Fariello et al 2013) HapFLK calculates a global measure ofdifferentiation for each SNP (FLK) or inferred haplotype(HapFLK) after having rescaled allelehaplotype frequenciesusing a population kinship matrix We calculated thekinship matrix using the complete data set excluding thescaffolds containing Agouti We launched the HapFLK soft-ware using 40 independent runs (ndashnfit 40) and ndashK 40 onlykeeping alleles with a MAFgt 005 We obtained the neutraldistribution of the HapFLK statistic by running the softwareon 1000 neutral simulated data sets under our best demo-graphic model

To map potential complete selective sweeps in the Agoutiregion we utilized the CLR method (Nielsen et al 2005) asimplemented in the software Sweepfinder2 (DeGiorgio et al2016) For the analysis we used the P maniculatus rufinusAgouti sequence to identify ancestral and derived allelic states(see Variant Calling and Filtering) We ran Sweepfinder2 withthe ldquondashsurdquo option defining grid-points at every variant andusing a precomputed background SFS We calculated thecutoff value for the CLR statistic using a parametric bootstrapapproach (as proposed by Nielsen et al 2005) For this pur-pose we reran the CLR analysis on 10000 data sets simulatedunder our inferred neutral demographic model for the SandHills populations in order to reduce false-positive rates (seeCrisci et al 2013 Poh et al 2014)

Calculating Conditions of Allele MaintenanceFollowing Yeaman and Whitlock (2011) the threshold forallele maintenance is defined as the migration rate thatsatisfies dfrac141(4 N) which represents the criteria at whichallele frequency changes owing to genetic drift are onthe same order as frequency changes owing to the interplaybetween selection and migration Further in order to extendthis migration threshold to the consideration of aphenotypic trait they define the fitness (W) of phenotype(Z) as

W frac14 1 ndash Uethh Z= 2heth THORNTHORNc

where U is the strength of stabilizing selection h is the locallyadaptive phenotype which takes a positive value in the de-rived population and a negative value in the ancestral popu-lation and c specifies a curvature for the function Furtherthe effect size of the underlying mutation is given as aFollowing Yeaman and Otto (2011) d may then be defined as

Pfeifer et al doi101093molbevmsy004 MBE

804Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 14: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

d frac14 w2thorn 1

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiw2 4 1 2meth THORNW1Aa

W1AA

W2Aa

W2AA 1

s

where w frac14 1meth THORN W1Aa

W1AAthorn W2Aa

W2AA

Wij is the fitness of allele

j in environment i a is the allele beneficial in the ancestralenvironment and deleterious in the derived environmentand A is the allele that is beneficial in the derived environ-ment Finally Yeaman and Whitlock (2011) define the migra-tion threshold for a particular value of a as

mthreshold frac141

W1Aa

W1AaW1AA 1thorn 14Neth THORN

W2Aa

W2AA 1thorn 14Neth THORNW2Aa

To compare with our empirical observations and inferredvalues we take for the sake of example the identified serinedeletion in exon 2 compared between the Sand Hills and thenorthern population First we set h frac14 61 (ie positive onthe Sand Hills negative off of the Sand Hills) and cfrac14 2 (ie aconvex shape [Yeaman and Whitlock 2011]) Inference per-taining to the strength of selection acting on the crypticphenotype has been made most notably from previous pre-dation experiments in which conspicuously colored pheno-types were attacked significantly more often than those thatwere cryptically colored with a calculated selection index-frac14 0545 (Linnen et al 2013) Furthermore previous crossingexperiments have suggested that the serine deletion is a dom-inant allele (Linnen et al 2009) Finally the most stronglyassociated trait has here been calculated near 05 (ie dorsalbrightness) For the corresponding value of a this suggests amigration threshold of mfrac14 08 Thus for the serine deletionit is readily apparent that our inferred migration rates are wellbelow this expected threshold allowing maintenance (wherethe 95 CI is contained in mlt 00004)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank J Larson and K Turner for laboratory assistance EKay E Kingsley and M Manceau for field assistance JDemboski and the Denver Museum of Nature and Sciencefor logistical support the University of Nebraska-Lincoln foruse of facilities andor permission to collect mice at CedarPoint Biological Station Gudmundsen Sandhills Laboratoryand Arapaho Prairie and J Chupasko for curation assistanceThis work was funded by a Swiss National Science FoundationSinergia grant to LE HEH and JDJ

ReferencesAeschbacher S Burger R 2014 The effect of linkage on establishment

and survival of locally beneficial mutations Genetics 197(1) 317ndash336Akaike H 1974 New look at statistical-model identification IEEE Trans

Automat Contr 19(6) 716ndash723Bagley RK Sousa VC Niemiller ML Linnen CR 2017 History geography

and host use shape genome-wide patterns of genetic variation in theredhead pine sawfly Mol Ecol 26(4) 1022ndash1044

Barsh GS 1996 The genetics of pigmentation from fancy genes tocomplex traits Trends Genet 12(8) 299ndash305

Browning SR Browning BL 2007 Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies byuse of localized haplotype clustering Am J Hum Genet 81(5)1084ndash1097

Bulmer MG 1972 The genetic variability of polygenic charactersunder optimizing selection mutation and drift Genet Res 19(1)17ndash25

Bultman SJ Klebig ML Michaud EJ Sweet HO Davisson MT WoychikRP 1994 Molecular analysis of reverse mutations from nonagouti (a)to black-and-tan (a (t)) and white-bellied agouti (Aw) reveals alter-native forms of agouti transcripts Genes Dev 8(4) 481ndash490

Caye K Deist TM Martins H Michel O Francois O 2016 TESS3 fastinference of spatial population structure and genome scans for se-lection Mol Ecol Res 16(2) 540ndash548

Chaves JA Cooper EA Hendry AP Podos J De Leon LF Raeymaekers JAMacMillan WO Uy JA 2016 Genomic variation at the tips of theadaptive radiation of Darwinrsquos finches Mol Ecol 25(21) 5282ndash5295

Clarke B Murray J 1962 Changes of gene-frequency in Cepaea nemoralisthe estimation of selective values Heredity 17(4) 467ndash476

Comeault AA Soria-Carrasco V Gompert Z Farkas TE Buerkle CAParchman TL Nosil P 2014 Genome-wide association mapping ofphenotypic traits subject to a range of intensities of natural selectionin Timema cristinae Am Nat 183(5) 711ndash727

Cook LM Mani GS 1980 A migration-selection model for the morphfrequency variation in the peppered moth over England and WalesBiol J Linn Soc 13(3) 179ndash198

Crisci J Poh Y-P Bean A Simkin A Jensen JD 2012 Recent progress inpolymorphism based population genetic inference J Hered 103(2)287ndash296

Crisci J Poh Y-P Mahajan S Jensen JD 2013 On the impact of equilib-rium assumptions on tests of selection Front Genet 4235

Danecek P Auton A Abecasis G Albers CA Banks E DePristo MAHandsaker RE Lunter G Marth GT Sherry ST 2011 The variantcall format and VCFtools Bioinformatics 27(15) 2156ndash2158

DeGiorgio M Huber CD Hubisz MJ Hellmann I Nielsen R 2016SweepFinder2 increased sensitivity robustness and flexibilityBioinformatics 32(12) 1895ndash1897

DePristo M Banks E Poplin R Garimella KV Maguire JR Hartl CPhilippakis AA del Angel G Rivas MA Hanna M et al 2011 Aframework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43(5) 491ndash498

Dice LR 1941 Variation of the deer-mouse (Peromyscus maniculatus) onthe Sand Hills of Nebraska and adjacent areas Contrib Lab VertebrateBiol Univ Michigan 151ndash19

Dice LR 1947 Effectiveness of selection by owls of deer mice (Peromyscusmaniculatus) which contrast in color with their background ContribLab Vertebrate Biol Univ Michigan 341ndash20

Endler JA 1977 Geographic variation speciation and clines Princeton(NJ) Princeton University Press

Excoffier L Dupanloup I Huerta-Sanchez E Sousa VC Foll M 2013Robust demographic inference from genomic and SNP data PLoSGenet 9(10) e1003905

Excoffier L Lischer HE 2010 Arlequin suite ver 35 a new series ofprograms to perform population genetics analyses under Linuxand Windows Mol Ecol Resour 10(3) 564ndash567

Fariello MI Boitard S Naya H SanCristobal M Servin B 2013 Detectingsignatures of selection through haplotype differentiation among hi-erarchically structured populations Genetics 193(3) 929ndash941

Feulner PGD Chain FJJ Panchal M Huang Y Eizaguirre C Kalbe M LenzTL Samonte IE Stoll M Bornberg-Bauer E et al 2015 Genomics ofdivergence along a continuum of parapatrick population differenti-ation PLoS Genet 11(7) e1005414

Frichot E Francois O OrsquoMeara B 2015 LEA an R package for landscapeand ecological association studies Methods Ecol Evol 6(8) 925ndash929

Frichot E Mathieu F Trouillon T Bouchard G Francois O 2014 Fast andefficient estimation of individual ancestry coefficients Genetics196(4) 973ndash983

Local Adaptation with Gene Flow doi101093molbevmsy004 MBE

805Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1
Page 15: The Evolutionary History of Nebraska Deer Mice: Local ...Lenormand and Otto 2000).Therelativeadvantageofthis linkage will increase as the population approaches the migra-tion threshold,

Gompert Z Lucas LK Nice CC Buerkle CA 2013 Genome divergenceand the genetic architecture of barriers to gene flow betweenLycaeides and L melissa Evolution 67(9) 2498ndash2514

Haldane JBS 1930 Enzymes LondonNew York Longmans Green andCo

Haldane JBS 1957 The cost of natural selection J Genet 55(3) 511ndash524Hoekstra HE Drumm KE Nachman MW 2004 Ecological genetics of

adaptive color polymorphism in pocket mice geographic variationin selected and neutral genes Evolution 58(6) 1329ndash1341

Jackson IJ 1994 Molecular and development genetics of mouse coatcolor Annu Rev Genet 28189ndash217

Jensen JD Foll M Bernatchez L 2016 Introduction the past present andfuture of genomic scans for selection Mol Ecol 25(1) 1ndash4

Jones FC Grabherr MG Chan YF Russell P Mauceli E Johnson JSwofford R Pirun M Zody MC White S et al 2012 The genomicbasis of adaptive evolution in threespine sticklebacks Nature484(7392) 55ndash61

Joost S Vuilleumier S Jensen JD Schoville S Leempoel K Stucki SWidmer I Melodelima C Rolland J Manel S 2013 Uncovering thegenetic basis of adaptive change on the intersection of landscapegenomics and theoretical population genetics Mol Ecol 22(14)3659ndash3665

Kingsley SP Manceau M Wiley CD Hoekstra HE 2009 Melanism inPeromyscus is caused by independent mutations in Agouti PLoS One4(7) e6435

Kirkpatrick M Barton N 2006 Chromosome inversions local adapta-tion and speciation Genetics 173(1) 419ndash434

Lenormand T 2002 Gene flow and the limits to natural selection Cell17(4) 183ndash189

Lenormand T Otto SP 2000 The evolution of recombination in a het-erogeneous environment Genetics 156(1) 423ndash438

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 27(8) 1157ndash1158

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 2009 The Sequence AlignmentMap formatand SAMtools Bioinformatics 25(16) 2078ndash2079

Linnen CR Hoekstra HE 2009 Measuring natural selection on genotypesand phenotypes in the wild Cold Spring Harb Symp Quant Biol74155ndash168

Linnen CR Kingsley EP Jensen JD Hoekstra HE 2009 On the origin andspread of an adaptive allele in Peromyscus mice Science 325(5944)1095ndash1098

Linnen CR Poh Y-P Peterson BK Barrett RDH Larson JG Jensen JDHoekstra HE 2013 Adaptive evolution of multiple traits throughmultiple mutations at a single gene Science 339(6125) 1312ndash1316

Loope DB Swinehart J 2000 Thinking like a dune field Gt Plains Res 105Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitive

and fast mapping of Illumina sequence reads Genome Res 21(6)936ndash939

Mallarino R Linden TA Linnen CR Hoekstra HE 2017 The role ofisoforms in the evolution of cryptic coloration in Peromyscusmice Mol Ecol 26(1) 245ndash258

Manceau M Domingues VS Linnen CR Rosenblum EB Hoekstra HE2010 Convergence in pigmentation at multiple levels mutationsgenes and function Philos Trans R Soc Lond B Biol Sci 365(1552)2439ndash2450

Manceau M Domingues VS Mallarino R Hoekstra HE 2011 The devel-opmental role of Agouti in color pattern evolution Science331(6020) 1062ndash1065

Martin M 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnetjournal 17(1) 10ndash12

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 The GenomeAnalysis Toolkit a MapReduce framework for analyzing next-generation DNA sequencing data Genome Res 20(9) 1297ndash1303

Mills MG Patterson LB 1994 Not just black and white pigment patterndevelopment and evolution in vertebrates Semin Cell Biol 2072ndash81

Montgomerie R 2006 Analyzing colours In Hill GE McGraw KJ editorsBird coloration Volume I Mechanisms and measurementsCambridge (MA) Harvard University Press p 90ndash147

Montgomerie R 2008 CLR Version 105 Kingston Canada QueensUniversity

Nielsen R Williamson S Kim Y Hubisz MJ Clark AG Bustamante C2005 Genomic scans for selective sweeps using SNP data GenomeRes 15(11) 1566ndash1575

Pfeifer SP 2017 From next-generation resequencing reads to a highquality variant dataset Heredity 118(2) 111ndash124

Pickrell JK Pritchard JK 2012 Inference of population splits and mixturesfrom genome-wide allele frequency data PLoS Genet 8(11) e1002967

Poh Y-P Domingues V Hoekstra HE Jensen JD 2014 On the prospect ofidentifying adaptive loci in recently bottlenecked populations a casestudy in beach mice PLoS One 9e110579

Purcell S Neale B Todd-Brown K Thomas L Ferreira MAR Bender DMaller J Sklar P de Bakker PIW Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81(3) 559ndash575

Rafajlovic M Kleinhans D Gulliksson C Fries J Johansson D Ardehed ASundqvist L Pereyra RT Mehlig B Jonsson PR et al 2017 Neutralprocesses forming large clones during colonization of new areas JEvol Biol 30(8) 1544ndash1560

Ross KG Keller L 1995 Joint influence of gene flow and selection on areproductively important genetic polymorphism in the fire antSolenopsis invicta Am Nat 146(3) 325ndash348

Rousset F 1997 Genetic differentiation and estimation of gene flowfrom F-statistics under isolation by distance Genetics 145(4)1219ndash1228

Smit AFA Hubley R Green P 2013ndash2015 RepeatMasker Open-40Available from httpwwwrepeatmaskerorg

Smith JM 1977 Why does the genome not congeal Nature 268(5622)693ndash696

Smouse PE Long JC Sokal RR 1986 Multiple regression and correlationextensions of the Mantel test of matrix correspondence Syst Zool35(4) 627ndash632

Sokal RR 1979 Testing statistical significance of geographic variationpatterns Syst Zool 28(2) 227ndash232

Stinchcombe JR Weinig C Ungerer M Olsen KM Mays C HalldorsdottirSS Purugganan MD Schmitt J 2004 A latitudinal cline in floweringtime in Arabidopsis thaliana modulated by the flowering time geneFRIGIDA Proc Natl Acad Sci U S A 101(13) 4712ndash4717

Tabachnick B Fidell LS 2001 Using multivariate statistics 4th ed BostonAllyn and Bacon

Tigano A Friesen VL 2016 Genomics of local adaptation with gene flowMol Ecol 25(10) 2144ndash2164

Van der Auwera GA Carneiro MO Hartl C Poplin R Del Angel G Levy-Moonshine A Jordan T Shakir K Roazen D Thibault J 2013 FromFastQ data to high-confidence variant calls the Genome AnalysisToolkit best practices pipeline Curr Protoc Bioinformatics4311101ndash111033

Vrieling H Duhl DM Millar SE Miller KA Barsh GS 1994 Differences indorsal and ventral pigmentation result from regional expression ofthe mouse agouti gene Proc Natl Acad Sci U S A 91(12) 5667ndash5671

Weir BS Cockerham CC 1984 Estimating F-statistics for the analysis ofpopulation structure Evolution 38(6) 1358ndash1370

Yeaman S Otto SP 2011 Establishment and maintenance of adaptivegenetic divergence under migration selection and drift Evolution65(7) 2123ndash2129

Yeaman S Whitlock MC 2011 The genetic architecture of adaptationunder migration-selection balance Evolution 65(7) 1897ndash1911

Zheng X Levin D Shen J Gogarten SM Laurie C Weir BS 2012 A high-performance computing toolset for relatedness and principal com-ponent analysis of SNP data Bioinformatics 28(24) 3326ndash3328

Zhou X Carbonetto P Stephens M 2013 Polygenic modelingwith Bayesian sparse linear mixed models PLoS Genet 9(2)e1003264

Pfeifer et al doi101093molbevmsy004 MBE

806Downloaded from httpsacademicoupcommbearticle-abstract3547924810448by Harvard University useron 07 May 2018

  • msy004-TF1