Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these...

55
Understanding Ethnic Differences: The Roles of Geography and Trade Andrew Dickens 4 February 2020 The impact of ethnic divisions on economic growth and development are well under- stood, yet there is little known about the source of these divisions. In this study I take the economic importance of ethnic group differences as given, and explore the geographic foundations of group differences. I construct a georeferenced data set to examine the border region of spatially adjacent ethnic groups, together with variation in the set of potentially cultivatable crops at the onset of the Columbian Exchange, to identify how variation in land productivity impacts linguistic differences between adjacent ethnic groups. I find that eth- nic groups separated across geographic regions with high variation in land productivity are more similar in language than groups separated across more homogeneous regions. I then provide evidence in support of the proposed mechanism: historical trade was more frequent in these high variation regions and the frequency of trade served as a social tie between cul- turally distinct ethnic groups. Brock University, Department of Economics, St. Catharines, ON. E-mail: [email protected]. A special thanks to Nippe Lagerl ¨ of for his detailed feedback. This research has also benefited from discussion with Tasso Adamopoulos, Chris Bidner, Arthur Blouin, Greg Casey, James Fenske, Marc Klemp, Oded Galor, Dietrich Voll- rath, and seminar participants at Brock University, Wilfrid Laurier University, York University, CEA Annual Meet- ing (McGill), DEGIT XXIV (Southern Denmark) and ASREC (Lund). I also thank Stelios Michalopoulos for gener- ously sharing data with me.

Transcript of Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these...

Page 1: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Understanding Ethnic Differences:The Roles of Geography and Trade

Andrew Dickens†

4 February 2020

The impact of ethnic divisions on economic growth and development are well under-

stood, yet there is little known about the source of these divisions. In this study I take

the economic importance of ethnic group differences as given, and explore the geographic

foundations of group differences. I construct a georeferenced data set to examine the border

region of spatially adjacent ethnic groups, together with variation in the set of potentially

cultivatable crops at the onset of the Columbian Exchange, to identify how variation in land

productivity impacts linguistic differences between adjacent ethnic groups. I find that eth-

nic groups separated across geographic regions with high variation in land productivity are

more similar in language than groups separated across more homogeneous regions. I then

provide evidence in support of the proposed mechanism: historical trade was more frequent

in these high variation regions and the frequency of trade served as a social tie between cul-

turally distinct ethnic groups.

†Brock University, Department of Economics, St. Catharines, ON. E-mail: [email protected]. A specialthanks to Nippe Lagerlof for his detailed feedback. This research has also benefited from discussion with TassoAdamopoulos, Chris Bidner, Arthur Blouin, Greg Casey, James Fenske, Marc Klemp, Oded Galor, Dietrich Voll-rath, and seminar participants at Brock University, Wilfrid Laurier University, York University, CEA Annual Meet-ing (McGill), DEGIT XXIV (Southern Denmark) and ASREC (Lund). I also thank Stelios Michalopoulos for gener-ously sharing data with me.

Page 2: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

1 Introduction

It is well understood that history shapes the evolution of culture (Nunn, 2012). The long arcof history then puts great importance on understanding the channels through which culturepersists and changes, given the mounting evidence that history and culture impact comparativeeconomic development today (Spolaore and Wacziarg, 2009; Comin et al., 2010; Putterman andWeil, 2010; Chanda et al., 2014). While there is a large literature linking major historical events toepisodes of cultural change, there is little empirical evidence on the economic mechanisms thatshape culture.1 The aim of this research is to shed light on an unexplored channel of culturalevolution: inter-ethnic trade.

I hypothesize that the natural process of cultural drift between ancestral ethnic groups ismediated by the frequency of inter-ethnic trade throughout history.2 This hypothesis is basedon a Ricardian theory of trade, where people living in geographic regions with high variation inland productivity historically relied on various modes of subsistence, thus creating an oppor-tunity for trade. The hypothesis builds on the long-standing observation that historical tradeoccurred near ecological frontiers (Lovejoy and Baier, 1975; Bates, 1983; Fenske, 2014). Ecolog-ical zones of varying productivity promote variety in production, and so the historical gainsfrom trade were greatest when goods from one ecological zone can be traded with goods fromanother (Bates, 2010). In pre-modern times, trade fostered communication networks and servedas a social tie between spatially and culturally distinct ethnic groups (Koetsier, 2019). Indeed,the extent to which related ancestral groups drift apart is, in part, determined by the degree towhich related groups are isolated from one another (Cavalli-Sforza and Feldman, 1981). Hencethe conjecture that trade shapes culture—the process of cultural drift between ancestral ethnicgroups is mitigated by their frequency of contact via trade.

To test this hypothesis, I use the Ethnologue to construct a georeferenced dataset of spatiallyadjacent ethnic groups from across the world.3 I extract the border segment connecting eachadjacent pair, and construct a 100-kilometer buffer zone around each border segment as myunit of observation. Each buffer zone is used as a topographic lens to study the region thatlinks a spatially adjacent group pair (see Figure 1).

For the independent variable of interest, I use variation in land productivity within eachbuffer zone as a proxy for the historical gains from trade. I do this using Galor and Ozak’s

1There is a large literature documenting channels of cultural persistence and change due to factors rooted deepin history (Voigtlander and Voth, 2012; Giuliano and Nunn, 2013; Alesina et al., 2013; Becker et al., 2016; Guisoet al., 2016, among others).

2The use of the word drift in this context is analogous to genetic drift in the sense that it describes divergencebetween related populations due to random error (Cavalli-Sforza and Feldman, 1981). This notion of drift differsfrom natural selection, where cultural change is directional and the result of non-random evolutionary forces.

3An ethnic group is a social grouping of people that is rooted in the belief of shared ancestry. A basic feature ofan ethnic group is some form of a cultural community, often manifested in a common language, where a sense ofsolidarity within the community unites its members (Fearon, 2003).

1

Page 3: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Figure 1: Unit of Observation – Border Buffer Zone

Adjacent group pair example: the Shona and Manyika of Zimbabwe. A buffer zone is constructed around thesegment of border separating the group pair and serves as the unit of observation.

(2016) Caloric Suitability Index (CSI)—a measure of potential caloric yields under low-level in-puts and rain-fed agriculture for each 5′ × 5′ (≈ 100 km2) grid cell across the globe. Cultiva-tion methods based on low-level inputs and rain-fed agriculture reflect traditional and labour-intensive cultivation methods used in the distant past, and the potential yield estimates arebased on agro-climatic constraints that are largely independent of human actions. For the pur-pose of identification, I also measure the change in variation in land productivity that resultedfollowing the Columbian Exchange—the widespread exchange of crops between the Old andNew World following Columbus’ encounter of the Americas in 1492 (Nunn and Qian, 2010).This unexpected change in the availability of agricultural goods provides a historical source ofquasi-random variation in a geographic region’s productivity (Galor and Ozak, 2016).

As a measure of cultural similarity, I calculate the linguistic distance between each adjacentgroup pair using a lexicostatistical measure of distance. Matching languages to each group isstraightforward since the Ethnologue maps ethnolinguistic groups—ethnic groups unified by acommon language. This approach builds on the idea that ethnolinguistic identity is an impor-tant predictor of cultural values, norms and preferences (Desmet et al., 2017).4 Sociolinguists

4The view that language and culture are inseparable is common enough that “language and its associated

2

Page 4: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Fishman et al. (1985) famously argued language reveals the associated cultural group’s way ofthinking and how they organize experience. This view of language builds on the well-knownSapir-Whorf hypothesis—a conjecture that the language an individual speaks shapes their per-spective and understanding of the world, and hence the values and norms that guide their ev-eryday decision making.5 Recently, economists have also taken an interest in the Sapir-Whorfhypothesis: language groups that grammatically associate the present and the future displaymore long-term orientation in economic behaviour (Chen, 2013; Galor et al., 2018), and societiesthat use gendered language more often exhibit traditional gender roles and have lower femalelabour force participation (Jakiela and Ozier, 2019). One explanation for this latter finding isthat the presence of gendered language inhibits female educational attainment (Galor et al.,2020).

The baseline estimates indicate that a standard deviation increase in land productivity vari-ation at the onset of the Columbian Exchange results in a 2-4 percentage point decrease inlinguistic distance. These results are robust to country fixed effects and various geographic, cli-matic and disease environment control variables. This evidence is also consistent across the fullsample and the “sibling” sample—I narrow my focus to a subset of adjacent group pairs thatdescend from the same parent language as a way of disentangling the effect of shared ancestryfrom the effect of geography that I’m interested in. Overall, the estimates highlight the linkbetween historical variations in land productivity and contemporary differences in language.

The key identifying assumption of the baseline model is that the change in land produc-tivity variation in the post-Columbian period is random and independent of all other factorsin a buffer zone, conditional on the level of land productivity variation in the pre-Columbianperiod. Yet large-scale migrations have occurred throughout the post-Columbian era, suggest-ing that contemporary group locations might differ from their historical location. To addressthis concern, I narrow my focus to adjacent group pairs located in contemporary countries thatwere largely unaffected by post-Columbian migrations. In all cases, across all specifications,the variables of interest remain statistically significant and become larger in magnitude than atbaseline. This increase in magnitude is consistent with post-Columbian migrations introducingan attenuation bias due to measurement error.

Next I document evidence in support of the hypothesized trade mechanism. First, I showthat the productivity of a tract of land predicts an ethnic group’s historical mode of subsistence.Here, I use Murdock’s (1967) Ethnographic Atlas, which includes latitude and longitude coordi-nates for the location of 1,265 pre-colonial ethnic groups, and each group’s historical relianceon different modes of subsistence. As a unit of observation, I construct a buffer zone aroundeach group’s latitude-longitude coordinate, and measure land productivity for each buffer zone

culture” is considered a colloquialism among ethnolinguists (Risager, 2015, p. 87).5In the field of experimental psychology, evidence of these Sapir-Whorf effects have have been identified in con-

ceptions of time (Boroditsky, 2001), space (Levinson, 2003), gender (Boroditsky et al., 2003) and colour (Robersonand Hanley, 2010)

3

Page 5: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

using Galor and Ozak’s (2016) CSI data. I find that pre-colonial ethnic groups located in high-productivity regions relied on agriculture more than groups located in low-productivity re-gions, who instead relied more on animal husbandry and fishing as a means of subsistence.Taken together, these findings suggest that geographic regions with high variation in land pro-ductivity relied on various modes of subsistence, thus creating an opportunity for trade (Bates,1983; Fenske, 2014).

Second, I document persistence of the long-run effect of historical trade on cultural drift.I use Old World trade route maps from Michalopoulos et al. (2016, 2018), and measure thedistance to the nearest trade route from adjacent group pair buffer zones. I find that adjacentgroup pairs located near these historical trade routes are more similar in language today thangroup pairs located far away. This evidence is consistent with the idea that cultural drift ismitigated by frequent interaction (Eggan, 1963; Boyd and Richerson, 1985), where interactionthrough trade creates and reinforces social ties between spatially and culturally distinct ethnicgroups.

These findings contribute to our understanding of the important role geography plays inthe emergence and evolution of ethnic groups. In related work, Michalopoulos (2012) linksthe spatial distribution of ethnic groups across the globe to variations in geographical factors.He finds that ethnic groups develop human capital suitable to their ecological environment,where such skills are non-transferable across ecological boundaries, resulting in the occurrenceof ethnic group boundaries near high-variation geographic regions. Ashraf and Galor (2013b)show that genetic diversity within a given population is negatively associated with that popula-tion’s migratory distance from East Africa, and in related work Ashraf and Galor (2013a) showthat these variations in genetic diversity contribute to spatial patterns of ethnic diversity today.Cervellati et al. (2017) find that ethnic groups are plentiful and remain small in size in ecologicalenvironments prone to malaria, since the spread of vector-borne pathogens like malaria are bestcontained through geographic isolation and minimal inter-group contact. Ahlerup and Olsson(2012) similarly find that isolating geographic features restrict group size and increase ethnicdiversity, and also show that diversity is more pronounced in countries where humans settledrelatively early.

However, the existing evidence only speaks to the extensive margin—why some regionsare more diverse than others—and cannot speak to the intensive margin—why we observedifferent rates of divergence between ancestrally related groups. Yet the economic effects ofintensive margin differences are well documented in the literature (e.g., Spolaore and Wacziarg(2013)). Hence, a subtle yet important question remains unanswered: why are some ethnicgroups more dissimilar from each other than others? My principal contribution is evidence thatvariation in land productivity, through its effect on inter-ethnic trade, explains why ancestrallyrelated ethnic groups diverge in language at different rates. Taken together, the extensive-and intensive-margin evidence suggest that variation in geographical factors results in more

4

Page 6: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

ethnically fractionalized regions, but such regions provide an opportunity for trade and therebymitigate the overall process of drift between divergent groups.

My findings also contribute to our understanding of how geography and history interactand shape the long-term process of economic development. A prime example of this interac-tion is found in the Columbian Exchange, where Columbus’ arrival in the Americas sparked awidespread exchange of crops, thereby altering the agro-climatic conditions across the Old andNew World (Nunn and Qian, 2010). This change in agro-climatic conditions had far-reachingconsequences: the introduction of the potato to the Old World fuelled population growth andurbanization rates (Nunn and Qian, 2011), and permanently reduced civil conflicts for nearlytwo-hundred years (Iyigun et al., 2017); the introduction of corn to Africa increased populationdensity and consequently the supply of slaves during the trans-Atlantic slave trade (Cherni-wchan and Moreno-Cruz, 2019); and the increased suitability for agricultural investment gen-erated an increased prevalence towards long-term orientation (Galor and Ozak, 2016), and theuse of the future tense in language (Galor et al., 2018).6

My contribution to this literature is evidence that the Columbian Exchange generated varia-tion in land productivity across space, thereby altering the subsistence activities of pre-colonialethnic groups and the potential gains from inter-ethnic trade. The long-term implications ofthis is that the Columbian Exchange altered the co-evolution of language between neighboringethnic groups through its impact on inter-ethnic trade. These ethnolinguistic group differenceshave contemporary implications for bilateral trade (Melitz, 2008), cross-country income differ-ences (Spolaore and Wacziarg, 2009), idea flows (Dickens, 2018b), the likelihood of internationalconflict (Spolaore and Wacziarg, 2016), the distribution of public resources (Dickens, 2018a), in-fant mortality rates (Gomes, 2018) and more (see Spolaore and Wacziarg (2013)).

The rest of this paper is organized as follows. Section 2 outlines a conceptual frameworkto interpret the main findings. Section 3 presents the data and how the buffer zones are con-structed. Section 4 outlines the empirical model and identification strategy, and presents theempirical results. Section 5 reports evidence in support of the proposed trade mechanism andSection 6 concludes.

6Many other major historical events have interacted with geography in ways that shape the contemporaryera. For example, pre-colonial ethnic groups living in geographic regions of Africa with rugged terrain were lessexposed to the trans-Atlantic slave trade because they were better able to escape from slave traders (Nunn andPuga, 2012). In this instance, the geography of rugged areas shielded groups from the persistent and negativeimpact of the slave trade (Nunn, 2008), and consequently these groups exhibit higher levels of trust today (Nunnand Wantchekon, 2011). Another key example of geography and history interacting is found in the timing ofthe Neolithic Revolution. The east-west orientation of Eurasia advanced the process of agricultural developmentsooner than in the Americas, where the north-south orientation of the continent inhibited the spread of domesti-cable plants and animals (Diamond, 1997; Olsson and Hibbs Jr., 2005). This transition to agriculture was similarlypropelled by climatic fluctuations, which led foragers to diversify their subsistence activities towards agriculture,conditional on the fact that climatic fluctuations were not so severe that they eliminate the underlying resourcebase (Ashraf and Michalopoulos, 2015). See Nunn (2009, 2012), and Spolaore and Wacziarg (2013) for an in-depthreview of this literature.

5

Page 7: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

2 Conceptual Framework

In the empirical analysis, I find that ethnic groups separated across geographic regions withhigh variation in land productivity are more similar in language than groups separated acrosslow-variation regions. But this raises the question: why do geographical variations impact lin-guistic differences at all? In what follows, I provide a conceptual framework to answer thisquestion by connecting insights from economic history to established findings in cultural an-thropology, linguistics and evolutionary biology.

A well-documented characteristic of an ecosystem boundary—the transition zone betweenvarious ecosystems—is that boundary regions exhibit high levels of biodiversity (Odum, 1971).Regions with high variation in land productivity, then, give life to a wide range of produciblegoods. However, the connection between productivity and production is more complex. Evenin low productivity regions, where few agricultural goods are produced, non-agricultural modesof subsistence such as pastoralism can thrive. Yet pastoralists are often not entirely self-sufficient,since they rely on connections to other groups and regions through trade (Kardulias, 2015).

Cultural anthropologists have long recognized that, in pre-modern times, inter-group tradeproliferated in regions with a range of producible goods because the economic viability of spe-cialization not only depends on the productive capacity of the environment, but equally uponthe opportunity for exchange with producers of different goods (Barth, 1961; Bates and Lees,1977; Bradburd, 1996). Ecosystem boundaries are then regions where land productivity variesconsiderably, and this variation ”promotes diversity in production and, with it, exchange overspace” (Bates, 2010, p. 21).

This geographical origin of inter-group trade is Ricardian in nature: differences in land pro-ductivity determine comparative advantages, and thus patterns of product specialization forseparate groups living in varying ecosystems. Historians have noted a similar mechanism ofecologically-driven trade (Lovejoy and Baier, 1975), and economists studying the origin of thestate have written extensively about this Ricardian mechanism, particularly in Africa, wherestates emerged in regions with ecologically-driven trade to protect growing market economies(Bates, 1983; Fenske, 2014).

The evolutionary impact of inter-ethnic trade on language relates to the extent of contact be-tween trading groups (Centola et al., 2007). The traditional view among linguistics is that thereare historical and ahistorical reasons for linguistic similarity. The ahistorical reasons includechance and language universals, both of which “should apply equally to all languages, so theireffects should be statistically equal for all languages” (Cysouw, 2013, p. 22). Hence, histori-cal factors are only of interest here, which include two primary channels of influence: verticaltransfer (i.e., genealogical descent) and horizontal transfer (i.e., cross-cultural borrowing).

Horizontal transfer is the channel through which inter-ethnic trade impacts the evolution of

6

Page 8: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

language.7 When two populations engage in collaborative activities with shared intentions andgoals, a cooperative infrastructure for communication is created. It is upon this infrastructurethat linguists believe the evolution of language manifests itself and horizontal transfer occurs(Tomasello, 2008). The linguistic relationship between two populations then evolves accordingto the extent of collaborative activities.

Trade—a collaborative activity—can be understood through this lens, where shared in-tentions and goals create a cooperative infrastructure for communication and symbiosis ineconomic exchange. Trading ethnic groups then face an adaptive advantage when their twolanguages are similar, suggesting that their phylogenetic relationship is preserved and evenstrengthened by the frequency at which the groups trade. Turner et al. (2003, p. 452) reiteratethis point, writing that “not only are products of diverse regions and ecosystems shared andredistributed when cultural groups meet and mingle, so too are [. . .] linguistic traits and vocab-ulary.” In pre-modern times, trade routes strengthened these collaborative activities and servedas “cultural highways” linking distinct populations across space (Diene, 2000, p. xiii).

Whereas the motives for specialization and trade disappear in homogeneous ecological en-vironments, and so does the occurrence of inter-group collaborative activities. The implicationof limited collaborative activities between two groups is geographic isolation, which acceleratesthe natural process of drift. At the extreme, the cultural norms of two related ethnic groups liv-ing in complete isolation will diverge entirely due to drift, since changes that occur in one groupare not necessarily found in the other group (Eggan, 1963; Boyd and Richerson, 1985). Hence,the effect of variation in land productivity, through its effect on inter-ethnic trade, explains whyancestrally related ethnic groups diverge in language at different rates.

3 Data

3.1 Border-Level Dataset

For the majority of the empirical analysis, I rely on a dataset where the geographical unit of ob-servation is the segment of border delimiting spatially adjacent ethnic homelands. I constructthis dataset using the Ethnologue’s (16th edition) global mapping of ethnolinguistic groups. TheEthnologue maps ethnolingusitic homelands contained within contemporary country borders,implying that the same group may appear in more than one country and receives unique treat-ment in the data.

To start, I use Geographic Information System (GIS) software to identify all spatially adja-cent group pairs across the world with concurrent group-level borders. This amounts to 15,581

7The linguistic similarly of two neighboring groups is surely due to both channels. In Section 4, I describemy empirical strategy that holds constant the vertical-transfer channel and thereby isolates the horizontal-transfereffect of inter-ethnic trade.

7

Page 9: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Figure 2: Spatial Distribution of Border Buffer Zones

unique pairs, where each pair is composed of two groups that each speak a distinct language.8

I then use GIS to locate the concurrent segment of border delimiting each adjacent pair. Finally,I construct a buffer zone 100 kilometers in diameter around each border segment as my unit ofobservation.9 As the unit of observation, these buffer zones serve as a topographic lens to studythe region that links spatially adjacent group pairs.

3.1.1 Dependent Variable: Linguistic Distance

Language is one aspect of cultural identity that is commonly used by researchers to study theeconomics of culture (e.g., Ginsburgh and Weber (2016)). Here, I use a computerized lexicosta-tistical measure of linguistic distance as an estimate of cultural distance. The measure I use wasdeveloped by the Automatic Similarity Judgement Program (ASJP), a team of linguists at theMax Planck Institute for Evolutionary Anthropology (Wichmann et al., 2010).

The starting point of this measure is a set of basic words common across all world lan-guages. For any given language, each word is transcribed according to its pronunciation usinga standardized orthography. These transcribed list of words are currently available for over 60percent of global languages. For any two languages of interest, I calculate the minimum num-

8I limit my search to Ethnologue group pairs with reported non-zero populations that are located within the samecontinent. The initial search identified 17,129 pairs, yet some groups occupy non-adjacent regions of a country,resulting in 1,548 duplicate pairs. I drop these duplicated pairs from the dataset.

9More specifically, the GIS software constructs a 50-kilometer radius in every direction for each point along theborder segment. The continuous application of this procedure results in a buffer zone that traces the concurrentsegment of border with a diameter of 100 kilometers.

8

Page 10: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

ber of edits necessary to translate the spelling of a word from one language to another using aLevenshtein distance algorithm. A lexicostatistical measure of distance between two languagesis calculated as a normalized average of these Levenshtein distances. See Dickens (2018a) for ain-depth discussion of this measure and Appendix B for a formal discussion of the estimationprocedure.

For each border buffer zone in my dataset, I use this procedure to estimate the linguisticdistance between the underlying ethnolinguistic group pair. As noted, the ASJP team has yetto transcribe word lists for all world languages, meaning that I can only calculate the linguisticdistance for 8,402 group pairs. Although this observed sample only accounts for 54 percent ofthe 15,581 identified pairs in the global sample, the overall set of included language groups ac-count for 84 percent of the global population and reside in 89 percent of all countries reportedin the Ethnologue. In terms of the included language families, the observed sample is quiterepresentative of the global sample too. For example, the three most prominent language fami-lies in the global sample (in percentage terms) include Niger-Congo (23 percent), Austronesian(17 percent) and Indo-European (7 percent). In the observed sample, Niger-Congo languagesconstitute 26 percent, Austronesian 20 percent and Indo-European 8 percent. Figure 2 mapsthe distribution of the 8,402 buffer zone centroids that I match to these estimates of linguisticdistance as part of the baseline sample.

3.1.2 Independent Variable: Land Productivity Variation

As a measure of land productivity, I use the Caloric Suitability Index (Galor and Ozak, 2016).10

Data for this measure come from the Global Agro-Ecological Zones (GAEZ) project of the Foodand Agriculture Organization (FAO). Galor and Ozak (2016) use GAEZ crop yield estimates for48 crops to construct an average measure of potential output for each 5′ × 5′ grid cell on earth.For each available crop, yield estimates are based on low-level inputs and rain-fed agriculturethat reflect traditional labour-intensive cultivation methods used in the distant past. Thesecrop yield estimates are also based on the agro-climatic constraints of a tract of land that areindependent of human action, thus reflecting the potential output of a grid cell rather than theactual output. These restrictions imposed on each crop yield estimate are important becausethey address the concern that land quality is potentially endogenous to human intervention.The estimated potential output for each crop is then converted into a standardized measureof potential caloric return. Finally, Galor and Ozak (2016) average across all 48 potential cropyields within each 5′ × 5′ grid cell to construct the Caloric Suitability Index (CSI)—an averagemeasure of land productivity measured in millions of kilo calories, per hectare, per year.

As a measure of pre-Columbian variation in land productivity, I calculate the standard de-viation of pre-1500 CSI grid cells within each buffer zone. I also construct a measure of pre-

10http://ozak.github.io/Caloric-Suitability-Index/

9

Page 11: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Figure 3: Example Buffer Zones

Two example buffer zones: the Shona-Tonga buffer zone to the west and the Shona-Manyika buffer zone to theeast (both located within the borders of modern-day Zimbabwe).

Columbian land productivity using the mean value of pre-1500 CSI grid cells. Galor and Ozak(2016) make an important distinction between the availability of crops in a grid cell in the pre-1500 CE and post-1500 CE period. The difference between these two periods reflect the changein land productivity that resulted from the expansion of crops in the post-Columbian period.Hence, for my variable of interest, I calculate the change in land productivity variation in thepost-Columbian period as the difference between each buffer zone’s post-1500 standard devi-ation and pre-1500 standard deviation. I similarly calculate the difference in land productivityas the change in average CSI values between these two time periods.

Figure 3 illustrates this approach for the pre-Columbian period, mapping two examplebuffer zones and pre-1500 CE land productivity. There is far more observable variation in landproductivity within the Shona-Manyika buffer zone to the east (standard deviation = 0.27) thanthere is in the Shona-Tonga buffer zone to the west (standard deviation = 0.13). The Shona andManyika are also much more similar in language (26 percent dissimilarity) relative to the Shonaand Tonga (69 percent dissimilarity). While only suggestive, this finding is consistent with theempirical evidence to follow—that ethnolinguistic groups are more similar in language whenseparated across geographic regions with high variation in land productivity.

10

Page 12: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

3.1.3 Additional Control Variables

I also collect a variety of other geographic and climatic data as control variables. Temperatureand precipitation data come from the WorldClim Global Climate Database. These datasets pro-vide information on temperature and precipitation at a spatial resolution of 5′ × 5′. With thesedata, I calculate the mean and standard deviation within each buffer zone. I use elevation datafrom the National Oceanic and Atmospheric Adminstration (NOAA). These data are availableat a 30-arc-second resolution. In addition to measuring the level of elevation within each bufferzone, I calculate the standard deviation of elevation as a measure of ruggedness (Michalopou-los, 2012; Kitamura and Lagerlof, 2019). As a measure of the disease environment, I use theMalaria Ecology Index to construct a measure of the average prevalence of malaria in a bufferzone. These data come from Kiszewski et al. (2004) and are available at a spatial resolution of0.5◦ × 0.5◦.11

I also construct numerous spatial measures. I use the Coastline dataset from Natural Earthto calculate distances from buffer zone centroids to the nearest coast, and data from NOAA’sGlobal Self-consistent, Hierarchical, High-resolution Geography Database to calculate distancesto the nearest lake, major river and minor river.12 To capture the geographic size of each ad-jacent group pair, I calculate the total land area occupied by both groups and the geodesicdistance between group centroids. I also calculate the absolute difference in latitude and lon-gitude of each pair using their centroid coordinates to account for the spatial orientation of anadjacent pair.

Group-level population data comes from the Ethnologue. For an adjacent group pair, I addthe total population for each group together. I exclude any group pair where one or both of thelanguage groups have a recorded population of zero.

3.1.4 Distance to Old World Trade Routes and Ports

The location of Old World trade routes and ports come from Michalopoulos et al. (2016, 2018).Drawing from various sources, these authors construct a database of pre-600 CE Islamic traderoutes, the ancient Roman road network, and roughly 2,900 ancient ports and harbours. Theyalso complement these data with the expanded Old World trade route network up to 1800 CE.Using these data, I calculate the geodesic distance between each buffer zone centroid and thenearest location of historical trade. I calculate distances for both the pre-600 CE period and thepre-1800 CE period.

11See Table A1 and Table A2 in the appendix for the full sample and sibling sample summary statistics.12I define major rivers according to GSHHG size category 1-3, and define minor rivers as those in size categories

4 and 5. For each I use the “full” resolution.

11

Page 13: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

3.2 Pre-Colonial Group-Level Dataset

Murdock’s (1967) Ethnographic Atlas is the primary source of data for the group-level dataset.These data include a large variety of cultural, political and economic characteristics of a largecross-section of pre-colonial ethnic groups across the world. Included in these data are latitudeand longitude coordinates for 1,265 groups, which mark the centroid of each group’s traditionalhomeland. I use these coordinates to construct a buffer zone extending 50 kilometers in everydirection from the group centroid, resulting in a buffer zone 100 kilometers in diameter. Thesebuffer zones serve as my unit of observation in this dataset, where each buffer zone is a windowinto the geographical region surrounding each group’s traditional homeland. Figure A1 in theAppendix maps the spatial distribution of these homelands.

Relevant to this study are the Ethnographic Atlas data detailing each group’s historical re-liance on agriculture, animal husbandry and fishing with a 10-point scale roughly approximat-ing percentage deciles. I use these data to assess the geographical origin of these subsistenceactivities. I proceed to do this by calculating the average land productivity of each buffer zoneusing the CSI data. Once again, I calculate average productivity for the pre-1500 CE and post-1500 CE period, and take the difference of these two measures as the change in land productiv-ity at the onset of the Columbian Exchange. Appendix Figure A2 provides a closer look at thespatial distribution of African pre-colonial homelands, and includes group-level buffer zonesover top the pre-1500 CSI data used to calculate average land productivity in the pre-Columbianperiod.

4 Empirical Strategy and Estimates

4.1 Identification Strategy

The main empirical challenge I face is to connect historical variations in land productivity,through its effect on inter-ethnic trade, to contemporary differences in language. For this, I relyon the natural experiment associated with the Columbian Exchange—the widespread exchangeof goods and crops between the Old and New World in the post-1500 period (Nunn and Qian,2010). Distinct ecological environments across the world were fundamentally changed by theflow of plants and animals in the post-Columbian era (Frankema, 2015). This coming togetherof the continents introduced a new set of potential crops for cultivation that, in the contextof this paper, provides quasi-random variation in potential land productivity in the historicalperiod.

Michalopoulos (2012) finds that ethnic groups develop human capital suitable to their eco-logical environment, where such skills are non-transferable across ecological boundaries. Thissuggests that, in the pre-1500 period, similar groups may have located next to one another

12

Page 14: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Figure 4: Phylogenetic Tree of Eritrean Languages

Afro-Asiatic

Semetic

Cushitic

East

Saho-Afar

AfarSaho

Central

Northern

Bilen

North

Bedawiyet

South

Ethiopian

North

TigringaTigre

Nilo-Saharan

Eastern Sudanic

Eastern

NaraKunama

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5

Level 6

This figure depicts the language tree for the 8 major languages of Eritrea. Among the 28 possible pairings of theselanguages, only 2 represent sibling pairs since there are only 2 language pairs that share a parent language on thelanguage tree: Tigre-Tigringa and Saho-Afar.

due to their relative ease of communication, and defined the frontier of their territories acrossgeographic regions with high variation in land productivity because of their location-specifichuman capital. While this would still speak to the geographical origins of linguistic distance,the inter-ethnic trade mechanism would be inconsequential.

I overcome this concern of endogenous group sorting with quasi-random variation resultingfrom the Columbian Exchange. The identifying source of variation in land productivity comesfrom an unexpected change in the potential set of crops for cultivation in the post-Columbianperiod (Galor and Ozak, 2016). The necessary assumption is that the change in land produc-tivity variation in the post-Columbian period is random and independent of all other determi-nants of linguistic distance, conditional on the level of land productivity variation in the pre-Columbian period. Hence, this quasi-random variation mitigates concern that similar groupssorted into geographic regions defined by high levels of productivity variation.

The other main empirical challenge I face is disentangling the horizontal transmission ofculture via trade from the effect of vertical transfer (i.e., genealogical descent). This considera-tion is important because all related adjacent group pairs exhibit similarity in language due, inpart, to genealogical descent.

I overcome this concern by narrowing my focus to sibling ethnolinguistic pairs—those thatdescend from the same parent language. I define these pairs to be siblings because they share

13

Page 15: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

an identical ancestral history, separated only at the most recent cleavage on the Ethnologue phy-logenetic tree. As an example, Figure 4 depicts the phylogenetic tree of the 8 major Eritreanlanguages for which there are 28 possible pairings. Only Tigre-Tigringa and Saho-Afar repre-sent sibling pairs, since these pairs share 5 out of 6 branches on the tree—the maximum numberof shared branches for two distinct languages.13

By narrowing my focus to sibling pairs, an equivalent phylogenetic relationship is guaran-teed between each pair, thus holding constant the effects of vertical transfer. This implies thatsibling sample estimates more reliably identify the horizontal channel of interest, where varia-tion in land productivity determines the extent of trade, and the frequency of contact via tradeshapes the transmission of culture.14

4.2 Empirical Model and Results

Define buffer zone k as the region surrounding the segment of border that separates ethnolin-guistic groups i and j. I estimate the effect of variations in land productivity in buffer zone kon the linguistic distance between groups i and j in the following way:

LDk = β0 + β1500ProdV ark + βchange∆ProdV ark + x′k Φ + λli(k) + θlj(k) + δc(k) + εk. (1)

The dependent variable LDk measures the linguistic distance between neighboring ethno-linguistic groups i and j in buffer zone k. ProdV ark denotes pre-1500 variation in land pro-ductivity in buffer zone k, and ∆ProdV ark denotes the change in land productivity variationin the post-1500 period at the onset of the Columbian Exchange. Here, xk represents a vectorof buffer zone geo-climatic characteristics and spatial control variables.15 λli(k) and θlj(k) respec-tively denote a complete set of language family fixed effects for groups i and j in buffer zone k,and δc(k) is a complete set of country fixed effects associated with buffer zone k. The conceptualframework suggests that β1500 < 0 and βchange < 0.

4.2.1 Baseline Estimates

Table 1 presents estimates of equation (1). All reported estimates include language familyfixed effects to adjust for ancestral differences in adjacent language pairs.16 Column 1 re-

13The Ethnologue world tree contains 15 levels, which I abstract from here for simplicity.14Figure 2 maps the spatial distribution of sibling pairs relative to the baseline sample of group pairs.15Including pre-1500 land productivity and the change in land productivity in the post-1500 period following

the Columbian Exchange; the malaria suitability index; elevation; ruggedness; precipitation and precipitationvariation; temperature and temperature variation; log distance between group i and j centroids; log distance tothe nearest coast, country border, lake, major river and minor river; the absolute difference in group i and j latitudeand longitude coordinates; log total area of an ethnolinguistic pair; and log population. Appendix Table A1 andA2 report summary statistics for both the full sample and sibling sample.

16Figure A3 displays estimates of βchange, the change in land productivity variation in the post-Columbianperiod, for 15 different samples with and without language family fixed effects. As described in Section 2, a

14

Page 16: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

ports within-family estimates for pre-Columbian land productivity variation and the post-Columbian change in land productivity variation. Both estimates enter with the expected neg-ative sign and are statistically significant at the 1 percent level. This says that ethnic groupsseparated across high-variation regions are more similar in language than groups separatedacross low-variation regions. In particular, a one standard deviation increase in pre-1500 landproductivity variation decreases linguistic distance by 1.7 percentage points, while a standarddeviation increase in productivity variation at the onset of the Columbian exchanges implies a2.0 percentage point decrease in linguistic distance.

These baseline findings are robust to adding land productivity controls (column 2), geogra-phy controls (column 3) and spatial controls (column 4). The estimates reported in column 5include the combined set of these control variables. The coefficient on pre-1500 land produc-tivity variation retains the expected sign, but loses significance at standard levels. Whereas thecoefficient of interest—the effect of land productivity variation resulting from the ColumbianExchange in the post-1500 period—retains statistical significance with the expected negativesign, and is statistically equivalent to the unconditional estimate in column 1. The estimatesreported in column 6 include country fixed effects, and the change in land productivity vari-ation retains statistical significance, but is reduced to two-thirds the size of the other baselineestimates. This estimate implies that a standard deviation increase in productivity variation inthe post-Columbian period implies a 1.3 percentage point decrease in linguistic distance.17

Overall, the estimated coefficient of interest is stable in magnitude across comparable fixedeffects specifications. The insensitivity of this estimate to the set of included control variablessuggests that the empirical model is well identified. The only source of sensitivity is whethercountry fixed effects are included or not, but this is to be expected for a variety reasons—e.g.,how contemporary nation states integrated different ethnolinguistic groups throughout the his-torical process of nation building.

4.2.2 Alternative Buffer Zone Size

The baseline estimates are obtained using an arbitrarily-drawn buffer zone of 100 kilometers inwidth. Here, I test the robustness of the baseline estimates using a 50-kilometer border bufferzone. Table 2 reports these estimates. In all instances, the variable of interest is negative and

key empirical challenge I face is to disentangle the horizontal transmission of culture via trade from the verticaltransmission of culture via genealogical descent. While the ancestral relationship between each group pair willimpact their level of similarity, the vertical nature of genealogical descent within a single group is orthogonal togeographical variations in the surrounding region. Hence, any estimate that excludes language family fixed effectsshould be biased towards zero by the diminished signal-to-noise ratio. This point is made clear in Figure A3, whereβchange is statistically insignificant for the full-sample estimate without fixed effects, but becomes negative andsignificant after narrowing the sample to increasingly related group pairs. To the contrary, estimates of βchange

with language family fixed effects are relatively stable in magnitude, and are always negative and significantirrespective of sample.

17See Table A4 in the appendix for a complete table that includes coefficient estimates for all control variables.

15

Page 17: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table 1: Border-Level Regressions: Baseline Results (100km Buffer Zone)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

∆ in land productivity variation (post-1500) -0.100*** -0.100*** -0.083*** -0.111*** -0.098*** -0.065*(0.030) (0.034) (0.031) (0.030) (0.033) (0.034)

Land productivity variation (pre-1500) -0.061*** -0.061*** -0.047* -0.043** -0.040 -0.029(0.022) (0.022) (0.025) (0.021) (0.026) (0.027)

Land Productivity Controls No Yes No No Yes YesGeography Controls No No Yes No Yes YesSpatial Controls No No No Yes Yes Yes

Language Family FE Yes Yes Yes Yes Yes YesCountry FE No No No No No Yes

Adjusted R2 0.25 0.25 0.26 0.26 0.28 0.37Observations 8402 8402 8402 8402 8402 7291

This table establishes the negative and statistically significant effect of variation in land productivity on a language pair’s lexicostatistical linguistic distance.The unit of observation is a 100 km buffer zone along the contiguous border segment of each language pair. Land productivity controls include pre-1500land productivity and the post-1500 change in land productivity. Geography controls include mean elevation, ruggedness, mean temperature and itsstandard deviation, mean precipitation and its standard deviation, and the prevalence of malaria. Spatial controls include logged distance to the nearestcoast, country border, lake, major river and minor river, logged distance between group centroids, the absolute difference in latitude and longitude, loggedland area and logged population. Standard errors are double-clustered at the level of each language group and are reported in parentheses. * p < 0.10, ** p< 0.05, *** p < 0.01.

16

Page 18: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table 2: Border-Level Regressions: Baseline Results (50km Buffer Zone)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

∆ in land productivity variation (post-1500) -0.115*** -0.107*** -0.115*** -0.133*** -0.118*** -0.058*(0.033) (0.034) (0.033) (0.033) (0.034) (0.032)

Land productivity variation (pre-1500) -0.090*** -0.089*** -0.091*** -0.079*** -0.077*** -0.033(0.024) (0.024) (0.024) (0.023) (0.023) (0.023)

Land Productivity Controls No Yes No No Yes YesGeography Controls No No Yes No Yes YesSpatial Controls No No No Yes Yes Yes

Language Family FE Yes Yes Yes Yes Yes YesCountry FE No No No No No Yes

Adjusted R2 0.25 0.25 0.25 0.26 0.27 0.37Observations 8402 8402 8402 8402 8402 7290

This table establishes the negative and statistically significant effect of variation in land productivity on a language pair’s lexicostatistical linguistic distanceusing an alternative buffer zone size. The unit of observation is a 50 km buffer zone along the contiguous border segment of each language pair. Controlvariables are identical to the complete set of baseline control variables used in Table 1. Standard errors are double-clustered at the level of each languagegroup and are reported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

17

Page 19: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

statistically significant, and similar in magnitude to the baseline estimates in Table 1. Overall,this suggests the baseline result is not an outcome of the arbitrarily-sized buffer zones.18

4.2.3 Sibling Sample Estimates

The hypothesized link between geographical variations and linguistic distance is horizontal innature, where inter-ethnic trade maintains shared linguistic and cultural traits between ances-tral groups. While the ancestral relationship between each group pair will impact their level ofsimilarity, the vertical nature of genealogical descent within a single group is independent ofgeographical variations in the surrounding region.

With this in mind, I narrow my focus to sibling ethnolinguistic pairs—those that descendfrom the same parent language. By design, sibling-sample estimates hold genealogical differ-ences constant, thereby limiting the extent of noise present in the full-sample estimates dueto genealogical descent. In this context, geographical variations matter to the extent that theyencourage or limit the opportunity for interaction via inter-ethnic trade.

In Table 3, I reproduce the baseline estimates of Table 1 using the sibling sample. Both pre-Columbian land productivity variation and the post-Columbian change in land productivityvariation are estimated to be negative and statistically significant. Consistent with the idea thatunadjusted genealogical differences in the full-sample estimates introduce noise, the siblingsample estimates are considerably larger in magnitude—often twice the size of the comparablebaseline estimate. Consider the within-country estimates in column 6, the most demandingspecification, where a one standard deviation increase in pre-1500 land productivity variationdecreases linguistic distance by 4.4 percentage points, while a standard deviation increase inpost-Columbian productivity variation implies a 4.2 percentage point decrease in linguisticdistance. Overall, Table 3 provides the strongest evidence of a link between land productiv-ity variations and linguistic distance, and illustrates the quantitative importance of isolatingthe horizontal transmission channel in this context.19

4.2.4 Post-Columbian Migrations

The key identifying assumption of the baseline model is that the change in land productivityvariation in the post-Columbian period is random and independent of all other factors in abuffer zone, conditional on the level of land productivity variation in the pre-Columbian pe-riod. Yet large-scale migrations have occurred throughout the post-Columbian era. This isproblematic because the contemporary location of an ethnolinguistic group might differ fromtheir ancestral location, and so historical changes in land productivity would be independentof contemporary language differences.

18See Table A5 in the appendix for a complete table that includes coefficient estimates for all control variables.19See Table A6 in the appendix for a complete table that includes coefficient estimates for all control variables.

18

Page 20: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table 3: Border-Level Regressions: Sibling Sample Results (100km Buffer Zone)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

∆ in land productivity variation (post-1500) -0.188*** -0.197*** -0.286*** -0.169*** -0.248*** -0.154**(0.042) (0.047) (0.058) (0.044) (0.060) (0.070)

Land productivity variation (pre-1500) -0.088** -0.088** -0.200*** -0.096*** -0.185*** -0.140**(0.034) (0.034) (0.056) (0.035) (0.059) (0.068)

Land Productivity Controls No Yes No No Yes YesGeography Controls No No Yes No Yes YesSpatial Controls No No No Yes Yes Yes

Language Family FE Yes Yes Yes Yes Yes YesCountry FE No No No No No Yes

Adjusted R2 0.28 0.28 0.29 0.30 0.30 0.39Observations 1669 1669 1669 1669 1669 1497

This table establishes the negative and statistically significant effect of variation in land productivity on a language pair’s lexicostatistical linguistic distancefor the baseline sibling sample. The unit of observation is a 100 km buffer zone along the contiguous border segment of each language pair. Controlvariables are identical to the complete set of baseline control variables used in Table 1. Standard errors are double-clustered at the level of each languagegroup and are reported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

19

Page 21: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Historical group-level migration data is unavailable to address this concern. Instead, I mini-mize the measurement error introduced by post-Columbian migrations by narrowing my focusto adjacent pairs that reside in a country where the majority of people are native to that country.By doing so, I exclude the regions of the world most affected by the wave of post-Columbianmigrations (e.g., the Americas).20 Table 4 reports within-country estimates for a variety of spec-ifications, where at least 25, 50 or 75 percent of the population is native to the country of resi-dence, for both the full and sibling samples. In all cases, the variable of interest remains statis-tically significant and become larger in magnitude than at baseline. This increase in magnitudeis consistent with post-Columbian migrations introducing an attenuation bias due to measure-ment error. All together, these estimates reassuringly suggest that post-Columbian migrationsare inconsequential to the baseline estimates.21

4.2.5 Additional Robustness Checks

In the Online Appendix, I document various other robustness checks. The Ethnologue some-times maps ethnolinguistic group boundaries as overlapping. This is problematic because abuffer zone will not be uniquely representative of the neighboring pair. In Table A11, I showthat full sample and sibling sample estimates increase in magnitude and significance whenexcluding overlapping buffer zones from the analysis.

I also explore potential bias due to omitted variables. Table A12 reports Altonji et al. (2005)(AET) statistics on how much stronger selection on unobservables must be compared to selec-tion on observables in order to fully explain the results.22 Overall, the reported AET statisticssuggest the influence of unobservable characteristics must be 1.35 to 4.26 times stronger thanthe influence of observables to fully account for my baseline findings. I also report Oster’s(2019) lower bound for the coefficient of interest, under the most conservative assumption ofR2

max = 1, and in all instances the coefficient remains negative and sizeable in magnitude.23

I also run a variety of placebo tests, where I limit the sample to geographic regions inhos-pitable to trade (i.e., sample observations above the 90th percentile in elevation or ruggedness),and New World observations where post-Columbian migrations introduce significant noise tothe estimates. The hypothesized trade mechanism is expected to disappear in regions inhos-pitable to trade, hence breaking the link between land productivity variation and linguistic

20The ancestral composition of a country’s contemporary population comes from Putterman and Weil (2010).21See Table A7 in the appendix for a complete table that includes coefficient estimates for all control variables.22To perform this test, I calculate the ratio βF /(βR − βF ), where βF is the coefficient estimate of the variable of

interest—∆ in land productivity variation (pre-1500)—that includes a full set of controls, while βR is the estimatedcoefficient of the variable of interest for the restricted set of controls.

23I calculate this lower bound using Oster’s (2019) formula β∗ = βF − (βR − βF )× R2max−R

2F

R2F−R2

R, where βF and βR

are defined the same as above, R2F is the R-squared from the fully-controlled regression and R2

R is the R-squaredfrom the restricted regression. I setR2

max = 1, the theoretical maximum, even though the presence of measurementerror in the dependent variable suggests this is an overly conservative assumption.

20

Page 22: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table 4: Border-Level Regressions: Native Population Sensitivity Analysis

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

Full Sample Sibling Sample

(1) (2) (3) (4) (5) (6) (7) (8)

∆ in land productivity variation (post-1500) -0.065* -0.072** -0.073** -0.078** -0.154** -0.159** -0.154** -0.175**(0.034) (0.036) (0.036) (0.038) (0.070) (0.071) (0.071) (0.073)

Land productivity variation (pre-1500) -0.029 -0.032 -0.031 -0.059* -0.140** -0.142** -0.138** -0.179**(0.027) (0.029) (0.030) (0.032) (0.068) (0.069) (0.069) (0.072)

Controls Yes Yes Yes Yes Yes Yes Yes Yes

Language Family FE Yes Yes Yes Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes Yes Yes Yes

Adjusted R2 0.37 0.36 0.36 0.34 0.39 0.40 0.40 0.39Observations 7291 6723 6425 5540 1497 1445 1424 1273

Native Population Sample Restriction None ≥25% ≥50% ≥75% None ≥25% ≥50% ≥75%

This table establishes the robustness of the baseline estimates to post-Columbian migrations with the full sample (columns 1-4) and sibling sample (columns5-8). Each sample is restricted to language pairs that reside in a country where 25, 50 or 75 percent of the population is native to the residing country. Theunit of observation is a 100 km buffer zone along the contiguous border segment of each language pair. Control variables are identical to the complete setof baseline control variables used in Table 1. Standard errors are double-clustered at the level of each language group and are reported in parentheses. * p< 0.10, ** p < 0.05, *** p < 0.01.

21

Page 23: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

distance. Whereas any long-run evidence of the trade mechanism should be washed away bysignificant post-Columbian migration in the New World sample. Table A13 reports the resultsof these tests, where the baseline result disappears in both the full sample and sibling sampleas expected.

In Table A14, I report estimates of the baseline model with country fixed effects but allowfor spatial correlation in the error term. I find that the coefficient of interest remains statisticallysignificant at spatial correlation cutoffs of 100km, 200km, 300km, 400km and 500km.24

5 Mechanism: Inter-Ethnic Trade

So far I have established that ethnic groups separated across geographic regions with highvariation in land productivity are more similar in language than groups separated across low-variation regions. A framework for interpreting this finding can be broken down into two partsand is outlined in Section 2: (i) geographical variations in land productivity shape the incentivesfor inter-ethnic trade in the distant past, and (ii) cultural drift is amplified or muted accordingto the extent of between-group interaction via trade. I conclude the empirical analysis withan investigation of this two-part causal chain linking land productivity variations to linguisticdifferences.

5.1 Land Productivity and Historical Modes of Subsistence

Does land productivity predict an ethnic group’s historical mode of subsistence? The extentto which it does is informative because regions with large variations in land productivity arethen more likely to rely on various modes of subsistence and thus produce a wide range oftradable goods. I proceed with an investigation of the geographic origins of agricultural versusnon-agricultural modes of subsistence using the Ethnographic Atlas—a large cross-section of pre-colonial ethnic groups. As outlined in Section 3, these data include coordinates for each group’scentroid, which I use to construct a buffer zone that is 100 kilometers in diameter (i.e., the bufferzone is circular and extends 50 kilometers in every direction from the group centroid). Thesebuffer zone serves as my geographical unit of observation, which I use to estimate the followingempirical model:

Subsistencei = α0 + α1500LandProdi + αchange∆LandProdi + x′i ϕ+ λl(i) + δc(i) + εi. (2)

Depending on the specification, the dependent variable Subsistencei is a measure of ethnicgroup i’s historical share of subsistence derived from agriculture, animal husbandry or fish-ing. LandProdi denotes pre-1500 land productivity within the buffer zone region surrounding

24I use Colella et al.’s (2019) Stata package acreg to make these standard error adjustments.

22

Page 24: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

group i, and ∆LandProdi captures the change in land productivity in the post-Columbian pe-riod. xi is a vector of geo-climatic buffer zone characteristics,25 and λl(i) and δc(i) respectivelydenote language family and country fixed effects.

Estimates of equation (2) are reported in Table 5. In columns 1 and 2, the dependent vari-able is a group’s share of subsistence derived from agriculture. With or without the completeset of control variables, the coefficient estimates for land productivity and its change in thepost-Columbian period are positive and statistically significant. This suggests that pre-colonialgroups were more likely to rely on agriculture as a primary means of subsistence when residingin a geographic region with high land productivity.

By extension, this finding also implies that a group living in a productive region shouldrely less on animal husbandry and fishing as a primary means of subsistence. This is indeedwhat I find when re-estimating equation (2) using a group’s share of subsistence derived fromanimal husbandry (columns 3 and 4) and fishing (columns 5 and 6) as the dependent variable.In all instances the estimated coefficients are negative as expected, and most are statisticallysignificant.26

Overall, Table 5 tells a story where the productive capacity of a group’s homeland explainstheir chosen method of subsistence. It also implies that geographic regions with large variationsin productivity were characterized by a variety of agricultural and non-agricultural subsistencepractices. This line of reasoning is consistent with the historical record outlined in Section 2,which describes high-variation regions as the historical location of product specialization andtrade. All together, this evidence gives credibility to the inter-ethnic trade mechanism, which isthe first part of the causal chain linking land productivity variations to linguistic differences.

5.2 The Long-Run Impact of Historical Trade on Cultural Differences

Cultural anthropologists have written extensively about geographic isolation being a key driverof cultural drift (see Section 2). In other words, the geography of ethnic group borders regu-lates the process of drift, and consequently impacts the co-evolution of linguistic traits betweenspatially adjacent ethnic groups. In the previous subsection, I establish that border regions withhigh variation in land productivity are suitable regions for historical inter-ethnic trade. Whatremains to be established is the second part of the aforementioned causal chain: the long-runimpact of a group’s exposure to inter-ethnic trade on language.

For this part of the empirical analysis, I rely on an empirical model similar to equation (1).I again define buffer zone k as the region surrounding the segment of border that separates

25Including elevation; ruggedness; precipitation and precipitation variation; temperature and temperature vari-ation; log distance to the nearest coast, country border, lake, major river and minor river; the absolute value oflatitude and longitude; and log land area. I drop the malaria ecology index to maintain sample size, however theresults are unchanged by its absence. Appendix Table A3 reports the summary statistics for these variables.

26See Table A8 in the appendix for a complete table that includes coefficient estimates for all control variables.

23

Page 25: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table 5: Ethnographic Atlas Regressions: Land Productivity and Historical Subsistence

Historical Dependence Historical Dependence Historical Dependenceon Agriculture on Animal Husbandry on Fishing

(1) (2) (3) (4) (5) (6)

∆ in land productivity (post-1500) 0.763*** 0.694*** -0.391** -0.390** -0.348 -0.444**(0.258) (0.251) (0.178) (0.173) (0.224) (0.195)

Land productivity (pre-1500) 0.891*** 0.810*** -0.245** -0.184* -0.261** -0.301***(0.142) (0.141) (0.105) (0.100) (0.105) (0.086)

Geography Controls No Yes No Yes No YesSpatial Controls No Yes No Yes No Yes

Language Family FE Yes Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes Yes

Adjusted R2 0.67 0.68 0.63 0.64 0.46 0.57Observations 1133 1119 1133 1119 1133 1119

Columns (1) and (2) document a positive association between land productivity of an ethnic group’s homeland and their historical reliance on agriculture,columns (3) and (4) document a negative association between land productivity of a group’s homeland and their historical reliance on animal husbandry,and columns (5) and (6) document a negative association between land productivity of a group’s homeland and their historical reliance on fishing. The unitof observation is a group-specific buffer zone, constructed from Ethnographic Atlas group centroids using a 50 km radius. Geography control variablesinclude elevation, ruggedness, mean precipitation and its standard deviation, mean temperature and its standard deviation, and the prevalence of malaria.Spatial controls include logged distance to the nearest coast, country border, lake, major river and minor river, logged land area and absolute latitude andlongitude. Robust standard errors in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

24

Page 26: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

ethnolinguistic groups i and j. Because I do not observe group-level trade directly, I insteaduse the location of Old World trade routes, and measure the geodesic distance between bufferzone k’s centroid and the nearest trade route for two time periods: pre-600 CE and pre-1800CE.27 This approach is motivated by the evidence that pre-colonial ethnic groups located nearOld World trade routes were more likely to engage in trade (Michalopoulos et al., 2018). Iestimate the relationship between historic trade route proximity and contemporary linguisticdistance between groups i and j in the following way:

LDk = µ0 + µDistancek + x′k ρ+ λli(k) + θlj(k) + δc(k) + εk. (3)

The dependent variable LDk measures the linguistic distance between neighboring ethno-linguistic groups i and j in buffer zone k. Depending on the specification, Distancek capturesthe (logged) distance between buffer zone k and the nearest Old World trade route in the pre-600 CE period or the pre-1800 CE period. xk represents a vector of buffer zone geo-climaticcharacteristics identical to the complete set of controls included in equation (1), λli(k) and λlj(k)

respectively denote language family fixed effects for group i and j, and δc(k) represents countryfixed effects. If groups located near trade routes more actively engaged in trade, and inter-ethnic trade mitigates cultural drift, then it is expected that µ > 0.

Table 6 reports estimates of equation (3) for both the full sample and sibling sample. Bothpre-600 CE distance and pre-1800 CE distance coefficients are positive as expected, but thepositive association between pre-1800 trade route distance and linguistic distance is far morerobust. Regardless of specification or sample, pre-1800 distance retains statistical significanceat conventional levels (columns 3 and 4), and outperforms pre-600 distance in a horse raceregression (column 5).

Yet for 25 percent of the full sample and 17 percent of the sibling sample observations, thedistance to the nearest trade route is unchanged between 600 CE and 1800 CE. For these obser-vations, both distance measures are statistically indistinguishable in a regression analysis. Toget around this, I also calculate the change in log distance between 600 CE and 1800 CE, andre-estimate equation (3) using pre-600 distance in levels and the post-600 change in distance.These estimates are reported in column 6, where both measures are now positive and statisti-cally significant. For each sample, the coefficient associated with the post-600 change in column6 is identical to the pre-1800 coefficient in column 5, since these variables are estimated usingidentical variation. However, in the column 6 estimates, pre-600 observations with unchangeddistance in the post-600 period no longer suffer from a colinearity problem, and are estimatedto be positive and significant.28

Overall, these data suggest that neighboring ethnic groups historically located near pre-600 and pre-1800 Old World trade routes are more similar in language today than neighboring

27I drop all observations associated with New World countries since I rely on Old World trade maps.28See Table A9 and A10 in the appendix for a complete table that includes coefficient estimates for all control

variables.

25

Page 27: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table 6: Border-Level Regressions: Mediating Channel – Historical Old World Trade Routes

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

Panel A: Full Sample

Log distance to trade route pre-600 CE 0.004 0.010 -0.000 0.013*(0.006) (0.007) (0.008) (0.007)

Log distance to trade route pre-1800 CE 0.007* 0.013*** 0.013***(0.004) (0.004) (0.005)

∆ log distance to trade route post-600 CE (pre-1800− pre-600) 0.013***(0.005)

Controls No Yes No Yes Yes YesLanguage Family FE Yes Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes YesAdjusted R2 0.29 0.33 0.29 0.33 0.33 0.33Observations 6205 6205 6205 6205 6205 6205

Panel B: Sibling Sample

Log distance to trade route pre-600 CE 0.021* 0.024** 0.011 0.031**(0.012) (0.012) (0.014) (0.013)

Log distance to trade route pre-1800 CE 0.020*** 0.022*** 0.019**(0.008) (0.008) (0.009)

∆ log distance to trade route post-600 CE (pre-1800− pre-600) 0.019**(0.009)

Controls No Yes No Yes Yes YesLanguage Family FE Yes Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes YesAdjusted R2 0.35 0.37 0.36 0.37 0.37 0.37Observations 1389 1389 1389 1389 1389 1389

This table documents a positive association between the linguistic distance of adjacent ethnic groups and their distance to the nearest pre-600 CE andpre-1800 CE trade route. The trade route data map Old World routes, so both the full sample and sibling sample are restricted to observations located inOld World countries. The unit of observation is a 100 km buffer zone along the contiguous border segment of each language pair. Control variables areidentical to the complete set of baseline control variables used in Table 1. Standard errors are double-clustered at the level of each language group and arereported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

26

Page 28: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

groups located further away. This finding is consistent with evidence that groups living nearhistorical trade routes engaged in trade more than those living further away (Michalopouloset al., 2018). While these estimates do not rely on post-Columbian changes, and thus cannotnecessarily be interpreted as causal, they are still informative of the channel through whichhistorical trade shapes contemporary differences in language. In particular, the sibling sam-ple estimates suggest that inter-ethnic trade might actually slow the process of cultural driftrather than simply transmit culture across group borders, since by definition this sample onlyincludes group pairs that were part of the same ancestral ethnic group. Upon separation, therate at which sibling pairs drift apart is dictated by the extent of between-group isolation. Thefrequency of interaction and trade then weakens this isolation effect, thereby slowing the over-all drift process.

6 Concluding Remarks

This study takes the economic importance of ethnic differences as given, and goes a step deeperto explore the origins of these differences. I construct a novel georeferenced dataset to examinethe border regions of neighboring ethnolinguistic groups, together with variation in the set ofpotentially cultivatable crops at the onset of the Columbian Exchange, to estimate how varia-tions in land productivity impact linguistic differences between neighboring groups. I find thatgroups separated across geographic regions with high variation in land productivity are moresimilar in language than groups separated across more homogeneous regions.

I argue that inter-ethnic trade is the causal chain connecting geographic variations to lin-guistic variations. To substantiate this argument, I show that land productivity predicts thesubsistence strategies of pre-colonial ethnic groups, suggesting that geographic regions withlarge variations in land productivity relied on various modes of subsistence, thus providingthe opportunity for product specialization and trade. Ecologically-driven trade of this type isalso consistent with the historical record (e.g., Bates (1983)). I then show that spatially adjacentethnic groups historically located near Old World trade routes are more similar in language to-day than groups located far away. Taking it as given that groups located near Old World traderoutes were more likely to engage in trade (Michalopoulos et al., 2018), I conclude that trade inthe distant past was instrumental in slowing the rate of drift between neighboring groups andis still observable today.

What does this result add to our understanding of the link between ethnolinguistic dif-ferences and contemporary patterns of development? It implies that other findings that havebeen interpreted as effects of ethnolinguistic distance might be rooted in geography. Giventhe mounting evidence that geography shapes contemporary patterns of economic develop-ment, my findings suggest that the exogeneity of ethnolinguistic distance in regression analysis

27

Page 29: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

should be questioned in the absence of the appropriate geographic control variables.These findings also give new perspective to how culture evolves. Cultural groups are adap-

tive to the geographic environment they inhabit and develop location-specific human capitalskills (Michalopoulos, 2012). The persistence of cultural traits is similarly dependent on thesurrounding environment—during periods of climatic stability, location-specific traits persistwithin groups because the “rules of thumb” of past generations remain relevant to the cur-rent generation (Giuliano and Nunn, 2017). By the same token, the vertical transmission oflocation-specific skills from parents to children occur less frequently when the geographic en-vironment is highly variable. Yet in such variable environments, I find that the horizontal trans-mission of culture between groups is amplified by way of inter-group trade. Taken together,this evidence suggests that a cultural group’s surrounding geographic environment partiallydetermines whether group members are inclined to pass along existing traits to the next gener-ation or adopt new traits from outside groups. The extent to which cultural groups adopt newtraits from outside groups is an interesting avenue of future research, particularly in this eraof climate change where cultural norms may become less relevant to changing environmentalconditions.

References

Ahlerup, P. and Olsson, O. (2012). The Roots of Ethnic Diversity. Journal of Economic Growth,17(2):71–102.

Alesina, A., Giuliano, P., and Nunn, N. (2013). On the Origins of Gender Roles: Women and thePlough. The Quarterly Journal of Economics, 128(2):469–530.

Altonji, J. G., Elder, T. E., and Taber, C. R. (2005). Selection on Observed and Unobserved Vari-ables: Assessing the Effectiveness of Catholic Schools. Journal of Political Economy, 113(1):151–184.

Ashraf, Q. and Galor, O. (2013a). Genetic Diversity and the Origins of Cultural Fragmentation.American Economic Review: Papers & Proceedings, 103(3):528–533.

Ashraf, Q. and Galor, O. (2013b). The “Out of Africa” Hypothesis, Human Genetic Diversity,and Comparative Economic Development. American Economic Review, 103(1):1–46.

Ashraf, Q. and Michalopoulos, S. (2015). Climatic Fluctuations and the Diffusion of Agriculture.Review of Economics and Statistics, 97(3):589–609.

Bakker, D., Brown, C. H., Brown, P., Egorov, D., Grant, A., Holman, E. W., Mailhammer, R.,Muller, A., Velupillai, V., and Wichmann, S. (2009). Add Typology to Lexicostatistics: ACombined Approach to Language Classification. Linguistic Typology, 13:167–179.

28

Page 30: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Barth, F. (1961). Nomads of South Persia: The Basseri Tribe of the Khamseh Confederacy. Little, Brownand Company, Boston.

Bates, D. G. and Lees, S. H. (1977). The Role of Exchange in Productive Specialization. AmericanAnthropologist, 27:824–841.

Bates, R. H. (1983). Essays on the Political Economy of Rural Africa. Cambridge University Press,Cambridge.

Bates, R. H. (2010). Prosperity and Violence: The Political Economy of Development. W. W. Norton& Company, New York.

Becker, S. O., Boeckh, K., Hainz, C., and Woessmann, L. (2016). The Empire Is Dead, Long Livethe Empire! Long-Run Persistence of Trust and Corruption in the Bureaucracy. EconomicJournal, 126(590):40–74.

Boroditsky, L. (2001). Does Language Shape Thought?: Mandarin and English Speakers’ Con-ceptions of Time. Cognitive Psychology, 43:1–22.

Boroditsky, L., Schmidt, L. A., and Phillips, W. (2003). Sex, Syntax, and Semantics. In Gentner,D. and Goldin-Meadow, S., editors, Language in Mind: Advances in the Study of Language andThought. Cambridge University Press, Cambridge.

Boyd, R. and Richerson, P. J. (1985). Culture and the Evolutionary Process. University of ChicagoPress, Chicago.

Bradburd, D. A. (1996). Toward an Understanding of the Economics of Pastoralism: The Bal-ance of Exchange Between Pastoralists and Nonpastoralists in Western Iran, 1815-1975. Hu-man Ecology, 24(1):1–38.

Cavalli-Sforza, L. L. and Feldman, M. W. (1981). Cultural Transmission and Evolution: A Quanti-tative Approach. Princeton University Press, Princeton.

Centola, D., Gonzalez-Avella, J. C., Eguiluz, V. M., and Miguel, M. S. (2007). Homophily, Cul-tural Drift, and the Co-Evolution of Cultural Groups. Journal of Conflict Resolution, 51(6):905–929.

Cervellati, M., Chiovelli, G., and Esposito, E. (2017). Bite and Divide: Ancestral Exposureto Malaria and the Emergence and Persistence of Ethnic Diversity in Africa. University ofBologna, mimeo.

Chanda, A., Justin Cook, C., and Putterman, L. (2014). Persistence of Fortune: Accounting forPopulation Movements, There Was No Post-Columbian Reversal. American Economic Journal:Macroeconomics, 6(3):1–28.

29

Page 31: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Chen, M. K. (2013). The Effect of Language on Economic Behavior: Evidence from SavingsRates, Health Behaviors, and Retirement Assets. American Economic Review, 103(2):690–731.

Cherniwchan, J. and Moreno-Cruz, J. (2019). Maize and Precolonial Africa. Journal of Develop-ment Economics, 136:137–150.

Colella, F., Lalive, R., Sakalli, S. O., and Thoenig, M. (2019). Inference with Arbitrary Clustering.IZA Discussion Paper 12584, (Vcv):1–15.

Comin, D., Easterly, W., and Gong, E. (2010). Was the Wealth of Nations Determined in 1000BC? American Economic Journal: Macroeconomics, 2(3):65–97.

Cysouw, M. (2013). Disentangling Geography from Genealogy. In Space in Language and Lin-guistics: Geographical, Interactional, and Cognitive Perspectives, pages 21–37. de Gruyter, Berlin.

Desmet, K., Ortuno-ortın, I., and Wacziarg, R. (2017). Culture, Ethnicity, and Diversity. AmericanEconomic Review, 107(9):2479–2513.

Diamond, J. (1997). Guns, Germs and Steel. W. W. Norton & Company, New York.

Dickens, A. (2018a). Ethnolinguistic Favoritism in African Politics. American Economic Journal:Applied Economics, 10(3):370–402.

Dickens, A. (2018b). Population Relatedness and Cross-Country Idea Flows: Evidence fromBook Translations. Journal of Economic Growth, 23(4):367–386.

Diene, D. (2000). Foreword. In The Silk Roads: Highways of Culture and Commerce. BerghahnBooks, New York.

Eggan, F. (1963). Cultural Drift and Social Change. Current Anthropology, 4(4):347–355.

Fearon, J. D. (2003). Ethnic and Cultural Diversity by Country. Journal of Economic Growth,8(2):195–222.

Fenske, J. (2014). Ecology, Trade, and States in Pre-Colonial Africa. Journal of the EuropeanEconomic Association, 12(3):612–640.

Fishman, J. A., Gertner, M. H., Lowy, E. G., and Milan, W. G. (1985). The Rise and Fall of theEthnic Revival: Perspectives on Language and Ethnicity. Walter de Gruyter & Company, Berlin.

Frankema, E. (2015). The Biogeographic Roots of World Inequality: Animals, Disease, andHuman Settlement Patterns in Africa and the Americas Before 1492. World Development,70(016):274–285.

30

Page 32: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Galor, O. and Ozak, O. (2016). The Agricultural Origins of Time Preference. American EconomicReview, 106(10):3064–3103.

Galor, O., Ozak, O., and Sarid, A. (2018). Geographical Roots of the Coevolution of Culturaland Linguistic Traits. NBER Working Paper 25289.

Galor, O., Ozak, O., and Sarid, A. (2020). Linguistic Traits and Human Capital Formation.American Economic Review: Papers & Proceedings (Forthcoming).

Ginsburgh, V. A. and Weber, S. (2016). Linguistic Distances and Ethnolinguistic Fractionaliza-tion and Disenfranchisement Indices. In Ginsburgh, V. A. and Weber, S., editors, The PalgraveHandbook of Economics and Language, pages 137–173. Palgrave Macmillan UK, London.

Giuliano, P. and Nunn, N. (2013). The Transmission of Democracy: From the Village to theNation-State. American Economic Review: Papers & Proceedings, 103(3):86–92.

Giuliano, P. and Nunn, N. (2017). Understanding Cultural Persistence and Change. NBERWorking Paper 23617.

Gomes, J. F. (2018). The Health Costs of Ethnic Distance: Evidence from Sub-Saharan Africa.University of Navarra, mimeo.

Guiso, L., Sapienza, P., and Zingales, L. (2016). Long-Term Persistence. Journal of the EuropeanEconomic Association, 14(6):1401–1436.

Holman, E. W., Wichmann, S., Brown, C. H., Velupillai, V., Muller, A., and Bakker, D. (2009).Explorations in Automated Language Classification. Folia Linguistica, 42(3-4):331–354.

Iyigun, M., Nunn, N., and Qian, N. (2017). The Long-run Effects of Agricultural Productivityon Conflict, 1400-1900. NBER Working Paper 24066.

Jakiela, P. and Ozier, O. (2019). Gendered Language. Center for Global Development Working Paper500, (January 2019).

Kardulias, P. N. (2015). Introduction: Pastoralism as an Adaptive Strategy. In Kardulias, P. N.,editor, The Ecology of Pastoralism. University Press of Colorado, Boulder.

Kiszewski, A., Mellinger, A., Spielman, A., Malaney, P., Sachs, S. E., and Sachs, J. (2004). AGlobal Index Representing the Stability of Malaria Transmission. American Journal of TropicalMedicine and Hygiene, 70(5):486–498.

Kitamura, S. and Lagerlof, N.-P. (2019). Geography and State Fragmentation. Journal of theEuropean Economic Association, Forthcoming.

31

Page 33: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Koetsier, T. (2019). From Early Trade and Communication Networks to the Internet, the Internetof Things and the Global Intelligent Machine (GIM). Advances in Historical Studies, 08(01):1–23.

Levinson, S. C. (2003). Space in Language and Cognition: Explorations in Cognitive Diversity. Cam-bridge University Press, Cambridge.

Lovejoy, P. E. and Baier, S. (1975). The Desert-Side Economy of Central Sudan. The InternationalJournal of African Historical Studies, 8(4):551–581.

Melitz, J. (2008). Language and Foreign Trade. European Economic Review, 52(4):667–699.

Michalopoulos, S. (2012). The Origins of Ethnolinguistic Diversity. American Economic Review,102(4):1508–1539.

Michalopoulos, S., Naghavi, A., and Prarolo, G. (2016). Islam, Inequality and Pre-IndustrialComparative Development. Journal of Development Economics, 120:86–98.

Michalopoulos, S., Naghavi, A., and Prarolo, G. (2018). Trade and Geography in the Spread ofIslam. Economic Journal, 128(616):3210–3241.

Murdock, G. P. (1967). Ethnographic Atlas. University of Pittsburgh Press, Pittsburgh.

Nunn, N. (2008). The Long-Term Effects of Africa’s Slave Trades. The Quarterly Journal of Eco-nomics, 123(1):139–176.

Nunn, N. (2009). The Importance of History for Economic Development. Annual Review ofEconomics, 1:65–92.

Nunn, N. (2012). Culture and the Historical Process. Economic History of Developing Regions,27(S1):S108–S126.

Nunn, N. and Puga, D. (2012). Ruggedness: The Blessing of Bad Geography in Africa. Reviewof Economics and Statistics, 94(1):20–36.

Nunn, N. and Qian, N. (2010). The Columbian Exchange: A History of Disease, Food, andIdeas. Journal of Economic Perspectives, 24(2):163–188.

Nunn, N. and Qian, N. (2011). The Potato’s Contribution to Population and Urbanization:Evidence From A Historical Experiment. The Quarterly Journal of Economics, 126(2):593–650.

Nunn, N. and Wantchekon, L. (2011). The Slave Trade and the Origins of Mistrust in Africa.American Economic Review, 101(7):3221–3252.

Odum, E. P. (1971). Fundamentals of Ecology. Philadelphia, 3rd edition.

32

Page 34: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Olsson, O. and Hibbs Jr., D. A. (2005). Biogeography and Long-Run Economic Development.European Economic Review, 49(4):909–938.

Oster, E. (2019). Unobservable Selection and Coefficient Stability: Theory and Evidence. Journalof Business Economics and Statistics, 37(2):187–204.

Putterman, L. and Weil, D. N. (2010). Post-1500 Population Flows and the Long-Run Determi-nants of Economic Growth and Inequality. The Quarterly Journal of Economics, 125(4):1627–1682.

Risager, K. (2015). Linguaculture: the Language-Culture Nexus in Transnational Perspective. InSharifian, F., editor, The Routeledge Handbook of Language and Culture, pages 87–99. New York.

Roberson, D. and Hanley, J. R. (2010). Relatively Speaking: An Account of the Relationship be-tween Language and Thought in the Color Domain. In Malt, B. C. and Wolf, P., editors, Wordsand the Mind: How Words Can Capture Human Experience, pages 183–198. Oxford UniversityPress, New York.

Spolaore, E. and Wacziarg, R. (2009). The Diffusion of Development. The Quarterly Journal ofEconomics, 124(2):469–529.

Spolaore, E. and Wacziarg, R. (2013). How Deep Are the Roots of Economic Development?Journal of Economic Literature, 51(2):325–369.

Spolaore, E. and Wacziarg, R. (2016). War and Relatedness. Review of Economics and Statistics,98(5):925–939.

Swadesh, M. (1952). Lexicostatistical Dating of Prehistoric Ethnic Contracts. Proceedings of theAmerican Philosophical Society, 96:121–137.

Swadesh, M. (1955). Towards Greater Accuracy in Lexicostatistic Dating. International Journalof American Linguistics, 21:121–137.

Tomasello, M. (2008). Origins of Human Communication. The MIT Press, Cambridge.

Turner, N. J., Davidson-Hunt, I. J., and O’Flaherty, M. (2003). Living on the Edge: Ecologicaland Cultural Edges as Sources of Diversity for Social-Ecological Resilience. Human Ecology,31(3):439–461.

Voigtlander, N. and Voth, H.-J. (2012). Persecuation Perpetuated: The Medieveal Origins ofAnti-Semitic Violence in Nazi Germany. Quarterly Journal of Economics, 127(3):1339–1392.

Wichmann, S., Holman, E. W., Bakker, D., and Brown, C. H. (2010). Evaluating LinguisticDistance Measures. Physica A, 389(17):3632–3639.

33

Page 35: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

FOR ONLINE PUBLICATION ONLY

A Supplementary Material

Figure A1: Spatial Distribution of Pre-Colonial Ethnic Groups

Location of pre-colonial ethnic groups in Murdock’s (1967) Ethnographic Atlas.

Figure A2: Pre-Colonial Group-Level Buffer Zones in Africa

Buffer zones of pre-colonial ethnic groups in Africa. Each buffer has a diameter for 100km.

34

Page 36: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Figure A3: Linguistic Distance and Post-1500 Change in Land Productivity Variation

This figure depicts unconditional estimates of βchange from equation (1), the change in land productivityvariation in the post-Columbian period, for 15 different samples with and without language family fixedeffects. The dependent variable for each regresion is linguistic distance, and the only included covariateis pre-Columbian land productivity variation. The unit of observation is a 100 km buffer zone alongthe contiguous border segment of each language pair. The full sample estimates come from the baselinesample, and, when moving from left to right, the subsequent estimates come from increasingly smallersubsamples of adjacent language pairs who share the stated number of branches on the global Ethnologuelanguage tree. The most restricted sample, with 14 shared branches, represents the sibling sample usedthroughout much of the empirical analysis. Intervals reflect 95% confidence levels.

35

Page 37: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A1 reports summary statistics for the baseline (full) sample dataset, Table A2 reportssummary statistics for the sibling sample dataset and Table A3 reports summary statistics forthe historical dataset. Table A4 is a replication of Table 1, but includes the coefficient estimatesfor all control variables. Similarly, Table A5 is a replication of Table 2, Table A7 is a replicationof Table 4, Table A8 is a replication of Table 5, Table A9 and Table A10 are a replication ofTable 6 that include all control variable estimates. Table A11 replicates the baseline results byexcludes overlapping buffer zones, Table A12 reports tests for selection on unobservables, TableA13 reports results from the placebo tests and Table A14 reproduces the baseline country fixedeffects regressions allowing for spatial correlation in the error term.

Table A1: Summary Statistics – Full Sample

Mean Std dev. Min Max N

Linguistic distance 0.727 0.177 0.003 0.945 8402Land productivity variation (pre-1500) 0.236 0.272 0.000 1.616 8402∆ in land productivity variation (post-1500) -0.08 0.201 -0.977 0.374 8402Land productivity (pre-1500) 1.41 0.752 0.000 5.151 8402∆ in land productivity (post-1500) -0.118 0.624 -2.136 1.388 8402Malaria 8.314 8.891 0.000 36.286 8402Elevation 691.06 655.78 -22.303 4929.1 8402Ruggedness 306.50 287.68 0.101 1932.0 8402Precipitation 14.202 7.935 0.011 50.669 8402Precipitation variation 1.714 1.795 0.000 19.648 8402Temperature 21.912 6.431 -12.287 29.492 8402Temperature variation 1.591 1.481 0.037 10.092 8402Log distance between group centroids 4.656 1.405 0.000 8.637 8402Log distance to coast 4.928 1.839 -3.797 7.697 8402Log distance to border 3.032 2.218 -12.096 6.838 8402Log distance to lake 4.747 1.034 0.000 7.977 8402Log distance to major river 4.259 1.707 -5.554 8.975 8402Log distance to minor river 5.663 1.912 -5.147 8.976 8402Population of language pair 12.578 3.482 1.609 20.637 8402Area of language pair (km2) 9.707 2.705 2.18 15.996 8402Latitude absolute difference 1.435 2.523 0.000 30.337 8402Longitude absolute difference 2.127 6.94 0.000 340.317 8402Log distance to trade route pre-600 CE 14.203 1.505 6.964 16.264 8402Log distance to trade route pre-1800 CE 13.246 1.767 6.964 16.245 8402∆ log distance to trade route post-600 CE -0.957 1.157 -6.768 0.000 8402

36

Page 38: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A2: Summary Statistics – Sibling Sample

Mean Std dev. Min Max N

Linguistic distance 0.526 0.169 0.003 0.913 1686Land productivity variation (pre-1500) 0.294 0.305 0.000 1.373 1686∆ in land productivity variation (post-1500) -0.15 0.263 -0.977 0.252 1686Land productivity (pre-1500) 1.475 0.74 0.000 4.598 1686∆ in land productivity (post-1500) -0.248 0.729 -2.064 1.335 1686Malaria 8.899 8.838 0.000 35.624 1686Elevation 680.08 634.62 1.057 4929.1 1686Ruggedness 335.79 294.33 0.295 1894.6 1686Precipitation 16.118 8.202 0.276 48.676 1686Precipitation variation 1.907 1.762 0.038 16.69 1686Temperature 22.725 5.161 -6.977 29.21 1686Temperature variation 1.717 1.509 0.037 9.981 1686Log distance between group centroids 3.737 1.185 0.000 8.092 1686Log distance to coast 4.405 1.942 -2.944 7.576 1686Log distance to border 3.036 1.853 -9.778 6.581 1686Log distance to lake 4.817 0.962 0.000 7.411 1686Log distance to major river 4.741 1.718 -3.61 7.866 1686Log distance to minor river 6.385 1.889 -3.61 8.82 1686Population of language pair 10.529 2.873 2.303 20.637 1686Area of language pair (km2) 7.996 2.199 2.652 15.971 1686Latitude absolute difference 0.515 1.003 0.000 15.229 1686Longitude absolute difference 0.596 1.995 0.000 51.597 1686Log distance to trade route pre-600 CE 14.498 1.304 7.446 16.034 1686Log distance to trade route pre-1800 CE 13.5 1.66 7.446 16.03 1686∆ log distance to trade route post-600 CE -0.998 1.126 -6.768 0.000 1686

37

Page 39: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A3: Summary Statistics – Historical Sample

Mean Std dev. Min Max N

Historical dependence on agriculture 4.442 2.763 0.000 9.000 1133Historical dependence on animal husbandry 1.56 1.789 0.000 9.000 1133Historical dependence on fishing 1.434 1.634 0.000 9.000 1133Land productivity (pre-1500) 1.204 0.864 0.000 4.997 1133∆ in land productivity (post-1500) 0.029 0.573 -2.200 1.403 1133Elevation 704.521 619.491 1.000 4976.3 1133Ruggedness 214.114 204.76 0.000 1292.8 1133Precipitation 105.813 74.296 0.000 495.8 1133Precipitation variation 10.684 13.053 0.000 116.0 1133Temperature 194.3 87.04 -165.1 297.9 1132Temperature variation 11.345 11.039 0.000 76.94 1132Log distance to coast 5.081 1.853 -1.189 7.496 1133Log distance to border 3.972 1.697 -2.871 6.817 1133Log distance to lake 4.630 1.207 -1.645 8.397 1122Log distance to major river 4.310 1.657 -5.384 8.971 1133Log distance to minor river 5.300 1.711 -5.384 8.972 1133Buffer zone area (km2) 8.809 0.525 3.567 8.968 1133Absolute latitude 20.291 16.735 0.000 77.489 1133Absolute longitude 65.549 48.903 0.000 179.50 1133

38

Page 40: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A4: Border-Level Regressions: Baseline Results (100km Buffer Zone)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

∆ in land productivity variation (post-1500) -0.100*** -0.100*** -0.083*** -0.111*** -0.098*** -0.065*(0.030) (0.034) (0.031) (0.030) (0.033) (0.034)

Land productivity variation (pre-1500) -0.061*** -0.061*** -0.047* -0.043** -0.040 -0.029(0.022) (0.022) (0.025) (0.021) (0.026) (0.027)

∆ in land productivity (post-1500) -0.001 -0.001 0.008(0.010) (0.010) (0.010)

Land productivity (pre-1500) -0.001 0.004 0.009(0.006) (0.006) (0.006)

Malaria Ecology Index 0.003*** 0.004*** 0.001**(0.001) (0.001) (0.001)

Elevation -0.000 0.000 0.000(0.000) (0.000) (0.000)

Ruggedness 0.000** 0.000*** 0.000***(0.000) (0.000) (0.000)

Average precipitation -0.000 0.000 -0.001(0.001) (0.001) (0.001)

Precipitation variation -0.002 -0.004** -0.004*(0.002) (0.002) (0.002)

Average temperature 0.000 0.000 0.001(0.002) (0.002) (0.002)

Temperature variation -0.016* -0.015* -0.018**(0.008) (0.008) (0.008)

Log distance between group centroids 0.024*** 0.025*** 0.026***(0.005) (0.004) (0.005)

Log distance to coast -0.005* -0.007** 0.005*(0.003) (0.003) (0.003)

Log distance to border -0.000 -0.001 -0.008***(0.001) (0.001) (0.002)

Log distance to lake 0.006** 0.006* 0.005(0.003) (0.003) (0.003)

Log distance to major river 0.006*** 0.006*** 0.001(0.002) (0.002) (0.002)

Log distance to minor river -0.008*** -0.005* -0.001(0.003) (0.003) (0.003)

Population of language pair 0.001 0.003 0.000(0.002) (0.002) (0.002)

Area of language pair (km2) -0.002 -0.002 0.009**(0.004) (0.003) (0.004)

Latitude difference -0.001 -0.002 -0.003**(0.001) (0.001) (0.001)

Longitude difference 0.000 0.000 -0.000(0.000) (0.000) (0.000)

Language Family FE Yes Yes Yes Yes Yes YesCountry FE No No No No No Yes

Adjusted R2 0.25 0.25 0.26 0.26 0.28 0.37Observations 8402 8402 8402 8402 8402 7291

This table establishes the negative and statistically significant effect of variation in land productivity on a language pair’s lexicostatisticallinguistic distance. The unit of observation is a 100 km buffer zone along the contiguous border segment of each language pair. Standard errorsare double-clustered at the level of each language group and are reported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

39

Page 41: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A5: Border-Level Regressions: Baseline Results (50km Buffer Zone)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

∆ in land productivity variation (post-1500) -0.115*** -0.107*** -0.115*** -0.133*** -0.118*** -0.058*(0.033) (0.034) (0.033) (0.033) (0.034) (0.032)

Land productivity variation (pre-1500) -0.090*** -0.089*** -0.091*** -0.079*** -0.077*** -0.033(0.024) (0.024) (0.024) (0.023) (0.023) (0.023)

∆ in land productivity (post-1500) -0.011 -0.016 0.003(0.010) (0.010) (0.013)

Land productivity (pre-1500) -0.006 -0.003 0.012(0.007) (0.007) (0.008)

Malaria Ecology Index 0.000 0.000 -0.000(0.000) (0.000) (0.000)

Elevation 0.000 0.000* 0.000(0.000) (0.000) (0.000)

Ruggedness -0.000** -0.000** -0.000(0.000) (0.000) (0.000)

Average precipitation 0.000 0.001* 0.000(0.000) (0.000) (0.000)

Precipitation variation 0.000 0.000 0.001(0.002) (0.002) (0.002)

Average temperature 0.000 0.000 0.000(0.001) (0.001) (0.001)

Temperature variation 0.015** 0.013** 0.008(0.006) (0.006) (0.006)

Log distance between group centroids 0.023*** 0.024*** 0.026***(0.005) (0.005) (0.005)

Log distance to coast -0.005* -0.004 0.008***(0.003) (0.003) (0.003)

Log distance to border 0.000 0.000 -0.009***(0.001) (0.001) (0.002)

Log distance to lake 0.006** 0.006** 0.006**(0.003) (0.003) (0.003)

Log distance to major river 0.005** 0.005** 0.000(0.002) (0.002) (0.002)

Log distance to minor river -0.008*** -0.008*** -0.003(0.003) (0.003) (0.003)

Population of language pair 0.002 0.002 -0.001(0.002) (0.002) (0.002)

Area of language pair (km2) -0.003 -0.003 0.009***(0.004) (0.004) (0.004)

Latitude difference -0.001 -0.001 -0.003**(0.001) (0.001) (0.001)

Longitude difference 0.000 0.000 -0.000(0.000) (0.000) (0.000)

Language Family FE Yes Yes Yes Yes Yes YesCountry FE No No No No No Yes

Adjusted R2 0.25 0.25 0.25 0.26 0.27 0.37Observations 8402 8402 8402 8402 8402 7290

This table establishes the negative and statistically significant effect of variation in land productivity on a language pair’s lexicostatisticallinguistic distance. The unit of observation is a 50 km buffer zone along the contiguous border segment of each language pair. Standard errorsare double-clustered at the level of each language group and are reported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

40

Page 42: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A6: Border-Level Regressions: Sibling Sample Results (100km Buffer Zone)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

∆ in land productivity variation (post-1500) -0.188*** -0.197*** -0.286*** -0.169*** -0.248*** -0.154**(0.042) (0.047) (0.058) (0.044) (0.060) (0.070)

Land productivity variation (pre-1500) -0.088** -0.088** -0.200*** -0.096*** -0.185*** -0.140**(0.034) (0.034) (0.056) (0.035) (0.059) (0.068)

∆ in land productivity (post-1500) -0.007 -0.003 0.010(0.016) (0.015) (0.019)

Land productivity (pre-1500) -0.018 -0.009 0.011(0.012) (0.012) (0.014)

Malaria Ecology Index 0.002** 0.002* -0.001(0.001) (0.001) (0.001)

Elevation 0.000 0.000 0.000(0.000) (0.000) (0.000)

Ruggedness 0.000* 0.000* 0.000(0.000) (0.000) (0.000)

Average precipitation -0.001 -0.001 -0.003**(0.001) (0.001) (0.001)

Precipitation variation -0.003 -0.000 -0.002(0.004) (0.005) (0.005)

Average temperature 0.003 0.002 0.007(0.003) (0.003) (0.006)

Temperature variation -0.015 -0.013 -0.008(0.017) (0.017) (0.018)

Log distance between group centroids 0.023** 0.023** 0.017*(0.009) (0.009) (0.009)

Log distance to coast 0.008* 0.005 0.008(0.005) (0.005) (0.007)

Log distance to border 0.000 -0.001 -0.007(0.003) (0.003) (0.006)

Log distance to lake 0.014*** 0.014** 0.011*(0.006) (0.006) (0.006)

Log distance to major river 0.005 0.004 -0.002(0.004) (0.004) (0.005)

Log distance to minor river -0.002 0.001 0.015*(0.006) (0.006) (0.008)

Population of language pair -0.004 -0.005 -0.006(0.003) (0.004) (0.004)

Area of language pair (km2) -0.019*** -0.017*** 0.002(0.006) (0.006) (0.007)

Latitude difference 0.002 0.002 -0.005(0.005) (0.005) (0.005)

Longitude difference 0.003** 0.003** 0.008(0.001) (0.001) (0.006)

Language Family FE Yes Yes Yes Yes Yes YesCountry FE No No No No No Yes

Adjusted R2 0.28 0.28 0.29 0.30 0.30 0.39Observations 1669 1669 1669 1669 1669 1497

This table establishes the negative and statistically significant effect of variation in land productivity on a language pair’s lexicostatisticallinguistic distance for the baseline sibling sample. The unit of observation is a 100 km buffer zone along the contiguous border segment of eachlanguage pair. Standard errors are double-clustered at the level of each language group and are reported in parentheses. * p < 0.10, ** p < 0.05,*** p < 0.01.

41

Page 43: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A7: Border-Level Regressions: Native Population Sensitivity Analysis

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

Full Sample Sibling Sample

(1) (2) (3) (4) (5) (6) (7) (8)

∆ in land productivity variation (post-1500) -0.065* -0.072** -0.073** -0.078** -0.154** -0.159** -0.154** -0.175**(0.034) (0.036) (0.036) (0.038) (0.070) (0.071) (0.071) (0.073)

Land productivity variation (pre-1500) -0.029 -0.032 -0.031 -0.059* -0.140** -0.142** -0.138** -0.179**(0.027) (0.029) (0.030) (0.032) (0.068) (0.069) (0.069) (0.072)

∆ in land productivity (post-1500) 0.008 0.010 0.010 0.006 0.010 0.013 0.012 0.004(0.010) (0.012) (0.012) (0.013) (0.019) (0.020) (0.021) (0.022)

Land productivity (pre-1500) 0.009 0.012* 0.014** 0.017** 0.011 0.012 0.012 0.014(0.006) (0.007) (0.007) (0.008) (0.014) (0.015) (0.016) (0.017)

Malaria Ecology Index 0.001** 0.001* 0.001** 0.001 -0.001 -0.002 -0.001 -0.002(0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)

Elevation 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000*(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)

Ruggedness 0.000*** 0.000*** 0.000** 0.000** 0.000 0.000 0.000 0.000(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)

Average precipitation -0.001 -0.001 -0.001 -0.001 -0.003** -0.003** -0.003** -0.003**(0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)

Precipitation variation -0.004* -0.004* -0.004 -0.004 -0.002 -0.001 -0.001 0.001(0.002) (0.002) (0.002) (0.003) (0.005) (0.005) (0.005) (0.005)

Average temperature 0.001 0.001 0.001 0.001 0.007 0.009 0.010 0.011(0.002) (0.003) (0.003) (0.003) (0.006) (0.006) (0.006) (0.007)

Temperature variation -0.018** -0.019** -0.017* -0.024** -0.008 -0.009 -0.010 -0.002(0.008) (0.009) (0.009) (0.012) (0.018) (0.018) (0.018) (0.023)

Log distance between group centroids 0.026*** 0.025*** 0.025*** 0.024*** 0.017* 0.018* 0.019** 0.020*(0.005) (0.005) (0.005) (0.005) (0.009) (0.009) (0.009) (0.010)

Log distance to coast 0.005* 0.005 0.005 0.005 0.008 0.007 0.007 0.008(0.003) (0.003) (0.003) (0.004) (0.007) (0.007) (0.007) (0.007)

Log distance to border -0.008*** -0.009*** -0.009*** -0.010*** -0.007 -0.007 -0.007 -0.007(0.002) (0.002) (0.003) (0.003) (0.006) (0.006) (0.006) (0.006)

Log distance to lake 0.005 0.005 0.006* 0.006* 0.011* 0.011* 0.011* 0.012*(0.003) (0.003) (0.003) (0.004) (0.006) (0.006) (0.006) (0.007)

Log distance to major river 0.001 0.000 0.001 0.000 -0.002 -0.002 -0.001 -0.002(0.002) (0.002) (0.003) (0.003) (0.005) (0.005) (0.005) (0.006)

Log distance to minor river -0.001 -0.001 -0.002 -0.001 0.015* 0.016* 0.014* 0.011(0.003) (0.003) (0.003) (0.004) (0.008) (0.008) (0.008) (0.008)

Population of language pair 0.000 0.001 0.001 0.000 -0.006 -0.006 -0.005 -0.007(0.002) (0.002) (0.002) (0.003) (0.004) (0.004) (0.004) (0.004)

Area of language pair (km2) 0.009** 0.009** 0.009** 0.010** 0.002 -0.000 -0.001 0.002(0.004) (0.004) (0.004) (0.004) (0.007) (0.007) (0.007) (0.007)

Latitude difference -0.003** -0.002 -0.002 -0.001 -0.005 -0.004 -0.004 -0.004(0.001) (0.002) (0.002) (0.002) (0.005) (0.005) (0.005) (0.005)

Longitude difference -0.000 0.000 0.000 0.001 0.008 0.008 0.007 0.007(0.000) (0.001) (0.001) (0.001) (0.006) (0.006) (0.006) (0.007)

Language Family FE Yes Yes Yes Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes Yes Yes Yes

Adjusted R2 0.37 0.36 0.36 0.34 0.39 0.40 0.40 0.39Observations 7291 6723 6425 5540 1497 1445 1424 1273

Native Population Sample Restriction None ≥25% ≥50% ≥75% None ≥25% ≥50% ≥75%

This table establishes the robustness of the baseline estimates to post-Columbian migrations with the full sample (columns 1-4) and siblingsample (columns 5-8). Each sample is restricted to language pairs that reside in a country where 25, 50 or 75 percent of the population is nativeto the residing country. The unit of observation is a 100 km buffer zone along the contiguous border segment of each language pair. Standarderrors are double-clustered at the level of each language group and are reported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

42

Page 44: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A8: Ethnographic Atlas Regressions: Land Productivity and Historical Subsistence

Historical Dependence Historical Dependence Historical Dependenceon Agriculture on Animal Husbandry on Fishing

(1) (2) (3) (4) (5) (6)

∆ in land productivity (post-1500) 0.763*** 0.694*** -0.391** -0.390** -0.348 -0.444**(0.258) (0.251) (0.178) (0.173) (0.224) (0.195)

Land productivity (pre-1500) 0.891*** 0.810*** -0.245** -0.184* -0.261** -0.301***(0.142) (0.141) (0.105) (0.100) (0.105) (0.086)

Elevation 0.000 0.000*** -0.001***(0.000) (0.000) (0.000)

Ruggedness -0.000 -0.001 0.002**(0.001) (0.001) (0.001)

Precipitation -0.003* -0.003*** 0.004***(0.002) (0.001) (0.001)

Precipitation variation 0.012 0.007 -0.004(0.008) (0.005) (0.004)

Temperature -0.001 0.004* -0.009***(0.003) (0.002) (0.003)

Temperature variation -0.002 0.009 -0.041***(0.022) (0.016) (0.015)

Log distance to coast 0.096 -0.115*** -0.138***(0.061) (0.044) (0.052)

Log distance to border -0.032 0.035 -0.078**(0.043) (0.036) (0.033)

Log distance to lake 0.149*** 0.015 -0.130***(0.054) (0.032) (0.050)

Log distance to major river -0.021 0.061** -0.009(0.049) (0.030) (0.036)

Log distance to minor river -0.057 -0.036 -0.079*(0.064) (0.036) (0.045)

Buffer zone area (km2) 0.241 0.126 -0.550***(0.170) (0.094) (0.167)

Absolute latitude -0.050** 0.039** -0.005(0.022) (0.017) (0.018)

Absolute longitude -0.020*** -0.007* 0.008(0.007) (0.004) (0.006)

Language Family FE Yes Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes Yes

Adjusted R2 0.67 0.68 0.63 0.64 0.46 0.57Observations 1133 1119 1133 1119 1133 1119

Columns (1) and (2) document a positive association between land productivity of an ethnic group’s homelandand their historical reliance on agriculture, columns (3) and (4) document a negative association between landproductivity of a group’s homeland and their historical reliance on animal husbandry, and columns (5) and (6)document a negative association between land productivity of a group’s homeland and their historical relianceon fishing. The unit of observation is a group-specific buffer zone, constructed from Ethnographic Atlas groupcentroids using a 50 km radius. Robust standard errors in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

43

Page 45: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A9: Border-Level Regressions: Historical Old World Trade Routes (Full Sample)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

Log distance to trade route pre-600 CE 0.004 0.010 0.000 0.013*(0.006) (0.007) (0.008) (0.007)

Log distance to trade route pre-1800 CE 0.007* 0.013*** 0.013**(0.004) (0.004) (0.005)

∆ log distance to trade route post-600 CE 0.013**(0.005)

∆ in land productivity (post-1500) 0.005 0.004 0.004 0.004(0.011) (0.011) (0.011) (0.011)

Land productivity (pre-1500) 0.013* 0.011 0.011 0.011(0.007) (0.007) (0.007) (0.007)

Malaria Ecology Index 0.001* 0.001* 0.001* 0.001*(0.001) (0.001) (0.001) (0.001)

Elevation 0.000 0.000 0.000 0.000(0.000) (0.000) (0.000) (0.000)

Ruggedness 0.000** 0.000** 0.000** 0.000**(0.000) (0.000) (0.000) (0.000)

Average precipitation -0.000 -0.000* -0.000* -0.000*(0.000) (0.000) (0.000) (0.000)

Precipitation variation -0.000* -0.000 -0.000 -0.000(0.000) (0.000) (0.000) (0.000)

Average temperature 0.000 0.000 0.000 0.000(0.000) (0.000) (0.000) (0.000)

Temperature variation -0.002** -0.002** -0.002** -0.002**(0.001) (0.001) (0.001) (0.001)

Log distance between group centroids 0.024*** 0.024*** 0.024*** 0.024***(0.005) (0.005) (0.005) (0.005)

Log distance to coast 0.006* 0.006* 0.006* 0.006*(0.003) (0.003) (0.003) (0.003)

Log distance to border -0.008*** -0.008*** -0.008*** -0.008***(0.003) (0.003) (0.003) (0.003)

Log distance to lake 0.007** 0.006* 0.006* 0.006*(0.003) (0.003) (0.003) (0.003)

Log distance to major river -0.001 0.000 0.000 0.000(0.003) (0.003) (0.003) (0.003)

Log distance to minor river -0.002 -0.003 -0.003 -0.003(0.003) (0.003) (0.003) (0.003)

Population of language pair 0.001 0.001 0.001 0.001(0.002) (0.002) (0.002) (0.002)

Area of language pair (km2) 0.009** 0.009** 0.009** 0.009**(0.004) (0.004) (0.004) (0.004)

Latitude difference -0.001 -0.002 -0.002 -0.002(0.002) (0.002) (0.002) (0.002)

Longitude difference 0.001 0.001 0.001 0.001(0.001) (0.001) (0.001) (0.001)

Language Family FE Yes Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes Yes

Adjusted R2 0.29 0.33 0.29 0.33 0.33 0.33Observations 6205 6205 6205 6205 6205 6205

This table documents a positive association between the linguistic distance of adjacent ethnic groups and their distance to the nearest pre-600CE and pre-1800 CE trade route. The trade route data map Old World routes, so the full sample is restricted to observations located in OldWorld countries. The unit of observation is a 100 km buffer zone along the contiguous border segment of each language pair. Standard errorsare double-clustered at the level of each language group and are reported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

44

Page 46: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A10: Border-Level Regressions: Historical Old World Trade Routes (Sibling Sample)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5) (6)

Log distance to trade route pre-600 CE 0.021* 0.024** 0.011 0.031**(0.012) (0.012) (0.014) (0.013)

Log distance to trade route pre-1800 CE 0.020*** 0.023*** 0.020**(0.008) (0.009) (0.010)

∆ log distance to trade route post-600 CE 0.020**(0.010)

∆ in land productivity (post-1500) -0.001 -0.002 -0.003 -0.003(0.020) (0.020) (0.020) (0.020)

Land productivity (pre-1500) -0.001 -0.006 -0.006 -0.006(0.015) (0.016) (0.016) (0.016)

Malaria Ecology Index -0.002* -0.002 -0.002 -0.002(0.001) (0.001) (0.001) (0.001)

Elevation 0.000 0.000 0.000 0.000(0.000) (0.000) (0.000) (0.000)

Ruggedness 0.000 0.000 0.000 0.000(0.000) (0.000) (0.000) (0.000)

Average precipitation -0.000** -0.000** -0.000** -0.000**(0.000) (0.000) (0.000) (0.000)

Precipitation variation -0.000 -0.000 -0.000 -0.000(0.001) (0.001) (0.001) (0.001)

Average temperature 0.001** 0.001** 0.001** 0.001**(0.001) (0.001) (0.001) (0.001)

Temperature variation 0.001 0.001 0.001 0.001(0.002) (0.002) (0.002) (0.002)

Log distance between group centroids 0.017* 0.020** 0.020** 0.020**(0.009) (0.010) (0.010) (0.010)

Log distance to coast 0.008 0.009 0.009 0.009(0.007) (0.007) (0.007) (0.007)

Log distance to border -0.005 -0.006 -0.006 -0.006(0.006) (0.006) (0.006) (0.006)

Log distance to lake 0.013** 0.011* 0.012* 0.012*(0.006) (0.006) (0.006) (0.006)

Log distance to major river -0.004 -0.002 -0.002 -0.002(0.005) (0.005) (0.005) (0.005)

Log distance to minor river 0.012 0.010 0.010 0.010(0.008) (0.008) (0.008) (0.008)

Population of language pair -0.004 -0.002 -0.002 -0.002(0.004) (0.004) (0.004) (0.004)

Area of language pair (km2) 0.001 -0.001 -0.001 -0.001(0.007) (0.007) (0.007) (0.007)

Latitude difference -0.005 -0.006 -0.006 -0.006(0.005) (0.005) (0.005) (0.005)

Longitude difference 0.007 0.006 0.006 0.006(0.006) (0.007) (0.007) (0.007)

Language Family FE Yes Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes Yes

Adjusted R2 0.35 0.37 0.36 0.37 0.37 0.37Observations 1389 1389 1389 1389 1389 1389

This table documents a positive association between the linguistic distance of adjacent ethnic groups and their distance to the nearest pre-600CE and pre-1800 CE trade route. The trade route data map Old World routes, so the sibling sample is restricted to observations located in OldWorld countries. The unit of observation is a 100 km buffer zone along the contiguous border segment of each language pair. Standard errorsare double-clustered at the level of each language group and are reported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

45

Page 47: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A11: Border-Level Regressions: Overlapping Polygon Sensitivity Analysis

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

Full Sample Sibling Sample

(1) (2) (3) (4)

∆ in land productivity variation (post-1500) -0.065* -0.138*** -0.154** -0.188**(0.034) (0.043) (0.070) (0.081)

Land productivity variation (pre-1500) -0.029 -0.095** -0.140** -0.186**(0.027) (0.039) (0.068) (0.080)

Controls Yes Yes Yes Yes

Language Family FE Yes Yes Yes YesCountry FE Yes Yes Yes Yes

Adjusted R2 0.40 0.40 0.44 0.42Observations 7291 5099 1497 1262

Overlapping Polygons Excluded No Yes No Yes

This table tests the sensitivity of the baseline estimates by limiting the sibling sample to ethnolinguistic pairs thatdo not overlap with any other groups. The unit of observation is a 100 km buffer zone along the contiguous bordersegment of each language pair. Control variables are identical to the complete set of baseline control variablesused in Table 1. Standard errors are double-clustered at the level of each language group and are reported inparentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

Table A12: Border-Level Regressions: Selection on Unobservables

Controls in Controls inRestricted Set Full Set Altonji et al. (2005) Oster (2019) β∗

Panel A: Full Sample

FE FE, Prod, Geog, Spatial -3.69 -0.38

FE, Prod FE, Prod, Geog, Spatial -4.26 -0.34

Panel B: Sibling Sample

FE FE, Prod, Geog, Spatial -1.35 -2.77

FE, Prod FE, Prod, Geog, Spatial -1.51 -2.64

This table reports Altonji et al’s (2005) measure of selection on unobservables and Oster’s (2019) β∗ lower boundestimate of the coefficient for the variable of interest: ∆ in land productivity variation (post-1500). The dependentvariable in each regression is lexicostatistical linguistic distance. The unit of observation is a 100 km buffer zonealong the contiguous border segment of each language pair. FE = language family and country fixed effects, prod= productivity controls, geog = geography controls and spatial = spatial controls. These variables are equivalentto the baseline set described in Table 1.

46

Page 48: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A13: Border-Level Regressions: Placebo Tests (100km Buffer Zone)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4)

Panel A: Full Sample

∆ in land productivity variation (post-1500) -0.065* -0.009 -0.044 -0.019(0.034) (0.092) (0.078) (0.092)

Land productivity variation (pre-1500) -0.029 0.054 0.044 0.058(0.027) (0.044) (0.046) (0.068)

Controls Yes Yes Yes YesLanguage Family FE Yes Yes Yes YesCountry FE Yes Yes Yes Yes

Adjusted R2 0.37 0.44 0.40 0.51Observations 7291 708 711 1084

Sample Restriction None ≥ 90% Elevation ≥ 90% Ruggedness New World

Panel B: Sibling Sample

∆ in land productivity variation (post-1500) -0.154** 0.205 0.010 0.362(0.070) (0.177) (0.287) (0.766)

Land productivity variation (pre-1500) -0.140** 0.122 -0.044 0.278(0.068) (0.181) (0.224) (0.507)

Controls Yes Yes Yes YesLanguage Family FE Yes Yes Yes YesCountry FE Yes Yes Yes Yes

Adjusted R2 0.39 0.49 0.46 0.42Observations 1497 142 145 108

Sample Restriction None ≥ 90% Elevation ≥ 90% Ruggedness New World

This table tests the sensitivity of baseline estimates by limiting the full and sibling sample to observations equal or greater than the 90th percentile inelevation (column 2), ruggedness (column 3) and New World observations (column 4). The unit of observation is a 100 km buffer zone along the contiguousborder segment of each language pair. Control variables are identical to the complete set of baseline control variables used in Table 1. Standard errors aredouble-clustered at the level of each language group and are reported in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

47

Page 49: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Table A14: Border-Level Regressions: Spatially Correlated Standard Errors (100km Buffer Zone)

Dependent Variable: Lexicostatistical Linguistic Distance ∈ (0, 1)

(1) (2) (3) (4) (5)

Panel A: Full Sample

∆ in land productivity variation (post-1500) -0.065* -0.065* -0.065* -0.065** -0.065*(0.035) (0.039) (0.036) (0.030) (0.034)

Land productivity variation (pre-1500) -0.029 -0.029 -0.029 -0.029 -0.029(0.028) (0.030) (0.028) (0.025) (0.030)

Centered R2 0.40 0.40 0.40 0.40 0.40Observations 7291 7291 7291 7291 7291Controls Yes Yes Yes Yes YesLanguage Family FE Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes

Spatial Correlation Cutoff 100km 200km 300km 400km 500km

Panel B: Sibling Sample

∆ in land productivity variation (post-1500) -0.154** -0.154** -0.154** -0.154** -0.154**(0.073) (0.075) (0.075) (0.071) (0.072)

Land productivity variation (pre-1500) -0.140** -0.140** -0.140** -0.140** -0.140**(0.068) (0.071) (0.071) (0.071) (0.067)

Centered R2 0.46 0.46 0.46 0.46 0.46Observations 1497 1497 1497 1497 1497Controls Yes Yes Yes Yes YesLanguage Family FE Yes Yes Yes Yes YesCountry FE Yes Yes Yes Yes Yes

Spatial Correlation Cutoff 100km 200km 300km 400km 500km

This table establishes the negative and statistically significant effect of variation in land productivity on a language pair’s lexicostatistical linguistic distanceis robust to adjusting for spatial correlation at various distance thresholds, across both the full sample and sibling sample. The unit of observation is a 100km buffer zone along the contiguous border segment of each language pair. Control variables are identical to the baseline set of control variables used inTable 1. * p < 0.10, ** p < 0.05, *** p < 0.01.

48

Page 50: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

B Computerized Lexicostatistical Similarity

The computerized approach to estimating lexicostatistical distances was developed as part ofthe Automatic Similarity Judgement Program (ASJP), a project run by linguists at the Max PlanckInstitute for Evolutionary Anthropology. To begin a list of 40 implied meanings (i.e., words)are compiled for each language to compare the lexical similarity of any language pair. Swadesh(1952) first introduced the notion of a basic list of words believed to be universal across nearlyall world languages. When a word is universal across world languages, its implied mean-ing, and therefore any estimate of linguistic distance, is independent of culture and geography.From here on I refer to this 40-word list as a Swadesh list, as it is commonly called.29

For each language the 40 words are transcribed into a standardized orthography calledASJPcode, a phonetic ASCII alphabet consisting of 34 consonants and 7 vowels. A standard-ized alphabet restricts variation across languages to phonological differences only. Meaningsare then transcribed according to pronunciation before language distances are estimated.

I use a variant of the Levenshtein distance algorithm, which in its simplest form calculatesthe minimum number of edits necessary to translate the spelling of a word from one languageto another. In particular, I use the normalized and divided Levenshtein distance estimatorproposed by Bakker et al. (2009).30 Denote LD(αi, βi) as the raw Levenshtein distance for wordi of languages α and β. Each word i comes from the aforementioned Swadesh list. Define thelength of this list be M , so 1 ≤ i ≤ M .31 The algorithm is run to calculate LD(αi, βi) for eachword in the M -word Swadesh list across each language pair. To correct for the fact that longerwords will often demand more edits, the distance is normalized according to word length:

LDN(αi, βi) =LD(αi, βi)

L(αi, βi)(4)

where L(αi, βi) is the length of the longer of the two spellings αi and βi of word i. LDN(αi, βi)

is the normalized Levenshtein distance, which represents a percentage estimate of dissimilaritybetween languages α and β for word i. For each language pair, LDN(αi, βi) is calculated foreach word of theM -word Swadesh list. Then the average lexical distance for each language pairis calculated by averaging across all M words for those two languages. The average distance

29A recent paper by Holman et al. (2009) shows that the 40-item list employed here, deduced from rigoroustesting for word stability across all languages, yields results at least as good as those of the commonly used 100-item list proposed by Swadesh (1955).

30I use Taraka Rama’s (2013) Python program for string distance calculations.31Wichmann et al. (2010) point out that in some instances not every word on the 40-word list exists for a lan-

guage, but in all cases a minimum of 70 percent of the 40-word list exist.

49

Page 51: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

between two languages is then

LDN(α, β) =1

M

M∑i=1

LDN(αi, βi). (5)

A second normalization procedure is then adopted to account for phonological similaritythat is the result of coincidence. This adjustment is done to correct for accidental similarity insound structure of two languages that is unrelated to their historical relationship. The motiva-tion for this step is that no prior assumptions need to be made about historical versus chancerelationship. To implement this normalization the defined distance LDN(α, β) is divided bythe global distance between two language. To see this, first denote the global distance betweenlanguages α and β as

GD(α, β) =1

M(M − 1)

M∑i 6=j

LD(αi, βj), (6)

where GD(α, β) is the global (average) distance between two languages excluding all wordcomparisons of the same meaning. This estimates the similarity of languages α and β only interms of the ordering and frequency of characters, and is independent of meaning. The secondnormalization procedure is then implemented by weighting equation (5) with equation (6) asfollows:

LDND(α, β) =LDN(α, β)

GD(α, β). (7)

LDND(α, β) is the final measure of linguistic distance, referred to as the normalized and di-vided Levenshtein distance (LDND). This measure yields a percentage estimate of the languagedissimilarity between α and β. In instances where two languages have many accidental simi-larities in terms of ordering and frequency of characters, the second normalization procedurecan yield percentage estimates larger than 100 percent by construction, so I divide LDND(α, β)

by its maximum value to normalize the measure as a continuous [0, 1] variable.

50

Page 52: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

C Data Description and Sources

Ethnolinguistic groups: Georeferenced group data comes from the World Language MappingSystem (WLMS). These data map information from each ethnolinguistic group in the Ethno-logue to the corresponding polygon. When constructing buffer zones are group borders, I useGoode’s homolosine map projection.Source: http://www.worldgeodatasets.com/language/

Land productivity: I use the caloric suitability index (CSI) from Galor and Ozak (2016). CSIis a measure of land productivity that reflects the potential caloric output of a grid cell. It’sbased on the Global Agro-Ecological Zones (GAEZ) project of the Food and Agriculture Or-ganization (FAO). A variety of related measures are available: in the reported estimates I useboth the pre-1500 and post-1500 average land productivity measure that includes cells withzero productivity. Land productivity is measured as average pre-1500 CSI within each bufferzone using Goode’s homolosine map projection to minimize area distortions. I similarly cal-culate post-1500 CSI and calculate the change in land productivity as the difference betweenpost-1500 CSI and pre-1500 CSI within each buffer zone. These measures are then convertedfrom millions of kilo calories to thousands of kilo calories by dividing by 1,000.Source: http://omerozak.com/csi and ArcGIS calculations.

Variation in land productivity: I use the caloric suitability index (CSI) from Galor and Ozak(2016). CSI is a measure of land productivity that reflects the potential caloric output of a gridcell. It’s based on the Global Agro-Ecological Zones (GAEZ) project of the Food and Agricul-ture Organization (FAO). Variation in land productivity is measured as the standard deviationof pre-1500 CSI within each buffer zone using the Goode’s homolosine map projection. I simi-larly calculate post-1500 CSI standard deviation, and calculate the change in land productivityin the post-1500 period using Stata. These measures are then converted from millions of kilocalories to thousands of kilo calories by dividing by 1,000.Source: http://omerozak.com/csi and ArcGIS calculations.

Elevation: Elevation data comes from the National Geophysical Data Centre (NGDC) at theNational Oceanic and Atmospheric Administration (NOAA). The Goode’s homolosine mapprojection is used to minimize area distortion.Source: www.ngdc.noaa.gov/mgg/topo/globe.html and ArcGIS calculations.

Ruggedness: Ruggedness is measured as the standard deviation of the NGDC elevation data.The Goode’s homolosine map projection is used to minimize area distortion.Source: www.ngdc.noaa.gov/mgg/topo/globe.html and ArcGIS calculations.

51

Page 53: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Precipitation: Precipitation data comes from the WorldClim – Global Climate Database. I mea-sure average precipitation and precipitation variation as the mean and standard deviation ofa buffer zone, respectively. The Goode’s homolosine map projection is used to minimize areadistortion.Source: http://www.worldclim.org/current and ArcGIS calculations.

Temperature: Temperature data comes from the WorldClim – Global Climate Database. I mea-sure average precipitation and precipitation variation as the mean and standard deviation ofa buffer zone, respectively. The Goode’s homolosine map projection is used to minimize areadistortion.Source: http://www.worldclim.org/current and ArcGIS calculations.

Malaria Ecology Index: I sourced the Malaria Ecology Index data from Kiszewski et al. (2004).The index measures the prevalence of malaria for each 0.5× 0.5 grid cell on earth. The Goode’shomolosine map projection is used to minimize area distortion.Source: https://sites.google.com/site/gordoncmccord//datasets and ArcGIS calculations.

Log distance between group centroids: I use ArcGIS to calculate the centroid of each adja-cent language group pair. I then calculate the great-circle distance between group centroidsusing the haversine formula.Source: Calculated using ArcGIS and Stata.

Log distance to coast: Georeferenced data on coastlines comes from Natural Earth. I use theFuller projection in ArcGIS to minimize distance distortions, and calculate the distance in kilo-meters from each buffer zone centroid to the nearest coast. I then calculate the natural log ofdistance to the coast.Source: https://www.naturalearthdata.com/downloads/110m-physical-vectors/110m-coastline/and ArcGIS calculations.

Log distance to border: Georeferenced data on country borders comes from the Digital Chartof the World provided by the WLMS. I use the Fuller projection in ArcGIS to minimize distancedistortions, and calculate the distance in kilometers from each buffer zone centroid to the near-est border. I then calculate the natural log of distance to the coast.Source: http://www.worldgeodatasets.com/language/ and ArcGIS calculations.

Log distance to lake: Georeferenced data on lakes comes from the Global Self-consistent Hi-erarchical High-resolution Geography (GSHHG), Version 2.3.7 June 15, 2017. I project the full

52

Page 54: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

resolution Level 2 shapefile (“Lakes”) into the Fuller projection to minimize distance distor-tions, and calculate the distance in kilometers from each buffer centroid to the nearest lake. Ithen calculate the natural log of (1 + distance) to adjust for zero-distance observations.Source: https://www.ngdc.noaa.gov/mgg/shorelines/ and ArcGIS calculations.

Log distance to major river: Georeferenced data on major rivers comes from the Global Self-consistent Hierarchical High-resolution Geography (GSHHG), Version 2.3.7 June 15, 2017. Iproject the full resolution shapefile for river size categories 1-3 into the Fuller projection to min-imize distance distortions, and calculate the distance in kilometers from each buffer centroid tothe nearest major lake. I then calculate the natural log of distance to a major lake.Source: https://www.ngdc.noaa.gov/mgg/shorelines/ and ArcGIS calculations.

Log distance to minor river: Georeferenced data on minor rivers comes from the Global Self-consistent Hierarchical High-resolution Geography (GSHHG), Version 2.3.7 June 15, 2017. Iproject the full resolution shapefile for river size categories 4 and 5 into the Fuller projectionto minimize distance distortions, and calculate the distance in kilometers from each buffer cen-troid to the nearest minor lake. I then calculate the natural log of distance to a minor lake.Source: https://www.ngdc.noaa.gov/mgg/shorelines/ and ArcGIS calculations.

Area of language pair: Total area of neighboring ethnolinguistic group polygons, measuredin kilometers squared.Source: Calculated using ArcGIS.

Population of language pair: Ethnolinguistic group population comes from the WLMS Eth-nologue database. Population of language pair is the sum of both groups population.Source: Calculated using Stata.

Latitude and Longitude difference: Latitude and longitude coordinates for an ethnolinguis-tic group correspond to the group’s centroid. Differences are calculated by taking the absolutedifference of a neighboring pair centroids.Source: Calculated using ArcGIS and Stata.

Distance to nearest Old World trade route (pre-600 and pre-1800 CE): Digitized maps of OldWorld trade routes come from Michalopoulos et al. (2016, 2018). Great-circle distance is calcu-lated between border buffer zone centroid and the nearest Old World route using the haversineformula.Source: Calculated using ArcGIS and Stata.

53

Page 55: Understanding Ethnic Differences: The Roles of Geography and … · group pairs located near these historical trade routes are more similar in language today than group pairs located

Historical dependence on agricultural subsistence: 0-9 scale indexing an ethnic group’s his-torical reliance on agriculture for subsistence. The index equals 0 for groups with 0-5% de-pendence, 1 for 6-15% dependence, 2 for 16-25% dependence, 3 for 26-35% dependence, 4 for36-45% dependence, 5 for 46-55% dependence, 6 for 56-65% dependence, 7 for 66-75% depen-dence, 8 for 76-85% dependence and 9 for 86-100% dependence.Source: Variable coded in the Ethnographic Atlas as v5 (Murdock, 1967).

Historical dependence on animal husbandry: 0-9 scale indexing an ethnic group’s histori-cal reliance on animal husbandry for subsistence. The index equals 0 for groups with 0-5%dependence, 1 for 6-15% dependence, 2 for 16-25% dependence, 3 for 26-35% dependence, 4 for36-45% dependence, 5 for 46-55% dependence, 6 for 56-65% dependence, 7 for 66-75% depen-dence, 8 for 76-85% dependence and 9 for 86-100% dependence.Source: Variable coded in the Ethnographic Atlas as v4 (Murdock, 1967).

Historical dependence on fishing: 0-9 scale indexing an ethnic group’s historical reliance onfishing for subsistence. The index equals 0 for groups with 0-5% dependence, 1 for 6-15% de-pendence, 2 for 16-25% dependence, 3 for 26-35% dependence, 4 for 36-45% dependence, 5 for46-55% dependence, 6 for 56-65% dependence, 7 for 66-75% dependence, 8 for 76-85% depen-dence and 9 for 86-100% dependence.Source: Variable coded in the Ethnographic Atlas as v3 (Murdock, 1967).

54