Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

36
Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côté a,1 , David Hall b , Thomas L. Grifths c , and Dan Klein b a Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; b Computer Science Division and c Department of Psychology, University of California, Berkeley, CA 94720 Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review March 19, 2012) One of the oldest problems in linguistics is reconstructing the words that appeared in the protolanguages from which modern languages evolved. Identifying the forms of these ancient lan- guages makes it possible to evaluate proposals about the nature of language change and to draw inferences about human history. Protolanguages are typically reconstructed using a painstaking manual process known as the comparative method. We present a family of probabilistic models of sound change as well as algo- rithms for performing inference in these models. The resulting system automatically and accurately reconstructs protolanguages from modern languages. We apply this system to 637 Austronesian languages, providing an accurate, large-scale automatic reconstruc- tion of a set of protolanguages. Over 85% of the systems recon- structions are within one character of the manual reconstruction provided by a linguist specializing in Austronesian languages. Be- ing able to automatically reconstruct large numbers of languages provides a useful way to quantitatively explore hypotheses about the factors determining which sounds in a language are likely to change over time. We demonstrate this by showing that the reconstructed Austronesian protolanguages provide compelling support for a hy- pothesis about the relationship between the function of a sound and its probability of changing that was rst proposed in 1955. ancestral | computational | diachronic R econstruction of the protolanguages from which modern lan- guages are descended is a difcult problem, occupying histor- ical linguists since the late 18th century. To solve this problem linguists have developed a labor-intensive manual procedure called the comparative method (1), drawing on information about the sounds and words that appear in many modern languages to hy- pothesize protolanguage reconstructions even when no written records are available, opening one of the few possible windows to prehistoric societies (2, 3). Reconstructions can help in un- derstanding many aspects of our past, such as the technological level (2), migration patterns (4), and scripts (2, 5) of early societies. Comparing reconstructions across many languages can help reveal the nature of language change itself, identifying which aspects of language are most likely to change over time, a long-standing question in historical linguistics (6, 7). In many cases, direct evidence of the form of protolanguages is not available. Fortunately, owing to the worlds considerable lin- guistic diversity, it is still possible to propose reconstructions by leveraging a large collection of extant languages descended from a single protolanguage. Words that appear in these modern lan- guages can be organized into cognate sets that contain words sus- pected to have a shared ancestral form (Table 1). The key observation that makes reconstruction from these data possible is that languages seem to undergo a relatively limited set of regular sound changes, each applied to the entire vocabulary of a language at specic stages of its history (1). Still, several factors make re- construction a hard problem. For example, sound changes are often context sensitive, and many are string insertions and deletions. In this paper, we present an automated system capable of large-scale reconstruction of protolanguages directly from words that appear in modern languages. This system is based on a probabilistic model of sound change at the level of phonemes, building on work on the reconstruction of ancestral sequences and alignment in computational biology (812). Several groups have recently explored how methods from computational biology can be applied to problems in historical linguistics, but such work has focused on identifying the relationships between languages (as might be expressed in a phylogeny) rather than reconstructing the languages themselves (1318). Much of this type of work has been based on binary cognate or structural matrices (19, 20), which discard all information about the form that words take, simply in- dicating whether they are cognate. Such models did not have the goal of reconstructing protolanguages and consequently use a rep- resentation that lacks the resolution required to infer ancestral phonetic sequences. Using phonological representations allows us to perform reconstruction and does not require us to assume that cognate sets have been fully resolved as a preprocessing step. Rep- resenting the words at each point in a phylogeny and having a model of how they change give a way of comparing different hypothesized cognate sets and hence inferring cognate sets automatically. The focus on problems other than reconstruction in previous computational approaches has meant that almost all existing protolanguage reconstructions have been done manually. How- ever, to obtain more accurate reconstructions for older languages, large numbers of modern languages need to be analyzed. The Proto-Austronesian language, for instance, has over 1,200 de- scendant languages (21). All of these languages could potentially increase the quality of the reconstructions, but the number of possibilities increases considerably with each language, making it difcult to analyze a large number of languages simultaneously. The few previous systems for automated reconstruction of pro- tolanguages or cognate inference (2224) were unable to handle this increase in computational complexity, as they relied on de- terministic models of sound change and exact but intractable algorithms for reconstruction. Being able to reconstruct large numbers of languages also makes it possible to provide quantitative answers to questions about the factors that are involved in language change. We dem- onstrate the potential for automated reconstruction to lead to novel results in historical linguistics by investigating a specic hypothesized regularity in sound changes called functional load. The functional load hypothesis, introduced in 1955, asserts that sounds that play a more important role in distinguishing words are less likely to change over time (6). Our probabilistic reconstruction of hundreds of protolanguages in the Austronesian phylogeny provides a way to explore this question quantitatively, producing compelling evidence in favor of the functional load hypothesis. Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H. performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C., D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper. The authors declare no conict of interest. This article is a PNAS Direct Submission. N.C. is a guest editor invited by the Editorial Board. Freely available online through the PNAS open access option. 1 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1204678110/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1204678110 PNAS Early Edition | 1 of 6 COMPUTER SCIENCES PSYCHOLOGICAL AND COGNITIVE SCIENCES

description

Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Transcript of Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Page 1: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Automated reconstruction of ancient languages usingprobabilistic models of sound changeAlexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb

aDepartment of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology,University of California, Berkeley, CA 94720

Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for reviewMarch 19, 2012)

One of the oldest problems in linguistics is reconstructing thewords that appeared in the protolanguages from which modernlanguages evolved. Identifying the forms of these ancient lan-guages makes it possible to evaluate proposals about the natureof language change and to draw inferences about human history.Protolanguages are typically reconstructed using a painstakingmanual process known as the comparative method. We present afamily of probabilistic models of sound change as well as algo-rithms for performing inference in these models. The resultingsystem automatically and accurately reconstructs protolanguagesfrommodern languages. We apply this system to 637 Austronesianlanguages, providing an accurate, large-scale automatic reconstruc-tion of a set of protolanguages. Over 85% of the system’s recon-structions are within one character of the manual reconstructionprovided by a linguist specializing in Austronesian languages. Be-ing able to automatically reconstruct large numbers of languagesprovides a useful way to quantitatively explore hypotheses about thefactors determining which sounds in a language are likely to changeover time. We demonstrate this by showing that the reconstructedAustronesian protolanguages provide compelling support for a hy-pothesis about the relationship between the function of a soundand its probability of changing that was first proposed in 1955.

ancestral | computational | diachronic

Reconstruction of the protolanguages from which modern lan-guages are descended is a difficult problem, occupying histor-

ical linguists since the late 18th century. To solve this problemlinguists have developed a labor-intensivemanual procedure calledthe comparative method (1), drawing on information about thesounds and words that appear in many modern languages to hy-pothesize protolanguage reconstructions even when no writtenrecords are available, opening one of the few possible windowsto prehistoric societies (2, 3). Reconstructions can help in un-derstanding many aspects of our past, such as the technologicallevel (2), migration patterns (4), and scripts (2, 5) of early societies.Comparing reconstructions across many languages can help revealthe nature of language change itself, identifying which aspectsof language are most likely to change over time, a long-standingquestion in historical linguistics (6, 7).In many cases, direct evidence of the form of protolanguages is

not available. Fortunately, owing to the world’s considerable lin-guistic diversity, it is still possible to propose reconstructions byleveraging a large collection of extant languages descended from asingle protolanguage. Words that appear in these modern lan-guages can be organized into cognate sets that contain words sus-pected to have a shared ancestral form (Table 1). The keyobservation that makes reconstruction from these data possible isthat languages seem to undergo a relatively limited set of regularsound changes, each applied to the entire vocabulary of a languageat specific stages of its history (1). Still, several factors make re-construction a hard problem. For example, sound changes are oftencontext sensitive, and many are string insertions and deletions.In this paper, we present an automated system capable of

large-scale reconstruction of protolanguages directly from wordsthat appear in modern languages. This system is based on aprobabilistic model of sound change at the level of phonemes,

building on work on the reconstruction of ancestral sequencesand alignment in computational biology (8–12). Several groupshave recently explored how methods from computational biologycan be applied to problems in historical linguistics, but such workhas focused on identifying the relationships between languages(as might be expressed in a phylogeny) rather than reconstructingthe languages themselves (13–18). Much of this type of work hasbeen based on binary cognate or structural matrices (19, 20), whichdiscard all information about the form that words take, simply in-dicating whether they are cognate. Such models did not have thegoal of reconstructing protolanguages and consequently use a rep-resentation that lacks the resolution required to infer ancestralphonetic sequences. Using phonological representations allows usto perform reconstruction and does not require us to assume thatcognate sets have been fully resolved as a preprocessing step. Rep-resenting the words at each point in a phylogeny and having a modelof how they change give a way of comparing different hypothesizedcognate sets and hence inferring cognate sets automatically.The focus on problems other than reconstruction in previous

computational approaches has meant that almost all existingprotolanguage reconstructions have been done manually. How-ever, to obtain more accurate reconstructions for older languages,large numbers of modern languages need to be analyzed. TheProto-Austronesian language, for instance, has over 1,200 de-scendant languages (21). All of these languages could potentiallyincrease the quality of the reconstructions, but the number ofpossibilities increases considerably with each language, making itdifficult to analyze a large number of languages simultaneously.The few previous systems for automated reconstruction of pro-tolanguages or cognate inference (22–24) were unable to handlethis increase in computational complexity, as they relied on de-terministic models of sound change and exact but intractablealgorithms for reconstruction.Being able to reconstruct large numbers of languages also

makes it possible to provide quantitative answers to questionsabout the factors that are involved in language change. We dem-onstrate the potential for automated reconstruction to lead tonovel results in historical linguistics by investigating a specifichypothesized regularity in sound changes called functional load.The functional load hypothesis, introduced in 1955, asserts thatsounds that play a more important role in distinguishing words areless likely to change over time (6). Our probabilistic reconstructionof hundreds of protolanguages in the Austronesian phylogenyprovides a way to explore this question quantitatively, producingcompelling evidence in favor of the functional load hypothesis.

Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H.performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C.,D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. N.C. is a guest editor invited by the EditorialBoard.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1204678110/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1204678110 PNAS Early Edition | 1 of 6

COMPU

TERSC

IENCE

SPS

YCHOLO

GICALAND

COGNITIVESC

IENCE

S

Page 2: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

ModelWe use a probabilistic model of sound change and a Monte Carloinference algorithm to reconstruct the lexicon and phonology ofprotolanguages given a collection of cognate sets from modernlanguages. As in other recent work in computational historicallinguistics (13–18), we make the simplifying assumption that eachword evolves along the branches of a tree of languages, reflectingthe languages’ phylogenetic relationships. The tree’s internalnodes are languages whose word forms are not observed, and theleaves are modern languages. The output of our system is a pos-terior probability distribution over derivations. Each derivationcontains, for each cognate set, a reconstructed transcription ofancestral forms, as well as a list of sound changes describing thetransformation from parent word to child word. This represen-tation is rich enough to answer a wide range of queries that wouldnormally be answered by carrying out the comparative methodmanually, such as which sound changes were most prominentalong each branch of the tree.We model the evolution of discrete sequences of phonemes,

using a context-dependent probabilistic string transducer (8).Probabilistic string transducers efficiently encode a distributionover possible changes that a string might undergo as it changesthrough time. Transducers are sufficient to capture most types ofregular sound changes (e.g., lenitions, epentheses, and elisions) andcan be sensitive to the context in which a change takes place. Mosttypes of changes not captured by transducers are not regular (1) andare therefore less informative (e.g., metatheses, reduplications, andhaplologies). Unlike simple molecular InDel models used in com-putational biology such as the TKF91 model (25), the parameter-ization of our model is very expressive: Mutation probabilities arecontext sensitive, depending on the neighboring characters, andeach branch has its own set of parameters. This context-sensitiveand branch-specific parameterization plays a central role in oursystem, allowing explicit modeling of sound changes.Formally, let τ be a phylogenetic tree of languages, where each

language is linked to the languages that descended from it. Insuch a tree, the modern languages, whose word forms will beobserved, are the leaves of τ. The most recent common ancestorof these modern languages is the root of τ. Internal nodes of thetree (including the root) are protolanguages with unobservedword forms. Let L denote all languages, modern and otherwise.All word forms are assumed to be strings in the InternationalPhonetic Alphabet (IPA).We assume that word forms evolve along the branches of the

tree τ. However, it is usually not the case that a word belongingto each cognate set exists in each modern language—words arelost or replaced over time, meaning that words that appear in theroot languages may not have cognate descendants in the lan-guages at the leaves of the tree. For the moment, we assume thereis a known list of C cognate sets. For each c∈ f1; . . . ;Cg let LðcÞ

denote the subset of modern languages that have a word form inthe cth cognate set. For each set c∈ f1; . . . ;Cg and each languageℓ∈LðcÞ, we denote the modern word form by wcℓ. For cognate setc, only the minimal subtree τðcÞ containing LðcÞ and the root isrelevant to the reconstruction inference problem for that set.Our model of sound change is based on a generative process

defined on this tree. From a high-level perspective, the genera-tive process is quite simple. Let c be the index of the currentcognate set, with topology τðcÞ. First, a word is generated for theroot of τðcÞ, using an (initially unknown) root language model(i.e., a probability distribution over strings). The words that ap-pear at other nodes of the tree are generated incrementally, usinga branch-specific distribution over changes in strings to generateeach word from the word in the language that is its parent in τðcÞ.Although this distribution differs across branches of the tree,making it possible to estimate the pattern of changes involved inthe transition from one language to another, it remains the samefor all cognate sets, expressing changes that apply stochasticallyto all words. The probabilities of substitution, insertion and de-letion are also dependent on the context in which the changeoccurs. Further details of the distributions that were used andtheir parameterization appear in Materials and Methods.The flexibility of our model comes at the cost of having literally

millions of parameters to set, creating challenges not found inmost computational approaches to phylogenetics. Our inferencealgorithm learns these parameters automatically, using estab-lished principles from machine learning and statistics. Specifi-cally, we use a variant of the expectation-maximization algorithm(26), which alternates between producing reconstructions on thebasis of the current parameter estimates and updating the pa-rameter estimates on the basis of those reconstructions. Thereconstructions are inferred using an efficient Monte Carlo in-ference algorithm (27). The parameters are estimated by opti-mizing a cost function that penalizes complexity, allowing us toobtain robust estimates of large numbers of parameters. See SIAppendix, Section 1 for further details of the inference algorithm.If cognate assignments are not available, our system can be

applied just to lists of words in different languages. In this case itautomatically infers the cognate assignments as well as thereconstructions. This setting requires only two modifications tothe model. First, because cognates are not available, we index thewords by their semantic meaning (or gloss) g, and there are thusG groups of words. The model is then defined as in the previouscase, with words indexed as wgℓ. Second, the generation process isaugmented with a notion of innovation, wherein a word wgℓ′ insome language ℓ′ may instead be generated independently fromits parent word wgℓ. In this instance, the word is generated froma language model as though it were a root string. In effect, thetree is “cut” at a language when innovation happens, and so theword begins anew. The probability of innovation in any given

Table 1. Sample of reconstructions produced by the system

*Complete sets of reconstructions can be found in SI Appendix.†Randomly selected by stratified sampling according to the Levenshtein edit distance Δ.‡Levenshtein distance to a reference manual reconstruction, in this case the reconstruction of Blust (42).§The colors encode cognate sets.{We use this symbol for encoding missing data.

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.

Page 3: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

language is initially unknown and must be learned automaticallyalong with the other branch-specific model parameters.

ResultsOur results address three questions about the performance of oursystem. First, how well does it reconstruct protolanguages? Second,howwell does it identify cognate sets? Finally, how can this approachbe used to address outstanding questions in historical linguistics?

Protolanguage Reconstructions. To test our system, we applied it toa large-scale database of Austronesian languages, the Austrone-sian Basic Vocabulary Database (ABVD) (28). We used a pre-viously established phylogeny for these languages, the Ethnologuetree (29) (we also describe experiments with other trees in Fig. 1).For this first test of our system we also used the cognate setsprovided in the database. The dataset contained 659 languages atthe time of download (August 7, 2010), including a few languagesoutside the Austronesian family and some manually reconstructedprotolanguages used for evaluation. The total data comprised142,661 word forms and 7,708 cognate sets. The goal was to re-construct the word in each protolanguage that corresponded toeach cognate set and to infer the patterns of sound changes alongeach branch in the phylogeny. See SI Appendix, Section 2 forfurther details of our simulations.We used the Austronesian dataset to quantitatively evaluate the

performance of our system by comparing withheld words fromknown languages with automatic reconstructions of those words.The Levenshtein distance between the held-out and reconstructedforms provides a measure of the number of errors in thesereconstructions. We used this measure to show that using morelanguages helped reconstruction and also to assess the overallperformance of our system. Specifically, we compared the system’serror rate on the ancestral reconstructions to a baseline and also tothe amount of divergence between the reconstructions of twolinguists (Fig. 1A). Given enough data, the system can achievereconstruction error rates close to the level of disagreement be-tween manual reconstructions. In particular, most reconstructionsperfectly agree with manual reconstructions, and only a few con-tain big errors. Refer to Table 1 for examples of reconstructions.See SI Appendix, Section 3 for the full lists.We also present in Fig. 1B the effect of the tree topology on

reconstruction quality, reiterating the importance of using in-formative topologies for reconstruction. In Fig. 1C, we show thatthe accuracy of our method increases with the number of ob-served Oceanic languages, confirming that large-scale inferenceis desirable for automatic protolanguage reconstruction: Recon-struction improved statistically significantly with each increase

except from 32 to 64 languages, where the average edit distanceimprovement was 0.05.For comparison, we also evaluated previous automatic re-

construction methods. These previous methods do not scale tolarge datasets so we performed comparisons on smaller subsetsof the Austronesian dataset. We show in SI Appendix, Section 2that our method outperforms these baselines.We analyze the output of our system in more depth in Fig. 2

A–C, which shows the system learned a variety of realistic soundchanges across the Austronesian family (30). In Fig. 2D, we showthe most frequent substitution errors in the Proto-Austronesianreconstruction experiments. See SI Appendix, Section 5 fordetails and similar plots for the most common incorrect inser-tions and deletions.

Cognate Recovery. Previous reconstruction systems (22) requiredthat cognate sets be provided to the system. However, the crea-tion of these large cognate databases requires considerable an-notation effort on the part of linguists and often requires that atleast some reconstruction be done by hand. To demonstrate thatour model can accurately infer cognate sets automatically, weused a version of our system that learns which words are cognate,starting only from raw word lists and their meanings. This systemuses a faster but lower-fidelity model of sound change to infercorrespondences. We then ran our reconstruction system oncognate sets that our cognate recovery system found. See SI Ap-pendix, Section 1 for details.This version of the system was run on all of the Oceanic lan-

guages in the ABVD, which comprise roughly half of the Aus-tronesian languages. We then evaluated the pairwise precision(the fraction of cognate pairs identified by our system that are alsoin the set of labeled cognate pairs), pairwise recall (the fraction oflabeled cognate pairs identified by our system), and pairwise F1measure (defined as the harmonic mean of precision and recall)for the cognates found by our system against the known cognatesthat are encoded in the ABVD. We also report cluster purity,which is the fraction of words that are in a cluster whose knowncognate group matches the cognate group of the cluster. See SIAppendix, Section 2.3 for a detailed description of the metrics.Using these metrics, we found that our system achieved a pre-

cision of 0.844, recall of 0.621, F1 of 0.715, and cluster purity of0.918. Thus, over 9 of 10 words are correctly grouped, and oursystem errs on the side of undergrouping words rather than clus-tering words that are not cognates. Because the null hypothesis inhistorical linguistics is to deem words to be unrelated unlessproved otherwise, a slight undergrouping is the desired behavior.

Rec

onst

ruct

ion

erro

r ra

te

A

0

0.125

0.250

0.375

0.500

Randommodern

Automated reconstruction

Agreement between two linguists

YakanBajoMapun

SamalS ias iInabaknonSubanunSinSubanonSioWesternBukManoboWestManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKa laManoboSaraBinukid

MaranaoIranunSasakBaliMamanwaIlonggoHiligaynonWarayWaray

ButuanonTausugJoloSurigaonon

AklanonBisCebuanoKa laganMansaka

TagalogAntBikolNagaCTagbanwaKa

TagbanwaAb

BatakPalawPalawanBat

BanggiSWPa lawano

HanunooPalauanS inghi

DayakBakat

KatinganDayakNgaju

Kadorih

MerinaMalaMaanyan

TunjungKerinc i

Me layuSara

Me layuOganLowMalay

Indones ian

BanjareseM

Ma layBahas

Me layuBrun

Minangkaba

PhanRangCh

ChruHainanCham

MokenIbanGayo

Idaan

BunduDusun

TimugonM

ur

KelabitBar

Bintulu

BerawanLon

Belait

KenyahLong

MelanauM

uk

Lahanan

EngganoMal

EngganoBan

Nias

TobaBatak

Madurese

Iv atanBas c

Itbayat

Babuyan

Iraralay

Itbayaten

Is amorong

Ivasay

Imorod

Yami

SambalBoto

Kapampanga

IfugaoBayn

Ka llahanKa

KallahanKe

Inibaloi

Pangas inan

KakidugenI

IlongotKak

IfugaoAmga

IfugaoBata

BontokGuin

BontocGuin

KankanayNo

Balangaw

KalingaGui

ItnegBinon

Gaddang

AttaPamplo

IsnegDibag

AgtaDum

agatCas

IlokanoYogyaOldJavanesW

esternMa layoPolynes ian

BugineseSo

Ma loh

MakassarTaeSToraja

SangirTabu

SangirSangilSaraBantikKa idipang

GorontaloHBolaangM

onPopaliaBonerateW

unaM

unaKa tobuTontem

boan

Tonsea

Bang

gaiW

diBa

ree

Wol

ioM

ori

Kaya

nUm

aJu

Buka

tPu

nanK

ela i

Wam

par

Yabe

m

Num

bam

iSib

Mou

kAr

ia

Lam

ogai

Mul

Amar

a

Seng

seng

Kaul

ongA

uVBi

libil

Meg

iar

Geda

ged

Mat

ukar

Bilia

u

Men

gen

Mal

euKo

veTa

rpia

Sobe

i

Kayu

pula

uKRi

wo

Kairi

ruKis

Wog

eoAli

Auhe

lawa

Salib

a

Suau

Ma i

s in

Gum

awan

a

Diod

io

Mol

ima

Ubir

Gapa

paiw

a

Wed

au

Dobu

an

Mis

ima

Kiliv

ila

Gaba

diDo

uraKu

niRo

ro

Mek

eoLala

Vilir

upu

Mot

u

Mag

oriS

out

Kand

asTang

aS iar

Kuan

ua

Patp

atar

Rov i

ana

S im

boNduk

e

Kus a

gheHoav

a

Lung

gaLuqa

Kubokota

Ughele

Ghanongga

Vangunu

Marovo

MbarekeSolosTaiof

Teop

NehanHapeNis sanHaku

Sis ingga

BabatanaTu

VarisiGhonVaris

iRirioSengga

BabatanaAvVaghua

Babatana

BabatanaLo

BabatanaKaLungaLunga

MonoMonoAluTorauUruava

MonoFauroBilurChekeHolo

MaringeKmaMaringeTatNggaoPoroMaringeLel

ZabanaKiaLaghuSamas

BlablangaBlablangaGKokotaKilokakaYsBanoniMadaraTungagTungKaraWes tNalikTiangTigakL ihirSunglMadakLamas

BarokMaututuNakanaiBilLaka la iVituYapes eMbughotuDhBugotuTalis eMoli

Talis eMalaGhariNdiGhariToloTalis e

Ma langoGhariNggaeMbirao

GhariTandaTalis ePoleGhariNggerTalis eKoo

GhariNginiLengo

LengoGhaimLengoParip

Nggela

ToambaitaKwai

LangalangaKwaio

MbaengguuLau

Mbae le lea

KwaraaeSolFata leka

LauWaladeLauNorthLonggu

SaaUkiNiMa

SaaSaaVillOroha

SaaUlawaSaa

Dorio

AreareMaas

AreareWaia

SaaAuluVil

BauroBaroo

BauroHaunu

Fagani

KahuaMami

SantaCatal

Aros iOneib

Aros iTawatAros i

Kahua

BauroPawaV

FaganiAguf

FaganiRihu

SantaAna

Tawaroga

TannaSouth

Kwamera

Lenakel

SyeErromanUra

ArakiSouthMere i

AmbrymSoutMota

PeteraraMa

PaameseSou

Mwotlap

Raga

SouthEfate

Orkon

Nguna

Namakir

Nati

Nahavaq

NeseTape

AnejomAnei

SakaoPortO

AvavaNeveei

Naman

Puluwatese

PuloAnnanW

oleaian

Satawalese

ChuukeseAKM

ortlockes

Carolinian

ChuukeseSonsoroles

Sa ipanCaroPuloAnnaW

oleai

Mokilese

PingilapesPonapeanKiribatiKus a ie

Marsha lles

NauruNelem

waJawe

CanalaIaai

NengoneDehu

Rotuman

WesternF ij

Maori

TahitiPenrhyn

Tuamotu

Manihiki

RurutuanTahitianM

oRarotongan

TahitianthM

arquesanM

angarevaHawaiian

RapanuiEasSam

oanBellonaTikopia

Emae

AnutaRennellese

FutunaAniwIfiraM

eleMUveaW

estFutunaEas t

VaeakauTauTakuu

TuvaluSikaiana

Kapingamar

NukuoroLuangiua

PukapukaTokelau

UveaEastTongan

NiueFij ianBau

Tean

uVa

noTa

nem

aBu

ma

Asum

boa

Tani

mbi

liNe

mba

oSe

imat

Wuv

ulu

Lou

Naun

aLe

ipon

S iv i

saTi

taL i

kum

Leve

iLo

niu

Mus

sau

Kas i

raIra

hGi

man

Buli

Mor

Min

yaifu

inBi

gaM

isoo

lAs War

open

Num

for

Mar

auAm

baiY

apen

Win

des i

Wan

North

Baba

rDa

wera

Dawe

Dai

Tela

Mas

bua

Imro

ing

Empl

awas

Eas t

Mas

e la

Seril

iCe

ntra

lMas

Sout

hEas

tBW

estD

amar

Tara

ngan

Ba

UjirN

Aru

Ngai

borS

Ar

Gese

rW

atub

ela

E lat

KeiB

es

Hitu

Ambo

n

SouA

man

aTe

Amah

aiPa

uloh

iAl

une

Mur

nate

nAl

Bonf

iaW

erin

ama

Buru

Nam

rol

Sobo

yoM

ambo

ruM

angg

arai

Ngad

haPo

ndok

Kodi

Wej

ewaT

ana

Bim

aSo

aSa

vuW

anuka

ka

Gaur

aNgg

au

Eas tSumban

Baliled

o

Lamboya

L ioF loresT

Ende

NageKambera

PalueNitun

Anakalang

Sekar

PulauArgun

RotiTerm

an

Atoni

Mambai

TetunTerik

Kemak

Lamalera le

LamaholotI

S ika

Kedang

Era i

TalurAputai

Perai

Tugun

Iliun

Serua

NilaTeunKis ar

Roma

Eas tDamar

LetineseYamdena

KeiTanimba

Se laru

KoiwaiIriaChamorro

Lampung

Komering

Sunda

RejangRe ja

TboliTagab

TagabiliKoronadalB

SaranganiB

BilaanKoroBilaanSara

CiuliAtayaSquliqAtay

SediqBunun

ThaoFavorlang

RukaiPaiwanKanakanabu

SaaroaTsouPuyumaKav alanBasaiCentralAmiS iraya

PazehSa is ia t

YakanBajoMapun

SamalS ias iInabaknonSubanunSinSubanonSioWesternBukManoboWestManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKalaManoboSaraBinukid

MaranaoIranunSasakBaliMamanwaIlonggoHiligaynonWarayWaray

ButuanonTausugJoloSurigaonon

AklanonBisCebuanoKalaganMansaka

TagalogAntBikolNagaCTagbanwaKa

TagbanwaAb

BatakPalawPalawanBat

BanggiSWPalawano

HanunooPalauanS inghi

DayakBakat

KatinganDayakNgaju

Kadorih

MerinaMalaMaanyan

TunjungKerinc i

Me layuSara

Me layuOganLowMalay

Indones ian

BanjareseM

Ma layBahas

MelayuBrun

Minangkaba

PhanRangCh

ChruHainanCham

MokenIbanGayo

Idaan

BunduDusun

TimugonM

ur

KelabitBar

Bintulu

BerawanLon

Belait

KenyahLong

MelanauM

uk

Lahanan

EngganoMa l

EngganoBan

Nias

TobaBatak

Madures e

Iv atanBasc

Itbayat

Babuyan

Iraralay

Itbayaten

Isamorong

Ivasay

Imorod

Yami

SambalBoto

Kapampanga

IfugaoBayn

Ka llahanKa

Ka llahanKe

Inibaloi

Pangas inan

KakidugenI

IlongotKak

IfugaoAmga

IfugaoBata

BontokGuin

BontocGuin

KankanayNo

Balangaw

KalingaGui

ItnegBinon

Gaddang

AttaPamplo

IsnegDibag

AgtaDum

agatCas

IlokanoYogyaOldJavanesW

es ternMa la yoPolynes ian

BugineseSo

Ma loh

Makassa rTaeSToraja

SangirTabu

SangirSangilSa raBantikKaidipang

Goronta loHBolaangM

onPopaliaBonerateW

unaM

unaKa tobuTontem

boan

T on s ea

Bang

g ai W

diBa

ree

Wol

ioM

ori

Kay a

nUm

aJu

Buka

tPu

nanK

e la i

Wam

par

Yabe

m

Num

bam

iSib

Mou

kAr

ia

Lam

oga i

Mul

Amar

a

Seng

seng

Kaul

ongA

uVBi

libil

Meg

iar

Geda

ged

Mat

ukar

Bilia

u

Men

gen

Mal

euKo

veTa

rpia

Sobe

i

Kayu

pula

uKRi

wo

Kairi

ruKis

Wog

eoAli

Auhe

lawa

Salib

a

Suau

Ma i

s in

Gum

awan

a

Diod

io

Mol

ima

Ubir

Gapa

paiw

a

Wed

au

Dobu

an

Mis

ima

Kiliv

ila

Gaba

diDo

uraKu

niRo

ro

Mek

eoLala

Vilir

upu

Mot

u

Mag

oriS

out

Kand

asTang

aSiar

Kuan

ua

Patp

atar

Rov i

ana

S im

boNduk

e

Kus a

gheHoav

a

Lung

gaLuqa

Kubokota

Ughele

Ghanongga

Vangunu

Marovo

MbarekeSolosTaiof

Teop

NehanHapeNis sanHaku

Sis ingga

BabatanaTu

VarisiGhonVaris

iRirioSengga

BabatanaAvVaghua

Babatana

BabatanaLo

BabatanaKaLungaLunga

MonoMonoAluTorauUruava

MonoFauroBilurChekeHolo

MaringeKmaMaringeTatNggaoPoroMaringeLel

ZabanaKiaLaghuSamas

BlablangaBlablangaGKokotaKilokakaYsBanoniMadaraTungagTungKaraWes tNalikTiangTigakL ihirSunglMadakLamas

BarokMaututuNakanaiBilLaka la iVituYapes eMbughotuDhBugotuTalis eMoli

Talis eMalaGhariNdiGhariToloTalis e

Ma langoGhariNggaeMbirao

GhariTandaTalisePoleGhariNggerTalis eKoo

GhariNginiLengo

LengoGhaimLengoParip

Nggela

ToambaitaKwai

LangalangaKwaio

MbaengguuLau

Mbaele lea

KwaraaeSolFata leka

LauWa ladeLauNorthLonggu

SaaUkiNiMa

SaaSaaVillOroha

SaaUlawaSaa

Dorio

AreareMaas

AreareWaia

SaaAuluVil

BauroBaroo

BauroHaunu

Fagani

KahuaMami

SantaCatal

Aros iOneib

Aros iTawatAros i

Kahua

BauroPawaV

FaganiAguf

FaganiRihu

SantaAna

Tawaroga

TannaSouth

Kwamera

Lenakel

SyeErromanUra

ArakiSouthMere i

AmbrymSoutMota

PeteraraMa

PaameseSou

Mwotlap

Raga

SouthEfate

Orkon

Nguna

Namakir

Nati

Nahavaq

NeseTape

AnejomAnei

SakaoPortO

AvavaNeveei

Naman

Puluwatese

PuloAnnanW

oleaian

Satawalese

ChuukeseAKM

ortlockes

Carolinian

ChuukeseSonsoroles

SaipanCaroPuloAnnaW

oleai

Mokiles e

PingilapesPonapeanKiriba tiKus a ie

Marsha lles

NauruNelem

waJawe

CanalaIaai

NengoneDehu

Rotuman

WesternF ij

Maori

TahitiPenrhyn

Tuamotu

Manihiki

RurutuanTahitianM

oRarotongan

TahitianthM

arquesanM

angarevaHawaiian

RapanuiEasSam

oanBellonaTikopia

Emae

AnutaRennellese

FutunaAniwIfiraM

eleMUveaW

estFutunaEas t

VaeakauTauTakuu

Tuva luSikaiana

Kapingamar

NukuoroLuangiua

PukapukaTokelau

Uve aEa stTon gan

NiueF ij ia nB au

T ean

uVa

noT a

nem

aBu

ma

Asum

boa

Tani

mbi

liNe

mba

oSe

imat

Wuv

ulu

L ou

Naun

aLe

ipon

Sivi

saTi

taLi

kum

Leve

iLo

niu

Mus

sau

Kas i

raIra

hGi

man

Buli

Mor

Min

yaifu

inBi

gaM

isoo

lAs War

open

Num

for

Mar

auAm

baiY

apen

Win

des i

Wan

North

Baba

rDa

wera

Dawe

Dai

Tela

Mas

bua

Imro

ing

Empl

awas

Eas t

Mas

e la

Seril

iCe

ntra

lMas

Sout

hEas

tBW

estD

amar

Tara

ngan

Ba

UjirN

Aru

Ngai

borS

Ar

Gese

rW

atub

ela

Ela t

KeiB

es

Hitu

Ambo

n

SouA

man

aTe

Amah

aiPa

uloh

iAl

une

Mur

nate

nAl

Bonf

iaW

erin

ama

Buru

Nam

rol

Sobo

yoM

ambo

ruM

angg

arai

Ngad

haPo

ndok

Kodi

Wej

ewaT

ana

Bim

aSo

aSa

vuW

anuka

ka

Gaur

aNgg

au

Eas tSumban

Baliled

o

Lamboya

L ioF loresT

Ende

NageKambera

PalueNitun

Anakalang

Sekar

PulauArgun

RotiTerm

an

Atoni

Mambai

TetunTerik

Kemak

Lamalera le

LamaholotI

S ika

Kedang

Era i

TalurAputai

Perai

Tugun

Iliun

Serua

NilaTeunKis ar

Roma

Eas tDamar

LetineseYamdena

KeiTanimba

Se laru

KoiwaiIriaChamorro

Lampung

Komering

Sunda

RejangRe ja

TboliTagab

TagabiliKoronadalB

SaranganiB

BilaanKoroBilaanSara

CiuliAtayaSquliqAtay

SediqBunun

ThaoFavorlang

RukaiPaiwanKanakanabu

SaaroaTsouPuyumaKav alanBasaiCentralAmiS iraya

PazehSa is ia t

N/A

B

0 10 20 30 40 50 600.3

0.35

0.4

0.45

0.5

0.55

Number of modern languages

Rec

onst

ruct

ion

erro

r ra

te

C

Rec

onst

ruct

ion

erro

r ra

te

Effect of tree quality

Effect of tree sizeFig. 1. Quantitative validation ofreconstructions and identificationof some important factors influ-encing reconstruction quality. (A)Reconstruction error rates for abaseline (which consists of pickingone modern word at random), oursystem, and the amount of dis-agreement between two linguist’smanual reconstructions. Reconstruc-tion error rates are Levenshteindistances normalized by the meanword form length so that errorscan be compared across languages.Agreement between linguists was computed on only Proto-Oceanic because the dataset used lacked multiple reconstructions for other protolanguages. (B) Theeffect of the topology on the quality of the reconstruction. On one hand, the difference between reconstruction error rates obtained from the system that ranon an uninformed topology (first and second) and rates obtained from the system that ran on an informed topology (third and fourth) is statistically significant.On the other hand, the corresponding difference between a flat tree and a random binary tree is not statistically significant, nor is the difference between usingthe consensus tree of ref. 41 and the Ethnologue tree (29). This suggests that our method has a certain robustness to moderate topology variations. (C) Re-construction error rate as a function of the number of languages used to train our automatic reconstruction system. Note that the error is not expected togo down to zero, perfect reconstruction being generally unidentifiable. The results in A and B are directly comparable: In fact, the entry labeled “Ethnologue”in B corresponds to the green Proto-Austronesian entry in A. The results in A and B and those in C are not directly comparable because the evaluation in Cis restricted to those cognates with at least one reflex in the smallest evaluation set (to make the curve comparable across the horizontal axis of C).

Bouchard-Côté et al. PNAS Early Edition | 3 of 6

COMPU

TERSC

IENCE

SPS

YCHOLO

GICALAND

COGNITIVESC

IENCE

S

Page 4: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Because we are ultimately interested in reconstruction, we thencompared our reconstruction system’s ability to reconstruct wordsgiven these automatically determined cognates. Specifically, wetook every cognate group found by our system (run on theOceanicsubclade) with at least two words in it. Then, we automaticallyreconstructed the Proto-Oceanic ancestor of those words, usingour system. For evaluation, we then looked at the average Lev-enshtein distance from our reconstructions to the known recon-structions described in the previous sections. This time, however,we average per modern word rather than per cognate group, toprovide a fairer comparison. (Results were not substantially dif-ferent when averaging per cognate group.) Compared with re-construction from manually labeled cognate sets, automaticallyidentified cognates led to an increase in error rate of only 12.8%and with a significant reduction in the cost of curating linguisticdatabases. See SI Appendix, Fig. S1 for the fraction of words witheach Levenshtein distance for these reconstructions.

Functional Load. To demonstrate the utility of large-scale recon-struction of protolanguages, we used the output of our system toinvestigate an open question in historical linguistics. The func-tional load hypothesis (FLH), introduced 1955 (6), claims thatthe probability that a sound will change over time is related tothe amount of information provided by a sound. Intuitively, iftwo phonemes appear only in words that are differentiated fromone another by at least one other sound, then one can argue thatno information is lost if those phonemes merge together, be-cause no new ambiguous forms can be created by the merger.A first step toward quantitatively testing the FLH was taken in

1967 (7). By defining a statistic that formalizes the amount ofinformation lost when a language undergoes a certain soundchange—on the basis of the proportion of words that are dis-criminated by each pair of phonemes—it became possible toevaluate the empirical support for the FLH. However, this initial

investigation was based on just four languages and found littleevidence to support the hypothesis. This conclusion was criti-cized by several authors (31, 32) on the basis of the small numberof languages and sound changes considered, although they pro-vided no positive counterevidence.Using the output of our system, we collected sound change

statistics from our reconstruction of 637 Austronesian languages,including the probability of a particular change as estimated by oursystem. These statistics provided the information needed to givea more comprehensive quantitative evaluation of the FLH, usinga much larger sample than previous work (details in SI Appendix,Section 2.4). We show in Fig. 3 A and B that this analysis providesclear quantitative evidence in favor of the FLH. The revealedpattern would not be apparent had we not been able to reconstructlarge numbers of protolanguages and supply probabilities of dif-ferent kinds of change taking place for each pair of languages.

DiscussionWe have developed an automated system capable of large-scalereconstruction of protolanguage word forms, cognate sets, andsound change histories. The analysis of the properties of hun-dreds of ancient languages performed by this system goes farbeyond the capabilities of any previous automated system andwould require significant amounts of manual effort by linguists.Furthermore, the system is in no way restricted to applicationslike assessing the effects of functional load: It can be used asa tool to investigate a wide range of questions about the structureand dynamics of languages.In developing an automated system for reconstructing ancient

languages, it is by no means our goal to replace the carefulreconstructions performed by linguists. It should be emphasizedthat the reconstruction mechanism used by our system ignoresmany of the phenomena normally used in manual recon-structions. We have mentioned limitations due to the transducer

(521)

( 559) ( 750)

Yakan ( 632) ( 91) ( 40)Bajo ( 375)Mapun ( 560)

SamalS ias i ( 230)

Inabaknon ( 620)

( 629) ( 630)

SubanunSin ( 628)

SubanonSio ( 362) ( 123) ( 728) ( 733)

Wes ternBuk ( 370)

ManoboWes t ( 366)

ManoboIlia

( 365)

ManoboDiba

( 27) ( 363)

ManoboAtad ( 364)

ManoboAtau

( 369)

ManoboTigw

( 614)

( 367)

ManoboKala

( 368)

ManoboSara

( 79)

Binukid

( 377)

( 376)

Maranao

( 233)

Iranun

( 43)

( 576)

Sasak

( 42)

Bali

( 405) ( 118)

( 355)

Mamanwa

( 80) ( 121) ( 507)

( 225)

Ilonggo

( 210)

Hiligaynon

( 717)

WarayWaray

( 613) ( 102)

( 103)

Butuanon

( 662)

TausugJolo

( 635)

Surigaonon

( 3)

AklanonBis

( 107)

Cebuano

( 372)

( 252)

Kalagan

( 371)

Mansaka

( 639)

TagalogAnt

( 69)

BikolNagaC

( 253)

( 641)

TagbanwaKa

( 640)

TagbanwaAb

( 495)

( 57)

BatakPalaw

( 494)

PalawanBat

( 47)

Banggi

( 547)

SWPalawano

( 208)

Hanunoo

( 493)

Palauan

( 311)

( 595)

S inghi

( 136)

DayakBakat

( 52)

( 727)

( 615)

( 267)

Katingan

( 137)

DayakNgaju

( 245)

Kadorih

( 151)

( 403)

MerinaMala

( 337)

Maanyan

( 693)

Tunjung

( 350)

( 349)

( 327)

( 279)

Kerinc i

( 400)

MelayuSara

( 398)

Melayu

( 487)

Ogan

( 331)

LowMalay

( 231)

Indones ian

( 48)

BanjareseM

( 348)

MalayBahas

( 399)

MelayuBrun

( 407)

M inangkaba

( 126)

( 125)

( 511)

PhanRangCh

( 130)

Chru

( 206)

HainanCham

( 410)

Moken

( 215)

Iban

( 187)

Gayo

( 477)

( 554)

( 217)

Idaan

( 99)

BunduDusun

( 464)

( 138)

( 677)

TimugonM

ur

( 276)

KelabitBar

( 78)

Bintulu

( 66)

( 65)

BerawanLon

( 62)

Belait

( 278)

KenyahLong

( 396)

( 397)

MelanauM

uk

( 303)

Lahanan

( 633)

( 167)

( 169)

EngganoMal

( 168)

EngganoBan

( 455)

Nias

( 679)

TobaBatak

( 341)

Madurese

( 470) ( 56)

( 55)

( 241)

( 242)

Iv atanBasc

( 237)

Itbayat

( 39)

Babuyan

( 234)

Iraralay

( 238)

Itbayaten

( 235)

Is amorong

( 240)

Iv as ay

( 228)

Imorod

( 752)

Yami

( 113)

( 561)

SambalBoto

( 263)

Kapampanga

( 469)

( 605)

( 619)

( 498)

( 64)

( 256)

( 222)IfugaoBayn

( 257)KallahanKa

( 258)KallahanKe

( 232)

Inibaloi

( 497)

Pangas inan

( 226)

( 251)

KakidugenI

( 227)

IlongotKak

( 110)

( 478)

( 219)

( 220)

IfugaoAmga

( 221)

IfugaoBata

( 90)

( 88)

( 89)BontokGuin

( 87)BontocGuin

( 262)

KankanayNo

( 41)

Balangaw

( 255)

( 254)

KalingaGui

( 239)

ItnegBinon

( 468)

( 216)

( 184)

Gaddang

( 30)

AttaPamplo

( 236)

Is negDibag

( 471)

( 2)

Agta

( 144)

DumagatCas

( 224)

Ilokano

( 243)

( 754)

Yogya

( 488)

OldJavanes

( 735)

Wes ternM

alayoPolynes ian

( 631)

( 611)

( 94)

( 93)BugineseSo

( 354)M

aloh

( 344)

Makassar

( 637)

TaeSToraja

( 568)

( 476)

( 567)SangirTabu

( 566)Sangir

( 565)SangilSara

( 50)

Bantik

( 203) ( 201)

( 248)Kaidipang

( 202)GorontaloH

( 84)BolaangM

on

( 423) ( 691)

( 518)Popalia

( 85)Bonerate

( 739) ( 747)

Wuna

(424)M

unaKatobu (466)

(685)Tontem

boan ( 684)

Tons ea (46)

BanggaiWdi

(51)Baree

(746)

Wolio

(418)

Mori

(270) (271)

KayanUmaJu

(96)

Bukat (529)

PunanKelai

(11

1)

(158)

(523)

(736)

(462)

(213) (715)

Wam

par (749)

Yabem (484)

Numbam

iSib (451)

(713)

(624) (67)

(422)M

ouk (20)

Aria (309)

LamogaiM

ul

(7)

Amara

(500) (585)

Sengseng (268)

KaulongAuV

(61) (475)

(73)Bilibil

(394)M

egiar (188)

Gedaged

(387)M

atukar

(72)

Biliau

(401)

Mengen

(353)

Maleu

(292)

Kove

(575) (574)

(661)

Tarpia

(600)

Sobei

(272)

KayupulauK

(539)

Riwo

(579) (250)

(249)

Kairiru

(358) (284)

Kis

(743)

Wogeo

(4)

Ali

(499)

(480)

(627)

(31)

Auhelawa

(558)

Saliba

(626)

Suau

(343)

Maisin

(463) (205)

Gumawana

(104)

(140)

Diodio

(412)

Molim

a

(17) (16)

(695)

Ubir

(185)

Gapapaiwa

(720)

Wedau

(141)

Dobuan

(508)

(281)

(409)

Misim

a

(280)

Kilivila

(117) (723)

(183)

Gabadi

(482) (143)

Doura

(295)

Kuni

(541)

Roro

(395)

Mekeo

(305)

Lala

(594)

(712)

Vilirupu

(421)

Motu

(342)

MagoriSout

(404)

(448)

(610)

(502)

(261)

Kandas

(655)

Tanga

(590)

Siar

(293)

Kuanua

(501)

Patpatar

(447) (730)

(544)

Roviana

(593)

Simbo

(437)

Nduke

(296)

Kusaghe

(212)

Hoava

(335)

Lungga

(336)

Luqa

(294)

Kubokota

(696)Ughele

(193)Ghanongga

(154)

(706)Vangunu

(382)Marovo

(391)Mbareke

(440)

(602)

Solos

(572)

(646)Taiof

(668)Teop

(438)

(439)NehanHape

(458)Nissan

(207)

Haku

(129)

(597)

Sisingga

(38)

BabatanaTu

(711)

VarisiGhon

(710)

Varisi

(538)

Ririo

(584)

Sengga

(35)

BabatanaAv

(705)

Vaghua

(34)

Babatana

(37)

BabatanaLo

(36)

BabatanaKa

(416)

(334)

LungaLunga

(413)

Mono

(414)

MonoAlu

(686)

Torau

(700)

Uruava

(415)

MonoFauro

(75)

Bilur

(571)

(157)

(128)ChekeHolo

(379)MaringeKma

(381)MaringeTat

(452)NggaoPoro

(380)MaringeLel

(732) (755)ZabanaKia

(302)LaghuSamas

(124)

(82)Blablanga

(83)BlablangaG

(289)Kokota

(282)KilokakaYs

(49)

Banoni

(316)

(340)

Madara

(692)

TungagTung

(265)KaraWest

(431)Nalik

(673)Tiang

(674)Tigak

(324)LihirSungl

(338)

(339)MadakLamas

(53)Barok

(741)

(388)Maututu

(430)NakanaiBil

(304)Lakalai

(714)

Vitu

(753)

Yapese

(11

2)

(618)

(190)

(92) (393)MbughotuDh

(95)Bugotu

(204)

(651)TaliseMoli (650)TaliseMala (195)GhariNdi (194)Ghari (681)Tolo (648)Talise (347)Malango (196)

GhariNggae (392)Mbirao (199)GhariTanda (652)TalisePole (197)GhariNgger (649)TaliseKoo (198)GhariNgini (189)

(319)Lengo (320)LengoGhaim

(321)LengoParip

(453)Nggela

(346)

(345)

(473)

(678)Toambaita

(298)Kwai

(312)Langalanga

(299)Kwaio

(390)Mbaengguu

(313)Lau

(389)

Mbaelelea (301)

KwaraaeSol

(175)

Fataleka (315)

LauWalade

(314)

LauNorth

(328)

Longgu (621)

(551)

SaaUkiNiMa

(550)

SaaSaaVill

(490)

Oroha

(552)

SaaUlawa

(548)

Saa

(142)

Dorio

(18)

AreareMaas

(19)

AreareWaia

(549)

SaaAuluVil

(564)

(58)

BauroBaroo

(59)

BauroHaunu

(172)

Fagani

(247)

KahuaMami

(570)

SantaCatal

(22)

ArosiOneib

(23)

ArosiTawat

(21)

Arosi

(246)

Kahua

(60)

BauroPawaV

(173)

FaganiAguf

(17

4)

FaganiRihu

(56

9)

SantaAna

(66

3)

Tawaroga

(

536)

(70

9)

(65

7)

(65

8)

TannaSouth

(30

0)

Kwamera

(31

8)

Lenakel

(17

1)

(63

6)

SyeErroman

(69

9)

Ura

(46

7)

(72

6)

(15

)

ArakiSouth

(40

2)

Merei

(15

0)

(10)

AmbrymSout

(42

0)

Mota

(51

0)

PeteraraMa

(49

1)

PaameseSou

(42

7)

Mwotlap

(53

1)

Raga

(11

9)

(60

7)

SouthEfate

(48

9)

Orkon

(45

4)

Nguna

(43

2)

Namakir

(35

2)

(43

4)

Nati

(42

9)

Nahavaq

(44

4)

Nese

(65

9)

Tape

(1

2)

Anejom

Anei

(

557)

Saka

oPor

tO

(

351)

(

32)

Avav

a

(

445)

Neve

ei

(

433)

Nam

an

(

522)

(

406)

(

516)

(

520)

(

528)

Pulu

wate

se

(

527)

Pulo

Anna

n

(

745)

Wol

eaia

n

(

577)

Sata

wale

se

(

132)

Chuu

kese

AK

(

419)

Mor

tlock

es

(

106)

Caro

linia

n

(

131)

Chuu

kese

(

603)

Sons

orol

es

(

555)

Saip

anCa

ro

(

526)

Pulo

Anna

(

744)

Wol

eai

(

515)

(

411)

Mok

ilese

(

512)

Ping

ilape

s

(

514)

Pona

pean

(

283)

Kirib

ati

(

297)

Kusa

ie

(

385)

Mar

shal

les

(

436)

Naur

u

(

332)

(

446)

(

474)

(

441)

Nele

mwa

(

244)

Jawe

(

105)

Cana

la

(

214)

Iaai

(

443)

Neng

one

(

139)

Dehu

(

116)

(

725)

(

543)

Rotu

man

(

734)

Wes

tern

Fij

(

146)

(

513)

(

481)

(

156)

(

122)

(64

5) (

374)

Mao

ri

(

642)

Tahi

ti

(

505)

Penr

hyn

(

689)

Tuam

otu

(

361)

Man

ihik

i

(

546)

Ruru

tuan

(

643)

Tahi

tianM

o

(

534)

Raro

tong

an

(

644)

Tahi

tiant

h

(

384)

(

383)

Mar

ques

an

(

359)

Man

gare

va

(

209)

Hawa

iian

(

533)

Rapa

nuiE

as

(

563)

(

562)

Sam

oan

(

182)

(

63)

Bello

na

(

675)

Tiko

pia

(

163)

Emae

(

13)

Anut

a

(

537)

Renn

elle

se

(

180)

Futu

naAn

iw

(

218)

Ifira

Mel

eM

(

703)

Uvea

Wes

t

(

181)

Futu

naEa

st

(

704)

Vaea

kauT

au

(

162)

(

647)

Taku

u

(69

4)Tu

valu

(

592)

Sika

iana

(

264)

Kapi

ngam

ar

(48

3)Nu

kuor

o

(33

3)Lu

angi

ua

(52

4)Pu

kapu

ka

(68

0)To

kela

u

(70

2)Uv

eaEa

st

(68

3)

(68

2)To

ngan

(

459)

Niue

(

177)

Fijia

nBau

(

159)

(

707)

(

666)

Tean

u

(70

8)Va

no

(65

4)Ta

nem

a

(98

)Bu

ma

(

701)

(

26)

Asum

boa

(

656)

Tani

mbi

li

(44

2)Ne

mba

o

(1)

(

738)

(

581)

Seim

at

(74

8)

Wuv

ulu

(

160)

(

616)

(

330)

Lou

(

435)

Naun

a

(37

3)

(15

3)

(31

7)Le

ipon

(

598)

S iv i

s aTi

ta

(

729)

(

325)

L iku

m

(32

3)

Leve

i

(

329)

Loni

u

(

426)

Mus

sau

(

609)

(

608)

(

266)

Kas i

raIra

h

(

200)

Gim

an

(

97)

Buli

(

108)

(

417)

Mor

(

532)

(

408)

Min

yaifu

in

(

68)

Biga

Mis

ool

(

25)

As

(

718)

War

open

(

485)

Num

for

(

120)

(

378)

Mar

au

(

8)

Amba

iYap

en

(

742)

Win

des i

Wan

(

519)

(

33)

(

465)

(

460)

North

Baba

r

(

135)

Dawe

raDa

we

(

134)

Dai

(

612)

(

622)

(

667)

Tela

Mas

bua

(

229)

Imro

ing

(

164)

Empl

awas

(

386)

(

148)

Eas t

Mas

ela

(

588)

Seril

i

(

115)

Cent

ralM

as

(

606)

Sout

hEas

tB

(

724)

Wes

tDam

ar

(

24)

(

660)

Tara

ngan

Ba

(

697)

UjirN

Aru

(

450)

Ngai

borS

Ar

(

114)

(

152)

(

45)

(

192)

(

191)

Gese

r

(

719)

Wat

ubel

a

(

161)

E lat

KeiB

es

(

586)

(

486)

(

587)

(

9)

(

211)

Hitu

Ambo

n

(

604)

SouA

man

aTe

(

6)

Amah

ai

(

503)

Paul

ohi

(

698)

(

5)

Alun

e

(

425)

Mur

nate

nAl

(

86)

Bonf

ia

(

722)

Wer

inam

a

(

101)

Buru

Nam

rol

(

601)

Sobo

yo

(

77)

(

357)

Mam

boru

(

360)

Man

ggar

ai

(

449)

Ngad

ha

(

517)

Pond

ok

(

287)

Kodi

(

721)

Wej

ewaT

ana

(

76)

Bim

a

(

599)

Soa

(

578)

Savu

(

716)

Wan

ukak

a

(

186)

Gaur

aNgg

au

( 1

49)

Eas tSum

ban

( 44

)

Baliledo

( 30

8)

Lamboya

( 16

6)

( 32

6)

L ioF loresT

( 16

5)

Ende

( 42

8)

Nage

( 25

9)

Kambera

( 49

6)

PalueNitun

( 11

)

Anakalang

( 46

1)

( 58

2)

Sekar

( 52

5)

PulauArgun

( 67

6)

( 47

9)

( 73

1)

( 54

2)

RotiTerm

an

( 29

)

Atoni

( 15

5)

( 35

6)

Mambai

( 66

9)

TetunTerik

( 27

7)

Kemak

( 17

8)

( 30

7)

Lamalerale

( 30

6)

LamaholotI

( 59

1)

S ika

( 27

3)

Kedang

(623)

( 74

0)

( 17

0)

Erai

( 65

3)

Talur

( 14

)

Aputai

( 50

6)

Perai

( 69

0)

Tugun

(223)

Iliun

(671)

(457)

(589)

Serua

(456)

Nila

(670)

Teun

(286)

(285)

Kis ar

(540)

Roma

(145)

Eas tDamar

(322)

Letinese

(617)

(275)

(751)

Yamdena

(274)

KeiTanimba

(583)

Selaru

( 288)

KoiwaiIria

( 127)

Chamorro

( 509)

( 310)

Lampung

( 290)

Komering

( 634)

Sunda

( 535)

RejangReja

( 74)

( 664)

( 665)

TboliTagab

( 638)

Tagabili

( 81)

( 291)

KoronadalB

( 573)

SaranganiB

( 70)

BilaanKoro

( 71)

BilaanSara

( 179)

( 28)

( 133)

CiuliAtaya

( 625)

SquliqAtay

( 580)

Sediq

( 100)

Bunun

( 737)

( 672)Thao

( 176)Favorlang

( 545)

Rukai

( 492)

Paiwan

( 688)

( 260)Kanakanabu

( 553)Saaroa

( 687) Tsou

( 530)

Puyuma

( 147) ( 472)

( 269) Kavalan ( 54) Basai

( 109) CentralAmi ( 596) S iraya

( 504) Pazeh ( 556) Sais iat

Proto-Austronesian

(Fig. S2)

(Fig. S3)

(Fig. S5)

(Fig. S4)

f

gdb

nm

k

hv

t

s

qp

z x

pppp g qqqqqe

a

o

i uy

e11

12

1 2

3

8

56

10

4 97

A B

C D

(

543)

Ro

(

734)

Wes

te

3)

(48

1)

(

156)

(

122)

(

645) (

374)

Mao

ri

(

642)

Tahi

ti

(

505)

Penr

hyn

(

689)

Tuam

otu

(

361)

Man

ihik

i

(

546)

Ruru

tuan

(

643)

Tahi

tianM

o

(

534)

Raro

tong

an

(

644)

Tahi

tiant

h

(

384)

(

383)

Mar

ques

an

(

359)

Man

gare

va

(

209)

Hawa

iian

(

533)

Rapa

nuiE

as

(

563)

(56

2)Sa

moa

n

(

182)

(

63)

Bello

na

(

675)

Tiko

pia

(

163)

Emae

(

13)

Anut

a

(53

7)Re

nnel

lese

(

180)

Futu

naAn

iw

(

218)

Ifira

Mel

eM

(

703)

Uvea

Wes

t

(

181)

Futu

naEa

st

(

704)

Vaea

kauT

au

(

162)

(

647)

Taku

u

(69

4)Tu

valu

(

592)

Sika

iana

(

264)

Kapi

ngam

ar

(48

3)Nu

kuor

o

(33

3)Lu

angi

ua

(52

4)Pu

kapu

ka

(68

0)To

kela

u

(70

2)Uv

eaEa

st

(68

3)

(68

2)To

ngan

(

459)

Niue

(177

)Fi

jianB

au

(66

6)Te

an

i iiFraction of substitution errors

Fig. 2. Analysis of the output of our system in more depth. (A) An Austronesian phylogenetic tree from ref. 29 used in our analyses. Each quadrant isavailable in a larger format in SI Appendix, Figs. S2–S5, along with a detailed table of sound changes (SI Appendix, Table S5). The numbers in parenthesesattached to each branch correspond to rows in SI Appendix, Table S5. The colors and numbers in parentheses encode the most prominent sound change alongeach branch, as inferred automatically by our system in SI Appendix, Section 4. (B) The most supported sound changes across the phylogeny, with the width oflinks proportional to the support. Note that the standard organization of the IPA chart into columns and rows according to place, manner, height, andbackness is only for visualization purposes: This information was not encoded in the model in this experiment, showing that the model can recover realisticcross-linguistic sound change trends. All of the arcs correspond to sound changes frequently used by historical linguists: sonorizations /p/ > /b/ (1) and /t/ > /d/(2), voicing changes (3, 4), debuccalizations /f/ > /h/ (5) and /s/ > /h/ (6), spirantizations /b/ > /v/ (7) and /p/ > /f/ (8), changes of place of articulation (9, 10), andvowel changes in height (11) and backness (12) (1). Whereas this visualization depicts sound changes as undirected arcs, the sound changes are actuallyrepresented with directionality in our system. (C) Zooming in a portion of the Oceanic languages, where the Nuclear Polynesian family (i) and Polynesianfamily (ii) are visible. Several attested sound changes such as debuccalization to Maori and place of articulation change /t/ > /k/ to Hawaiian (30) are suc-cessfully localized by the system. (D) Most common substitution errors in the PAn reconstructions produced by our system. The first phoneme in each pairðx; yÞ represents the reference phoneme, followed by the incorrectly hypothesized one. Most of these errors could be plausible disagreements among humanexperts. For example, the most dominant error (p, v) could arise over a disagreement over the phonemic inventory of Proto-Austronesian, whereas vowels arecommon sources of disagreement.

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.

Page 5: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

formalism but other limitations include the lack of explicit mod-eling of changes at the level of the phoneme inventories used bya language and the lack of morphological analysis. Challengesspecific to the cognate inference task, for example difficulties withpolymorphisms, are also discussed in more detail in SI Appendix.Another limitation of the current approach stems from the as-sumption that languages form a phylogenetic tree, an assumptionviolated by borrowing, dialect variation, and creole languages.However, we believe our system will be useful to linguists inseveral ways, particularly in contexts where there are large num-bers of languages to be analyzed. Examples might include usingthe system to propose short lists of potential sound changes andcorrespondences across highly divergent word forms.An exciting possible application of this work is to use the

model described here to infer the phylogenetic relationshipsbetween languages jointly with reconstructions and cognate sets.This will remove a source of circularity present in most previouscomputational work in historical linguistics. Systems for inferringphylogenies such as ref. 13 generally assume that cognate sets aregiven as a fixed input, but cognacy as determined by linguists is inturn motivated by phylogenetic considerations. The phylogenetictree hypothesized by the linguist is therefore affecting the treebuilt by systems using only these cognates. This problem can beavoided by inferring cognates at the same time as a phylogeny,something that should be possible using an extended version ofour probabilistic model.Our system is able to reconstruct the words that appear in

ancient languages because it represents words as sequences ofsounds and uses a rich probabilistic model of sound change. Thisis an important step forward from previous work applying com-putational ideas to historical linguistics. By leveraging the fullsequence information available in the word forms in modernlanguages, we hope to see in historical linguistics a breakthroughsimilar to the advances in evolutionary biology prompted by thetransition from morphological characters to molecular sequencesin phylogenetic analysis.

Materials and MethodsThis section provides a more detailed specification of our probabilisticmodel. See SI Appendix, Section 1.2 for additional content on the algo-rithm and simulations.

Distributions. The conditional distributions over pairs of evolving strings arespecified using a lexicalized stochastic string transducer (33).

Consider a language ℓ′ evolving to ℓ for cognate set c. Assume we have aword form x =wcℓ′. The generative process for producing y =wcℓ works asfollows. First, we consider x to be composed of characters x1x2 . . . xn, withthe first and last ones being a special boundary symbol x1 =#∈Σ, which isnever deleted, mutated, or created. The process generates y = y1y2 . . . yn in nchunks yi ∈Σ*; i∈ f1; . . . ;ng, one for each xi . The yi s may be a single char-acter, multiple characters, or even empty. To generate yi , we define a mu-tation Markov chain that incrementally adds zero or more characters to aninitially empty yi . First, we decide whether the current phoneme in the topword t = xi will be deleted, in which case yi = e (the probabilities of the

decisions taken in this process depend on a context to be specified shortly).If t is not deleted, we choose a single substitution character in the bottomword. We write S =Σ∪ fζg for this set of outcomes, where ζ is the specialoutcome indicating deletion. Importantly, the probabilities of this multino-mial can depend on both the previous character generated so far (i.e., therightmost character p of yi−1) and the current character in the previousgeneration string (t), providing a way to make changes context sensitive.This multinomial decision acts as the initial distribution of the mutationMarkov chain. We consider insertions only if a deletion was not selected inthe first step. Here, we draw from a multinomial over S , where this time thespecial outcome ζ corresponds to stopping insertions, and the other ele-ments of S correspond to symbols that are appended to yi . In this case, theconditioning environment is t = xi and the current rightmost symbol p in yi .Insertions continue until ζ is selected. We use θS;t;p;ℓ and θI;t;p;ℓ to denote theprobabilities over the substitution and insertion decisions in the currentbranch ℓ′→ ℓ. A similar process generates the word at the root ℓ of a tree orwhen an innovation happens at some language ℓ, treating this word asa single string y1 generated from a dummy ancestor t = x1. In this case, onlythe insertion probabilities matter, and we separately parameterize theseprobabilities with θR;t;p;ℓ. There is no actual dependence on t at the root orinnovative languages, but this formulation allows us to unify the parame-terization, with each θω;t;p;ℓ ∈RjΣj+1, where ω∈ fR; S; Ig. During cognate in-ference, the decision to innovate is controlled by a simple Bernoulli randomvariable ngℓ for each language in the tree. When known cognate groups areassumed, ncℓ is set to 0 for all nonroot languages and to 1 for the rootlanguage. These Bernoulli distributions have parameters νℓ.

Mutation distributions confined in the family of transducers miss certainphylogenetic phenomena. For example, the process of reduplication (as in“bye-bye”, for example) is a well-studied mechanism to derive morpholog-ical and lexical forms that is not explicitly captured by transducers. The samesituation arises in metatheses (e.g., Old English frist > English first). How-ever, these changes are generally not regular and therefore less informative(1). Moreover, because we are using a probabilistic framework, these eventscan still be handled in our system, even though their costs will simply not beas discounted as they should be.

Note also that the generative process described in this section does notallow explicit dependencies to the next character in ℓ. Relaxing this as-sumption can be done in principle by using weighted transducers, but at thecost of a more computationally expensive inference problem (caused by thetransducer normalization computation) (34). A simpler approach is to usethe next character in the parent ℓ′ as a surrogate for the next character in ℓ.Using the context in the parent word is also more aligned to the standardrepresentation of sound change used in historical linguistics, where thecontext is defined on the parent as well.

More generally, dependencies limited to a bounded context on the parentstring can be incorporated in our formalism. By bounded, we mean that itshould be possible to fix an integer k beforehand such that all of themodeled dependencies are within k characters to the string operation. Thecaveat is that the computational cost of inference grows exponentially in k.We leave open the question of handling computation in the face of un-bounded dependencies such as those induced by harmony (35).

Parameterization. Instead of directly estimating the transition probabilities ofthemutationMarkov chain (which could be done, in principle, by taking themto be the parameters of a collection of multinomial distributions) we expressthem as the output of a multinomial logistic regression model (36). This

A B

0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-40

0.2

0.4

0.6

0.8

1

Functional load

Mer

ger

post

erio

r

0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-40

0.2

0.4

0.6

0.8

1

Functional load

Mer

ger

post

erio

r

0

1

102

104

106Fig. 3. Increasing the number of languages we canreconstruct gives new ways to approach questions inhistorical linguistics, such as the effect of functionalload on the probability of merging two sounds. Theplots shown are heat maps where the color encodesthe log of the number of sound changes that fallinto a given two-dimensional bin. Each soundchange x > y is encoded as a pair of numbers in theunit square, ðl;mÞ, as explained in Materials andMethods. To convey the amount of noise one couldexpect from a study with the number of languagesthat King previously used (7), we first show in A theheat map visualization for four languages. Next, weshow the same plot for 637 Austronesian languagesin B. Only in this latter setup is structure clearlyvisible: Most of the points with high probability of merging can be seen to have comparatively low functional load, providing evidence in favor of thefunctional load hypothesis introduced in 1955. See SI Appendix, Section 2.4 for details.

Bouchard-Côté et al. PNAS Early Edition | 5 of 6

COMPU

TERSC

IENCE

SPS

YCHOLO

GICALAND

COGNITIVESC

IENCE

S

Page 6: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

model specifies a distribution over transition probabilities by assigningweights to a set of features that describe properties of the sound changesinvolved. These features provide a more coherent representation of thetransition probabilities, capturing regularities in sound changes that reflectthe underlying linguistic structure.

We used the following feature templates: OPERATION, which identifieswhether an operation in the mutation Markov chain is an insertion, a de-letion, a substitution, a self-substitution (i.e., of the form x > y, x = y), or theend of an insertion event; MARKEDNESS, which consists of language-specificn-gram indicator functions for all symbols in Σ (during reconstruction, onlyunigram and bigram features are used for computational reasons; for cognateinference, only unigram features are used); FAITHFULNESS, which consists ofindicators for mutation events of the form 1 [ x > y ], where x ∈Σ, y ∈ S .Feature templates similar to these can be found, for instance, in the work ofrefs. 37 and 38, in the context of string-to-string transduction models used incomputational linguistics. This approach to specifying the transition proba-bilities produces an interesting connection to stochastic optimality theory (39,40), where a logistic regression model mediates markedness and faithfulnessof the production of an output form from an underlying input form.

Data sparsity is a significant challenge in protolanguage reconstruction.Although the experiments we present here use an order of magnitude morelanguages than previous computational approaches, the increase in observeddata also brings with it additional unknowns in the form of intermediateprotolanguages. Because there is one set of parameters for each language,adding more data is not sufficient to increase the quality of the recon-struction; it is important to share parameters across different branches in thetree to benefit from having observations from more languages. We used thefollowing technique to address this problem: We augment the parameteri-zation to include the current language (or language at the bottom of thecurrent branch) and use a single, global weight vector instead of a set ofbranch-specific weights. Generalization across branches is then achievedby using features that ignore ℓ, whereas branch-specific features dependon ℓ. Similarly, all of the features in OPERATION, MARKEDNESS, andFAITHFULNESS have universal and branch-specific versions.

Using these features and parameter sharing, the logistic regression modeldefines the transition probabilities of the mutation process and the rootlanguage model to be

θω;t;p;ℓ = θω;t;p;ℓðξ; λÞ= expfÆλ; fðω; t;p; ℓ; ξÞægZðω; t;p; ℓ; λÞ × μðω; t; ξÞ; [1]

where ξ∈ S , f : fS; I;Rg×Σ×Σ× L× S →Rk is the feature function (whichindicates which features apply for each event), Æ · ; · æ denotes inner product,and λ∈Rk is a weight vector. Here, k is the dimensionality of the feature spaceof the logistic regression model. In the terminology of exponential families, Zand μ are the normalization function and the reference measure, respectively:

Zðω; t;p; ℓ; λÞ=Xξ′∈S

exp�Æλ; f

�ω; t;p; ℓ; ξ′

�æ�

μðω; t; ξÞ=

8>><>>:

0 if ω= S; t =#; ξ≠#0 if ω=R; ξ= ζ0 if ω≠R; ξ=#1  o:w:

Here, μ is used to handle boundary conditions, ensuring that the resultingprobability distribution is well defined.

During cognate inference, the innovation Bernoulli random variables νgℓare similarly parameterized, using a logistic regression model with two kindsof features: a global innovation feature κglobal ∈R and a language-specificfeature κℓ ∈R. The likelihood function for each νgℓ then takes the form

νgℓ =1

1+ exp�−κglobal − κℓ

�: [2]

ACKNOWLEDGMENTS. This work was supported by Grant IIS-1018733 fromthe National Science Foundation.

1. Hock HH (1991) Principles of Historical Linguistics (Mouton de Gruyter, The Hague,Netherlands).

2. Ross M, Pawley A, Osmond M (1998) The Lexicon of Proto Oceanic: The Culture andEnvironment of Ancestral Oceanic Society (Pacific Linguistics, Canberra, Australia).

3. Diamond J (1999) Guns, Germs, and Steel: The Fates of Human Societies (WW Norton,New York).

4. Nichols J (1999) Archaeology and Language: Correlating Archaeological and LinguisticHypotheses, eds Blench R, Spriggs M (Routledge, London).

5. Ventris M, Chadwick J (1973) Documents in Mycenaean Greek (Cambridge Univ Press,Cambridge, UK).

6. Martinet A (1955) Économie des Changements Phonétiques [Economy of phoneticsound changes] (Maisonneuve & Larose, Paris).

7. King R (1967) Functional load and sound change. Language 43:831–852.8. Holmes I, Bruno WJ (2001) Evolutionary HMMs: A Bayesian approach to multiple

alignment. Bioinformatics 17(9):803–820.9. Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence

alignment. Mol Biol Evol 21(3):529–540.10. Suchard MA, Redelings BD (2006) BAli-Phy: Simultaneous Bayesian inference of

alignment and phylogeny. Bioinformatics 22(16):2047–2048.11. Liberles DA, ed (2007) Ancestral Sequence Reconstruction (Oxford Univ Press, Ox-

ford, UK).12. Paten B, et al. (2008) Genome-wide nucleotide-level mammalian ancestor recon-

struction. Genome Res 18(11):1829–1843.13. Gray RD, Jordan FM (2000) Language trees support the express-train sequence of

Austronesian expansion. Nature 405(6790):1052–1055.14. Ringe D, Warnow T, Taylor A (2002) Indo-European and computational cladistics.

Trans Philol Soc 100:59–129.15. Evans SN, Ringe D, Warnow T (2004) Inference of Divergence Times as a Statistical In-

verse Problem, McDonald Institute Monographs, eds Forster P, Renfrew C (McDonaldInstitute, Cambridge, UK).

16. Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatoliantheory of Indo-European origin. Nature 426(6965):435–439.

17. Nakhleh L, Ringe D, Warnow T (2005) Perfect phylogenetic networks: A new meth-odology for reconstructing the evolutionary history of natural languages. Language81:382–420.

18. Bryant D (2006) Phylogenetic Methods and the Prehistory of Languages, eds Forster P,Renfrew C (McDonald Institute for Archaeological Research, Cambridge, UK), pp111–118.

19. Daumé H III, Campbell L (2007) A Bayesian model for discovering typological im-plications. Assoc Comput Linguist 45:65–72.

20. Dunn M, Levinson S, Lindstrom E, Reesink G, Terrill A (2008) Structural phylogenyin historical linguistics: Methodological explorations applied in Island Melanesia.Language 84:710–759.

21. Lynch J, ed (2003) Issues in Austronesian (Pacific Linguistics, Canberra, Australia).22. Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word

lists in four daughter languages. J Quant Linguist 7:233–244.23. Kondrak G (2002) Algorithms for Language Reconstruction. PhD thesis (Univ of Tor-

onto, Toronto).24. Ellison TM (2007) Bayesian identification of cognates and correspondences. Assoc

Comput Linguist 45:15–22.25. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum like-

lihood alignment of DNA sequences. J Mol Evol 33(2):114–124.26. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data

via the EM algorithm. J R Stat Soc B 39:1–38.27. Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel

trees. Adv Neural Inf Process Syst 21:177–184.28. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian basic vocabulary database:

From bioinformatics to lexomics. Evol Bioinform Online 4:271–283.29. Lewis MP, ed (2009) Ethnologue: Languages of the World (SIL International, Dallas,

TX, 16th Ed).30. Lyovin A (1997) An Introduction to the Languages of the World (Oxford Univ Press,

Oxford, UK).31. Hockett CF (1967) The quantification of functional load. Word 23:320–339.32. Surendran D, Niyogi P (2006) Competing Models of Linguistic Change. Evolution and

Beyond (Benjamins, Amsterdam).33. Varadarajan A, Bradley RK, Holmes IH (2008) Tools for simulating evolution of aligned

genomic regions with integrated parameter estimation. Genome Biol 9(10):R147.34. Mohri M (2009) Handbook of Weighted Automata, Monographs in Theoretical

Computer Science, eds Droste M, Kuich W, Vogler H (Springer, Berlin).35. Hansson GO (2007) On the evolution of consonant harmony: The case of secondary

articulation agreement. Phonology 24:77–120.36. McCullagh P, Nelder JA (1989) Generalized Linear Models (Chapman & Hall, London).37. Dreyer M, Smith JR, Eisner J (2008) Latent-variable modeling of string transductions

with finite-state methods. Empirical Methods on Natural Language Processing 13:1080–1089.

38. Chen SF (2003) Conditional and joint models for grapheme-to-phoneme conversion.Eurospeech 8:2033–2036.

39. Goldwater S, Johnson M (2003) Learning OT constraint rankings using a maximumentropy model. Proceedings of the Workshop on Variation Within Optimality Theoryeds Spenader J, Eriksson A, Dahl Ö (Stockholm University, Stockholm) pp 113–122.

40. Wilson C (2006) Learning phonology with substantive bias: An experimental andcomputational study of velar palatalization. Cogn Sci 30(5):945–982.

41. Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansionpulses and pauses in Pacific settlement. Science 323(5913):479–483.

42. Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesiancomparative linguistics. Inst Linguist Acad Sinica 1:31–94.

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.

Page 7: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Supporting Information:Automated reconstruction of ancient languages using

probabilistic models of sound change

Alexandre Bouchard-CoteDavid Hall

Thomas L. GriffithsDan Klein

This Supporting Information describes the learning and inference algorithms used by our system (Section 1),details of simulations supporting the accuracy of our system and analyzing the effects of functional load (Section 2),lists of reconstructions (Section 3), lists of sound changes (Section 4), and analysis of frequent errors (Section 5).

1 Learning and inferenceThe generative model introduced in the Materials and Methods section of the main paper sets us up with twoproblems to solve: estimating the values of the parameters characterizing the distribution on sound changes oneach branch of the tree, and inferring the optimal values of the strings representing words in the unobservedprotolanguages. Section 1.1 introduces the full objective function that we need to optimize in order to estimatethese quantities. Section 1.2 describes the Monte Carlo Expectation-Maximization algorithm we used for solvingthe learning problem. We present the algorithm for inferring ancestral word forms in Section 1.4.

1.1 Full objective functionThe generative model specified in the Materials and Methods section of the main paper defines an objective func-tion that we can optimize in order to find good protolanguage reconstructions. This objective function takes theform of a regularized log-likelihood, combining the probability of the observed languages with additional con-straints intended to deal with data sparsity. This objective function can be written concisely if we let Pλ(·),Pλ(·|·)denote the root and branch probability models described in the Materials and Methods section of the paper (withtransition probabilities given by the above logistic regression model), I(c), the set of internal (non-leaf) nodes inτ(c), pa(`), the parent of language `, r(c), the root of τ(c) and W (c) = (Σ∗)|I(c)|. The full objective function isthen

Li(λ, κ) =

C∑c=1

log∑

~w∈W (c)

Pλ(wc,r(c))∏`∈I(c)

Pλ(wc,`|wc,pa(`), nc`)Pκ(nc`|νc`)−||λ||22 + ||κ||22

2σ2(1)

where the second term is a standard L2 regularization penalty intended to reduce over-fitting due to data sparsity(we used σ2 = 1) [1]. The goal of learning is to find the value of λ, the parameters of the logistic regression modelfor the transition probabilities, that maximizes this function.

1

Page 8: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

1.2 A Monte Carlo Expectation-Maximization algorithm for reconstructionOptimization of the objective function given in Equation 1 is done using a Monte Carlo variant of the Expectation-Maximization (EM) algorithm [2]. This algorithm breaks down into two steps, an E step in which the objectivefunction is approximated and an M step in which this approximate objective function is optimized. The M stepis convex and computed using L-BFGS [3] but the E step is intractable [4], in part because it requires solvingthe problem of inferring the words in the protolanguages. We approximate the solution to this inference problemusing a Markov chain Monte Carlo (MCMC) algorithm [5]. This algorithm repeatedly samples words from theprotolanguages until it converges on the distribution implied by our generative model. Since this procedure isguaranteed to find a local maximum of the objective, we ran it with several different random initializations of themodel parameters. The next two subsections provide the details of these two parts of our system.

1.2.1 E step: Inferring the posterior over string reconstructions

In the E step, the inference problem is to compute an expectation under the posterior over strings in a protolanguagegiven observed word forms at the leaves of the tree.1 The typical approach in biological InDel models [6] is touse Gibbs sampling, where the entire string at each node in the tree is repeatedly resampled, conditioned on itsparent and children. We will call this method Single Sequence Resampling (SSR). While conceptually simple, thisapproach suffers from mixing problems in large trees, since it can take a long time for information to propagatefrom one region of the tree to another [6]. Consequently, we use a different MCMC procedure, called AncestryResampling (AR) that alleviates these mixing problems. This method was originally introduced for biologicalapplications [7], but commonalities between the biological and linguistic cases make it possible to use it in ourmodel.

Concretely, the problem with SSR arises when the tree under consideration is large or unbalanced. In this case,it can take a long time for information from the observed languages to propagate to the root of the tree. Indeed,samples at the root will initially be independent of the observations. AR addresses this problem by resampling onethin vertical slice of all sequences at a time, called an ancestry (for the precise definition of the algorithm, see [7]).Slices condition on observed data, avoiding mixing problems, and can propagate information rapidly across thetree. We ran the ancestry resampling algorithm for a number of iterations that increased linearly with the number ofiterations of the EM algorithm that had been completed, resulting in an approximation regime that could allow theEM algorithm to converge to a solution [8]. To speed-up the large experiments, we also used an approximation inAR. This approximation is based on fixing the value of a set-valued auxiliary variables zg , where zg = {wg,`}, and` ranges over the set of all languages (both internal and at the leaves). Conditioning on these variables, samplingwg,`|zg can be done exactly using dynamic programming and rejection sampling.

1.2.2 M step: Convex optimization of the approximate objective

In the M step, we individually update the parameters λ and κ as specified in Equation 1. We show how λ is updatedin this section, κ can be optimized similarly. Let C = (ω, t, p, `) denote local transducer contexts from the spaceC = {S, I,R} × Σ × Σ × L of all such contexts. Let N(C, ξ) be the expected number of times the transitionξ was used in context C in the preceding E-step. Given these sufficient statistics, the estimate of λ is given byoptimizing the expected complete (regularized) log-likelihood O(λ) derived from the original objective functiongiven Equation [1] in the Materials and Methods section of the main paper (ignoring terms that do not involve λ),

O(λ) =∑C∈C

∑ξ∈S

N(C, ξ)[〈λ, f(C, ξ)〉 − log

∑ξ′

exp{〈λ, f(C, ξ′)〉}]− ||λ||

2

2σ2.

1To be precise: the posterior is over both protolanguage strings and the derivations between these strings and the modern words.

2

Page 9: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

We use L-BFGS [3] to optimize this convex objective function. L-BFGS requires the partial derivatives

∂O(λ)

∂λj=∑C∈C

∑ξ∈S

N(C, ξ)[fj(C, ξ)−

∑ξ′

θC(ξ′;λ)fj(C, ξ′)

]− λjσ2

= Fj −∑C∈C

∑ξ∈S

N(C, ·)θC(ξ′;λ)fj(C, ξ′)−λjσ2,

where Fj =∑C∈C

∑ξ∈S N(C, ξ)fj(C, ξ) is the empirical feature vector and N(C, ·) =

∑ξN(C, ξ) is the

number of times context C was used. Fj and N(C, ·) do not depend on λ and thus can be precomputed at thebeginning of the M-step, thereby speeding up each L-BFGS iteration.

1.3 Approximate Expectation-Maximization for cognate inferenceThe procedure for cognate inference is similar: we again operate within the Expectation-Maximization framework,and the M-steps are identical. However, because of different characteristics of the cognate inference problem, wemake a few different choices for approximations for the E-step. The Monte Carlo inference algorithm describedin the preceding section works exceedingly well for reconstructing words for known cognates. However, a MonteCarlo approach to determining which words are cognate requires resampling the innovation variables ng`, whichis likely to lead to slow mixing of the Markov chain.

We therefore take a different approach when doing cognate inference. Here, we restrict our system to notperform inference over all possible reconstructions for all words, but to only use words that correspond to someobserved modern word with the same meaning. The simplification is of course false, but it works well in practice.Specifically, we perform inference on the tree for each gloss using message passing [9], also known as pruningin the computational biology literature [10], where each message µ(wg`) has a non-zero score only when wg` isone of the observed modern word forms in gloss g. The result of inference yields expected alignments counts foreach character in each language along with the expected number of innovations that occur at each language. Theseexpectations can then be used in the convex M-step.

1.4 Ancestral word form reconstructionIn the E step described in the preceding section, a posterior distribution π over ancestral sequences given observedforms is approximated by a collection of samplesX1, X2, . . . XS . In this section, we describe how this distributionis summarized to produce a single output string for each cognate set.

This algorithm is based on a fundamental Bayesian decision theoretic concept: Bayes estimators. Given a lossfunction over strings Loss : Σ∗ × Σ∗ → [0,∞), an estimator is a Bayes estimator if it belongs to the random set:

argminx∈Σ∗

Eπ Loss(x,X) = argminx∈Σ∗

∑y∈Σ∗

Loss(x, y)π(y).

Bayes estimators are not only optimal within the Bayesian decision framework, but also satisfy frequentist opti-mality criteria such as admissibility [11]. In our case, the loss we used is the Levenshtein [12] distance, denotedLoss(x, y) = Lev(x, y) (we discuss this choice in more detail in Section 2).

Since we do not have access to π, but rather to an approximation based on S samples, the objective functionwe use for reconstruction rewrites as follows:

argminx∈Σ∗

∑y∈Σ∗

Lev(x, y)π(y) ≈ argminx∈Σ∗

1

S

S∑s=1

Lev(x,Xs)

= argminx∈Σ∗

S∑s=1

Lev(x,Xs).

3

Page 10: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

The raw samples contain both derivations and strings for all protolanguages, whereas we are only interested inreconstructing words in a single protolanguage. This is addressed by marginalization, which is done in samplingrepresentations by simply discarding the irrelevant information. Hence, the random variables Xs in the aboveequation can be viewed as being string-valued random variables.

Note that the optimum is not changed if we restrict the minimization to be taken on x ∈ Σ∗ such thatm ≤ |x| ≤M where m = mins |Xs|,M = maxs |Xs|. However, even with this simplification, optimization is intractable.As an approximation, we considered only strings built by at most k contiguous substrings taken from the wordforms in X1, X2, . . . , XS . If k = 1, then it is equivalent to taking the min over {Xs : 1 ≤ s ≤ S}. At the otherend of the spectrum, if k = S, it is exact. This scheme is exponential in k, but since words are relatively short, wefound that k = 2 often finds the same solution as higher values of k.

1.5 Finding Cognate GroupsOur cognate model finds cuts to the phylogeny in order to determine which words are cognate with one another.However, this approach cannot straightforwardly handle instances where the evolution of the words did not followstrictly treelike behavior. This limitation applies to polymorphisms—where multiple words for a given meaningare available in a language—and also for borrowing—where the words did not evolve according to the phylogeny.

However, we can modify the inference procedure to capture these kinds of behaviors using a post hoc agglom-erative “merge” procedure. Specifically, we can run our procedure to find an initial set of cognate groups, andthen merge those cognate groups that produce an increase in model score. That is, we create several initial smallsubtrees containing some cognates, and then stitch them together into one or more larger trees. Thus, non-treelikebehaviors like borrowing will be represented as multiple trees that “overlap.” For instance, if two languages eachhave two words for one meaning (say A and B), then the initial stage might find that the two A’s are cognate,leaving the two B’s as singleton cognate groups. However, merging these two words into a single group will likelyproduce a gain in likelihood, and so we can merge them. Note this procedure is unlikely to work well for longdistance borrowings, as the sound changes involved in long distance borrowing are likely to be very different fromthose according to the phylogeny. Nevertheless, we found this procedure to be effective in practice.

2 ExperimentsIn this section, we give more details on the results in the main paper, namely those concerning validation usingreconstruction error rate, and those measuring the effect of the tree topology and the number of languages. Wealso include additional comparisons to other reconstruction methods, as well as cognate inference results. InSection 2.1, we analyze in isolation the effects of varying the set of features, the number of observed languages,the topology, and the number of iterations of EM. In Section 2.2 we compare performance to an oracle and to twoother systems.

Evaluation of all methods was done by computing the Levenshtein distance [12] (uniform-cost edit distance)between the reconstruction produced by each method and the reconstruction produced by linguists. The Leven-shtein distance is the minimum number of substitutions, insertions, or deletions of a phoneme required to transformone word to another. While the Levenshtein distance misses important aspects of phonology (all phoneme substi-tutions are not equal, for instance), it is parameter-free and still correlates to a large extent with linguistic qualityof reconstruction. It is also superior to held-out log-likelihood, which fails to penalize errors in the modelingassumptions, and to measuring the percentage of perfect reconstructions, which ignores the degree of correctnessof each reconstructed word. We averaged this distance across reconstructed words to report a single number foreach method. The statistical significance of all performance differences are assessed using a paired t-test withsignificance level of 0.05.

4

Page 11: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

2.1 Evaluating system performanceWe used the Austronesian Basic Vocabulary Database (ABVD) [13] as the basis for a series of experiments usedto evaluate the performance of our system and the factors relevant to its success. The database, downloaded fromhttp://language.psy.auckland.ac.nz/austronesian/

on August 7, 2010, includes partial cognacy judgments and IPA transcriptions,2 as well as a several reconstructedprotolanguages.

In our main experiments, we used the tree topology induced by the Ethnologue classification [14] in orderto facilitate interpretability of the results (i.e. so that we can give well-known names to clades in the tree inour analyses). Our method does not require specifying branch lengths since our unsupervised learning procedureprovides a more flexible way of estimating the amount of change between points of the phylogenetic tree.

The first claim we verified experimentally is that having more observed languages aids reconstruction of pro-tolanguages. In the results in this section, we used the subset of the languages under Proto-Oceanic (POc) tospeed-up the computations. To test this hypothesis we added observed modern languages in increasing order ofdistance dc to the target reconstruction of POc so that the languages that are most useful for POc reconstructionare added first. This prevents the effects of adding a close language after several distant ones being confused withan improvement produced by increasing the number of languages.

The results are reported in Figure 1(c) of the main paper. They confirm that large-scale inference is desirablefor automatic protolanguage reconstruction: reconstruction improved statistically significantly with each increaseexcept from 32 to 64 languages, where the average edit distance improvement was 0.05.

We then conducted a number of experiments intended to identify the contribution made by different factors itincorporates. We found that all of the following ablations significantly hurt reconstruction: using a flat tree (inwhich all languages are equidistant from the reconstructed root and from each other) instead of the consensus tree,dropping the markedness features, dropping the faithfulness features, and disabling sharing across branches. Theresults of these experiments are shown in Table S.1.

For comparison, we also included in the same table the performance of a semi-supervised system trained byK-fold validation. The system was run K = 5 times, with 1 − K−1 of the POc words given to the system asobservations in the graphical model for each run. It is semi-supervised in the sense that target reconstructions formany internal nodes are not available in the dataset, so they are still not filled.3

2.2 Comparisons against other methodsThe first competing method, PRAGUE was introduced in [15]. In this method, the word forms in a given protolan-guage are reconstructed using a Viterbi multi-alignment between a small number of its descendant languages. Thealignment is computed using hand-set parameters. Deterministic rules characterizing changes between pairs ofobserved languages are extracted from the alignment when their frequency is higher than a threshold, and a proto-phoneme inventory is built using linguistically motivated rules and parsimony. A reconstruction of each observedword is first proposed independently for each language. If at least two reconstructions agree, a majority vote istaken, otherwise no reconstruction is proposed. This approach has several limitations. First, it is not tractable forlarger trees, since the time complexity of their multi-alignment algorithm grows exponentially in the number oflanguages. Second, deterministic rules, while elegant in theory, are not robust to noise: even in experiments withonly four daughter languages, a large fraction of the words could not be reconstructed.

Since PRAGUE does not scale to large datasets, we also built a second, more tractable baseline. This newbaseline system, CENTROID, computes the centroid of the observed word forms in Levenshtein distance. Let

2While most word forms in ABVD are encoded using IPA, there are a few exceptions, for example language family specific conventionssuch as the usage of the okina symbol (’) for glottal stops (P) in some Polynesian languages or ad hoc conventions such as using the bigram‘ng’ for the agma symbol (N). We have preprocessed as many of these exceptions as possible, and probabilist methods are generally robust toreasonable amounts of encoding glitches.

3We also tried a fully supervised system where a flat topology is used so that all of these latent internal nodes are avoided; but it did notperform as well—this is consistent with the -Topology experiment of Table S.1.

5

Page 12: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Lev(x, y) denote the Levenshtein distance between word forms x and y. Ideally, we would like the baselinesystem to return:

argminx∈Σ∗

∑y∈O

Lev(x, y),

where O = {y1, . . . , y|O|} is the set of observed word forms. This objective function is motivated by Bayesiandecision theory [11], and shares similarity to the more sophisticated Bayes estimator described in Section 2. How-ever it replaces the samples obtained by MCMC sampling by the set of observed words. Similarly to the algorithmof Section 2, we also restrict the minimization to be taken on x ∈ Σ(O)∗ such that m ≤ |x| ≤ M and to stringsbuilt by at most k contiguous substrings taken from the word forms in O, where m = mini |yi|,M = maxi |yi|and Σ(O) is the set of characters occurring in O. Again, we found that k = 2 often finds the same solution ashigher values of k. The difference was in all the cases not statistically significant, so we report the approximationk = 2 in what follows.

We also compared against an oracle, denoted ORACLE, which returns

argminy∈O

Lev(y, x∗),

where x∗ is the target reconstruction. We will denote it by ORACLE. This is superior to picking a single closestlanguage to be used for all word forms, but it is possible for systems to perform better than the oracle since it hasto return one of the observed word forms. Of course, this scheme is only available to assess system performanceon held-out data: It cannot make new predictions.

We performed the comparison against a previous system proposed in [15] on the same dataset and experimentalconditions as used in [15]. The PMJ dataset was compiled by [16], who also reconstructed the correspondingprotolanguage. Since PRAGUE is not guaranteed to return a reconstruction for each cognate set, only 55 word formscould be directly compared to our system. We restricted comparison to this subset of the data. This favors PRAGUEsince the system only proposes a reconstruction when it is certain. Still, our system outperformed PRAGUE, withan average distance of 1.60 compared to 2.02 for PRAGUE. The difference is marginally significant, p = 0.06,partly due to the small number of word forms involved.

To get a more extensive comparison, we considered the hybrid system that returns PRAGUE’s reconstructionwhen possible and otherwise back off to the Sundanese (Snd.) modern form, then Madurese (Mad.), Malay (Mal.)and finally Javanic (Jv.) (the optimal back-off order). In this case, we obtained an edit distance of 1.86 using oursystem against 2.33 for PRAGUE, a statistically significant difference.

We also compared against ORACLE and CENTROID in a large-scale setting. Specifically, we compare to theexperimental setup on 64 modern languages used to reconstruct POc described before. Encouragingly, whilethe system’s average distance (1.49) does not attain that of the ORACLE (1.13), we significantly outperform theCENTROID baseline (1.79).

2.3 Cognate recoveryTo test the effectiveness of our cognate recovery system, we ran our system on all of the Oceanic languages in theABVD, which comprises roughly half of the Austronesian languages. We then evaluated the pairwise precision,recall, F1, and purity scores, defined as follows. Let G = {G1, G2, . . . , Gg} denote the known partitions of theforms into cognates, and let F = {F1, F2, . . . , Ff} denote the inferred partitions. Let pairs(F ) denote the set ofunordered pairs of indices in the same partition in F : pairs(F ) = {{i, j} : ∃k s.t. i, j ∈ Fk, i 6= j}, and similarly

6

Page 13: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

for pairs(G). The metrics are defined as follows:

precision =|pairs(G) ∩ pairs(F )|

|pairs(F )|,

recall =|pairs(G) ∩ pairs(F )|

|pairs(G)|,

F1 = 2precision · recall

precision + recall,

purity =1

N

∑f

maxg|Gg ∩ Ff |.

Using these metrics, we found that our system achieved a precision of 84.4, recall of 62.1, F1 of 71.5, andcluster purity of 91.8. Thus, over 9 out of 10 words are correctly grouped, and our system errs on the side of under-grouping words rather than clustering words that are not cognates. Since the null hypothesis in historical linguisticsis to deem words to be unrelated unless proven otherwise, a slight under-grouping is the desired behavior.

Since we are ultimately interested in reconstruction, we then compared our reconstruction system’s ability toreconstruct words given these automatically determined cognates. Specifically, we took every cognate group foundby our system (run on the Oceanic subclade) with at least two words in it. That is, we excluded words that oursystem found to be isolates. Then, we automatically reconstructed the Proto-Oceanic ancestor of those words usingour system (using the auxiliary variables z described in Section 1.2.1).

For evaluation, we then looked at the average edit distance from our reconstructions to the known reconstruc-tions described in the previous sections. This time, however, we average per modern word rather than per cognategroup, to provide a fairer comparison. (Results were not substantially different averaging per cognate group.)

Using known cognates from the ABVD, there was an average reconstruction error of 2.19, versus 2.47 for theautomatically reconstructed cognates, or an increase in error rate of 12.8%. The fraction of words with each Leven-shtein distance for these reconstructions is shown in Figure S.1. While the plots are similar, the automatic cognatesexhibit a slightly longer tail. Thus, even with automatic cognates, the reconstruction system can reconstruct wordsfaithfully in many cases, only failing in a few instances.

2.4 Computation of functional loadsTo measure functional load quantitatively, we used the same estimator as the one used in [17]. This definitionis based on associating a context vector cx,` to each phoneme and language. For a given language with N(`)phoneme tokens, these context vectors are defined as follows: first, fix an enumeration order of all the contextsfound in the corpus, where the context is defined as a pair of phonemes, one at the left and one at the right of aposition. Element i in this enumeration will correspond to component i of the context vectors. Then, the value ofcomponent i in context vector cx,` is set to be the number of time phoneme x occurred in context i and languagel. Finally, King’s definition of functional load FL`(x, y) is the dot product of the two induced context vectors:

FL`(x, y) =1

N(`)2〈cx,`, cy,`〉 =

1

N(`)2

∑i

cx,`(i)× cy,`(i),

where the denominator is simply a normalization that insures FL`(x, y) ≤ 1. Note that if x and y are in comple-mentary distribution in language `, then the two vectors cx,` and cy,` are orthogonal. The functional load is indeedzero in this case.

In Figure 3 of the main paper, we show heat maps where the color encodes the log of the number of soundchanges that fall into a given 2-dimensional bin. Each sound change x > y is encoded as pair of numbers in theunit interval, (l, m), where l is an estimate of the functional load of the pair and m is the posterior fraction of theinstances of the phoneme x that undergo a change to y. We now describe how l, m were estimated. The posterior

7

Page 14: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

fraction m for the merger x > y between languages pa(`) → ` is easily computed from the same expectedsufficient statistics used for parameter estimation:

m`(x > y) =

∑p∈ΣN(S, x, p, `, y)∑

p′∈Σ

∑y′∈ΣN(S, x, p′, `, y′)

.

The estimate of the functional load requires additional statistics, i.e. the expected context vectors cx,` and expectedphoneme token counts N(`), but these can be readily extracted from the output of the MCMC sampler. Theestimate is then:

l`(x, y) =1

N(`)2〈cx,`, cy,`〉.

Finally, the set of points used to construct the heat map is:{(lpa(`)(x, y), m`(x > y)

): ` ∈ L− {root}, x ∈ Σ, y ∈ Σ, x 6= y

}.

3 Reconstruction listsIn Table S.2–S.4, we show the lists of consensus reconstructions produced by our system (the ‘Automatic’ column).For comparison, we also include a baseline (randomly picking one modern word), and the edit distances (when itis greater than zero). In Proto-Oceanic, since two manual reconstructions are available, we include the distancesof the automatic reconstruction to both manual reconstructions (‘P-A’ and ‘B-A’) and the distance between the twomanual reconstructions (‘B-P’).

4 Sound changesIn Figures S.2–S.5, we zoom and rotate each quadrant of the tree shown in Figure 2 of the main paper. Forinformation on the most frequent change of each branch, refer to the row in Table S.5 corresponding to the code inparenthesis attached to each branch. By most frequent, we mean the change with the highest expected count in thelast EM iteration, collapsing contexts for simplicity. In cases where no change is observed with an expected countof more than 1.0, we skip the corresponding entry—this can be caused for example by languages where cognacyinformation was too sparse in the dataset. The functional load, normalized to be in the interval [0, 1] is also shownfor reference.

5 Frequent errorsIn Figure S.6, we analyze the frequent discrepancy between the PAn reconstruction from our system with thosefrom [18]. The most frequent problematic substitutions, insertions, and deletions are shown. These frequencieswere obtained by aligning, after running the system, the reconstructions with the references. The aligner is a pair-HMM trained via EM, using only phoneme identity and gap information [19]. The frequencies were extractedfrom the posterior alignments.

References and Notes[1] Hastie T, Tibshiranim R, Friedman J (2009) The Elements of Statistical Learning (Springer).

[2] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the RoyalStatistical Society. Series B (Methodological) 39:1–38.

8

Page 15: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

[3] Liu DC, Nocedal J, Dong C (1989) On the limited memory BFGS method for large scale optimization. Mathematical Programming45:503–528.

[4] Lunter GA, Miklos I, Song YS, Hein J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J.Comp. Biol. 10:869–889.

[5] Tierney L (1994) Markov chains for exploring posterior distributions. The Annals of Statistics 22:1701–1728.

[6] Holmes I, Bruno WJ (2001) Evolutionary HMM: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820.

[7] Bouchard-Cote A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel trees. Advances in Neural Information ProcessingSystems 21:177–184.

[8] Caffo B, Jank W, Jones G (2005) Ascent-based Monte Carlo EM. Journal of the Royal Statistical Society - Series B 67:235–252.

[9] Judea P (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. (Morgan Kaufmann).

[10] Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368–376.

[11] Robert CP (2001) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer).

[12] Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10.

[13] Greenhill S, Blust R, Gray R (2008) The Austronesian basic vocabulary database: From bioinformatics to lexomics. EvolutionaryBioinformatics 4:271–283.

[14] Lewis MP, ed (2009) Ethnologue: Languages of the World, Sixteenth edition. (SIL International).

[15] Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Journal of Quanti-tative Linguistics 7:233–244.

[16] Nothofer B (1975) The reconstruction of Proto-Malayo-Javanic (M. Nijhoff).

[17] King R (1967) Functional load and sound change. Language 43:831–852.

[18] Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian comparative linguistics. Inst. Linguist. Acad. Sinica1:31–94.

[19] Berg-Kirkpatrick T, Bouchard-Cote A, DeNero J, Klein D (2010) Painless unsupervised learning with features. Proceedings of the NorthAmerican Conference on Computational Linguistics pp 582–590.

Condition Edit dist.Unsupervised full system 1.87-FAITHFULNESS 2.02-MARKEDNESS 2.18-Sharing 1.99-Topology 2.06Semi-supervised system 1.75

Table S.1: Effects of ablation of various aspects of our unsupervised system on mean edit distance to POc. -Sharingcorresponds to the restriction to the subset of the features in OPERATION, FAITHFULNESS and MARKEDNESS thatare branch-specific, -Topology corresponds to using a flat topology where the only edges in the tree connectmodern languages to POc. The semi-supervised system is described in the text. All differences (compared to theunsupervised full system) are statistically significant.

9

Page 16: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Table S.2. Proto-Austronesian reconstructions

Gloss Reference Baseline Distance Automatic Distance

tohold(1) *gemgem *higem 3 *gemgemsmoke(1) *qebel *q1v1l 3 *qebeltoscratch(1) *ka Raw *kagaw 1 *ka Rawand(2) *mah *ma 1 *ma 1leg/foot(1) *qaqay *ai 4 *qaqayshoulder(1) *qaba Ra *vala 4 *qaba Rawoman/female(1) *bahi *babinay 4 *vavaian 5left(1) *kawi Ri *kayli 3 *kawi Riday(1) *qalejaw *andew 5 *qalejawmother(1) *tina *qina 1 *tinatosuck(1) *sepsep *sipsip 2 *sepsepsmall(2) *kedi *kedhi 1 *kedinight(1) *be RNi *beyi 2 *be RNitosqueeze(1) *pe Req *perah 3 *pereq 1tohit(1) *palu *mipalo 3 *palurat(1) *labaw *lapo 3 *kulavaw 3you(1) *ikamu *kamo 2 *kamu 1toplant(1) *mula *h1mula 2 *mulatoswell(1) *ba Req *ba Req *ba Reqtosee(1) *kita *kita *kitaone(1) *isa *sa 1 *isatosleep(1) *tudu R *maturug 4 *tudu Rdog(1) *wasu *kahu 2 *vatu 2topound,beat(20) *tutuh *nutu 2 *tutu 1stone(1) *batu *batu *batugreen(1) *mataq *mataq *mataqfather(1) *tama *qama 1 *tamathis(1) *ini *eni 1 *ani 1tooth(1) *nipen *lipon 2 *nipentochoose(1) *piliq *piliP 1 *piliqstar(1) *bituqen *bitun 2 *bituqentobuy(23) *baliw *taiw 2 *taiw 2tovomit(1) *utaq *mutjaq 2 *utaqtowork(1) *qumah *quma 1 *quma 1wide(1) *malawas *Gawig 5 *malabe R 3tocut,hack(1) *ta Raq *ta Raq *ta Raqtofear(1) *matakut *takuP 3 *matakuttolive,bealive(1) *maqudip *maPuNi 3 *maqudipthunder(3) *de RuN *zuN 3 *deruN 1tofly(2) *layap *layap *layaptoshoot(1) *panaq *fanak 2 *panaqname(1) *Najan *ngaza 4 *Najantobuy(1) *beli *poli 2 *beliand(1) *ka *kae 1 *kawhen?(1) *ijan *pirang 3 *pijan 1todig(1) *kalih *kali 1 *kali 1ash(1) *qabu *abu 1 *qabubig(1) *ma Raya *ma Raya *ma Rayatocut,hack(3) *tektek *tutek 2 *tektekroad/path(1) *zalan *dalan 1 *zalantostand(1) *di Ri *diri 1 *di Risand(1) *qenay *one 4 *qenay

continued on next page

10

Page 17: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous page

Gloss Reference Baseline Distance Automatic Distance

meat/flesh(31) *isi *ici 1 *isiwhat?(2) *nanu *anu 1 *anu 1toswell(26) * Ribawa *malifawa 4 *abeh 5bird(2) *qayam *ayam 1 *qayamhand(1) *lima *lime 1 *limain,inside(1) *idalem *dalale 4 *idalemif(2) *nu *no 1 *nutoknow,beknowledgeable(2) *bajaq *mafanaP 5 *mafanaP 5at(20) *di *ri 1 *diwind(2) *bali *feli 2 *beliu 2new(1) *mabaqe Ru *baPru 5 *vaquan 6blood(1) *da Raq *da Raq *da Raqbreast(1) *susu *soso 2 *susui(1) *iaku *ako 2 *iakusalt(1) *qasi Ra *sie 4 *qasi Ratoflow(1) *qalu R *ilir 4 *qalu Rfive(1) *lima *lima *limaat(1) *i *i *iother(1) *duma *duma *dumaleaf(2) *bi Raq *bela 3 *bela 3all(1) *amin *kemon 3 *aminrotten(1) *mabu Raq *mavuk 4 *mabu Ruk 2tocook(1) *tanek *tan@k 1 *tanekhead(1) *qulu *PuGuh 3 *qulumouth(2) *Nusu *Nutu 1 *Nuju 1house(1) * Rumaq *uma 2 * Rumaqif(1) *ka *ke 1 *kaneck(1) *liqe R *liq1g 2 *liqe Rneedle(1) *za Rum *dag1m 3 *za Rumhe/she(1) *siia *ia 2 *siiafruit(1) *buaq *buaq *buaqback(1) *likud *likudeP 2 *likudtochew(2) *qelqel *qmelqel 1 *qmelqel 1salt(2) *timus *timus *timuswe(2) *kami *sikami 2 *kamilong(1) *inaduq *nandu 3 *anaduq 1we(1) *ikita *itam 3 *kita 1three(1) *telu *t1lu 1 *telulake(1) *danaw *ranu 3 *danawtoeat(1) *kaen *kman 2 *kman 2no,not(3) *ini *ini *iniwhere?(1) *inu *sadinno 5 *ainu 1how?(1) *kuja *gagua 4 *kua 1tothink(34) *nemnem *n1mn1m 2 *kinemnem 2who?(2) *siima *cima 2 *tima 2tobite(1) *ka Rat *kagat 1 *ka Rattail(1) *iku R *ikog 2 *iku R

11

Page 18: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Table S.3. Proto-Oceanic reconstructions

Reconstructions Pairwise distances

Gloss Blust (B) Pawley (P) Automatic (A) B-P P-A B-A

fish(1) *ikan *ikan *ikanfive(1) *lima *lima *limawhat?(1) *sapa *saa *sava 1 1 1meat/flesh(1) *pisiko *pisako *kiko 1 3 3star(1) *pituqun *pituqon *vetuqu 1 4 3fog(1) *kaput *kaput *kabu 2 2toscratch(44) *karu *kadru *kadru 1 1shoulder(1) *pa Ra *qapa Ra *vara 2 4 2where?(3) *pai *pea *vea 2 1 3toclimb(2) *sake *sake *cake 1 1toeat(1) *kani *kani *kanitwo(1) *rua *rua *ruadry(11) *maca *masa *mamasa 1 2 3narrow(1) *kopit *kopit *kapi 2 2todig(1) *keli *keli *kelibone(2) *su Ri *su Ri *sui 1 1stone(1) *patu *patu *patuleft(1) *mawi Ri *mawi Ri *mawii 1 1they(1) *ira *ira *sira 1 1toliedown(1) *qinop *qeno *eno 2 1 3tohide(1) *puni *puni *vuni 1 1rope(1) *tali *tali *talismoke(2) *qasu *qasu *qasuwhen?(1) *Naican *Naijan *Nisa 1 3 3we(2) *kamami *kami *kami 2 2this(1) *ne *ani *eni 2 1 2egg(1) *qatolu R *katolu R *tolu 1 3 3stick/wood(1) *kayu *kayu *kai 2 2tosit(16) *nopo *nopo *nofo 1 1toshoot(1) *panaq *pana *pana 1 1liver(1) *qate *qate *qateneedle(1) *sa Rum *sa Rum *sau 2 2feather(1) *pulu *pulu *vulu 1 1topound,beat(2) *tutuk *tuki *tutuk 3 3near(9) *tata *tata *tataheavy(1) *mamat *mapat *mamava 1 3 2year(1) *taqun *taqun *taqu 1 1old(1) *matuqa *matuqa *matuqafire(1) *api *api *avi 1 1tochoose(1) *piliq *piliq *vili 2 2rain(1) *qusan *qusan *usa 2 2togrow(1) *tubuq *tubuq *tubu 1 1tosee(1) *kita *kita *kitatohear(1) *roNo R *roNo R *roNo 1 1tochew(1) *mamaq *mamaq *mama 1 1louse(1) *kutu *kutu *kutuwind(1) *aNin *mataNi *mataNi 4 4bad,evil(1) *saqat *saqat *saqa 1 1hand(1) *lima *lima *limatoflow(1) *tape *tape *tave 1 1night(1) *boNi *boNi *boNi

continued on next page

12

Page 19: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous page

Reconstructions Pairwise distances

Gloss Blust (B) Pawley (P) Automatic (A) B-P P-A B-A

day(5) *qaco *qaco *qaso 1 1tospit(14) *qanusi *qanusi *aNusu 3 3person/humanbeing(1) *taumataq *tamwata *tamata 3 1 2tovomit(1) *mumutaq *mumuta *muta 1 2 3name(1) *Najan *qajan *qasa 1 2 3snake(12) *mwata *mwata *mwataman/male(1) *mwa Ruqane *taumwaqane *mwane 5 5 4tobreathe(1) *manawa *manawa *manawafar(1) *sauq *sauq *sau 1 1tobuy(1) *poli *poli *voli 1 1tovomit(8) *luaq *luaq *lua 1 1tocook(9) *tunu *tunu *tunuthick(3) *matolu *matolu *matoluleg/foot(1) *waqe *waqe *waqetobite(1) *ka Rat *ka Rati *karat 1 2 1leaf(1) *raun *rau *dau 1 1 2sky(1) *laNit *laNit *laNi 1 1todrink(1) *inum *inum *inumtostand(2) *tuqur *taqur *tuqu 1 2 1i(1) *au *au *yau 1 1warm(1) *mapanas *mapanas *mavana 2 2moon(1) *pulan *pulan *vula 2 2how?(1) *kua *kuya *kua 1 1three(1) *tolu *tolu *tolutoplant(2) *tanum *tanom *tan@m 1 1 1mosquito(1) *namuk *namuk *namu 1 1bird(1) *manuk *manuk *manu 1 1four(1) *pani *pat *vati 2 2 2water(2) *wai R *wai R *wai 1 1one(1) *sakai *tasa *sa 3 2 3skin(1) *kulit *kulit *kulittoyawn(1) *mawap *mawap *mawa 1 1he/she(1) *ia *ia *ianose(1) *isuN *ijuN *isu 1 2 1thatch/roof(1) *qatop *qatop *qato 1 1towalk(2) *pano *pano *vano 1 1flower(1) *puNa *puNa *vuNa 1 1dust(1) *qapuk *qapuk *avu 3 3neck(18) * Ruqa * Ruqa *ua 2 2eye(1) *mata *mata *matafather(1) *tama *tamana *tama 2 2tofear(1) *matakut *matakut *matakutroot(2) *waka Ra *waka R *waka 1 1 2tostab,pierce(8) *soka *soka *sokabreast(1) *susu *susu *susutolive,bealive(1) *maqurip *maqurip *maquri 1 1head(1) *qulu *qulu *quluthou(1) *ko *iko *kou 1 2 1fruit(1) *puaq *puaq *vua 2 2

13

Page 20: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Table S.4. Proto-Polynesian reconstructions

Gloss Reference Baseline Distance Automatic Distance

rope(9) *taula *taura 1 *taulashort(11) *pukupuku *puPupuPu 2 *pukupukustrikewithfist *moto *moko 1 *motocoral *puNa *puNa *puNacook *tafu *tahu 1 *tafupainful,sick(1) *masaki *mahaki 1 *masakidownwards *hifo *ifo 1 *hifodive *ruku *uku 1 *rukutoface *haNa *haNa *haNagrasp *kapo *Papo 1 *kapobay *faNa *hana 2 *faNatail *siku *siPu 1 *sikutoblow(6) *pupusi *pupuhi 1 *pupusichannel *awa *ava 1 *awapandanus *fara *fala 1 *faraurinate *mimi *mimi *miminavel *pito *pito *pitogall *Pahu *au 2 *Pahuwave *Nalu *Nalu *Nalusleep *mohe *moe 1 *mohewarm(1) *mafanafana *mafanafana *mafanafanabeak *Nutu *Nutu *Nututooth *nifo *niho 1 *nifodance *saka *haka 1 *sakaleg *waPe *wae 1 *waPedrown *lemo *lemo *lemowater *wai *vai 1 *wainose *isu *isu *isutaro *talo *kalo 1 *talotohide(1) *funi *huna 2 *suna 2upwards *hake *aPe 2 *hakeoverripe *pePe *pee 1 *pePetosit(16) *nofo *noho 1 *nofored *kula *Pula 1 *kulatochew(1) *mama *mama *mamathree(1) *tolu *tolu *tolutosleep(10) *mohe *moe 1 *mohenine *hiwa *iwa 1 *hiwadawn *ata *ata *atafeather(1) *fulu *fulu *fulucanoe *waka *vaPa 2 *wakaleft(11) *sema *hema 1 *semaashes *refu *lehu 2 *refudew *sau *sau *sausmall *riki *riki *rikihouse *fale *hale 1 *falevoice *lePo *leo 1 *lePooctopus *feke *hePe 2 *feketoclimb(2) *kake *aPe 2 *ake 1sea *tahi *kai 2 *tahiday *Paho *ao 2 *Pahobranch *maNa *maNa *maNa

continued on next page

14

Page 21: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous page

Gloss Reference Baseline Distance Automatic Distance

lovepity *PaloPofa *aroha 5 *PaloPofaflyrun *lele *rere 2 *lelethick(3) *matolutolu *matolutolu *matolutoru 1tostrippeel *hisi *hihi 1 *hisisitdwell *nofo *noho 1 *nofoblack *kele *Pele 1 *keletorch *rama *ama 1 *rama

Table S.5. Sound changes

Code Parent Child Functional load Number of occurrences

(1) v (ProtoOcean) > p (AdmiraltyIslands) 0.09884 5.67800(2) a (Northern Dumagat) > 2 (Agta) 5.132e-05 110.26986(3) q (Bisayan) > P (AklanonBis) 0.03289 46.31015(4) a (Schouten) > @ (Ali) 7.158e-05 5.45255(5) e (UlatInai) > a (Alune) 1.00000 1.31832(6) e (SeramStraits) > o (Amahai) 0.05802 6.91108(7) a (SouthwestNewBritain) > e (Amara) 0.19090 15.52212(8) a (CentralWestern) > e (AmbaiYapen) 0.21003 5.74372(9) i (SeramStraits) > a (Ambon) 0.27139 3.04082(10) a (EastVanuatu) > e (AmbrymSout) 0.09085 14.42495(11) s (BimaSumba) > h (Anakalang) 0.03743 10.01598(12) a (Vanuatu) > e (AnejomAnei) 0.08559 14.43646(13) f (Futunic) > p (Anuta) 0.09282 19.96156(14) n (Wetar) > N (Aputai) 0.04788 7.10435(15) t (WestSanto) > r (ArakiSouth) 0.21996 20.80080(16) Not enough data available for reliable sound change estimates(17) d (NorthPapuanMainlandDEntrecasteaux) > t (AreTaupota) 0.06738 2.92590(18) O (Southern Malaita) > o (AreareMaas) 0.01151 1.99913(19) O (Southern Malaita) > o (AreareWaia) 0.01151 1.99971(20) a (Bibling) > e (Aria) 0.29446 4.99706(21) Not enough data available for reliable sound change estimates(22) O (SanCristobal) > o (ArosiOneib) 0.00562 1.99971(23) Not enough data available for reliable sound change estimates(24) b (ProtoCentr) > p (Aru) 0.13711 9.35892(25) a (RajaAmpat) > E (As) 0.05156 4.74243(26) o (Utupua) > u (Asumboa) 0.07562 3.70741(27) a (Central Manobo) > o (AtaTigwa) 0.00291 7.29063(28) s (Formosan) > h (Atayalic) 0.04098 4.31225(29) l (West NuclearTimor) > n (Atoni) 0.23175 7.02772(30) t (Ibanagic) > q (AttaPamplo) 0.47297 7.87084(31) a (Suauic) > e (Auhelawa) 0.18692 7.98362(32) N (MalekulaCentral) > n (Avava) 0.10747 15.49566(33) a (ProtoCentr) > e (Babar) 0.04891 5.61192(34) O (Choiseul) > o (Babatana) 0.01953 8.94227(35) O (Choiseul) > o (BabatanaAv) 0.01953 1.99952(36) O (Choiseul) > o (BabatanaKa) 0.01953 2.99952(37) Not enough data available for reliable sound change estimates(38) O (Choiseul) > o (BabatanaTu) 0.01953 2.99961(39) v (Ivatan) > b (Babuyan) 0.01574 16.97554(40) a (BorneoCoastBajaw) > e (Bajo) 0.04774 21.59109(41) a (NuclearCordilleran) > 2 (Balangaw) 0.00387 37.89097(42) N (BaliSasak) > n (Bali) 0.19166 13.91349(43) e (ProtoMalay) > @ (BaliSasak) 6.064e-04 26.22332(44) h (BimaSumba) > s (Baliledo) 0.03743 7.64100(45) p (East CentralMaluku) > f (BandaGeser) 0.05559 3.52479(46) a (Sulawesi) > o (BanggaiWdi) 0.06991 8.61615(47) a (Palawano) > o (Banggi) 6.735e-04 7.69053(48) e (LocalMalay) > a (BanjareseM) 0.10667 36.24790(49) a (SouthNewIrelandNorthwestSolomonic) > o (Banoni) 0.16106 5.69268(50) r (Sangiric) > d (Bantik) 0.01606 6.99267(51) b (Sulawesi) > w (Baree) 0.05030 10.71565(52) q (ProtoMalay) > P (Barito) 0.04087 15.54090(53) e (Madak) > o (Barok) 0.04576 1.91240(54) u (Northern EastFormosan) > o (Basai) 0.00130 5.88492(55) u (BashiicCentralLuzonNorthernMindoro) > o (Bashiic) 0.02423 63.00884(56) r (NorthernPhilippine) > y (BashiicCentralLuzonNorthernMindoro) 0.01665 3.54936(57) N (Palawano) > n (BatakPalaw) 0.04752 4.94491(58) O (SanCristobal) > o (BauroBaroo) 0.00562 1.99981(59) Not enough data available for reliable sound change estimates(60) Not enough data available for reliable sound change estimates(61) p (Vitiaz) > f (Bel) 0.01771 1.06548(62) u (BerawanLowerBaram) > o (Belait) 0.03125 12.34195(63) f (Futunic) > h (Bellona) 0.01009 15.97130(64) t (Pangasinic) > s (Benguet) 0.09759 1.35377

continued on next page

15

Page 22: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(65) u (BerawanLowerBaram) > o (BerawanLon) 0.03125 13.55252(66) k (NorthSarawakan) > P (BerawanLowerBaram) 0.15797 10.19426(67) e (SouthwestNewBritain) > i (Bibling) 0.03719 2.29657(68) O (RajaAmpat) > o (BigaMisool) 0.03843 5.00642(69) k (CentralPhilippine) > c (BikolNagaC) 9.303e-05 12.87548(70) a (Blaan) > O (BilaanKoro) 0.01132 23.88863(71) N (Blaan) > n (BilaanSara) 0.08320 18.08416(72) Not enough data available for reliable sound change estimates(73) u (Northern NuclearBel) > o (Bilibil) 0.03687 2.95970(74) a (ProtoMalay) > 1 (Bilic) 0.00777 19.43024(75) O (SouthNewIrelandNorthwestSolomonic) > o (Bilur) 0.00242 1.97890(76) t (BimaSumba) > d (Bima) 0.08028 9.97797(77) b (ProtoCentr) > w (BimaSumba) 0.08312 11.67021(78) a (NorthSarawakan) > e (Bintulu) 0.07029 9.74568(79) 1 (Manobo) > u (Binukid) 0.03862 2.90829(80) 1 (CentralPhilippine) > u (Bisayan) 0.01651 18.53014(81) 1 (Bilic) > a (Blaan) 0.08598 17.06499(82) Not enough data available for reliable sound change estimates(83) Not enough data available for reliable sound change estimates(84) t (GorontaloMongondow) > s (BolaangMon) 0.03159 4.79938(85) o (TukangbesiBonerate) > O (Bonerate) 0.02521 18.76480(86) u (Seram) > i (Bonfia) 0.14937 11.37310(87) u (Bontok) > o (BontocGuin) 0.04335 58.88829(88) a (BontokKankanay) > 1 (Bontok) 0.10466 3.41602(89) q (Bontok) > P (BontokGuin) 8.294e-04 49.64130(90) u (NuclearCordilleran) > o (BontokKankanay) 0.03806 9.39059(91) u (SuluBorneo) > o (BorneoCoastBajaw) 0.05043 3.53556(92) v (GelaGuadalcanal) > p (Bughotu) 0.09530 1.99015(93) a (Bugis) > @ (BugineseSo) 0.00654 17.32980(94) N (SouthSulawesi) > k (Bugis) 0.10991 2.98167(95) s (Bughotu) > h (Bugotu) 0.03395 8.55895(96) e (KayanMurik) > a (Bukat) 0.06056 4.02644(97) a (SouthHalmahera) > e (Buli) 0.05583 3.50307(98) o (Vanikoro) > O (Buma) 0.13019 23.27248(99) a (Sabahan) > o (BunduDusun) 0.05640 1.58162(100) u (Formosan) > o (Bunun) 2.254e-04 11.98634(101) u (CentralMaluku) > o (BuruNamrol) 0.01570 19.92112(102) o (South Bisayan) > u (ButuanTausug) 0.03947 3.20776(103) u (ButuanTausug) > o (Butuanon) 0.03983 30.32254(104) r (NorthPapuanMainlandDEntrecasteaux) > l (Bwaidoga) 0.10631 8.92246(105) a (NewCaledonian) > E (Canala) 0.05974 4.64394(106) N (ProtoChuuk) > n (Carolinian) 0.04042 23.82083(107) u (Bisayan) > o (Cebuano) 0.03656 8.68282(108) l (SouthHalmaheraWestNewGuinea) > r (CenderawasihBay) 0.11907 11.14761(109) u (EastFormosan) > o (CentralAmi) 4.190e-04 12.82795(110) u (SouthCentralCordilleran) > o (CentralCordilleran) 0.02941 3.43974(111) e (ProtoMalay) > @ (CentralEastern) 6.064e-04 32.49321(112) j (ProtoOcean) > s (CentralEasternOceanic) 0.00410 2.00058(113) e (BashiicCentralLuzonNorthernMindoro) > i (CentralLuzon) 0.01063 2.14101(114) R (ProtoCentr) > r (CentralMaluku) 0.02692 12.80932(115) u (MaselaSouthBabar) > E (CentralMas) 0.04503 4.21108(116) N (RemoteOceanic) > g (CentralPacific) 0.00869 11.59225(117) k (Peripheral PapuanTip) > g (CentralPapuan) 0.11515 5.29705(118) u (MesoPhilippine) > o (CentralPhilippine) 0.01374 38.98219(119) x (NortheastVanuatuBanksIslands) > k (CentralVanuatu) 0.02455 3.46857(120) i (CenderawasihBay) > u (CentralWestern) 0.12684 1.99006(121) u (Bisayan) > o (Central Bisayan) 0.03656 7.64073(122) l (East Nuclear Polynesian) > r (Central East Nuclear Polynesian) 0.02709 2.35911(123) a (Manobo) > 1 (Central Manobo) 0.09378 18.50241(124) a (SantaIsabel) > u (Central SantaIsabel) 0.27278 1.22175(125) t (Chamic) > k (ChamChru) 0.34269 7.73602(126) y (Malayic) > i (Chamic) 0.00237 1.73963(127) b (ProtoMalay) > p (Chamorro) 0.15451 10.03113(128) 7 (East SantaIsabel) > g (ChekeHolo) 0.03879 10.10942(129) o (SouthNewIrelandNorthwestSolomonic) > O (Choiseul) 0.00242 8.56855(130) a (ChamChru) > @ (Chru) 0.02139 31.97181(131) a (ProtoChuuk) > e (Chuukese) 0.11108 36.12356(132) a (ProtoChuuk) > e (ChuukeseAK) 0.11108 59.55135(133) @ (Atayalic) > a (CiuliAtaya) 0.02870 6.00504(134) e (North Babar) > o (Dai) 0.02409 5.70633(135) n (North Babar) > l (DaweraDawe) 0.16784 23.27929(136) N (LandDayak) > n (DayakBakat) 0.26785 7.13031(137) N (South West Barito) > n (DayakNgaju) 0.16918 29.67016(138) a (NorthSarawakan) > e (Dayic) 0.07029 5.15706(139) a (LoyaltyIslands) > e (Dehu) 0.20861 4.46889(140) g (Bwaidoga) > y (Diodio) 0.08008 12.32717(141) l (NorthPapuanMainlandDEntrecasteaux) > r (Dobuan) 0.10631 12.30865(142) p (Southern Malaita) > b (Dorio) 8.555e-04 1.99973(143) l (Nuclear WestCentralPapuan) > r (Doura) 0.13489 5.63596(144) u (Northern Dumagat) > o (DumagatCas) 0.01544 18.63984(145) s (SouthwestMaluku) > h (EastDamar) 0.02479 6.75499(146) a (CentralPacific) > o (EastFijianPolynesian) 0.23158 1.54042

continued on next page

16

Page 23: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(147) c (Formosan) > t (EastFormosan) 0.04224 6.87583(148) b (MaselaSouthBabar) > v (EastMasela) 0.00417 5.01447(149) i (BimaSumba) > u (EastSumban) 0.17438 28.36485(150) b (NortheastVanuatuBanksIslands) > v (EastVanuatu) 0.16115 2.78838(151) e (Barito) > i (East Barito) 0.02495 4.21938(152) p (CentralMaluku) > f (East CentralMaluku) 0.09093 7.47879(153) u (Manus) > o (East Manus) 0.01231 4.11061(154) g (NewGeorgia) > h (East NewGeorgia) 0.02220 5.62094(155) a (NuclearTimor) > e (East NuclearTimor) 0.14170 3.51835(156) l (Nuclear Polynesian) > r (East Nuclear Polynesian) 0.04761 79.14382(157) O (SantaIsabel) > o (East SantaIsabel) 0.02383 4.65059(158) @ (CentralEastern) > o (EasternMalayoPolynesian) 0.00303 18.76994(159) a (RemoteOceanic) > e (EasternOuterIslands) 0.14981 13.43554(160) a (AdmiraltyIslands) > e (Eastern AdmiraltyIslands) 0.16811 7.58136(161) a (BandaGeser) > o (ElatKeiBes) 0.05446 15.50349(162) r (SamoicOutlier) > l (Ellicean) 0.03331 5.08687(163) a (Futunic) > e (Emae) 0.20786 3.97652(164) e (SouthwestBabar) > E (Emplawas) 0.18331 11.05576(165) Not enough data available for reliable sound change estimates(166) i (BimaSumba) > e (EndeLio) 0.04706 5.22061(167) a (Sumatra) > @ (Enggano) 0.00149 1.61296(168) Not enough data available for reliable sound change estimates(169) Not enough data available for reliable sound change estimates(170) f (Wetar) > h (Erai) 0.03346 7.93007(171) a (Vanuatu) > e (Erromanga) 0.08559 16.54778(172) O (SanCristobal) > o (Fagani) 0.00562 1.99961(173) O (SanCristobal) > o (FaganiAguf) 0.00562 1.99952(174) O (SanCristobal) > o (FaganiRihu) 0.00562 1.99690(175) Not enough data available for reliable sound change estimates(176) u (WesternPlains) > o (Favorlang) 0.01234 20.09559(177) s (EastFijianPolynesian) > c (FijianBau) 0.04333 9.93879(178) f (Timor) > w (FloresLembata) 0.01025 7.38989(179) l (ProtoAustr) > r (Formosan) 0.01978 16.25315(180) r (Futunic) > l (FutunaAniw) 0.05835 2.94909(181) r (Futunic) > l (FutunaEast) 0.05835 48.57376(182) l (SamoicOutlier) > r (Futunic) 0.03331 74.90293(183) l (WestCentralPapuan) > r (Gabadi) 0.13055 5.24097(184) u (Ibanagic) > o (Gaddang) 0.01362 18.75014(185) e (Are) > i (Gapapaiwa) 0.06186 4.23744(186) a (BimaSumba) > o (GauraNggau) 0.13543 29.26141(187) a (ProtoMalay) > @ (Gayo) 0.00259 24.49089(188) r (Northern NuclearBel) > z (Gedaged) 0.02021 7.64916(189) c (GelaGuadalcanal) > s (Gela) 0.01050 2.08324(190) k (SoutheastSolomonic) > g (GelaGuadalcanal) 0.01749 19.76693(191) b (GeserGorom) > w (Geser) 0.06536 3.91631(192) f (BandaGeser) > w (GeserGorom) 0.05946 4.12917(193) Not enough data available for reliable sound change estimates(194) g (Guadalcanal) > G (Ghari) 4.995e-04 9.42542(195) Not enough data available for reliable sound change estimates(196) h (Guadalcanal) > G (GhariNggae) 5.670e-05 2.00000(197) O (Guadalcanal) > o (GhariNgger) 0.00601 2.99863(198) Not enough data available for reliable sound change estimates(199) Not enough data available for reliable sound change estimates(200) a (SouthHalmahera) > o (Giman) 0.11966 11.83686(201) a (GorontaloMongondow) > o (Gorontalic) 0.12679 8.78982(202) k (Gorontalic) > P (GorontaloH) 0.00973 18.65349(203) a (Sulawesi) > o (GorontaloMongondow) 0.06991 26.41042(204) s (GelaGuadalcanal) > c (Guadalcanal) 0.01050 3.50693(205) Not enough data available for reliable sound change estimates(206) Not enough data available for reliable sound change estimates(207) N (NehanNorthBougainville) > n (Haku) 0.10208 6.26107(208) 1 (MesoPhilippine) > u (Hanunoo) 0.02599 24.61094(209) t (Marquesic) > k (Hawaiian) 0.31942 56.48048(210) u (Peripheral Central Bisayan) > o (Hiligaynon) 0.02028 1.95625(211) r (Ambon) > l (HituAmbon) 0.03275 16.21990(212) g (West NewGeorgia) > G (Hoava) 0.00152 7.69893(213) a (NorthNewGuinea) > e (HuonGulf) 0.13866 4.84173(214) a (LoyaltyIslands) > e (Iaai) 0.20861 8.25767(215) u (Malayic) > o (Iban) 0.00621 12.93856(216) k (NorthernCordilleran) > q (Ibanagic) 0.40890 11.09862(217) N (Sabahan) > n (Idaan) 0.13834 5.84917(218) e (Futunic) > a (IfiraMeleM) 0.20786 4.95522(219) 1 (NuclearCordilleran) > o (Ifugao) 0.01254 19.55025(220) k (Ifugao) > q (IfugaoAmga) 0.34057 4.76075(221) k (Ifugao) > q (IfugaoBata) 0.34057 17.99754(222) 1 (Kallahan) > o (IfugaoBayn) 0.01101 17.75466(223) n (Wetar) > N (Iliun) 0.04788 10.14036(224) u (NorthernLuzon) > o (Ilokano) 0.02660 29.66945(225) a (Peripheral Central Bisayan) > u (Ilonggo) 0.12642 4.07172(226) a (SouthernCordilleran) > 1 (Ilongot) 0.07296 10.35510(227) N (Ilongot) > n (IlongotKak) 0.06631 9.13131(228) a (Ivatan) > e (Imorod) 0.08734 6.03794

continued on next page

17

Page 24: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(229) n (SouthwestBabar) > m (Imroing) 0.23281 10.35889(230) u (SamaBajaw) > o (Inabaknon) 0.02973 19.36631(231) a (LocalMalay) > e (Indonesian) 0.10667 3.03924(232) u (Benguet) > o (Inibaloi) 0.03849 64.30170(233) y (MaranaoIranon) > i (Iranun) 0.00959 4.31410(234) a (Ivatan) > e (Iraralay) 0.08734 11.08683(235) l (Ivatan) > d (Isamorong) 0.01340 14.95905(236) t (Ibanagic) > s (IsnegDibag) 0.05600 2.91874(237) h (Ivatan) > x (Itbayat) 0.00445 31.26009(238) o (Ivatan) > u (Itbayaten) 5.820e-04 67.10087(239) u (KalingaItneg) > o (ItnegBinon) 0.02352 60.52923(240) l (Ivatan) > d (Ivasay) 0.01340 14.96335(241) g (Bashiic) > y (Ivatan) 0.02574 5.02873(242) e (Ivatan) > 1 (IvatanBasc) 0.00723 32.77267(243) q (ProtoMalay) > h (Javanese) 0.07129 15.44980(244) a (Northern NewCaledonian) > e (Jawe) 0.13215 8.47484(245) e (West Barito) > o (Kadorih) 7.673e-05 3.80522(246) O (SanCristobal) > o (Kahua) 0.00562 1.99971(247) O (SanCristobal) > o (KahuaMami) 0.00562 1.99952(248) l (Gorontalic) > r (Kaidipang) 0.01053 21.87631(249) k (KairiruManam) > q (Kairiru) 0.00897 10.55125(250) u (Schouten) > i (KairiruManam) 0.17315 1.37804(251) n (Ilongot) > N (KakidugenI) 0.06631 9.38577(252) r (Mansakan) > l (Kalagan) 0.04611 3.61546(253) i (MesoPhilippine) > 1 (Kalamian) 0.02195 3.10459(254) 1 (KalingaItneg) > o (KalingaGui) 0.01076 22.89422(255) a (CentralCordilleran) > i (KalingaItneg) 0.13957 2.25237(256) s (Benguet) > h (Kallahan) 0.02667 15.50239(257) s (Kallahan) > h (KallahanKa) 0.00566 1.92678(258) 1 (Kallahan) > e (KallahanKe) 0.00403 38.49811(259) n (BimaSumba) > N (Kambera) 0.00600 25.28750(260) b (Tsouic) > v (Kanakanabu) 0.01715 4.07361(261) v (PatpatarTolai) > w (Kandas) 0.03857 3.91962(262) u (BontokKankanay) > o (KankanayNo) 0.04380 49.14539(263) n (CentralLuzon) > N (Kapampanga) 0.01456 6.07915(264) t (Ellicean) > d (Kapingamar) 3.974e-05 35.94977(265) v (LavongaiNalik) > f (KaraWest) 0.02487 4.21517(266) a (SouthHalmahera) > e (KasiraIrah) 0.05583 5.86174(267) e (South West Barito) > E (Katingan) 0.00426 27.03064(268) I (Pasismanua) > i (KaulongAuV) 0.01894 6.17672(269) a (Northern EastFormosan) > i (Kavalan) 0.08596 4.99960(270) b (ProtoMalay) > v (KayanMurik) 1.449e-07 6.64852(271) a (KayanMurik) > e (KayanUmaJu) 0.06056 5.93656(272) a (SarmiJayapuraBay) > e (KayupulauK) 0.32087 3.93441(273) a (FloresLembata) > e (Kedang) 0.10748 14.89342(274) b (KeiTanimbar) > B (KeiTanimba) 4.198e-05 4.13306(275) a (SoutheastMaluku) > e (KeiTanimbar) 0.17157 1.74318(276) a (Dayic) > e (KelabitBar) 0.07691 10.95832(277) e (East NuclearTimor) > E (Kemak) 2.255e-04 5.94197(278) i (NorthSarawakan) > e (KenyahLong) 0.03423 6.39511(279) a (LocalMalay) > o (Kerinci) 0.01029 37.28182(280) a (KilivilaLouisiades) > e (Kilivila) 0.14778 5.83519(281) o (Peripheral PapuanTip) > a (KilivilaLouisiades) 0.17286 2.23572(282) o (Central SantaIsabel) > O (KilokakaYs) 0.02707 5.60509(283) N (MicronesianProper) > n (Kiribati) 0.04469 10.23912(284) P (Manam) > k (Kis) 0.07162 3.84299(285) t (KisarRoma) > k (Kisar) 0.05336 24.56857(286) f (SouthwestMaluku) > w (KisarRoma) 0.03321 7.75331(287) a (BimaSumba) > o (Kodi) 0.13543 22.04728(288) l (ProtoCentr) > r (KoiwaiIria) 0.06793 11.95360(289) O (Central SantaIsabel) > o (Kokota) 0.02707 31.07342(290) e (Pesisir) > o (Komering) 0.00325 8.58447(291) a (Blaan) > O (KoronadalB) 0.01132 30.81323(292) s (NgeroVitiaz) > r (Kove) 0.06561 7.13972(293) u (PatpatarTolai) > a (Kuanua) 0.16803 1.98361(294) O (West NewGeorgia) > o (Kubokota) 0.00573 2.00145(295) v (Nuclear WestCentralPapuan) > b (Kuni) 0.11533 6.21112(296) Not enough data available for reliable sound change estimates(297) a (MicronesianProper) > e (Kusaie) 0.12251 14.33579(298) O (Northern Malaita) > o (Kwai) 0.00348 1.99952(299) a (Northern Malaita) > o (Kwaio) 0.18875 8.31055(300) l (Tanna) > r (Kwamera) 0.05929 25.71039(301) N (Northern Malaita) > n (KwaraaeSol) 0.03728 9.80996(302) Not enough data available for reliable sound change estimates(303) e (MelanauKajang) > @ (Lahanan) 0.00231 14.95868(304) i (Willaumez) > e (Lakalai) 0.11589 1.41234(305) r (Nuclear WestCentralPapuan) > l (Lala) 0.13489 4.70309(306) u (FloresLembata) > o (LamaholotI) 0.02895 10.53359(307) Not enough data available for reliable sound change estimates(308) s (BimaSumba) > h (Lamboya) 0.03743 9.36744(309) o (Bibling) > u (LamogaiMul) 0.12203 5.67124(310) a (Pesisir) > @ (Lampung) 0.00929 12.80317

continued on next page

18

Page 25: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(311) e (ProtoMalay) > a (LandDayak) 0.04499 11.18206(312) Not enough data available for reliable sound change estimates(313) k (Northern Malaita) > g (Lau) 0.07399 10.37078(314) Not enough data available for reliable sound change estimates(315) Not enough data available for reliable sound change estimates(316) r (NewIreland) > l (LavongaiNalik) 0.13377 4.12434(317) a (East Manus) > e (Leipon) 0.13133 15.77294(318) a (Tanna) > e (Lenakel) 0.05393 9.10010(319) Not enough data available for reliable sound change estimates(320) h (Gela) > G (LengoGhaim) 5.895e-05 1.98182(321) Not enough data available for reliable sound change estimates(322) f (SouthwestMaluku) > w (Letinese) 0.03321 8.50360(323) n (West Manus) > N (Levei) 0.18931 43.18724(324) a (LavongaiNalik) > e (LihirSungl) 0.07781 10.60106(325) i (West Manus) > e (Likum) 0.08687 4.08584(326) e (EndeLio) > @ (LioFloresT) 0.00290 10.17585(327) a (Malayan) > e (LocalMalay) 0.09530 18.43396(328) Not enough data available for reliable sound change estimates(329) r (Manus) > P (Loniu) 0.00619 8.93647(330) a (SoutheastIslands) > e (Lou) 0.09902 12.46015(331) i (LocalMalay) > e (LowMalay) 0.03770 2.99404(332) a (RemoteOceanic) > e (LoyaltyIslands) 0.14981 15.79651(333) t (Ellicean) > k (Luangiua) 0.47404 39.73227(334) e (MonoUruava) > a (LungaLunga) 0.13896 3.85273(335) O (West NewGeorgia) > o (Lungga) 0.00573 2.00155(336) O (West NewGeorgia) > o (Luqa) 0.00573 2.00145(337) b (East Barito) > w (Maanyan) 0.07537 6.13215(338) a (NewIreland) > e (Madak) 0.16211 8.90472(339) k (Madak) > g (MadakLamas) 0.05218 9.51948(340) l (LavongaiNalik) > r (Madara) 0.10006 10.95189(341) u (ProtoMalay) > o (Madurese) 0.01585 39.16274(342) l (CentralPapuan) > r (MagoriSout) 0.12605 3.95178(343) a (Nuclear PapuanTip) > i (Maisin) 0.22933 2.17692(344) n (SouthSulawesi) > N (Makassar) 0.18370 10.70075(345) k (MalaitaSanCristobal) > P (Malaita) 0.09152 5.02310(346) v (SoutheastSolomonic) > f (MalaitaSanCristobal) 0.06368 25.62607(347) Not enough data available for reliable sound change estimates(348) N (LocalMalay) > n (MalayBahas) 0.17169 34.31949(349) p (Malayic) > m (Malayan) 0.17398 3.33605(350) q (ProtoMalay) > h (Malayic) 0.07129 31.25704(351) a (Vanuatu) > e (MalekulaCentral) 0.08559 25.93669(352) a (NortheastVanuatuBanksIslands) > e (MalekulaCoastal) 0.08702 31.59187(353) a (Vitiaz) > o (Maleu) 0.11223 11.90975(354) e (Bugis) > a (Maloh) 0.07175 4.70755(355) u (CentralPhilippine) > o (Mamanwa) 0.01883 38.86565(356) u (East NuclearTimor) > a (Mambai) 0.34369 4.99285(357) e (BimaSumba) > a (Mamboru) 0.13262 7.46411(358) k (KairiruManam) > P (Manam) 0.01548 9.58872(359) a (Marquesic) > e (Mangareva) 0.26245 3.72483(360) s (BimaSumba) > c (Manggarai) 2.420e-05 8.94154(361) r (Tahitic) > l (Manihiki) 9.361e-04 13.67126(362) N (SouthernPhilippine) > n (Manobo) 0.07976 13.17836(363) 1 (AtaTigwa) > o (ManoboAtad) 0.03723 66.54667(364) 1 (AtaTigwa) > o (ManoboAtau) 0.03723 66.57068(365) 1 (Central Manobo) > a (ManoboDiba) 0.12386 3.76411(366) g (West Central Manobo) > h (ManoboIlia) 0.03983 13.28241(367) a (South Manobo) > 1 (ManoboKala) 0.12673 7.79055(368) 1 (South Manobo) > 2 (ManoboSara) 2.271e-04 58.20252(369) o (AtaTigwa) > 1 (ManoboTigw) 0.03723 11.93633(370) a (West Central Manobo) > 1 (ManoboWest) 0.13743 4.54113(371) l (Mansakan) > r (Mansaka) 0.04611 18.44830(372) o (CentralPhilippine) > u (Mansakan) 0.01883 21.46241(373) a (Eastern AdmiraltyIslands) > e (Manus) 0.13652 4.75105(374) v (Tahitic) > w (Maori) 0.00255 12.29480(375) l (BorneoCoastBajaw) > w (Mapun) 0.03973 7.82813(376) u (MaranaoIranon) > o (Maranao) 0.00377 54.21909(377) 1 (SouthernPhilippine) > e (MaranaoIranon) 0.00422 7.67944(378) Not enough data available for reliable sound change estimates(379) Not enough data available for reliable sound change estimates(380) Not enough data available for reliable sound change estimates(381) Not enough data available for reliable sound change estimates(382) s (East NewGeorgia) > c (Marovo) 0.00144 4.87889(383) a (Marquesic) > e (Marquesan) 0.26245 10.48977(384) f (Central East Nuclear Polynesian) > h (Marquesic) 0.05235 2.06222(385) a (MicronesianProper) > e (Marshalles) 0.12251 42.07442(386) i (South Babar) > E (MaselaSouthBabar) 0.04148 3.02753(387) e (Northern NuclearBel) > i (Matukar) 0.04103 3.86380(388) t (Willaumez) > f (Maututu) 0.00143 5.97137(389) Not enough data available for reliable sound change estimates(390) Not enough data available for reliable sound change estimates(391) Not enough data available for reliable sound change estimates(392) Not enough data available for reliable sound change estimates

continued on next page

19

Page 26: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(393) Not enough data available for reliable sound change estimates(394) e (Northern NuclearBel) > i (Megiar) 0.04103 3.96950(395) b (Nuclear WestCentralPapuan) > p (Mekeo) 0.01057 10.53926(396) e (Northwest) > a (MelanauKajang) 0.05269 4.10012(397) a (MelanauKajang) > e (MelanauMuk) 0.05542 8.30760(398) N (LocalMalay) > n (Melayu) 0.17169 15.57376(399) e (LocalMalay) > a (MelayuBrun) 0.10667 57.81798(400) Not enough data available for reliable sound change estimates(401) r (Vitiaz) > l (Mengen) 0.09236 7.76084(402) a (WestSanto) > e (Merei) 0.12570 4.40581(403) u (East Barito) > o (MerinaMala) 0.00538 45.42570(404) p (WesternOceanic) > b (MesoMelanesian) 0.06671 3.25436(405) e (ProtoMalay) > 1 (MesoPhilippine) 0.00192 27.69856(406) s (ProtoMicro) > t (MicronesianProper) 0.14436 10.32542(407) e (Malayan) > a (Minangkaba) 0.09530 36.65853(408) b (RajaAmpat) > p (Minyaifuin) 0.08190 5.44290(409) r (KilivilaLouisiades) > l (Misima) 0.09569 2.85853(410) u (Malayic) > o (Moken) 0.00621 18.67524(411) a (Ponapeic) > O (Mokilese) 0.00863 20.21510(412) k (Bwaidoga) > P (Molima) 0.02226 6.39306(413) r (MonoUruava) > l (Mono) 0.18018 7.90990(414) Not enough data available for reliable sound change estimates(415) Not enough data available for reliable sound change estimates(416) N (SouthNewIrelandNorthwestSolomonic) > n (MonoUruava) 0.07198 4.50934(417) t (CenderawasihBay) > P (Mor) 0.00191 7.90698(418) a (Sulawesi) > o (Mori) 0.06991 15.10951(419) d (ProtoChuuk) > t (Mortlockes) 0.10821 17.95820(420) v (EastVanuatu) > w (Mota) 0.04447 4.78867(421) v (SinagoroKeapara) > h (Motu) 0.01870 13.39891(422) r (Bibling) > x (Mouk) 3.176e-05 16.89672(423) u (Sulawesi) > o (MunaButon) 0.04575 3.73293(424) N (Western Munic) > n (MunaKatobu) 2.731e-04 6.02727(425) d (UlatInai) > r (MurnatenAl) 0.00181 5.95393(426) r (ProtoOcean) > l (Mussau) 0.16171 3.95147(427) a (EastVanuatu) > E (Mwotlap) 0.01111 27.83527(428) r (EndeLio) > z (Nage) 0.02129 1.96583(429) i (MalekulaCoastal) > e (Nahavaq) 0.05885 9.63004(430) n (Willaumez) > l (NakanaiBil) 0.00412 1.02690(431) a (LavongaiNalik) > @ (Nalik) 0.00211 12.96772(432) u (CentralVanuatu) > i (Namakir) 0.18127 21.52549(433) a (MalekulaCentral) > e (Naman) 0.13582 33.25250(434) b (MalekulaCoastal) > p (Nati) 0.01049 19.54611(435) r (SoutheastIslands) > l (Nauna) 0.10938 5.60587(436) a (ProtoMicro) > e (Nauru) 0.13661 5.67814(437) Not enough data available for reliable sound change estimates(438) s (NehanNorthBougainville) > h (Nehan) 0.05608 13.16549(439) N (Nehan) > n (NehanHape) 0.11803 18.93949(440) a (SouthNewIrelandNorthwestSolomonic) > o (NehanNorthBougainville) 0.16106 4.34745(441) e (Northern NewCaledonian) > a (Nelemwa) 0.13215 10.43158(442) o (Utupua) > œ (Nembao) 0.01621 4.98208(443) a (LoyaltyIslands) > e (Nengone) 0.20861 5.30862(444) m (Vanuatu) > n (Nese) 0.35703 18.91500(445) a (MalekulaCentral) > e (Neveei) 0.13582 34.11813(446) e (LoyaltyIslands) > a (NewCaledonian) 0.20861 1.71479(447) k (SouthNewIrelandNorthwestSolomonic) > g (NewGeorgia) 0.08381 5.00348(448) e (MesoMelanesian) > i (NewIreland) 0.06281 3.31247(449) w (BimaSumba) > v (Ngadha) 4.053e-05 14.09575(450) i (Aru) > e (NgaiborSAr) 0.02299 6.77784(451) f (NorthNewGuinea) > w (NgeroVitiaz) 0.01667 3.53470(452) Not enough data available for reliable sound change estimates(453) s (Gela) > h (Nggela) 0.05008 17.07009(454) b (CentralVanuatu) > p (Nguna) 0.06113 6.26520(455) p (Sumatra) > f (Nias) 0.00205 8.90278(456) f (NilaSerua) > h (Nila) 0.01125 4.39065(457) t (TeunNilaSerua) > l (NilaSerua) 0.13648 1.88492(458) u (Nehan) > w (Nissan) 0.02937 8.95963(459) a (Tongic) > e (Niue) 0.18204 4.31952(460) l (North Babar) > n (NorthBabar) 0.16784 6.84401(461) R (ProtoCentr) > r (NorthBomberai) 0.02692 10.12083(462) v (WesternOceanic) > p (NorthNewGuinea) 0.12924 8.71798(463) a (Nuclear PapuanTip) > e (NorthPapuanMainlandDEntrecasteaux) 0.27192 3.13375(464) a (Northwest) > e (NorthSarawakan) 0.05269 5.44048(465) e (Babar) > E (North Babar) 0.03978 1.27682(466) b (Sulawesi) > w (North Minahasan) 0.05030 3.03447(467) u (Vanuatu) > o (NortheastVanuatuBanksIslands) 0.06513 4.83269(468) r (NorthernLuzon) > g (NorthernCordilleran) 0.01688 4.17784(469) m (NorthernPhilippine) > n (NorthernLuzon) 0.15955 5.87812(470) N (ProtoMalay) > n (NorthernPhilippine) 0.10751 18.96525(471) o (NorthernCordilleran) > u (Northern Dumagat) 0.01648 1.25949(472) l (EastFormosan) > n (Northern EastFormosan) 0.14981 5.83985(473) p (Malaita) > b (Northern Malaita) 0.00727 4.99257(474) a (NewCaledonian) > o (Northern NewCaledonian) 0.17212 3.00312

continued on next page

20

Page 27: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(475) b (Bel) > p (Northern NuclearBel) 0.08564 2.04429(476) N (Sangiric) > n (Northern Sangiric) 0.10504 18.92173(477) q (ProtoMalay) > P (Northwest) 0.04087 30.80231(478) l (CentralCordilleran) > k (NuclearCordilleran) 0.14865 1.01516(479) b (Timor) > f (NuclearTimor) 0.08459 2.80832(480) l (PapuanTip) > n (Nuclear PapuanTip) 0.10099 4.22343(481) N (Polynesian) > n (Nuclear Polynesian) 0.01742 5.34206(482) r (WestCentralPapuan) > l (Nuclear WestCentralPapuan) 0.13055 1.52585(483) t (Ellicean) > d (Nukuoro) 3.974e-05 48.84748(484) f (HuonGulf) > w (NumbamiSib) 0.03605 5.99986(485) t (CenderawasihBay) > k (Numfor) 0.18043 10.67064(486) f (Seram) > h (Nunusaku) 0.02050 9.64666(487) a (LocalMalay) > @ (Ogan) 0.00113 22.90283(488) N (Javanese) > n (OldJavanes) 0.12968 22.81888(489) t (CentralVanuatu) > r (Orkon) 0.18595 20.59925(490) O (Southern Malaita) > o (Oroha) 0.01151 1.99952(491) r (EastVanuatu) > l (PaameseSou) 0.17094 18.23342(492) s (Formosan) > t (Paiwan) 0.10317 11.06529(493) a (ProtoMalay) > e (Palauan) 0.04499 11.70163(494) 1 (Palawano) > @ (PalawanBat) 4.146e-04 36.88428(495) 1 (MesoPhilippine) > u (Palawano) 0.02599 5.66511(496) Not enough data available for reliable sound change estimates(497) u (Pangasinic) > o (Pangasinan) 0.03899 25.70031(498) Not enough data available for reliable sound change estimates(499) a (WesternOceanic) > o (PapuanTip) 0.15356 3.56136(500) N (SouthwestNewBritain) > n (Pasismanua) 0.09977 3.11319(501) v (PatpatarTolai) > h (Patpatar) 0.00182 8.75307(502) a (SouthNewIrelandNorthwestSolomonic) > e (PatpatarTolai) 0.15474 6.20403(503) e (SeramStraits) > i (Paulohi) 0.18832 3.98409(504) u (Formosan) > o (Pazeh) 2.254e-04 6.81642(505) f (Tahitic) > h (Penrhyn) 0.02693 5.16875(506) n (Wetar) > N (Perai) 0.04788 4.13934(507) u (Central Bisayan) > o (Peripheral Central Bisayan) 0.02827 1.93754(508) e (PapuanTip) > a (Peripheral PapuanTip) 0.19334 3.54732(509) q (ProtoMalay) > h (Pesisir) 0.07129 14.80076(510) v (EastVanuatu) > f (PeteraraMa) 0.01242 17.18743(511) a (ChamChru) > 1 (PhanRangCh) 0.00636 12.19166(512) Not enough data available for reliable sound change estimates(513) v (EastFijianPolynesian) > f (Polynesian) 0.05871 17.52595(514) a (Ponapeic) > e (Ponapean) 0.16942 10.52294(515) a (PonapeicTrukic) > e (Ponapeic) 0.13099 19.78210(516) t (MicronesianProper) > d (PonapeicTrukic) 0.04291 17.55398(517) e (BimaSumba) > a (Pondok) 0.13262 6.53158(518) B (TukangbesiBonerate) > F (Popalia) 0.00989 10.98695(519) e (CentralEastern) > @ (ProtoCentr) 0.00248 3.97165(520) o (PonapeicTrukic) > a (ProtoChuuk) 0.09341 2.83945(521) N (ProtoAustr) > n (ProtoMalay) 0.13485 12.99231(522) v (RemoteOceanic) > f (ProtoMicro) 0.04980 13.02087(523) a (EasternMalayoPolynesian) > o (ProtoOcean) 0.09981 7.99270(524) f (SamoicOutlier) > w (Pukapuka) 0.00204 8.90305(525) a (NorthBomberai) > e (PulauArgun) 0.13318 12.93777(526) u (ProtoChuuk) > U (PuloAnna) 8.595e-06 28.37331(527) d (ProtoChuuk) > t (PuloAnnan) 0.10821 26.92081(528) d (ProtoChuuk) > t (Puluwatese) 0.10821 32.33646(529) u (KayanMurik) > o (PunanKelai) 0.00695 11.47865(530) b (Formosan) > v (Puyuma) 0.02080 5.97573(531) s (EastVanuatu) > h (Raga) 0.03550 19.03806(532) r (CenderawasihBay) > l (RajaAmpat) 0.10632 10.89494(533) f (East Nuclear Polynesian) > h (RapanuiEas) 0.05254 6.51440(534) a (Tahitic) > e (Rarotongan) 0.23947 2.99725(535) a (ProtoMalay) > @ (RejangReja) 0.00259 33.94479(536) i (CentralEasternOceanic) > a (RemoteOceanic) 0.30671 1.87121(537) r (Futunic) > g (Rennellese) 0.01560 46.78448(538) O (Choiseul) > o (Ririo) 0.01953 8.10758(539) r (NorthNewGuinea) > z (Riwo) 0.01094 1.23968(540) r (KisarRoma) > R (Roma) 0.00324 9.86839(541) k (Nuclear WestCentralPapuan) > h (Roro) 0.04267 7.79458(542) f (West NuclearTimor) > b (RotiTerman) 0.05510 4.89359(543) t (WestFijianRotuman) > f (Rotuman) 0.07237 18.85436(544) O (West NewGeorgia) > o (Roviana) 0.00573 6.86288(545) u (Formosan) > o (Rukai) 2.254e-04 9.26163(546) k (Tahitic) > P (Rurutuan) 0.00737 41.17228(547) u (Palawano) > o (SWPalawano) 7.014e-04 10.93716(548) f (Southern Malaita) > h (Saa) 0.00833 23.36029(549) Not enough data available for reliable sound change estimates(550) O (Southern Malaita) > o (SaaSaaVill) 0.01151 2.00000(551) O (Southern Malaita) > o (SaaUkiNiMa) 0.01151 1.99961(552) Not enough data available for reliable sound change estimates(553) u (Tsouic) > o (Saaroa) 0.00593 29.89765(554) a (Northwest) > o (Sabahan) 0.01254 7.65235(555) d (ProtoChuuk) > t (SaipanCaro) 0.10821 30.24687(556) u (Formosan) > o (Saisiat) 2.254e-04 32.49754

continued on next page

21

Page 28: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(557) a (Vanuatu) > E (SakaoPortO) 1.415e-05 7.37912(558) v (Suauic) > h (Saliba) 0.01548 5.59439(559) e (ProtoMalay) > a (SamaBajaw) 0.04499 11.37100(560) a (SuluBorneo) > 1 (SamalSiasi) 0.03121 4.55207(561) u (CentralLuzon) > o (SambalBoto) 0.03570 51.44942(562) k (SamoicOutlier) > P (Samoan) 7.913e-05 40.57180(563) r (Nuclear Polynesian) > l (SamoicOutlier) 0.04761 2.52276(564) Not enough data available for reliable sound change estimates(565) h (Northern Sangiric) > r (SangilSara) 0.01859 10.57112(566) q (Northern Sangiric) > P (Sangir) 0.17663 33.66747(567) P (Northern Sangiric) > q (SangirTabu) 0.17663 5.11841(568) a (Sulawesi) > e (Sangiric) 0.06360 9.41251(569) l (SanCristobal) > r (SantaAna) 0.20803 20.89020(570) O (SanCristobal) > o (SantaCatal) 0.00562 1.99971(571) s (SouthNewIrelandNorthwestSolomonic) > h (SantaIsabel) 0.01828 6.35726(572) l (NehanNorthBougainville) > n (SaposaTinputz) 0.13005 8.23827(573) a (Blaan) > u (SaranganiB) 0.10239 1.65720(574) o (SarmiJayapuraBay) > a (Sarmi) 0.30507 3.74591(575) l (NorthNewGuinea) > r (SarmiJayapuraBay) 0.09033 9.81639(576) a (BaliSasak) > @ (Sasak) 0.07763 7.22450(577) a (ProtoChuuk) > e (Satawalese) 0.11108 19.48892(578) a (BimaSumba) > e (Savu) 0.13262 10.66529(579) n (NorthNewGuinea) > N (Schouten) 0.12461 3.55253(580) a (Atayalic) > u (Sediq) 0.15623 4.94540(581) f (Western AdmiraltyIslands) > h (Seimat) 0.03838 5.52135(582) u (NorthBomberai) > i (Sekar) 0.07251 14.57543(583) b (SoutheastMaluku) > h (Selaru) 0.00154 5.27609(584) Not enough data available for reliable sound change estimates(585) E (Pasismanua) > e (Sengseng) 0.00635 6.51904(586) r (East CentralMaluku) > l (Seram) 0.17834 9.26050(587) l (Nunusaku) > r (SeramStraits) 0.08036 17.65607(588) e (MaselaSouthBabar) > E (Serili) 0.03391 27.97054(589) f (NilaSerua) > w (Serua) 0.09929 7.71214(590) N (PatpatarTolai) > n (Siar) 0.11079 7.91352(591) n (FloresLembata) > N (Sika) 0.12345 10.33270(592) f (Ellicean) > h (Sikaiana) 0.01500 18.73830(593) O (West NewGeorgia) > o (Simbo) 0.00573 3.68466(594) a (CentralPapuan) > o (SinagoroKeapara) 0.25340 1.00322(595) a (LandDayak) > o (Singhi) 0.01654 17.32791(596) u (EastFormosan) > o (Siraya) 4.190e-04 8.28489(597) O (Choiseul) > o (Sisingga) 0.01953 13.70360(598) Not enough data available for reliable sound change estimates(599) r (BimaSumba) > z (Soa) 0.00401 6.96601(600) a (Sarmi) > e (Sobei) 0.23073 10.06460(601) k (CentralMaluku) > P (Soboyo) 0.00549 7.46123(602) a (NehanNorthBougainville) > e (Solos) 0.10574 4.14147(603) Not enough data available for reliable sound change estimates(604) l (Ambon) > r (SouAmanaTe) 0.03275 3.58352(605) r (NorthernLuzon) > l (SouthCentralCordilleran) 0.03231 7.74272(606) n (MaselaSouthBabar) > l (SouthEastB) 0.25859 6.79647(607) a (CentralVanuatu) > e (SouthEfate) 0.06242 9.08441(608) v (SouthHalmaheraWestNewGuinea) > p (SouthHalmahera) 0.04413 6.86021(609) u (EasternMalayoPolynesian) > i (SouthHalmaheraWestNewGuinea) 0.15907 2.91215(610) i (NewIreland) > u (SouthNewIrelandNorthwestSolomonic) 0.17058 3.74428(611) u (Sulawesi) > o (SouthSulawesi) 0.04575 8.90368(612) t (Babar) > k (South Babar) 0.08233 5.96986(613) l (Bisayan) > y (South Bisayan) 0.06356 5.93967(614) a (Manobo) > 1 (South Manobo) 0.09378 14.76928(615) y (West Barito) > i (South West Barito) 0.00602 4.16231(616) o (Eastern AdmiraltyIslands) > u (SoutheastIslands) 0.01939 2.52762(617) R (ProtoCentr) > r (SoutheastMaluku) 0.02692 16.51716(618) r (CentralEasternOceanic) > l (SoutheastSolomonic) 0.18698 6.93482(619) t (SouthCentralCordilleran) > b (SouthernCordilleran) 0.17112 1.97141(620) e (ProtoMalay) > 1 (SouthernPhilippine) 0.00192 17.69275(621) l (Malaita) > n (Southern Malaita) 0.09412 3.02851(622) o (South Babar) > u (SouthwestBabar) 0.05720 5.18676(623) b (Timor) > w (SouthwestMaluku) 0.05085 6.39151(624) i (Vitiaz) > u (SouthwestNewBritain) 0.23570 3.84871(625) u (Atayalic) > o (SquliqAtay) 0.00681 13.92923(626) k (Suauic) > P (Suau) 0.02509 20.06806(627) r (Nuclear PapuanTip) > l (Suauic) 0.04206 7.42557(628) 1 (Subanun) > o (SubanonSio) 0.03138 42.58704(629) a (SouthernPhilippine) > 1 (Subanun) 0.05979 13.92604(630) g (Subanun) > d (SubanunSin) 0.15626 6.97087(631) e (ProtoMalay) > a (Sulawesi) 0.04499 11.63780(632) u (SamaBajaw) > o (SuluBorneo) 0.02973 3.89781(633) e (ProtoMalay) > o (Sumatra) 0.00512 23.81901(634) N (ProtoMalay) > n (Sunda) 0.10751 21.82474(635) u (South Bisayan) > o (Surigaonon) 0.03947 24.45484(636) a (Erromanga) > e (SyeErroman) 0.11628 12.15031(637) t (SouthSulawesi) > P (TaeSToraja) 0.06897 3.10286(638) N (Tboli) > n (Tagabili) 0.10214 23.38929

continued on next page

22

Page 29: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(639) u (CentralPhilippine) > o (TagalogAnt) 0.01883 10.16369(640) k (Kalamian) > q (TagbanwaAb) 0.42879 5.41452(641) q (Kalamian) > k (TagbanwaKa) 0.42879 17.73591(642) u (Tahitic) > o (Tahiti) 0.19238 6.98869(643) e (Tahitic) > a (TahitianMo) 0.23947 3.68949(644) a (Tahitic) > e (Tahitianth) 0.23947 2.71372(645) f (Central East Nuclear Polynesian) > h (Tahitic) 0.05235 6.11830(646) v (SaposaTinputz) > f (Taiof) 0.04497 6.79178(647) l (Ellicean) > r (Takuu) 0.01827 30.88407(648) Not enough data available for reliable sound change estimates(649) Not enough data available for reliable sound change estimates(650) Not enough data available for reliable sound change estimates(651) Not enough data available for reliable sound change estimates(652) h (Guadalcanal) > G (TalisePole) 5.670e-05 1.97576(653) f (Wetar) > h (Talur) 0.03346 6.74401(654) e (Vanikoro) > a (Tanema) 0.28248 10.81075(655) a (PatpatarTolai) > e (Tanga) 0.10990 12.52910(656) e (Utupua) > u (Tanimbili) 0.10880 6.71919(657) a (Vanuatu) > @ (Tanna) 0.01056 13.80984(658) r (Tanna) > l (TannaSouth) 0.05929 26.23479(659) a (Vanuatu) > e (Tape) 0.08559 36.62961(660) Not enough data available for reliable sound change estimates(661) f (Sarmi) > p (Tarpia) 0.09285 4.76042(662) o (ButuanTausug) > u (TausugJolo) 0.03983 38.80244(663) Not enough data available for reliable sound change estimates(664) a (Bilic) > O (Tboli) 0.01846 18.79298(665) O (Tboli) > o (TboliTagab) 0.02446 2.78683(666) O (Vanikoro) > o (Teanu) 0.13019 24.94671(667) l (SouthwestBabar) > n (TelaMasbua) 0.17076 14.35748(668) s (SaposaTinputz) > h (Teop) 0.06456 11.64051(669) N (East NuclearTimor) > n (TetunTerik) 0.02632 3.88184(670) t (TeunNilaSerua) > P (Teun) 0.01477 8.79107(671) e (SouthwestMaluku) > E (TeunNilaSerua) 0.00128 6.36734(672) b (WesternPlains) > f (Thao) 0.01723 9.66201(673) u (LavongaiNalik) > a (Tiang) 0.23417 7.41175(674) a (LavongaiNalik) > o (Tigak) 0.10194 2.85222(675) r (Futunic) > l (Tikopia) 0.05835 3.75175(676) R (ProtoCentr) > r (Timor) 0.02692 17.70656(677) e (Dayic) > o (TimugonMur) 0.00467 20.83722(678) s (Northern Malaita) > 8 (Toambaita) 0.01236 5.00367(679) k (Sumatra) > h (TobaBatak) 0.01209 10.18636(680) s (SamoicOutlier) > h (Tokelau) 0.02620 9.71242(681) g (Guadalcanal) > h (Tolo) 0.04461 9.43634(682) a (Tongic) > o (Tongan) 0.28401 6.49634(683) s (Polynesian) > h (Tongic) 0.04938 24.85227(684) l (North Minahasan) > d (Tonsea) 0.05604 2.89546(685) b (North Minahasan) > w (Tontemboan) 0.12446 9.61430(686) n (MonoUruava) > l (Torau) 0.17242 4.80739(687) a (Tsouic) > o (Tsou) 0.00357 26.52983(688) d (Formosan) > c (Tsouic) 0.02777 7.43830(689) n (Tahitic) > N (Tuamotu) 0.00257 17.01166(690) a (Wetar) > i (Tugun) 0.34504 3.34154(691) a (MunaButon) > o (TukangbesiBonerate) 0.19256 6.52913(692) a (LavongaiNalik) > e (TungagTung) 0.07781 8.41577(693) u (Barito) > o (Tunjung) 0.00125 6.26453(694) s (Ellicean) > h (Tuvalu) 0.01029 12.42513(695) p (Are) > f (Ubir) 0.00167 1.99937(696) Not enough data available for reliable sound change estimates(697) p (Aru) > f (UjirNAru) 0.01141 10.59153(698) h (Nunusaku) > b (UlatInai) 0.07007 6.57134(699) o (Erromanga) > e (Ura) 0.05761 9.45115(700) l (MonoUruava) > r (Uruava) 0.18018 8.71608(701) a (EasternOuterIslands) > o (Utupua) 0.19429 6.30089(702) s (SamoicOutlier) > h (UveaEast) 0.02620 28.16233(703) r (Futunic) > l (UveaWest) 0.05835 27.69390(704) r (Futunic) > l (VaeakauTau) 0.05835 41.42212(705) u (Choiseul) > @ (Vaghua) 3.654e-05 19.13794(706) Not enough data available for reliable sound change estimates(707) a (EasternOuterIslands) > e (Vanikoro) 0.33088 4.43031(708) o (Vanikoro) > e (Vano) 0.10677 4.13392(709) o (RemoteOceanic) > u (Vanuatu) 0.11711 14.34763(710) o (Choiseul) > O (Varisi) 0.01953 6.43868(711) Not enough data available for reliable sound change estimates(712) r (SinagoroKeapara) > l (Vilirupu) 0.17037 8.15153(713) a (NgeroVitiaz) > e (Vitiaz) 0.13267 4.47748(714) s (MesoMelanesian) > d (Vitu) 0.02619 6.27238(715) t (HuonGulf) > r (Wampar) 0.06174 5.81284(716) s (BimaSumba) > h (Wanukaka) 0.03743 14.17816(717) Not enough data available for reliable sound change estimates(718) v (CenderawasihBay) > w (Waropen) 0.08865 4.74200(719) r (GeserGorom) > l (Watubela) 0.17521 13.68799(720) l (AreTaupota) > r (Wedau) 0.04316 3.56623

continued on next page

23

Page 30: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

continued from previous pageCode Parent Child Functional load Number of occurrences

(721) i (BimaSumba) > e (WejewaTana) 0.04706 11.31057(722) u (Seram) > v (Werinama) 3.739e-05 7.78342(723) t (CentralPapuan) > k (WestCentralPapuan) 0.13745 14.10036(724) a (ProtoCentr) > o (WestDamar) 0.02891 9.89554(725) f (CentralPacific) > h (WestFijianRotuman) 0.00402 4.95883(726) k (NortheastVanuatuBanksIslands) > h (WestSanto) 0.01801 2.59859(727) a (Barito) > e (West Barito) 0.06510 2.67923(728) a (Central Manobo) > 1 (West Central Manobo) 0.12386 21.47034(729) a (Manus) > e (West Manus) 0.16506 7.10396(730) O (NewGeorgia) > o (West NewGeorgia) 0.00631 2.83443(731) r (NuclearTimor) > l (West NuclearTimor) 0.14621 3.15074(732) Not enough data available for reliable sound change estimates(733) 1 (West Central Manobo) > e (WesternBuk) 4.647e-04 76.71920(734) s (WestFijianRotuman) > c (WesternFij) 0.00926 4.99871(735) Not enough data available for reliable sound change estimates(736) a (ProtoOcean) > e (WesternOceanic) 0.12139 2.58873(737) d (Formosan) > s (WesternPlains) 0.04913 4.48311(738) s (AdmiraltyIslands) > h (Western AdmiraltyIslands) 0.01688 4.34207(739) a (MunaButon) > o (Western Munic) 0.19256 11.48592(740) t (SouthwestMaluku) > k (Wetar) 0.09657 3.25493(741) r (MesoMelanesian) > l (Willaumez) 0.14060 7.57978(742) b (CentralWestern) > v (WindesiWan) 0.16193 3.04357(743) P (Manam) > k (Wogeo) 0.07162 8.86144(744) k (ProtoChuuk) > g (Woleai) 0.00179 35.44518(745) a (ProtoChuuk) > e (Woleaian) 0.11108 60.39379(746) a (Sulawesi) > o (Wolio) 0.06991 8.71005(747) w (Western Munic) > v (Wuna) 0.00508 2.73532(748) t (Western AdmiraltyIslands) > P (Wuvulu) 0.00203 11.89389(749) a (HuonGulf) > e (Yabem) 0.13370 6.51629(750) a (SamaBajaw) > e (Yakan) 0.04299 27.73964(751) a (KeiTanimbar) > e (Yamdena) 0.18822 17.50706(752) o (Bashiic) > u (Yami) 0.00368 5.79971(753) a (ProtoOcean) > i (Yapese) 0.27352 4.64937(754) a (Javanese) > e (Yogya) 0.08208 4.09979(755) o (West SantaIsabel) > O (ZabanaKia) 0.02351 13.69682

24

Page 31: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

Figure S.1: Percentage of words with varying levels of Levenshtein distance. Known Cognates (gold) were hand-annotated by linguists, while Automatic Cognates were found by our system.

25

Page 32: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

(521)

( 559)

( 750)Yakan ( 632) ( 91) ( 40)Bajo ( 375)Mapun

( 560)

SamalS ias i

( 230)

Inabaknon ( 620)

( 629) ( 630)

SubanunSin

( 628)

SubanonSio ( 362) ( 123)

( 728) ( 733)

Wes ternBuk ( 370)

ManoboWes t

( 366)

ManoboIlia

( 365)

ManoboDiba

( 27) ( 363)

ManoboAtad

( 364)

ManoboAtau

( 369)

ManoboTigw

( 614)

( 367)

ManoboKala

( 368)

ManoboSara

( 79)

Binukid

( 377)

( 376)

Maranao

( 233)

Iranun

( 43)

( 576)

Sasak

( 42)

Bali

( 405) ( 118)

( 355)

Mamanwa

( 80) ( 121) ( 507)

( 225)

Ilonggo

( 210)

Hiligaynon

( 717)

WarayWaray

( 613)

( 102)

( 103)

Butuanon

( 662)

TausugJolo

( 635)

Surigaonon

( 3)

AklanonBis

( 107)

Cebuano

( 372)

( 252)

Kalagan

( 371)

Mansaka

( 639)

TagalogAnt

( 69)

BikolNagaC

( 253)

( 641)

TagbanwaKa

( 640)

TagbanwaAb

( 495)

( 57)

BatakPalaw

( 494)

PalawanBat

( 47)

Banggi

( 547)

SWPalawano

( 208)

Hanunoo

( 493)

Palauan

( 311)

( 595)

S inghi

( 136)

DayakBakat

( 52)

( 727)

( 615)

( 267)

Katingan

( 137)

DayakNgaju

( 245)

Kadorih

( 151)

( 403)

MerinaM

ala

( 337)

Maanyan

( 693)

Tunjung

( 350)

( 349)

( 327)

( 279)

Kerinc i

( 400)

MelayuSara

( 398)

Melayu

( 487)

Ogan

( 331)

LowMalay

( 231)

Indones ian

( 48)

BanjareseM

( 348)

MalayBahas

( 399)

MelayuBrun

( 407)

Minangkaba

( 126)

( 125)

( 511)

PhanRangCh

( 130)

Chru

( 206)

HainanCham

( 410)

Moken

( 215)

Iban

( 187)

Gayo

( 477)

( 554)

( 217)

Idaan

( 99)

BunduDusun

( 464)

( 138)

( 677)

TimugonM

ur

( 276)

KelabitBar

( 78)

Bintulu

( 66)

( 65)

Bera

wanLon

( 62)

Bela

it

( 278)

KenyahLong

( 396)

( 397)

Mela

nauM

uk

( 303)

Lahanan

( 633)

( 167)

( 169)

EngganoM

al

( 168)

EngganoBan

( 455)

Nia

s

( 679)

TobaBata

k

( 341)

Madure

se

( 470) ( 56)

( 55)

( 241)

( 242)

Ivata

nBasc

( 237)

Itbayat

( 39)

Babuyan

( 234)

Irara

lay

( 238)

Itbayate

n

( 235)

Isam

oro

ng

( 240)

Ivasay

( 228)

Imoro

d

( 752)

Yam

i

( 113)

( 561)

Sam

balB

oto

( 263)

Kapam

panga

( 469)

( 605)

( 619)

( 498)

( 64)

( 256)

( 222)Ifu

gaoBayn

( 257)Kalla

hanKa

( 258)Kalla

hanKe

( 232)

Inib

alo

i

( 497)

Pangasin

an

( 226)

( 251)

Kakid

ugenI

( 227)

IlongotK

ak

( 110)

( 478)

( 219)

( 220)

IfugaoAm

ga

( 221)

IfugaoBata

( 90)

( 88)

( 89)Bonto

kGuin

( 87)Bonto

cGuin

( 262)

KankanayNo

( 41)

Bala

ngaw

( 255)

( 254)

Ka

ling

aG

ui

( 239)

Itne

gB

ino

n

( 468)

( 216)

( 184)

Ga

dd

an

g

( 30)

Atta

Pa

mp

lo

( 236)

Isn

eg

Dib

ag

( 471)

( 2)

Ag

ta

( 144)

Du

ma

ga

tCa

s

( 224)

Ilok

an

o

( 243)

( 754)

Yo

gy

a

( 488)

Old

Jav

an

es

( 735)

We

ste

rnM

ala

yo

Po

lyn

es

ian

(631)

( 611)

( 94)

( 93)

Bu

gin

es

eS

o

( 354)

Ma

loh

( 344)

Ma

ka

ss

ar

( 637)

Ta

eS

To

raja

(568)

(476)

(567)

Sa

ng

irTa

bu

(566)

Sa

ng

ir

(565)

Sa

ng

ilSa

ra

(50)

Ba

ntik

(203)

(201)

(248)

Ka

idip

an

g

(202)

Go

ron

talo

H

(84)

Bo

laa

ng

Mo

n

(423)

(691)

(518)

Po

pa

lia (85)

Bo

ne

rate

(739)

(747)

Wu

na

(424)

Mu

na

Ka

tob

u

(466)

(685)

To

nte

mb

oa

n (684)

To

ns

ea

(46)

Ba

ng

ga

iWd

i (51)

Ba

ree

(746)

Wo

lio (418)

Mo

ri

(270)

(271)

Ka

ya

nU

ma

Ju

(96)

Bu

ka

t

(529)

Pu

na

nK

ela

i

(

111)

(158)

(523)

(736)

(462)

(213)

(715)

Wa

mp

ar

(749)

Ya

be

m (484)

Nu

mb

am

iSib

(451) (7

13)

(624)

(67)

(422)

Mo

uk

(20)

Aria

(309)

La

mo

ga

iMu

l

(7)

Am

ara

(500)

(585)

Se

ng

se

ng

(268)

Ka

ulo

ng

Au

V

(61) (475)

(73)

Bilib

il (394)

Me

gia

r

(188)

Ge

da

ge

d

(387)

Ma

tuk

ar

(72)

Bilia

u

(401)

Me

ng

en

(353)

Ma

leu

(292)

Ko

ve

(575)

(574)

(661)

Ta

rpia

(600)

So

be

i

(272)

Ka

yu

pu

lau

K

(539)

Riw

o

(579) (250)

(249)

Kairiru

(358) (284)

Kis

(743)

Wogeo

(4)

Ali

(499)

(480)

(627)

(31)

Auhela

wa

(558)

Salib

a

(626)

Suau

(343)

Mais

in

(463)

(205)

Gum

awana

(104)

(140)

Dio

dio

(412)

Molim

a

(17) (16)

(695)

Ubir

(185)

Gapapaiw

a

(720)

Wedau

(141)

Dobuan

(508)

(281)

(409)

Mis

ima

(280)

Kiliv

ila

(117) (723)

(183)

Gabadi

(482) (143)

Doura

(295)

Kuni

(541)

Roro

(395)

Mekeo

(305)

Lala

(594)

(712)

Viliru

pu

(421)

Motu

(342)

MagoriS

out

(404)

(448)

(610)

(502)

(261)

Kandas

(655)

Tanga

(590)

Sia

r

(293)

Kuanua

(501)

Patp

ata

r

(447) (730)

(544)

Rovia

na

(593)

Sim

bo

(437)

Nduke

(296)

Kusaghe

(212)

Hoava

(335)

Lungga

(336)

Luqa

(294)

Kubokota

(696)

Ughele

(193)

Ghanongga

(154)

(706)

Vangunu

(382)

Marovo

(391)

Mbareke

(440)

(602)

Solos

(572)

(646)

Taiof

(668)

Teop

(438)

(439)

NehanHape

(458)

Nissan

(207)

Haku

(129)

(597)

Sisingga

(38)

BabatanaTu

(711)

VarisiGhon

(710)

Varisi

(538)

Ririo

(584)

Sengga

(35)

BabatanaAv

(705)

Vaghua

(34)

Babatana

(37)

BabatanaLo

(36)

BabatanaKa

(416)

(334)

LungaLunga

(413)

Mono

(414)

MonoAlu

(686)

Torau

(700)

Uruava

(415)

MonoFauro

(75)

Bilur

(571)

(157)

(128)ChekeHolo

(379)MaringeKma

(381)MaringeTat

(452)NggaoPoro

(380)MaringeLel

(732)

(755)ZabanaKia

(302)LaghuSamas

(124)

(82)Blablanga

(83)BlablangaG

(289)Kokota

(282)KilokakaYs

(49)

Banoni

(316)

(340)

Madara

(692)

TungagTung

(265)

KaraWest

(431)

Nalik

(673)

Tiang

(674)

Tigak

(324)

LihirSungl

(338)

(339)MadakLamas

(53)Barok

(741)

(388)

Maututu

(430)

NakanaiBil

(304)Lakalai

(714)

Vitu

(753)

Yapese

(

112)

(618)

(190)

(92)

(393)MbughotuDh

(95)Bugotu

(204)

(651)TaliseMoli

(650)TaliseMala

(195)GhariNdi (194)Ghari (681)Tolo (648)

Talise (347)

Malango (196)

GhariNggae (392)Mbirao (199)GhariTanda (652)TalisePole (197)GhariNgger (649)TaliseKoo (198)

GhariNgini (189) (319)

Lengo (320)

LengoGhaim (321)

LengoParip (453)

Nggela

(346)

(345)

(473)

(678)

Toambaita (298)

Kwai (312)

Langalanga

(299)

Kwaio (390)

Mbaengguu

(313)

Lau (389)

Mbaelelea

(301)

KwaraaeSol

(175)

Fataleka

(315)

LauWalade

(314)

LauNorth

(328)

Longgu (621)

(551)

SaaUkiNiMa

(550)

SaaSaaVill

(490)

Oroha

(552)

SaaUlawa

(548)

Saa

(142)

Dorio

(18)

AreareMaas

(19)

AreareWaia

(54

9)

SaaAuluVil

(56

4)

(58

)

BauroBaroo

(59

)

BauroHaunu

(17

2)

Fagani

(24

7)

KahuaMami

(57

0)

SantaCatal

(22

)

ArosiOneib

(23

)

ArosiTawat

(21

)

Arosi

(24

6)

Kahua

(60

)

BauroPawaV

(17

3)

FaganiAguf

(17

4)

FaganiRihu

(56

9)

SantaAna

(66

3)

Tawaroga

(

536)

(

709)

(

657)

(

658)

TannaSouth

(

300)

Kwamera

(

318)

Lenakel

(

171)

(

636)

SyeErrom

an

(

699)

Ura

(

467)

(

726)

(

15)

ArakiS

outh

(

402)

Mere

i

(

150)

(

10)

Ambry

mSout

(

420)

Mota

(

510)

Petera

raM

a

(

491)

PaameseSou

(

427)

Mwotla

p

(

531)

Raga

(

119)

(

607)

SouthEfa

te

(

489)

Orkon

(

454)

Nguna

(

432)

Namakir

(

352)

(

434)

Nati

(

429)

Nahavaq

(

444)

Nese

(

659)

Tape

(

12)

Anejom

Anei

(

557)

Saka

oPor

tO

(

351)

(

32)

Avav

a

(

445)

Neve

ei

(

433)

Nam

an

(

522)

(

406)

(

516)

(

520)

(

528)

Pulu

wat

ese

(

527)

Pulo

Annan

(

745)

Wole

aia

n

(

577)

Sata

wale

se

(

132)

ChuukeseAK

(

419)

Mortlo

ckes

(

106)

Caro

linia

n

(

131)

Chuukese

(

603)

Sonsoro

les

(

555)

Saip

anCaro

(

526)

Pulo

Anna

(

744)

Wole

ai

(

515)

(

411)

Mokilese

(

512)

Pin

gilapes

(

514)

Ponapean

(

283)

Kirib

ati

(

297)

Kusaie

(

385)

Mars

halles

(

436)

Nauru

(

332)

(

446)

(

474)

(

441)

Nele

mwa

(

244)

Jawe

(

105)

Canala

(

214)

Iaai

(

443)

Nengone

(

139)

Dehu

(

116)

(

725)

(

543)

Rotu

man

(

734)

Weste

rnFij

(

146)

(

513)

(

481)

(

156)

(

122)

(

645)

(

374)

Maori

(

642)

Tahiti

(

505)

Penrh

yn

(

689)

Tuam

otu

(

361)

Manih

iki

(

546)

Ruru

tuan

(

643)

Tahitia

nM

o

(

534)

Raro

tongan

(

644)

Ta

hit

ian

th

(

384)

(

383)

Ma

rqu

es

an

(

359)

Ma

ng

are

va

(

209)

Ha

wa

iia

n

(

533)

Ra

pa

nu

iEa

s

(563)

(56

2)

Sa

mo

an

(

182)

(

63)

Be

llo

na

(

675)

Tik

op

ia

(

163)

Em

ae

(

13)

An

uta

(

537)

Re

nn

ell

es

e

(

180)

Fu

tun

aA

niw

(

218)

Ifir

aM

ele

M

(

703)

Uv

ea

We

st

(

181)

Fu

tun

aE

as

t

(704)

Va

ea

ka

uT

au

(162)

(647)

Ta

ku

u (694)

Tu

va

lu (592)

Sik

aia

na

(264)

Ka

pin

ga

ma

r

(483)

Nu

ku

oro

(333)

Lu

an

giu

a (524)

Pu

ka

pu

ka

(680)

To

ke

lau

(702)

Uv

ea

Ea

st

(683)

(682)

To

ng

an

(459)

Niu

e (177)

Fij

ian

Ba

u

(159)

(707)

(666)

Te

an

u (708)

Va

no

(654)

Ta

ne

ma

(98)

Bu

ma

(701)

(26)

As

um

bo

a (656)

Ta

nim

bil

i (442)

Ne

mb

ao

(1)

(738)

(581)

Se

ima

t

(748)

Wu

vu

lu (160)

(616)

(330)

Lo

u (435)

Na

un

a

(373)

(153)

(317)

Le

ipo

n (598)

Siv

isa

Tit

a

(

729)

(

325)

Lik

um

(

323)

Le

ve

i

(

329)

Lo

niu

(

426)

Mu

ss

au

(

609)

(

608)

(

266)

Ka

sir

aIr

ah

(

200)

Gim

an

(

97)

Bu

li

(

108)

(

417)

Mo

r

(

532)

(

408)

Min

ya

ifu

in

(

68)

Big

aM

iso

ol

(

25)

As

(

718)

Wa

rop

en

(

485)

Nu

mfo

r

(

120)

(

378)

Ma

rau

(

8)

Am

ba

iYa

pe

n

(

742)

Win

desiW

an

(

519)

(

33)

(

465)

(

460)

Nort

hBabar

(

135)

Dawera

Dawe

(

134)

Dai

(

612)

(

622)

(

667)

Tela

Masbua

(

229)

Imro

ing

(

164)

Em

pla

was

(

386)

(

148)

EastM

asela

(

588)

Serili

(

115)

CentralM

as

(

606)

South

EastB

(

724)

WestD

am

ar

(

24)

(

660)

Tara

nganBa

(

697)

UjirN

Aru

(

450)

Ngaib

orS

Ar

(

114)

(

152)

(

45)

(

192)

(

191)

Geser

(

719)

Watu

bela

(

161)

Ela

tKeiB

es

(

586)

(

486)

(

587)

(

9)

(

211)

HituAm

bon

(

604)

SouAm

anaTe

(

6)

Am

ahai

(

503)

Paulo

hi

(

698)

(

5)

Alu

ne

(

425)

Murn

ate

nAl

(

86)

Bonfia

(

722)

Werinam

a

(

101)

Buru

Nam

rol

(

601)

Soboyo

(

77)

(

357)

Mam

boru

(

360)

Manggara

i

(

449)

Ngadha

(

517)

Pondok

(

287)

Kodi

(

721)

Weje

waTana

(

76)

Bim

a

(

599)

Soa

(

578)

Savu

(

716)

Wan

ukak

a

(

186)

Gaura

Nggau

(

149)

Eas tSum

ban

(

44)

Baliledo

(

308)

Lamboya

(

166)

(

326)

L ioF lo

resT

(

165)

Ende

(

428)

Nage

(

259)

Kambera

(

496)

PalueNitu

n

(

11)

Anakalang

(

461)

(

582)

Sekar

(

525)

PulauArg

un

(

676)

(

479)

(

731)

(

542)

RotiTerm

an

(

29)

Atoni

(

155)

(

356)

Mam

bai

(

669)

TetunTerik

(

277)

Kemak

(

178)

(

307)

Lamale

rale

(

306)

Lamaholo

tI

(

591)

S ika

(

273)

Kedang

( 62

3)

( 74

0)

(

170)

E rai

(

653)

Talur

( 14

)

Aputai

( 50

6)

Perai

( 69

0)

Tugun

( 22

3)

Iliun

( 67

1)

( 45

7)

( 58

9)

Serua

( 45

6)

Nila

( 67

0)

Teun

( 28

6)

( 28

5)

Kis ar

( 54

0)

Roma

( 14

5)

Eas tDamar

( 32

2)

Letinese

( 61

7)

( 27

5)

( 75

1)

Yamdena

( 27

4)

KeiTanimba

( 58

3)

Selaru

( 28

8)

KoiwaiIria

( 127)

Chamorro

( 509)

(310)

Lampung

( 290)

Komering

( 634)

Sunda

( 535)

RejangReja

( 74)

( 664)

( 665)

TboliTagab

( 638)

Tagabili

( 81)

( 291)

KoronadalB

( 573)

SaranganiB

( 70)

BilaanKoro

( 71)

BilaanSara

( 179)

( 28)

( 133)

CiuliAtaya

( 625)

SquliqAtay

( 580)

Sediq

( 100)

Bunun

( 737)

( 672)

Thao

( 176)

Favorlang

( 545)

Rukai

( 492)

Paiwan

( 688)

( 260)Kanakanabu

( 553)Saaroa

( 687)Tsou

( 530)

Puyuma

( 147)

( 472) ( 269) Kavalan ( 54) Basai

( 109) CentralAmi ( 596) S iraya

( 504) Pazeh ( 556) Sais iat

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

Figure S.2: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-referenced with the code in parenthesis attached to each branch.

26

Page 33: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

(521)

(559)

(750)

Ya

ka

n (632)

(91)

(40)

Ba

jo (375)

Ma

pu

n (560)

Sa

ma

lSia

si

(230)

Ina

ba

kn

on

(620)

(629)

(630)

Su

ba

nu

nS

in

(628)

Su

ba

no

nS

io (362)

(123)

(728)

(733)

We

ste

rnB

uk

(370)

Ma

no

bo

We

st

(366)

Ma

no

bo

Ilia

(365)

Ma

no

bo

Dib

a

(27)

(363)

Ma

no

bo

Ata

d

(364)

Ma

no

bo

Ata

u

(

369)

Ma

no

bo

Tig

w

(

614)

(

367)

Ma

no

bo

Ka

la

(

368)

Ma

no

bo

Sa

ra

(

79)

Bin

uk

id

(

377)

(

376)

Ma

ran

ao

(

233)

Ira

nu

n

(

43)

(

576)

Sa

sa

k

(

42)

Ba

li

(

405)

(

118)

(

355)

Ma

ma

nw

a

(

80)

(

121)

(

507)

(

225)

Ilo

ng

go

(

210)

Hil

iga

yn

on

(

717)

Wa

ray

Wa

ray

(

613)

(

102)

(

103)

Bu

tua

no

n

(

662)

Ta

us

ug

Jolo

(

635)

Su

rig

ao

no

n

(

3)

Akla

nonBis

(

107)

Cebuano

(

372)

(

252)

Kala

gan

(

371)

Mansaka

(

639)

Tagalo

gAnt

(

69)

Bik

olN

agaC

(

253)

(

641)

TagbanwaKa

(

640)

TagbanwaAb

(

495)

(

57)

Bata

kPala

w

(

494)

Pala

wanBat

(

47)

Banggi

(

547)

SW

Pala

wano

(

208)

Hanunoo

(

493)

Pala

uan

(

311)

(

595)

Sin

ghi

(

136)

DayakBakat

(

52)

(

727)

(

615)

(

267)

Katingan

(

137)

DayakNgaju

(

245)

Kadorih

(

151)

(

403)

MerinaM

ala

(

337)

Maanyan

(

693)

Tunju

ng

(

350)

(

349)

(

327)

(

279)

Kerinci

(

400)

Mela

yuSara

(

398)

Mela

yu

(

487)

Ogan

(

331)

LowM

ala

y

(

231)

Indonesia

n

(

48)

Banja

reseM

(

348)

Mala

yBahas

(

399)

Mela

yuBru

n

(

407)

Min

angkaba

(

126)

(

125)

(

511)

PhanRangCh

(

130)

Chru

(

206)

Hai

nanC

ham

(

410)

Mok

en

(

215)

Iban

(

187)

Gayo

(

477)

(

554)

(

217)

Idaa

n

(

99)

BunduDusun

(

464)

(

138)

(

677)

TimugonM

ur

(

276)

KelabitB

ar

(

78)

Bintu

lu

(

66)

(

65)

BerawanLon

(

62)

Belait

(

278)

KenyahLong

(

396)

(

397)

Mela

nauMuk

(

303)

Lahanan

(

633)

(

167)

(

169)

EngganoMal

(

168)

EngganoBan

(

455)

Nias

(

679)

TobaBatak

(

341)

Madure

se

( 47

0)

(

56)

(

55)

(

241)

(

242)

Ivata

nBas c

(

237)

Itbayat

(

39)

Babuyan

(

234)

Irara

lay

(

238)

Itbayate

n

(

235)

Isam

orong

(

240)

Ivasay

(

228)

Imoro

d

( 75

2)

Yami

( 11

3)

( 56

1)

SambalBoto

( 26

3)

Kapampanga

( 46

9)

( 60

5)

( 61

9)

( 49

8)

( 64

)

( 25

6)

( 22

2)Ifu

gaoBayn

( 25

7)Kalla

hanKa

( 25

8)Kalla

hanKe

( 23

2)

Inibaloi

( 49

7)

Pangas inan

( 22

6)

( 25

1)

KakidugenI

( 22

7)

IlongotKak

( 11

0)

( 47

8)

( 21

9)

( 22

0)

IfugaoAmga

( 22

1)

IfugaoBata

( 90

)

( 88

)

( 89

)BontokGuin

( 87

)BontocGuin

( 26

2)

KankanayNo

( 41)

Balangaw

( 255)

(254)

KalingaGui

( 239)

ItnegBinon

( 468)

( 216)

( 184)

Gaddang

( 30)

AttaPamplo

( 236)

Is negDibag

( 471)

( 2)

Agta

( 144)

DumagatCas

( 224)

Ilokano

( 243)

( 754)

Yogya

( 488)

OldJavanes

( 735)

Wes ternMalayoPolynes ian

( 631)

( 611)

( 94)

( 93)

BugineseSo

( 354)Maloh

( 344)

Makassar

( 637)

TaeSToraja

( 568)

( 476)

( 567)SangirTabu

( 566)Sangir

( 565)SangilSara

( 50)

Bantik

( 203)

( 201)

( 248) Kaidipang

( 202) GorontaloH

( 84)BolaangMon

( 423)

( 691) ( 518) Popalia ( 85) Bonerate

( 739) ( 747) Wuna ( 424) MunaKatobu

( 466) ( 685) Tontemboan ( 684)

Tonsea ( 46)BanggaiWdi ( 51)

Baree ( 746)

Wolio ( 418)

Mori

( 270) ( 271)

KayanUmaJu

( 96)

Bukat

( 529)

PunanKelai (111)

(158)

(523)

( 736)

( 462)

( 213) ( 715)

Wampar ( 749)

Yabem ( 484)

NumbamiS ib ( 451) ( 713)

( 624)

( 67) ( 422)

Mouk ( 20)

Aria ( 309)

LamogaiMul

( 7)

Amara

( 500) ( 585)

Sengseng

( 268)

KaulongAuV

( 61) ( 475) ( 73)

Bilibil ( 394)

Megiar

( 188)

Gedaged

( 387)

Matukar

( 72)

Biliau

( 401)

Mengen

( 353)

Maleu

( 292)

Kove

( 575)

( 574)

( 661)

Tarpia

( 600)

Sobei

( 272)

KayupulauK

( 539)

Riwo

( 579) ( 250) ( 249)

Kairiru

( 358) ( 284)

Kis

( 743)

Wogeo

( 4)

Ali

( 499)

( 480)

( 627)

( 31)

Auhelawa

( 558)

Saliba

( 626)

Suau

( 343)

Mais in

( 463)

( 205)

Gumawana

( 104)

( 140)

Diodio

( 412)

Molima

( 17) ( 16)

( 695)

Ubir

( 185)

Gapapaiwa

( 720)

Wedau

( 141)

Dobuan

( 508)

( 281)

( 409)

M is ima

( 280)

Kiliv ila

( 117) ( 723) ( 183)

Gabadi

( 482) ( 143)

Doura

( 295)

Kuni

( 541)

Roro

( 395)

Mekeo

( 305)

Lala

( 594)

( 712)

Vilirupu

( 421)

Motu

( 342)

MagoriSout

( 404)

( 448) ( 610)

( 502)

( 261)

Kandas

( 655)

Tanga

( 590)

S iar

( 293)

Kuanua

( 501)

Patpatar

( 447) ( 730)

( 544)

Rov iana

( 593)

S imbo

( 437)

Nduke

( 296)

Kusaghe

( 212)

Hoava

( 335)

Lungga

( 336)

Luqa

( 294)

Kubokota

( 696)

Ughele

( 193)

Ghanongga

( 154)

( 706)

Vangunu

( 382)

Marovo

( 391)

Mbare

ke

( 440) ( 602)

Solo

s

( 572)

( 646)

Taio

f

( 668)

Teop

( 438)

( 439)

NehanHape

( 458)

Nis

san

( 207)

Haku

( 129)

( 597)

Sis

ingga

( 38)

Babata

naTu

( 711)

Varis

iGhon

( 710)

Varis

i

( 538)

Ririo

( 584)

Sengga

( 35)

Babata

naAv

( 705)

Vaghua

( 34)

Babata

na

( 37)

Babata

naLo

( 36)

Babata

naKa

( 416)

( 334)

LungaLunga

( 413)

Mono

( 414)

MonoAlu

( 686)

Tora

u

( 700)

Uru

ava

( 415)

MonoFauro

( 75)

Bilu

r

( 571)

( 157)

( 128)ChekeHolo

( 379)M

arin

geKm

a

( 381)M

arin

geTat

( 452)NggaoPoro

( 380)M

arin

geLel

( 732)

( 755)ZabanaKia

( 302)LaghuSam

as

( 124)

( 82)Bla

bla

nga

( 83)Bla

bla

ngaG

( 289)K

ok

ota

( 282)K

ilok

ak

aY

s

( 49)

Ba

no

ni

( 316)

( 340)

Ma

da

ra

( 692)

Tu

ng

ag

Tu

ng

( 265)

Ka

raW

es

t

( 431)

Na

lik

( 673)

Tia

ng

( 674)

Tig

ak

( 324)

Lih

irSu

ng

l

( 338)

( 339)M

ad

ak

La

ma

s

( 53)

Ba

rok

( 741)

( 388)

Ma

utu

tu

( 430)

Na

ka

na

iBil

( 304)

La

ka

lai

(714)

Vitu

(753)

Ya

pe

se

(112)

(618)

(190)

(92)

(393)

Mb

ug

ho

tuD

h

(95)

Bu

go

tu

(204)

(651)

Ta

lise

Mo

li

(650)

Ta

lise

Ma

la

(195)

Gh

ariN

di

(194)

Gh

ari

(681)

To

lo (648)

Ta

lise

(347)

Ma

lan

go

(196)

Gh

ariN

gg

ae

(392)

Mb

irao

(199)

Gh

ariT

an

da

(652)

Ta

lise

Po

le (197)

Gh

ariN

gg

er

(649)

Ta

lise

Ko

o (198)

Gh

ariN

gin

i (189)

(319)

Le

ng

o (320)

Le

ng

oG

ha

im (321)

Le

ng

oP

arip

(453)

Ng

ge

la

(346)

(345)

(473)

(678)

To

am

ba

ita (298)

Kw

ai

(312)

La

ng

ala

ng

a

(299)

Kw

aio

(390)

Mb

ae

ng

gu

u

(313)

La

u (3

89)

Mb

ae

lele

a

(301)

Kw

ara

ae

So

l

(175)

Fa

tale

ka

(315)

La

uW

ala

de

(314)

La

uN

orth

(328)

Lo

ng

gu

(621)

(551)

Sa

aU

kiN

iMa

(550)

Sa

aS

aa

Vill

(490)

Oro

ha

(552)

Sa

aU

law

a

(548)

Sa

a

(142)

Do

rio

(18)

Are

are

Ma

as

(19)

Are

are

Waia

(549)

SaaAulu

Vil

(564)

(58)

Bauro

Baro

o

(59)

Bauro

Haunu

(172)

Fagani

(247)

KahuaM

am

i

(570)

Santa

Cata

l

(22)

Aro

siO

neib

(23)

Aro

siT

awat

(21)

Aro

si

(246)

Kahua

(60)

Bauro

PawaV

(173)

FaganiA

guf

(174)

FaganiR

ihu

(569)

Santa

Ana

(663)

Tawaro

ga

(536)

(709)

(657)

(658)

TannaSouth

(300)

Kwam

era

(318)

Lenakel

(171)

(636)

SyeErro

man

(699)

Ura

(467)

(726)

(15)

Ara

kiS

outh

(402)

Mere

i

(150) (10)

Am

bry

mSout

(420)

Mota

(510)

Pete

rara

Ma

(491)

Paam

eseSou

(427)

Mwotla

p

(531)

Raga

(119)

(607)

South

Efa

te

(489)

Ork

on

(454)

Nguna

(432)

Nam

akir

(352)

(434)

Nati

(429)

Nahavaq

(444)

Nese

(659)

Tape

(12)

AnejomAnei

(557)

SakaoPortO

(351)

(32)

Avava

(445)

Neveei

(433)

Naman

(522)

(406)

(516)

(520)

(528)

Puluwatese

(527)

PuloAnnan

(745)

Woleaian

(577)

Satawalese

(132)

ChuukeseAK

(419)

Mortlockes

(106)

Carolinian

(131)

Chuukese

(603)

Sonsoroles

(555)

SaipanCaro

(526)

PuloAnna

(744)

Woleai

(515)

(411)

Mokilese

(512)

Pingilapes

(514)

Ponapean

(283)

Kiribati

(297)

Kusaie

(385)

Marshalles

(436)

Nauru

(332)

(446)

(474)

(441)

Nelemwa

(244)

Jawe

(105)

Canala

(214)

Iaai

(443)

Nengone

(139)

Dehu

(116)

(725)

(543)

Rotuman

(734)

WesternFij

(146) (513)

(481)

(156) (122)

(645)

(374)Maori

(642)Tahiti

(505)Penrhyn

(689)Tuamotu

(361)Manihiki

(546)Rurutuan

(643)TahitianMo

(534)Rarotongan

(644)Tahitianth

(384)

(383)Marquesan

(359)Mangareva

(209)Hawaiian

(533)

RapanuiEas

(563)

(562)

Samoan

(182)

(63)Bellona

(675)Tikopia

(163)Emae

(13)Anuta

(537)Rennellese

(180)FutunaAniw

(218)IfiraMeleM

(703)UveaWest

(181)FutunaEast

(704)VaeakauTau

(162)

(647)Takuu (694)

Tuvalu (592)

Sikaiana (264)

Kapingamar

(483)Nukuoro

(333)Luangiua

(524)Pukapuka (680)Tokelau (702)

UveaEast

(683) (682)

Tongan (459)

Niue (177)FijianBau

(159) (707)

(666)Teanu (708)Vano (654)Tanema (98)

Buma (701) (26)

Asumboa (656)

Tanimbili (442)

Nembao (1)

(738) (581)

Seimat

(748)

Wuvulu (160)

(616) (330)

Lou (435)

Nauna (373)

(153) (317)

Leipon (598)

SivisaTita

(729) (325)

Likum (323)

Levei

(329)

Loniu

(426)

Mussau

(609)

(608)

(266)

KasiraIrah

(200)

Giman

(97)

Buli

(108)

(417)

Mor

(532) (408)

Minyaifuin

(68)

BigaMisool

(25)

As

(718)

Waropen

(485)

Numfor

(120)

(378)

Marau

(8)

AmbaiYapen

(742)

WindesiWan (

519)

(33

)

(465)

(46

0)

NorthBabar

(13

5)

DaweraDawe

(13

4)

Dai

(61

2)

(62

2)

(66

7)

TelaMasbua

(22

9)

Imroing

(16

4)

Emplawas

(38

6)

(14

8)

EastMasela

(58

8)

Serili

(11

5)

CentralM

as

(60

6)

SouthEastB

(72

4)

WestDamar

(24

)

(66

0)

TaranganBa

(69

7)

UjirNAru

(45

0)

NgaiborSAr

(

114)

(

152)

(

45)

(

192)

(19

1)

Geser

(

719)

Watu

bela

(

161)

ElatK

eiBes

(

586)

(

486)

(

587)

(

9)

(

211)

HituAm

bon

(

604)

SouAmanaTe

(

6)

Amahai

(

503)

Paulohi

(

698)

(

5)

Alune

(

425)

Murn

atenAl

(

86)

Bonfia

(

722)

Werin

ama

(

101)

BuruNam

rol

(

601)

Soboyo

(

77)

(

357)

Mam

boru

(

360)

Manggara

i

(

449)

Ngadha

(

517)

Pondok

(

287)

Kodi

(

721)

Weje

waTana

(

76)

Bima

(

599)

Soa

(

578)

Savu

(

716)

Wanukaka

(

186)

Gaura

Nggau

(

149)

East

Sum

ban

(

44)

Balil

edo

(

308)

Lam

boya

(

166)

(

326)

LioF

lore

sT

(

165)

Ende

(

428)

Nage

(

259)

Kam

bera

(

496)

Palu

eNitun

(

11)

Anakala

ng

(

461)

(

582)

Sekar

(

525)

Pula

uArg

un

(

676)

(

479)

(

731)

(

542)

RotiTerm

an

(

29)

Ato

ni

(

155)

(

356)

Mam

bai

(

669)

Tetu

nTerik

(

277)

Kem

ak

(

178)

(

307)

Lam

ale

rale

(

306)

Lam

aholo

tI

(

591)

Sik

a

(

273)

Kedang

(

623)

(

740)

(

170)

Era

i

(

653)

Talu

r

(

14)

Aputa

i

(

506)

Pera

i

(

690)

Tugun

(

223)

Iliu

n

(

671)

(

457)

(

589)

Seru

a

(

456)

Nila

(

670)

Teun

(

286)

(

285)

Kis

ar

(

540)

Rom

a

(

145)

EastD

am

ar

(

322)

Letinese

(

617)

(

275)

(

751)

Yam

dena

(

274)

KeiT

anim

ba

(

583)

Sela

ru

(

288)

Koiw

aiIria

(

127)

Cham

orr

o

(

509)

(

310)

La

mp

un

g

(

290)

Ko

me

rin

g

(

634)

Su

nd

a

(

535)

Re

jan

gR

eja

(

74)

(

664)

(

665)

Tb

oli

Ta

ga

b

(

638)

Ta

ga

bil

i

(

81)

(

291)

Ko

ron

ad

alB

(

573)

Sa

ran

ga

niB

(

70)

Bil

aa

nK

oro

(

71)

Bil

aa

nS

ara

(179)

(

28)

(

133)

Ciu

liA

tay

a

(

625)

Sq

uli

qA

tay

(

580)

Se

diq

(

100)

Bu

nu

n

(737)

(

672)

Th

ao

(176)

Fa

vo

rla

ng

(545)

Ru

ka

i

(492)

Pa

iwa

n

(688)

(260)

Ka

na

ka

na

bu

(553)

Sa

aro

a

(687)

Ts

ou

(530)

Pu

yu

ma

(147)

(472)

(269)

Ka

va

lan

(54)

Ba

sa

i (109)

Ce

ntr

alA

mi

(596)

Sir

ay

a (504)

Pa

ze

h (556)

Sa

isia

t !

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

Figure S.3: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-referenced with the code in parenthesis attached to each branch.

27

Page 34: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

( 521)

(559)

(750)Yakan (632) (91) (40)Bajo (375)Mapun

(560)

SamalSiasi

(230)

Inabaknon (620)

(629) (630)

SubanunSin

(628)

SubanonSio (362) (123)

(728) (733)

WesternBuk (370)

ManoboWest

(366)

ManoboIlia

(365)

ManoboDiba

(27) (363)

ManoboAtad

(364)

ManoboAtau

(369)

ManoboTigw

(614)

(367)

ManoboKala

(368)

ManoboSara

(79)

Binukid

(377)

(376)

Maranao

(233)

Iranun

(43)

(576)

Sasak

(42)

Bali

(40

5) (118)

(355)

Mamanwa

(80) (121) (507)

(225)

Ilonggo

(210)

Hiligaynon

(717)

WarayWaray

(613)

(102)

(103)

Butuanon

(662)

TausugJolo

(635)

Surigaonon

(3)

AklanonBis

(10

7)

Cebuano

(37

2)

(25

2)

Kalagan

(37

1)

Mansaka

(63

9)

TagalogAnt

(69

)

BikolNagaC

(25

3)

(64

1)

TagbanwaKa

(64

0)

TagbanwaAb

(49

5)

(57

)

BatakPalaw

(49

4)

PalawanBat

(47

)

Banggi

(54

7)

SWPalawano

(20

8)

Hanunoo

(49

3)

Palauan

(31

1)

(59

5)

Singhi

(13

6)

DayakBakat

(

52)

(

727)

(

615)

(

267)

Katingan

(

137)

DayakNgaju

(

245)

Kadorih

(

151)

(

403)

Merin

aMala

(

337)

Maanyan

(

693)

Tunjung

(

350)

(

349)

(

327)

(

279)

Kerinci

(

400)

Mela

yuSara

(

398)

Mela

yu

(

487)

Ogan

(

331)

LowMala

y

(

231)

Indonesia

n

(

48)

Banjare

seM

(

348)

Mala

yBahas

(

399)

Mela

yuBrun

(

407)

Min

angkaba

(

126)

(

125)

(

511)

PhanRangCh

(

130)

Chru

(

206)

HainanCham

(

410)

Moken

(

215)

Iban

(

187)

Gayo

(

477)

(

554)

(

217)

Idaa

n

(

99)

Bund

uDus

un

(

464)

(

138)

(

677)

Tim

ugon

Mur

(

276)

Kela

bitB

ar

(

78)

Bint

ulu

(

66)

(

65)

Bera

wanLon

(

62)

Bela

it

(

278)

KenyahLong

(

396)

(

397)

Mela

nauM

uk

(

303)

Lahanan

(

633)

(

167)

(

169)

EngganoM

al

(

168)

EngganoBan

(

455)

Nia

s

(

679)

TobaBata

k

(

341)

Madure

se

(

470)

(56

)

(

55)

(

241)

(

242)

Ivata

nBasc

(

237)

Itbayat

(

39)

Babuyan

(

234)

Irara

lay

(

238)

Itbayate

n

(

235)

Isam

oro

ng

(

240)

Ivasay

(

228)

Imoro

d

(

752)

Yam

i

(

113)

(

561)

Sam

balB

oto

(

263)

Kapam

panga

(

469)

(

605)

(

619)

(

498)

(

64)

(

256)

(

222)

IfugaoBayn

(

257)

KallahanKa

(

258)

KallahanKe

(

232)

Inib

alo

i

(

497)

Pangasin

an

(

226)

(

251)

Kakid

ugenI

(

227)

IlongotK

ak

(

110)

(

478)

(

219)

(

220)

IfugaoAm

ga

(

221)

IfugaoBata

(

90)

(

88)

(

89)

Bonto

kGuin

(

87)

Bonto

cGuin

(

262)

KankanayNo

(

41)

Bala

ngaw

(

255)

(

254)

Ka

lin

ga

Gu

i

(

239)

Itn

eg

Bin

on

(

468)

(

216)

(

184)

Ga

dd

an

g

(

30)

Att

aP

am

plo

(

236)

Isn

eg

Dib

ag

(

471)

(

2)

Ag

ta

(

144)

Du

ma

ga

tCa

s

(

224)

Ilo

ka

no

(

243)

(

754)

Yo

gy

a

(

488)

Old

Jav

an

es

(

735)

We

ste

rnM

ala

yo

Po

lyn

es

ian

(631)

(

611)

(

94)

(

93)

Bu

gin

es

eS

o

(

354)

Ma

loh

(

344)

Ma

ka

ss

ar

(

637)

Ta

eS

To

raja

(568)

(476)

(567)

Sa

ng

irT

ab

u

(566)

Sa

ng

ir

(565)

Sa

ng

ilS

ara

(50)

Ba

nti

k

(203)

(201)

(248)

Ka

idip

an

g

(202)

Go

ron

talo

H

(84)

Bo

laa

ng

Mo

n

(423)

(691)

(518)

Po

pa

lia

(85)

Bo

ne

rate

(739)

(747)

Wu

na

(424)

Mu

na

Ka

tob

u

(466)

(685)

To

nte

mb

oa

n (684)

To

ns

ea

(46)

Ba

ng

ga

iWd

i (51)

Ba

ree

(746)

Wo

lio

(418)

Mo

ri

(270)

(271)

Ka

ya

nU

ma

Ju

(96)

Bu

ka

t

(529)

Pu

na

nK

ela

i

( 111)

( 158)

( 523)

(

736)

(

462)

(213)

(715)

Wa

mp

ar

(749)

Ya

be

m (484)

Nu

mb

am

iSib

(

451)

(

713)

(624)

(67)

(422)

Mo

uk

(20)

Ari

a (309)

La

mo

ga

iMu

l

(

7)

Am

ara

(

500)

(

585)

Se

ng

se

ng

(

268)

Ka

ulo

ng

Au

V

(

61)

(

475)

(

73)

Bil

ibil

(

394)

Me

gia

r

(

188)

Ge

da

ge

d

(

387)

Ma

tuk

ar

(

72)

Bil

iau

(

401)

Me

ng

en

(

353)

Ma

leu

(

292)

Ko

ve

(

575)

(

574)

(

661)

Ta

rpia

(

600)

So

be

i

(

272)

Ka

yu

pu

lau

K

(

539)

Riw

o

(

579)

(

250)

(

249)

Kair

iru

(

358)

(

284)

Kis

(

743)

Wogeo

(

4)

Ali

(

499)

(

480)

(

627)

(

31)

Auhela

wa

(

558)

Saliba

(

626)

Suau

(

343)

Mais

in

(

463)

(

205)

Gum

awana

(

104)

(

140)

Dio

dio

(

412)

Molim

a

(

17)

(

16)

(

695)

Ubir

(

185)

Gapapaiw

a

(

720)

Wedau

(

141)

Dobuan

(

508)

(

281)

(

409)

Mis

ima

(

280)

Kiliv

ila

(

117)

(

723)

(

183)

Gabadi

(

482)

(

143)

Doura

(

295)

Kuni

(

541)

Roro

(

395)

Mekeo

(

305)

Lala

(

594)

(

712)

Vilirupu

(

421)

Motu

(

342)

MagoriSout

(

404)

(

448)

(

610)

(

502)

(

261)

Kandas

(

655)

Tanga

(

590)

Sia

r

(

293)

Kuanua

(

501)

Patp

ata

r

(

447)

(

730)

(

544)

Rovia

na

(

593)

Sim

bo

(

437)

Nduke

(

296)

Kusa

ghe

(

212)

Hoav

a

(

335)

Lung

ga

(

336)

Luqa

(

294)

Kubo

kota

(

696)

Ughele

(

193)

Ghanongga

(

154)

(

706)

Vangunu

(

382)

Maro

vo

(

391)

Mbare

ke

(

440)

(

602)

Solos

(

572)

(

646)

Taiof

(

668)

Teop

(

438)

(

439)

NehanHape

(

458)

Niss an

(

207)

Haku

(

129)

(

597)

S isin

gga

(

38)

BabatanaTu

(

711)

VarisiG

hon

(

710)

Varisi

(

538)

Ririo

(

584)

Sengga

(

35)

BabatanaAv

(

705)

Vaghua

(

34)

Babatana

(

37)

BabatanaLo

(

36)

BabatanaKa

( 41

6)

( 33

4)

LungaLunga

( 41

3)

Mono

( 41

4)

MonoAlu

( 68

6)

Torau

( 70

0)

Uruava

( 41

5)

MonoFauro

( 75

)

Bilur

( 57

1)

( 15

7)

( 12

8)ChekeHolo

( 37

9)Marin

geKma

( 38

1)Marin

geTat

( 45

2)NggaoPoro

( 38

0)Marin

geLel

( 73

2)

( 75

5)ZabanaKia

( 30

2)LaghuSamas

(124)

( 82

)Blablanga

( 83) BlablangaG

(289) Kokota

( 282) KilokakaYs

( 49)

Banoni

( 316)

( 340)

Madara

( 692)

TungagTung

( 265)

KaraWes t

( 431)

Nalik

( 673)

Tiang

( 674)

Tigak

( 324)

L ihirSungl

( 338)

( 339)MadakLamas

( 53)Barok

( 741)

( 388)

Maututu

( 430)

NakanaiBil

( 304)Lakalai

( 714)

Vitu

( 753)

Yapese

( 112)

( 618)

( 190)

( 92)

( 393) MbughotuDh

( 95) Bugotu

( 204)

( 651) Talis eMoli

( 650) Talis eMala

( 195) GhariNdi ( 194) Ghari ( 681) Tolo ( 648) Talis e ( 347) Malango ( 196)

GhariNggae ( 392)

Mbirao ( 199)GhariTanda ( 652)Talis ePole ( 197)GhariNgger ( 649)Talis eKoo ( 198)

GhariNgini ( 189) ( 319)

Lengo ( 320)

LengoGhaim ( 321)

LengoParip ( 453)

Nggela

( 346)

( 345)

( 473)

( 678)

Toambaita ( 298)

Kwai ( 312)

Langalanga

( 299)

Kwaio ( 390)

Mbaengguu

( 313)

Lau ( 389)

Mbaelelea

( 301)

KwaraaeSol

( 175)

Fataleka

( 315)

LauWalade

( 314)

LauNorth

( 328)

Longgu ( 621)

( 551)

SaaUkiNiMa

( 550)

SaaSaaVill

( 490)

Oroha

( 552)

SaaUlawa

( 548)

Saa

( 142)

Dorio

( 18)

AreareMaas

( 19)

AreareWaia

( 549)

SaaAuluVil

( 564)

( 58)

BauroBaroo

( 59)

BauroHaunu

( 172)

Fagani

( 247)

KahuaMami

( 570)

SantaCatal

( 22)

Aros iOneib

( 23)

Aros iTawat

( 21)

Aros i

( 246)

Kahua

( 60)

BauroPawaV

( 173)

FaganiAguf

( 174)

FaganiRihu

( 569)

SantaAna

( 663)

Tawaroga

( 536)

( 709)

( 657)

( 658)

TannaSouth

( 300)

Kwamera

( 318)

Lenakel

( 171)

( 636)

SyeErroman

( 699)

Ura

( 467)

( 726)

( 15)

ArakiSouth

( 402)

Merei

( 150) ( 10)

Ambrym

Sout

( 420)

Mota

( 510)

PeteraraMa

( 491)

PaameseSou

( 427)

Mwotlap

( 531)

Raga

( 119)

( 607)

SouthEfate

( 489)

Orkon

( 454)

Nguna

( 432)

Namakir

( 352)

( 434)

Nati

( 429)

Nahavaq

( 444)

Nese

( 659)

Tape

( 12)

AnejomAnei

( 557)

SakaoPortO

( 351)

( 32)

Avava

( 445)

Neveei

( 433)

Naman

( 522)

( 406)

( 516)

( 520)

( 528)

Puluwatese

( 527)

Pulo

Annan

( 745)

Wole

aia

n

( 577)

Sata

wale

se

( 132)

ChuukeseAK

( 419)

Mortlo

ckes

( 106)

Caro

linia

n

( 131)

Chuukese

( 603)

Sonsoro

les

( 555)

Saip

anCaro

( 526)

Pulo

Anna

( 744)

Wole

ai

( 515)

( 411)

Mokile

se

( 512)

Pin

gila

pes

( 514)

Ponapean

( 283)

Kirib

ati

( 297)

Kusaie

( 385)

Mars

halle

s

( 436)

Nauru

( 332)

( 446)

( 474)

( 441)

Nele

mwa

( 244)

Jawe

( 105)

Canala

( 214)

Iaai

( 443)

Nengone

( 139)

Dehu

( 116)

( 725)

( 543)

Rotu

man

( 734)

Weste

rnFij

( 146) ( 513)

( 481)

( 156) ( 122) ( 645)

( 374)M

aori

( 642)Tahiti

( 505)Penrh

yn

( 689)Tuam

otu

( 361)M

anih

iki

( 546)Ruru

tuan

( 643)Tahitia

nM

o

( 534)Raro

tongan

( 644)T

ah

itian

th

( 384)

( 383)M

arq

ue

sa

n

( 359)M

an

ga

rev

a

( 209)H

aw

aiia

n

( 533)

Ra

pa

nu

iEa

s

(563) ( 562)

Sa

mo

an

( 182)

( 63)B

ello

na

( 675)T

iko

pia

( 163)E

ma

e

( 13)A

nu

ta

( 537)R

en

ne

lles

e

( 180)

Fu

tun

aA

niw

( 218)

IfiraM

ele

M

( 703)

Uv

ea

We

st

( 181)

Fu

tun

aE

as

t

(704)

Va

ea

ka

uT

au

(162)

(647)

Ta

ku

u (694)

Tu

va

lu (592)

Sik

aia

na

(264)

Ka

pin

ga

ma

r

(483)

Nu

ku

oro

(333)

Lu

an

giu

a (524)

Pu

ka

pu

ka

(680)

To

ke

lau

(702)

Uv

ea

Ea

st

(683)

(682)

To

ng

an

(459)

Niu

e (177)

Fijia

nB

au

(159)

(707)

(666)

Te

an

u (708)

Va

no

(654)

Ta

ne

ma

(98)

Bu

ma

(701)

(26)

As

um

bo

a (656)

Ta

nim

bili

(442)

Ne

mb

ao

(1)

(738)

(581)

Se

ima

t

(748)

Wu

vu

lu (160)

(616)

(330)

Lo

u (435)

Na

un

a (3

73)

(153)

(317)

Le

ipo

n (598)

Siv

isa

Tita

(729)

(325)

Lik

um

(323)

Le

ve

i

(329)

Lo

niu

(426)

Mu

ss

au

(609)

(608)

(266)

Ka

sira

Irah

(200)

Gim

an

(97)

Bu

li

(108)

(417)

Mo

r

(532) (408)

Min

ya

ifuin

(68)

Big

aM

iso

ol

(25)

As

(718)

Wa

rop

en

(485)

Nu

mfo

r

(120)

(378)

Ma

rau

(8)

Am

ba

iYa

pe

n

(742)

Win

desiW

an

(519)

(33) (465)

(460)

North

Babar

(135)

Dawera

Dawe

(134)

Dai

(612)

(622)

(667)

Tela

Masbua

(229)

Imro

ing

(164)

Em

pla

was

(386)

(148)

EastM

asela

(588)

Serili

(115)

Centra

lMas

(606)

South

EastB

(724)

WestD

am

ar

(24)

(660)

Tara

nganBa

(697)

UjirN

Aru

(450)

Ngaib

orS

Ar

(114) (152)

(45)

(192)

(191)

Geser

(719)

Watu

bela

(161)

Ela

tKeiB

es

(586) (486)

(587) (9)

(211)

Hitu

Am

bon

(604)

SouAm

anaTe

(6)

Am

ahai

(503)

Paulo

hi

(698)

(5)

Alu

ne

(425)

Murn

ate

nAl

(86)

Bonfia

(722)

Werin

am

a

(101)

Buru

Nam

rol

(601)

Soboyo

(77)

(357)

Mam

boru

(360)

Manggara

i

(449)

Ngadha

(517)

Pondok

(287)

Kodi

(721)

Weje

waTana

(76)

Bima

(599)

Soa

(578)

Savu

(716)

Wanukaka

(186)

GauraNggau

(149)

EastSumban

(44)

Baliledo

(308)

Lamboya

(166)

(326)

LioFloresT

(165)

Ende

(428)

Nage

(259)

Kambera

(496)

PalueNitun

(11)

Anakalang

(461)

(582)

Sekar

(525)

PulauArgun

(676)

(479)

(731)

(542)

RotiTerman

(29)

Atoni

(155)

(356)

Mam

bai

(669)

TetunTerik

(277)

Kemak

(178)

(307)

Lamalerale

(306)

LamaholotI

(591)

Sika

(273)

Kedang

(623)

(740)

(170)

Erai

(653)

Talur

(14)

Aputai

(506)

Perai

(690)

Tugun

(223)

Iliun

(671)

(457)

(589)

Serua

(456)

Nila

(670)

Teun

(286)

(285)

Kisar

(540)

Roma

(145)

EastDamar

(322)

Letinese

(617)

(275)

(751)

Yamdena

(274)

KeiTanimba

(583)

Selaru

(288)

KoiwaiIria

(127)

Chamorro

(509)

(310)

Lampung

(290)

Komering

(634)

Sunda

(535)

RejangReja

(74)

(664)

(665)

TboliTagab

(638)

Tagabili

(81)

(291)

KoronadalB

(573)

SaranganiB

(70)

BilaanKoro

(71)

BilaanSara

(179)

(28)

(133)

CiuliAtaya

(625)

SquliqAtay

(580)

Sediq

(100)

Bunun

(737)

(672)

Thao

(176)

Favorlang

(545)

Rukai

(492)

Paiwan

(688)

(260)Kanakanabu

(553)Saaroa

(687)Tsou

(530)

Puyuma

(147)

(472) (269)Kavalan (54)Basai

(109)CentralAmi (596)Siraya

(504)Pazeh (556)Saisiat

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

Figure S.4: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-referenced with the code in parenthesis attached to each branch.

28

Page 35: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

(521)

(559)

(750)

Ya

ka

n (632)

(91)

(40)

Ba

jo (375)

Ma

pu

n (560)

Sa

ma

lSia

si

(230)

Ina

ba

kn

on

(620)

(629)

(630)

Su

ba

nu

nS

in

(628)

Su

ba

no

nS

io (362)

(123)

(728)

(733)

We

ste

rnB

uk

(370)

Ma

no

bo

We

st

(366)

Ma

no

bo

Ilia

(365)

Ma

no

bo

Dib

a

(27)

(363)

Ma

no

bo

Ata

d

(364)

Ma

no

bo

Ata

u

(369)

Ma

no

bo

Tig

w

(614)

(367)

Ma

no

bo

Ka

la

(368)

Ma

no

bo

Sa

ra

(79)

Bin

uk

id

(377)

(376)

Ma

ran

ao

(233)

Iran

un

(43)

(576)

Sa

sa

k

(42)

Ba

li

(405) (118)

(355)

Ma

ma

nw

a

(80) (121)

(507) (225)

Ilon

gg

o

(210)

Hilig

ay

no

n

(717)

Wa

ray

Wa

ray

(613)

(102)

(103)

Bu

tua

no

n

(662)

Ta

us

ug

Jolo

(635)

Su

riga

on

on

(3)

Akla

nonBis

(107)

Cebuano

(372)

(252)

Kala

gan

(371)

Mansaka

(639)

Tagalo

gAnt

(69)

Bik

olN

agaC

(253)

(641)

TagbanwaKa

(640)

TagbanwaAb

(495)

(57)

Bata

kPala

w

(494)

Pala

wanBat

(47)

Banggi

(547)

SW

Pala

wano

(208)

Hanunoo

(493)

Pala

uan

(311)

(595)

Sin

ghi

(136)

DayakBakat

(52)

(727)

(615)

(267)

Katin

gan

(137)

DayakNgaju

(245)

Kadorih

(151)

(403)

Merin

aM

ala

(337)

Maanyan

(693)

Tunju

ng

(350)

(349)

(327)

(279)

Kerin

ci

(400)

Mela

yuSara

(398)

Mela

yu

(487)

Ogan

(331)

LowM

ala

y

(231)

Indonesia

n

(48)

Banja

reseM

(348)

Mala

yBahas

(399)

Mela

yuBru

n

(407)

Min

angkaba

(126)

(125)

(511)

PhanRangCh

(130)

Chru

(206)

HainanC

ham

(410)

Moken

(215)

Iban

(187)

Gayo

(477)

(554)

(217)

Idaan

(99)

BunduDusun

(464)

(138)

(677)

TimugonM

ur

(276)

KelabitBar

(78)

Bintulu

(66)

(65)

BerawanLon

(62)

Belait

(278)

KenyahLong

(396)

(397)

MelanauM

uk

(303)

Lahanan

(633)

(167)

(169)

EngganoMal

(168)

EngganoBan

(455)

Nias

(679)

TobaBatak

(341)

Madurese

(470)

(56)

(55)

(241)

(242)

IvatanBasc

(237)

Itbayat

(39)

Babuyan

(234)

Iraralay

(238)

Itbayaten

(235)

Isamorong

(240)

Ivasay

(228)

Imorod

(752)

Yami

(113)

(561)

SambalBoto

(263)

Kapampanga

(469)

(605)

(619)

(498)

(64)

(256)

(222)IfugaoBayn

(257)KallahanKa

(258)KallahanKe

(232)

Inibaloi

(497)

Pangasinan

(226)

(251)

KakidugenI

(227)

IlongotKak

(110)

(478)

(219)

(220)

IfugaoAmga

(221)

IfugaoBata

(90)

(88)

(89)BontokGuin

(87)BontocGuin

(262)

KankanayNo

(41)

Balangaw

(255)

(254)

KalingaGui

(239)

ItnegBinon

(468)

(216)

(184)

Gaddang

(30)

AttaPamplo

(236)

IsnegDibag

(471)

(2)

Agta

(144)

DumagatCas

(224)

Ilokano

(243)

(754)

Yogya

(488)

OldJavanes

(735)

WesternMalayoPolynesian

(631)

(611)

(94)

(93)

BugineseSo

(354)Maloh

(344)

Makassar

(637)

TaeSToraja

(568)

(476)

(567)SangirTabu

(566)Sangir

(565)SangilSara

(50)

Bantik

(203)

(201)

(248)Kaidipang

(202)GorontaloH

(84)BolaangMon

(423)

(691) (518)Popalia (85)Bonerate

(739) (747)Wuna (424)

MunaKatobu

(466) (685)

Tontemboan (684)Tonsea (46)BanggaiWdi (51)

Baree (746)

Wolio (418)

Mori

(270) (271)

KayanUmaJu

(96)

Bukat

(529)

PunanKelai

(11

1)

(

158)

(523)

(

736)

(462)

(213) (715)

Wampar (749)

Yabem (484)

NumbamiSib (451) (713)

(624)

(67) (422)

Mouk (20)

Aria (309)

LamogaiMul

(7)

Amara

(500) (585)

Sengseng

(268)

KaulongAuV

(61) (475) (73)

Bilibil (394)

Megiar

(188)

Gedaged

(387)

Matukar

(72)

Biliau

(401)

Mengen

(353)

Maleu

(292)

Kove

(575)

(574)

(661)

Tarpia

(600)

Sobei

(272)

KayupulauK

(539)

Riwo

(57

9)

(250)

(249)

Kairiru

(35

8)

(284)

Kis

(74

3)

Wogeo

(4)

Ali

(49

9)

(48

0)

(62

7)

(31

)

Auhelawa

(55

8)

Saliba

(62

6)

Suau

(34

3)

Maisin

(46

3)

(20

5)

Gumawana

(10

4)

(14

0)

Diodio

(41

2)

Molima

(17

)

(16)

(69

5)

Ubir

(18

5)

Gapapaiwa

(72

0)

Wedau

(14

1)

Dobuan

(

508)

(

281)

(40

9)

Misima

(

280)

Kilivila

(

117)

(

723)

(

183)

Gabadi

(

482)

(14

3)

Doura

(

295)

Kuni

(

541)

Roro

(

395)

Mekeo

(

305)

Lala

(

594)

(

712)

Viliru

pu

(

421)

Motu

(

342)

MagoriS

out

(

404)

(

448)

(61

0)

(

502)

(

261)

Kandas

(

655)

Tanga

(

590)

Siar

(

293)

Kuanua

(

501)

Patpata

r

(

447)

(

730)

(

544)

Roviana

(

593)

Simbo

(

437)

Nduke

(

296)

Kusaghe

(

212)

Hoava

(

335)

Lungga

(

336)

Luqa

(

294)

Kubo

kota

(

696)

Ughe

le

(

193)

Ghan

ongg

a

(

154)

(

706)

Vang

unu

(

382)

Mar

ovo

(

391)

Mbare

ke

(

440)

(

602)

Solo

s

(

572)

(

646)

Taio

f

(

668)

Teop

(

438)

(

439)

NehanHape

(

458)

Nis

san

(

207)

Haku

(

129)

(

597)

Sis

ingga

(

38)

Babata

naTu

(

711)

VarisiG

hon

(

710)

Varisi

(

538)

Ririo

(

584)

Sengga

(

35)

Babata

naAv

(

705)

Vaghua

(

34)

Babata

na

(

37)

Babata

naLo

(

36)

Babata

naKa

(

416)

(

334)

LungaLunga

(

413)

Mono

(

414)

MonoAlu

(

686)

Tora

u

(

700)

Uru

ava

(

415)

MonoFauro

(

75)

Bilur

(

571)

(

157)

(

128)

ChekeHolo

(

379)

MaringeKm

a

(

381)

MaringeTat

(

452)

NggaoPoro

(

380)

MaringeLel

(

732)

(

755)

ZabanaKia

(

302)

LaghuSam

as

(

124)

(

82)

Bla

bla

nga

(

83)

Bla

bla

ngaG

(

289)

Ko

ko

ta

(

282)

Kil

ok

ak

aY

s

(

49)

Ba

no

ni

(

316)

(

340)

Ma

da

ra

(

692)

Tu

ng

ag

Tu

ng

(

265)

Ka

raW

es

t

(

431)

Na

lik

(

673)

Tia

ng

(

674)

Tig

ak

(

324)

Lih

irS

un

gl

(

338)

(

339)

Ma

da

kL

am

as

(

53)

Ba

rok

(

741)

(

388)

Ma

utu

tu

(

430)

Na

ka

na

iBil

(

304)

La

ka

lai

(714)

Vit

u

(753)

Ya

pe

se

(

112)

(

618)

(190)

(92)

(393)

Mb

ug

ho

tuD

h

(95)

Bu

go

tu

(204)

(651)

Ta

lis

eM

oli

(650)

Ta

lis

eM

ala

(195)

Gh

ari

Nd

i (194)

Gh

ari

(681)

To

lo (648)

Ta

lis

e (347)

Ma

lan

go

(196)

Gh

ari

Ng

ga

e (392)

Mb

ira

o (199)

Gh

ari

Ta

nd

a (652)

Ta

lis

eP

ole

(197)

Gh

ari

Ng

ge

r (649)

Ta

lis

eK

oo

(198)

Gh

ari

Ng

ini

(189)

(319)

Le

ng

o (320)

Le

ng

oG

ha

im (321)

Le

ng

oP

ari

p (453)

Ng

ge

la

(

346)

(

345)

(

473)

(678)

To

am

ba

ita

(298)

Kw

ai

(312)

La

ng

ala

ng

a

(299)

Kw

aio

(

390)

Mb

ae

ng

gu

u

(

313)

La

u

(389)

Mb

ae

lele

a

(

301)

Kw

ara

ae

So

l

(

175)

Fa

tale

ka

(

315)

La

uW

ala

de

(

314)

La

uN

ort

h

(

328)

Lo

ng

gu

(

621)

(

551)

Sa

aU

kiN

iMa

(

550)

Sa

aS

aa

Vil

l

(

490)

Oro

ha

(

552)

Sa

aU

law

a

(

548)

Sa

a

(

142)

Do

rio

(

18)

Are

are

Ma

as

(

19)

Are

are

Waia

(

549)

SaaAulu

Vil

(

564)

(

58)

Bauro

Baro

o

(

59)

Bauro

Haunu

(

172)

Fagani

(

247)

KahuaM

am

i

(

570)

Santa

Cata

l

(

22)

Aro

siO

neib

(

23)

Aro

siT

awat

(

21)

Aro

si

(

246)

Kahua

(

60)

Bauro

PawaV

(

173)

FaganiA

guf

(

174)

FaganiR

ihu

(

569)

Santa

Ana

(

663)

Tawaro

ga

( 53

6)

(

709)

(

657)

(

658)

TannaSouth

(

300)

Kwam

era

(

318)

Lenakel

(

171)

(

636)

SyeErrom

an

(

699)

Ura

(

467)

(

726)

(

15)

Ara

kiS

outh

(

402)

Mere

i

(

150)

(

10)

Am

bry

mSout

(

420)

Mota

(

510)

Pete

rara

Ma

(

491)

Paam

eseSou

(

427)

Mwotlap

(

531)

Raga

(

119)

(

607)

South

Efa

te

(

489)

Ork

on

(

454)

Nguna

(

432)

Nam

akir

(

352)

(

434)

Nati

(

429)

Nah

avaq

(

444)

Nese

(

659)

Tape

(

12)

Anej

omAn

ei

(

557)

Saka

oPor

tO

(

351)

(

32)

Avava

(

445)

Neveei

(

433)

Naman

(

522)

(

406)

(

516)

(

520)

(

528)

Puluwate

se

(

527)

PuloAnnan

(

745)

Wole

aian

(

577)

Satawale

se

(

132)

ChuukeseAK

(

419)

Mortl

ockes

(

106)

Carolin

ian

(

131)

Chuukese

(

603)

Sonsorole

s

(

555)

SaipanCaro

(

526)

PuloAnna

(

744)

Wole

ai

(

515)

(

411)

Mokile

s e

(

512)

P ingila

pes

(

514)

Ponapean

(

283)

Kiribati

(

297)

Kusaie

(

385)

Mars

halles

(

436)

Nauru

( 33

2)

( 44

6)

( 47

4)

( 44

1)

Nelemwa

( 24

4)

Jawe

( 10

5)

Canala

( 21

4)

Iaai

( 44

3)

Nengone

( 13

9)

Dehu

( 116)

( 72

5)

( 54

3)

Rotuman

( 73

4)

Wes ternF ij

( 146) ( 513)

( 481)

( 15

6)

( 122)

( 64

5)

( 37

4) Maori

( 64

2) Tahiti

( 50

5) Penrhyn

( 68

9) Tuamotu

( 36

1) Manihiki

( 54

6) Rurutuan

( 64

3) TahitianMo

( 534) Rarotongan

(644) Tahitia

nth

( 384)

( 383) Marquesan

( 359) Mangareva

( 209) Hawaiian

( 533)

RapanuiEas

( 563)

( 562)

Samoan

( 182)

( 63) Bellona

( 675) Tikopia

( 163) Emae

( 13) Anuta

( 537) Rennelles e

( 180) FutunaAniw

( 218) IfiraMeleM

( 703) UveaWes t

( 181) FutunaEas t

( 704) VaeakauTau

( 162)

( 647) Takuu ( 694) Tuvalu ( 592) S ikaiana ( 264) Kapingamar

( 483) Nukuoro ( 333) Luangiua

( 524) Pukapuka ( 680) Tokelau ( 702) UveaEas t

( 683) ( 682) Tongan ( 459) Niue

( 177)F ijianBau

( 159) ( 707)

( 666)Teanu ( 708)Vano ( 654)Tanema ( 98)

Buma ( 701) ( 26)

Asumboa ( 656)

Tanimbili ( 442)

Nembao ( 1)

( 738) ( 581)

Seimat

( 748)

Wuvulu ( 160)

( 616) ( 330)

Lou ( 435)

Nauna ( 373)

( 153) ( 317)

Leipon ( 598)

S iv is aTita

( 729) ( 325)

L ikum ( 323)

Levei

( 329)

Loniu

( 426)

Mussau

( 609)

( 608)

( 266)

Kas ira Irah

( 200)

Giman

( 97)

Buli

( 108)

( 417)

Mor

( 532) ( 408)

M inyaifuin

( 68)

BigaMisool

( 25)

As

( 718)

Waropen

( 485)

Numfor

( 120)

( 378)

Marau

( 8)

AmbaiYapen

( 742)

W indes iWan ( 519)

( 33) ( 465)

( 460)

NorthBabar

( 135)

DaweraDawe

( 134)

Dai

( 612)

( 622)

( 667)

TelaMasbua

( 229)

Imroing

( 164)

Emplawas

( 386)

( 148)

Eas tMasela

( 588)

Serili

( 115)

CentralMas

( 606)

SouthEas tB

( 724)

Wes tDamar

( 24)

( 660)

TaranganBa

( 697)

UjirNAru

( 450)

NgaiborSAr

( 114) ( 152)

( 45)

( 192)

( 191)

Geser

( 719)

Watubela

( 161)

E latKeiBes

( 586) ( 486)

( 587) ( 9)

( 211)

HituAmbon

( 604)

SouAmanaTe

( 6)

Amahai

( 503)

Paulohi

( 698)

( 5)

Alune

( 425)

MurnatenAl

( 86)

Bonfia

( 722)

Werinam

a

( 101)

BuruNamrol

( 601)

Soboyo

( 77)

( 357)

Mam

boru

( 360)

Manggarai

( 449)

Ngadha

( 517)

Pondok

( 287)

Kodi

( 721)

WejewaTana

( 76)

Bima

( 599)

Soa

( 578)

Savu

( 716)

Wanukaka

( 186)

GauraNggau

( 149)

Eas tSumban

( 44)

Baliledo

( 308)

Lamboya

( 166)

( 326)

L ioF loresT

( 165)

Ende

( 428)

Nage

( 259)

Kam

bera

( 496)

Palu

eNitu

n

( 11)

Anakala

ng

( 461)

( 582)

Sekar

( 525)

Pula

uArg

un

( 676)

( 479)

( 731)

( 542)

RotiT

erm

an

( 29)

Ato

ni

( 155)

( 356)

Mam

bai

( 669)

Tetu

nTerik

( 277)

Kem

ak

( 178)

( 307)

Lam

ale

rale

( 306)

Lam

aholo

tI

( 591)

Sik

a

( 273)

Kedang

( 623)

( 740)

( 170)

Era

i

( 653)

Talu

r

( 14)

Aputa

i

( 506)

Pera

i

( 690)

Tugun

( 223)

Iliun

( 671)

( 457)

( 589)

Seru

a

( 456)

Nila

( 670)

Teun

( 286)

( 285)

Kis

ar

( 540)

Rom

a

( 145)

EastD

am

ar

( 322)

Letin

ese

( 617)

( 275)

( 751)

Yam

dena

( 274)

KeiT

anim

ba

( 583)

Sela

ru

( 288)

Koiw

aiIria

( 127)

Cham

orro

( 509)

( 310)

La

mp

un

g

( 290)

Ko

me

ring

( 634)

Su

nd

a

( 535)

Re

jan

gR

eja

( 74)

( 664)

( 665)

Tb

oliT

ag

ab

( 638)

Ta

ga

bili

( 81)

( 291)

Ko

ron

ad

alB

( 573)

Sa

ran

ga

niB

( 70)

Bila

an

Ko

ro

( 71)

Bila

an

Sa

ra

(179)

( 28)

( 133)

Ciu

liAta

ya

( 625)

Sq

uliq

Ata

y

( 580)

Se

diq

( 100)

Bu

nu

n

(737)

( 672)

Th

ao

(176)

Fa

vo

rlan

g

(545)

Ru

ka

i

(492)

Pa

iwa

n

(688)

(260)

Ka

na

ka

na

bu

(553)

Sa

aro

a

(687)

Ts

ou

(530)

Pu

yu

ma

(147)

(472)

(269)

Ka

va

lan

(54)

Ba

sa

i (109)

Ce

ntra

lAm

i (596)

Sira

ya

(504)

Pa

ze

h (556)

Sa

isia

t

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

Figure S.5: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-referenced with the code in parenthesis attached to each branch.

29

Page 36: Ancient Lenguage Reconstructed PNAS 2013 Bouchard Côté w App

qaʀnkstpumŋrwio

others

0 0.075 0.150 0.225 0.300

(p, v)(o, ə)(r, d)(ʀ, r)

(p, b)(w, o)(i, e)(c, s)(u, i)(e, i)

(b, p)(o, w)(i, u)

(k, g)(j, s)

others

0 0.1 0.2 0.3 0.4

Fraction of substitution errors

ai

metrskulŋpqvn

others

0 0.075 0.150 0.225 0.300

Fraction of insertion errors

A B C

Fraction of deletion errors

Figure S.6: Most common substitution errors, insertion errors, and deletion errors in the PAn reconstructionsproduced by our system. In (A), the first phoneme in each pair (x, y) represents the reference phoneme, followedby the incorrectly hypothesized one. In (B), each phoneme corresponds to a phoneme present in the automaticreconstruction but not in the reference, and vice-versa in (C).

30