Toward a deep dialectological representation of Indo-Aryan · 2019. 6. 1. · tangling...

10
Proceedings of VarDial, pages 110–119 Minneapolis, MN, June 7, 2019 c 2019 Association for Computational Linguistics 110 Toward a deep dialectological representation of Indo-Aryan Chundra A. Cathcart Department of Comparative Linguistics University of Zurich [email protected] Abstract This paper presents a new approach to disen- tangling inter-dialectal and intra-dialectal re- lationships within one such group, the Indo- Aryan subgroup of Indo-European. I draw upon admixture models and deep generative models to tease apart historic language contact and language-specific behavior in the over- all patterns of sound change displayed by Indo-Aryan languages. I show that a “deep” model of Indo-Aryan dialectology sheds some light on questions regarding inter-relationships among the Indo-Aryan languages, and per- forms better than a “shallow” model in terms of certain qualities of the posterior distribu- tion (e.g., entropy of posterior distributions), and outline future pathways for model devel- opment. 1 Introduction At the risk of oversimplifying, quantitative mod- els of language relationship fall into two broad categories. At a wide, family-level scale, phylo- genetic methods adopted from computational bi- ology have had success in shedding light on the histories of genetically related but significantly di- versified speech varieties (Bouckaert et al., 2012). At a shallower level, the subfield of dialectome- try has used a wide variety of chiefly distance- based methodologies to analyze variation among closely related dialects with similar lexical and ty- pological profiles (Nerbonne and Heeringa, 2001), though this work also emphasizes the importance of hierarchical linguistic relationships and the use of abstract, historically meaningful features (Proki´ c and Nerbonne, 2008; Nerbonne, 2009). It is possible, however, that neither methodology is completely effective for for language groups of in- termediate size, particularly those where certain languages have remained in contact to an extent that blurs the phylogenetic signal, but have expe- rienced great enough diversification that dialecto- metric approaches are not appropriate. This pa- per presents a new approach to disentangling inter- dialectal and intra-dialectal relationships within one such group, the Indo-Aryan subgroup of Indo- European. Indo-Aryan presents many interesting puz- zles. Although all modern Indo-Aryan (hence- forth NIA) languages descend from Sanskrit or Old Indo-Aryan (henceforth OIA), their subgroup- ing and dialectal interrelationships remain some- what poorly understood (for surveys of assorted problems, see Emeneau 1966; Masica 1991; Toul- min 2009; Smith 2017; Deo 2018). This is partly due to the fact that these languages have remained in contact with each other, and this admixture has complicated our understanding of the lan- guages’ history. Furthermore, while most NIA languages have likely gone through stages closely resembling attested Middle Indo-Aryan (MIA) languages such as Prakrit or Pali, no NIA language can be taken with any certainty to be direct descen- dants of an attested MIA variety, further shrouding the historical picture of their development. The primary goal of the work described in this paper is to build, or work towards build- ing, a model of Indo-Aryan dialectology that in- corporates realistic assumptions regarding histor- ical linguistics and language change. I draw upon admixture models and deep generative mod- els to tease apart historic language contact and language-specific behavior in the overall patterns of sound change displayed by Indo-Aryan lan- guages. I show that a “deep” model of Indo-Aryan dialectology sheds some light on questions re- garding inter-relationships among the Indo-Aryan languages, and performs better than a “shallow” model in terms of certain qualities of the poste- rior distribution (e.g., entropy of posterior distri- butions). I provide a comparison with other met-

Transcript of Toward a deep dialectological representation of Indo-Aryan · 2019. 6. 1. · tangling...

  • Proceedings of VarDial, pages 110–119Minneapolis, MN, June 7, 2019 c©2019 Association for Computational Linguistics

    110

    Toward a deep dialectological representation of Indo-Aryan

    Chundra A. CathcartDepartment of Comparative Linguistics

    University of [email protected]

    Abstract

    This paper presents a new approach to disen-tangling inter-dialectal and intra-dialectal re-lationships within one such group, the Indo-Aryan subgroup of Indo-European. I drawupon admixture models and deep generativemodels to tease apart historic language contactand language-specific behavior in the over-all patterns of sound change displayed byIndo-Aryan languages. I show that a “deep”model of Indo-Aryan dialectology sheds somelight on questions regarding inter-relationshipsamong the Indo-Aryan languages, and per-forms better than a “shallow” model in termsof certain qualities of the posterior distribu-tion (e.g., entropy of posterior distributions),and outline future pathways for model devel-opment.

    1 Introduction

    At the risk of oversimplifying, quantitative mod-els of language relationship fall into two broadcategories. At a wide, family-level scale, phylo-genetic methods adopted from computational bi-ology have had success in shedding light on thehistories of genetically related but significantly di-versified speech varieties (Bouckaert et al., 2012).At a shallower level, the subfield of dialectome-try has used a wide variety of chiefly distance-based methodologies to analyze variation amongclosely related dialects with similar lexical and ty-pological profiles (Nerbonne and Heeringa, 2001),though this work also emphasizes the importanceof hierarchical linguistic relationships and theuse of abstract, historically meaningful features(Prokić and Nerbonne, 2008; Nerbonne, 2009). Itis possible, however, that neither methodology iscompletely effective for for language groups of in-termediate size, particularly those where certainlanguages have remained in contact to an extentthat blurs the phylogenetic signal, but have expe-

    rienced great enough diversification that dialecto-metric approaches are not appropriate. This pa-per presents a new approach to disentangling inter-dialectal and intra-dialectal relationships withinone such group, the Indo-Aryan subgroup of Indo-European.

    Indo-Aryan presents many interesting puz-zles. Although all modern Indo-Aryan (hence-forth NIA) languages descend from Sanskrit orOld Indo-Aryan (henceforth OIA), their subgroup-ing and dialectal interrelationships remain some-what poorly understood (for surveys of assortedproblems, see Emeneau 1966; Masica 1991; Toul-min 2009; Smith 2017; Deo 2018). This is partlydue to the fact that these languages have remainedin contact with each other, and this admixturehas complicated our understanding of the lan-guages’ history. Furthermore, while most NIAlanguages have likely gone through stages closelyresembling attested Middle Indo-Aryan (MIA)languages such as Prakrit or Pali, no NIA languagecan be taken with any certainty to be direct descen-dants of an attested MIA variety, further shroudingthe historical picture of their development.

    The primary goal of the work described inthis paper is to build, or work towards build-ing, a model of Indo-Aryan dialectology that in-corporates realistic assumptions regarding histor-ical linguistics and language change. I drawupon admixture models and deep generative mod-els to tease apart historic language contact andlanguage-specific behavior in the overall patternsof sound change displayed by Indo-Aryan lan-guages. I show that a “deep” model of Indo-Aryandialectology sheds some light on questions re-garding inter-relationships among the Indo-Aryanlanguages, and performs better than a “shallow”model in terms of certain qualities of the poste-rior distribution (e.g., entropy of posterior distri-butions). I provide a comparison with other met-

  • 111

    rics, and outline future pathways for model devel-opment.

    2 Sound Change

    The notion that sound change proceeds in a reg-ular and systematic fashion is a cornerstone ofthe comparative method of historical linguistics.When we consider cognates such as Greek pherōand Sanskrit bharā(mi) ‘I carry’, we observe reg-ular sound correspondences (e.g., ph:bh) which al-low us to formulate sound changes that have oper-ated during the course of each language’s develop-ment from their shared common ancestor. Underideal circumstances, these are binary yes/no ques-tions (e.g., Proto-Indo-European *bh > Greek ph).At other times, there is some noise in the signal:for instance, OIA ks. is realized as kh in most Ro-mani words (e.g., aks. i- ‘eye’ > jakh), but also asčh (ks. urikā- > čhuri ‘knife’), according to Ma-tras (2002, 41). This is undoubtedly due to rel-atively old language contact (namely lexical bor-rowing) between prehistoric Indo-Aryan dialects,as opposed to different conditioning environmentswhich trigger a change ks. > kh in some phono-logical contexts but ks. > čh in others. The ideathat Indo-Aryan speech varieties borrowed formsfrom one another on a large scale is well estab-lished (Turner, 1975 [1967], 406), as is often thecase in situations where closely related dialectshave developed in close geographic proximity toone another (cf. Bloomfield, 1933, 461–495). Aneffective model of Indo-Aryan dialectology mustbe able to account this sort of admixture. Phylo-genetic methods and distance-based methods pro-vide indirect information regarding language con-tact (e.g., in the form of uncertain tree topologies),but do not explicitly model intimate borrowing.

    A number of studies have used mixed-membership models such as the Structure model(Pritchard et al., 2000) in order to explicitly modeladmixture between languages (Reesink et al.,2009; Syrjänen et al., 2016). Under this approach,individual languages receive their linguistic fea-tures from latent ancestral components with par-ticular feature distributions. A key assumption ofthe Structure model is the relative invariance andstability of the features of interest (e.g., allele fre-quencies, linguistic properties). However, soundchange is a highly recurrent process, with manytelescoped and intermediate changes, and it is notpossible to treat sound changes that have operated

    as stable, highly conservative features.1

    Intermediate stages between OIA and NIAlanguages are key for capturing similarities incross-linguistic behavior, and we require a modelthat teases apart dialect group-specific trends andlanguage-level ones. Consider the following ex-amples:

    • Assamese /x/, the reflex of OIA s, ś, s. ,is thought to develop from intermediate *ś(Kakati, 1941, 224). This isogloss wouldunite it with languages like Bengali, whichshow /S/ for OIA s, ś, s. .

    • Some instances of NIA bh likely come froman earlier *mh (Tedesco 1965, 371; Oberlies2005, 48) (cf. Oberlies 2005:48).

    • The Marathi change ch > s affects certainwords containing MIA *ch < OIA ks. as wellas OIA ch (Masica, 1991, 457); ch ∼ kh <OIA ks. variation is of importance to MIA andNIA dialectology (compare the Romani ex-amples given above).

    In all examples, a given NIA language shows theeffects of chronologically deep behavior whichserves as an isogloss uniting it with other NIAlanguages, but this trend is masked by subse-quent language-specific changes.2 Work on proba-bilistic reconstruction of proto-word forms explic-itly appeals to intermediate chronological stageswhere linguistic data are unobserved (Bouchard-Côté et al., 2007); however, unlike the work cited,this paper does not assume a fixed phylogeny, andhence I cannot adopt many of the simplifying con-ventions that the authors use.

    3 Data

    I extracted all modern Indo-Aryan forms fromTurner’s (1962–1966) Comparative Dictionary ofthe Indo-Aryan Languages (henceforth CDIAL),3

    1Cathcart (to appear) circumvents this issue in a mixed-membership model of Indo-Aryan dialectology by consider-ing only sound changes thought a priori in the literature to berelatively stable and of importance to dialectology.

    2Some similar-looking sound changes can be shown to bechronologically shallow. For instance, the presence of s. fororiginal kh in Old Braj, taken by most scholars to represent alegitimate sound change and not just an orthographic idiosyn-crasy, affects Persian loans such as s. aracu ‘expense’←Mod-ern Persian xirč (McGregor, 1968, 125). This orthographicbehavior is found in Old Gujarati as well (Baumann, 1975,9). For further discussion of this issue, see Strnad 2013, 16ff.

    3Available online at http://dsal.uchicago.edu/dictionaries/soas/

    http://dsal.uchicago.edu/dictionaries/soas/http://dsal.uchicago.edu/dictionaries/soas/

  • 112

    along with the Old Indo-Aryan headwords (hence-forth ETYMA) from which these reflexes descend.Transcriptions of the data were normalized andconverted to the International Phonetic Alphabet(IPA). Systematic morphological mismatches be-tween OIA etyma and reflexes were accountedfor, including stripping the endings from all verbs,since citation forms for OIA verbs are in the 3sgpresent, while most NIA reflexes give the infini-tive. I matched each dialect with correspond-ing languoids in Glottolog (Hammarström et al.,2017) containing geographic metadata, resultingin the merger of several dialects. I excluded cog-nate sets with fewer than 10 forms, yielding 33231modern Indo-Aryan forms. I preprocessed thedata, first converting each segment into its respec-tive sound class, as described by List (2012), andsubsequently aligning each converted OIA/NIAstring pair via the Needleman-Wunsch algorithm,using the Expectation-Maximization method de-scribed by Jäger (2014), building off of work byWieling et al. (2012). This yields alignments ofthe following type: e.g., OIA /a:ntra/ ‘entrails’ >Nepali /a:n∅ro/, where ∅ indicates a gap wherethe “cursor” advances for the OIA string but notthe Nepali string. Gaps on the OIA side are ig-nored, yielding a one-to-many OIA-to-NIA align-ment; this ensures that all aligned cognate sets areof the same length.

    4 Model

    The basic family of model this paper employs is aBayesian mixture model which assumes that eachword in each language is generated by one ofK la-tent dialect components. Like Structure (and sim-ilar methodologies like Latent Dirichlet Alloca-tion), this model assumes that different elementsin the same language can be generated by differ-ent dialect components. Unlike the most basictype of Structure model, which assumes a two-level data structure consisting of (1) languages andthe (2) features they contain, our model assumes athree-level hierarchy, where (1) languages contain(2) words, which display the operation of differ-ent (3) sound changes; latent variable assignmenthappens at the word level.

    I contrast the behavior of a DEEP model withthat of a SHALLOW model. The deep model drawsinspiration from Bayesian deep generative mod-els (Ranganath et al., 2015), which incorporateintermediate latent variables which mimic the ar-

    chitecture of a neural network. This structure al-lows us to posit an intermediate representationbetween the sound patterns in the OIA etymonand the sound patterns in the NIA reflex, allow-ing the model to pick up on shared dialectal sim-ilarities between forms in languages as opposedto language-specific idiosyncrasies. The shal-low model, which serves as a baseline of sorts,conflates dialect group-level and language-leveltrends; it contains a flat representation of all of thesound changes taking place between a NIA wordand its ancestral OIA etymon, and in this sense ishalfway between a Structure model and a Naı̈veBayes classifier (with a language-specific ratherthan global prior over component membership).

    4.1 Shallow model

    Here, I describe the generative process for theshallow model, assuming W OIA etyma, L lan-guages, K dialect components, I unique OIA in-puts, O unique NIA outputs, and aligned OIA-NIA word pair lengths Tw : w ∈ {1, ...,W}.For each OIA etymon, an input xw,t at time pointt ∈ {1, ..., Tw} consists of a trigram centered atthe timepoint in question (e.g., ntr in OIA /a:ntra/‘entrails’), and the NIA reflex’s output yw,l,t con-tains the segment(s) aligned with timepoint t (e.g.,Nepali ∅). xw,t : t = 0 is the left word boundary,while xw,t : t = Tw + 1 is the right word bound-ary. Accordingly, sound change in the model canbe viewed as a rewrite rule of the type A > B / C

    D. The model has the following parameters:

    • Language-level weights over dialect compo-nents: Ul,k; l ∈ {1, ..., L}, k ∈ {1, ...,K}

    • Dialect component-level weights over soundchanges: Wk,i,o; k ∈ {1, ...,K}, i ∈{1, ..., I}, o ∈ {1, ..., O}

    The generative process is as follows:

    For each OIA etymon xw ∈ {1, ...,W}

    For each language l ∈ {1, ..., L} in whichthe etymon survives, containing a reflexyw,l

    Draw a dialect component assignmentzw,l ∼ Categorical(f(Ul,·))

    For each time point t ∈ {1, ..., Tw}Draw a NIA sound yw,l,t ∼

    Categorical(f(Wzw,l,xw,t,·))

  • 113

    All weights in U and W are drawn from a Normaldistribution with a mean of 0 and standard devi-ation of 10; f(·) represents the softmax function(throughout this paper), which transforms theseweights to probability simplices. The generativeprocess yields the following joint log likelihood ofthe OIA etyma x and NIA reflexes y (with the dis-crete latent variables z marginalized out:

    P (x,y|U,W ) =W∏w=1

    L∏l=1

    K∑k=1

    [f(Ul,k)

    Tw∏t=1

    f(Wk,xw,l,t,yw,l,t)

    ](1)

    As readers will note, this model weights allsound changes equally, and makes no attempt todistinguish between dialectologically meaningfulchanges and noisy, idiosyncratic changes.

    4.2 Deep modelThe deep model, like the shallow model, is a mix-ture model, and as such retains the language-levelweights over dialect component membership U .However, unlike the shallow model, in which thelikelihood of an OIA etymon and NIA reflex un-der a component assignment z = k is depen-dent on a flat representation of edit probabilitiesbetween OIA trigrams and NIA unigrams associ-ated with dialect component k. Here, I attemptto add some depth to this representation of soundchange by positing a hidden layer of dimension Jbetween each xw,t and yw,l,t. The goal here is tomimic a “noisy” reconstruction of an intermediatestage between OIA and NIA represented by dialectgroup k. This reconstruction is not an explicit,linguistically meaningful string (as in Bouchard-Côté et al. 2007, 2008, 2013); furthermore, it isre-generated for each individual reflex of each et-ymon, and not shared across data points (such amodel would introduce deeply nested dependen-cies between variables, and enumerating all possi-ble reconstructions would be computationally in-feasible).

    For parsimony’s sake, I employ a simple Recur-rent Neural Network (RNN) architecture to cap-ture rightward dependencies (Elman, 1990). Fig-ure 1 gives a visual representation of the net-work, unfolded in time. This model exchangesW ,the dialect component-level weights over soundchanges, for the following parameters:

    • Dialect component-level weights governinghidden layer unit activations by OIA sounds:

    W xk,i,j ; k ∈ {1, ...,K}, i ∈ {1, ..., I}, j ∈{1, ..., J}

    • Dialect component-level weights governinghidden layer unit activations by previous hid-den layers: W hk,i,j ; k ∈ {1, ...,K}, i ∈{1, ..., J}, j ∈ {1, ..., J}

    • Language-level weights governing NIA out-put activations by hidden layer units:W yl,j,o; l ∈ {1, ..., L}, j ∈ {1, ..., J}, o ∈{1, ..., O}

    For a given mixture component z = k, the activa-tion of the hidden layer at time t, ht, depends ontwo sets of parameters, each associated with com-ponent k: the weightsW xk,xxt ,·, associated with theOIA input at time t; and W hk , the weights asso-ciated with the previous hidden layer ht−1’s acti-vations, for all t > 1. Given a hidden layer ht,the weights W l can be used to generate a proba-bility distribution over possible outcomes in NIAlanguage l. The forward pass of this networkcan be viewed as a generative process, denotedyw,t ∼ RNN(xw,l,W xk ,W hk ,W l) under the pa-rameters for component k and language l; undersuch a process, the likelihood of yw,l can be com-puted as follows:

    PRNN(yw,l|xw,Wxk ,W

    hk ,W

    l) =

    Tw∏t=1

    f(h>t Wl)yw,l,t (2)

    where

    ht =

    {f(W xk,xw,t,·), if t = 1

    f(h>t−1Wh ⊕W xk,xw,t,·), if t > 1

    (3)

    The generative process for this model is nearlyidentical to the process described in the previ-ous sections; however, after the dialect compo-nent assignment (zw,l ∼ Categorical(f(Ul,·)))is drawn, the NIA string yw,l is sampled fromRNN(xw,W xzw,l ,W

    hzw,l

    ,W l). The joint log likeli-hood of the OIA etyma x and NIA reflexes y (withthe discrete latent variables z marginalized out isthe following:

    P (x,y|U,W x,W h,W y) =W∏w=1

    L∏l=1

    K∑k=1

    [f(Ul,k)PRNN(yw,l|xw,W xk ,W hk ,W l)

    ] (4)

  • 114

    The same N (0, 10) prior as above is placed overU,W x,W h,W y. J , the dimension of the hiddenlayer, is fixed at 100. This model bears some sim-ilarities to the mixture of RNNs described by Kimet al. (2018).

    I have employed a simple RNN (rather than amore state-of-the art architecture) for several rea-sons. The first is that I am interested in the conse-quences of expanding a flat mixture model to con-tain a simple, slightly deeper architecture. Addi-tionally, I believe that the fact that the hidden layerof an RNN can be activated by a softmax functionis more desirable from the perspective of repre-senting sound change as a categorical or multi-nomial distribution, as all layer unit activationssum to one, as opposed to the situation with LongShort-Term Memory (LSTM) and Gated Recur-rent Units (GRU), which traditionally use sigmoidor hyperbolic tangent functions to activate the hid-den layer. Furthermore, long-distance dependen-cies are not particularly widespread in Indo-Aryansound change, lessening the need for more com-plex architectures. At the same time, the RNNis a crude approximation to the reality of lan-guage change. RNNs and related models draw asingle arc between a hidden layer at time t andthe corresponding output. It is perhaps not ap-propriate to envision this single dependency un-less the dimensionality of the hidden layer is largeenough to absorb potential contextual informationthat is crucial to sound change. To put it sim-ply, emission probabilities in sound change aresharper than transitions common in most NLP ap-plications (e.g., sentence prediction), and it maynot be correct to envision yt given ht′ 1); NIA outputs y1, ..., yTwdepend on hidden layers. Hidden layer activations aredependent on dialect component-specific parameters,while activations of the output layer are dependent onindividual NIA language-specific parameters.

    monitor convergence by observing the trace of thelog posterior (Figure 2).

    The flat model fails to pick up on any majordifferences between languages, finding virtuallyidentical posterior values of f(Ul), the language-level distribution over dialect component member-ship, for all l ∈ {1, ..., L}. According to theMAP configuration, each language draws formsfrom the same dialect group with > .99 proba-bility, essentially undergoing a sort of “compo-nent collapse” that latent variable models some-times encounter (Bowman et al., 2015; Dinh andDumoulin, 2016). It is likely that bundling to-gether sound change features leads to component-level distributions over sound changes with highentropy that are virtually indistinguishable fromone another.5 While this particular result is dis-appointing in the lack of information it provides, Iobserve some properties of our models’ posteriorvalues in order to diagnose problems that can beaddressed in future work (discussed below).

    The deep model, on the other hand, infershighly divergent language-level posterior distri-butions over cluster membership. Since thesedistributions are not identical across initializa-tions due to the label-switching problem, I com-pute the Jensen-Shannon divergence between thelanguage-level posterior distributions over clustermembership for each pair of languages in our sam-ple for each initialization. I then average these di-vergences across initializations. These averaged

    5I made several attempts to run this model with differ-ent specifications, including different prior distributions, butachieved the same result.

    https://github.com/chundrac/IA_dial/VarDial2019https://github.com/chundrac/IA_dial/VarDial2019

  • 115

    0 2000 4000 6000 8000 10000

    1.048

    1.046

    1.044

    1.042

    1.040

    1.038

    1.0361e8

    0 2000 4000 6000 8000 10000

    4.26

    4.24

    4.22

    4.20

    4.18

    4.16

    4.14

    4.12

    1e7

    Figure 2: Log posteriors for shallow model (left) anddeep model (right) for 10000 iterations over three ran-dom initializations.

    divergences are then scaled to three dimensionsusing multidimensional scaling. Figure 3 gives avisualization of these transformed values via thered-green-blue color vector, plotted on a map; lan-guages with similar component distributions dis-play similar colors. With a few exceptions (thatmay be artifacts of the fact that certain languageshave only a small number of data points associ-ated with them), a noticieable divide can be seenbetween languages of the main Indo-Aryan speechregion on one hand, and languages of northwest-ern South Asia (dark blue), the Dardic languagesof Northern Pakistan, and the Pahari languagesof the Indian Himalayas, though this division isnot clear cut. Romani and other Indo-Aryan va-rieties spoken outside of South Asia show affil-iation with multiple groups. While Romani di-alects are thought to have a close genetic affin-ity with Hindi and other Central Indic languages,it was likely in contact with languages of north-west South Asian during the course of its speak-ers’ journey out of South Asia (Hamp, 1987; Ma-tras, 2002). However, this impressionistic evalua-tion is by no means a confirmation that the deepmodel has picked up on linguistically meaningfuldifferences between speech varieties. In the fol-lowing sections, some comparison and evaluationmetrics and checks are deployed in order to assessthe quality of these models’ behavior.

    5.1 Entropy of distributions

    I measure the average entropy of the model’s pos-terior distributions in order to gauge the extent towhich the models are able to learn sparse, informa-tive distributions over sound changes, hidden stateactivations, or other parameters concerning transi-tions through the model architecture. Normalizedentropy is used in order to make entropies of distri-butions of different dimension comparable; a dis-tribution’s entropy can be normalized by dividingby its maximum possible entropy.

    As mentioned above, our data set consists ofOIA trigrams and the NIA segment correspondingto the second segment in the trigram, representingrewrite rules operating between OIA and the NIAlanguages in our sample. It is often the case thatmore than one NIA reflex is attested for a givenOIA trigram. As such, the sound changes that haveoperated in an NIA language can be representedas a collection of categorical distributions, eachsumming to one. I calculate the average of thenormalized entropies of these sound change dis-tributions as a baseline against which to compareentropy values for the models’ parameters. Thepooled average of the normalized entropies acrossall languages is .11, while the average of averagesfor each language is .063.

    For the shallow model, the parameter of interestis f(V ), the dialect component-level collection ofdistributions over sound changes, the mean nor-malized entropy of which, averaged across initial-izations but pooled across components within eachinitialization, is 0.91 (raw values range from 0.003to 1). For the deep model, the average entropyof the dialect-level distributions over hidden-layeractivations, f(W x), is only slightly lower, at 0.86(raw values range from close to 0 to 1).

    For each k ∈ {1, ...,K}, I compute the for-ward pass of RNN(xw,l,W xk ,W

    hk ,W

    l) for eachetymon w and each language l in which theetymon survives using the inferred values forW xk ,W

    hk ,W

    l and compute the entropy of eachf(h>t W

    l), yielding an average of .74 (raw val-ues range from close to 0 to 1). While these val-ues are still very high, it is clear that the inclu-sion of a hidden layer has learned sparser, poten-tially more meaningful distributions than the flatapproach, and that increasing the dimensionalityof the hidden layer will likely bring about evensparser, more meaningful distributions. The en-tropies cited here are considerably higher than theaverage entropy of languages’ sound change dis-tributions, but the latter distributions do little to tellus about the internal clustering of the languages.

    5.2 Comparison with other linguisticdistance metrics

    Here, I compare the cluster membership inferredby this paper’s models against other measures oflinguistic distance. Each method yields a pairwiseinter-language distance metric, which can be com-pared against a non-linguistic measure. I measure

  • 116

    assa1263awad1243

    bagh1251

    balk1252

    beng1280

    bhad1241 bhat1263

    bhoj1244braj1242

    brok1247

    carp1235

    cham1307

    chil1275

    chur1258

    dhiv1236

    dogr1250

    doma1258

    doma1260

    garh1243

    gawa1247

    gran1245

    guja1252

    halb1244

    hind1269

    indu1241

    jaun1243

    kach1277

    kala1372 kala1373

    kalo1256

    kang1280

    kash1277

    khet1238

    khow1242 kohi1248

    konk1267

    kull1236

    kuma1273

    loma1235

    maga1260

    maha1287

    mait1250

    mara1378

    marw1260

    nepa1254

    nort2665

    nort2666

    oriy1255

    paha1251pang1282

    panj1256

    phal1254

    savi1242

    sera1259

    shin1264

    shum1235

    sind1272

    sinh1246

    sint1235

    sirm1239

    sout2671

    sout2672

    tira1253

    torw1241

    vlax1238

    wels1246

    west2386

    wota1240

    0

    20

    40

    60

    0 25 50 75long

    lat

    Figure 3: Dialect group makeup of languages in sample under deep model

    the correlation between each linguistic distancemeasure as well as great circle geographic distanceand patristic distance according to the Glottologphylogeny using Spearman’s ρ.

    5.2.1 Levenshtein distanceBorin et al. (2014) measure the normalized Lev-enshtein distances (i.e., the edit distance betweentwo strings divided by the length of the longerstring) between words for the same concept inpairs of Indo-Aryan languages, and find that av-erage normalized Levenshtein distance correlatessignificantly with patristic distances in the Ethno-logue tree. This paper’s dataset is not organized bysemantic meaning, so for comparability, I measurethe average normalized Levenshtein distance be-tween cognates in pairs of Indo-Aryan languages,which picks up on phonological divergence be-tween dialects, as opposed to both phonologicaland lexical divergence.

    5.2.2 Jensen-Shannon divergenceEach language in our dataset attests one or more(due to language contact, analogy, etc.) outcomesfor a given OIA trigram, yielding a collection ofsound change distributions, as described above.For each pair of languages, I compute the Jensen-Shannon divergence between sound change distri-butions for all OIA trigrams that are continued inboth languages, and average these values. This

    gives a measure of pairwise average diachronicphonological divergence between languages.

    5.2.3 LSTM AutoencoderRama and Çöltekin (2016) and Rama et al. (2017)develop an LSTM-based method for represent-ing the phonological structure of individual wordforms across closely related speech varieties. Eachstring is fed to a unidirectional or bidirectionalLSTM autoencoder, which learns a continuouslatent multidimensional representation of the se-quence. This embedding is then used to recon-struct the input sequence. The latent values in theembedding provide information that can be usedto compute dissimilarity (in the form of cosineor Euclidean distance) between strings or acrossspeech varieties (by averaging the latent values forall strings in each dialect or language). I use thebidirectional LSTM Autoencoder described in thework cited in order to learn an 8-dimensional la-tent representation for all NIA forms in the dataset,training the model over 20 epochs on batches of 32data points using the Adam optimizer to minimizethe categorical cross-entropy between the input se-quence and the NIA reconstruction predicted bythe model. I use the learned model parameters togenerate a latent representation for each form. Thelatent representations are averaged across formswithin each language, and pairwise linguistic Eu-clidean distances are computed between each av-

  • 117

    Geographic GeneticShallow JSD −0.01 −0.03Deep JSD 0.147∗ 0.008LDN 0.346∗ 0.013Raw JSD 0.302∗ −0.051∗LSTM AE 0.158∗ −0.068∗LSTM ED 0.084∗ 0.0001

    Table 1: Spearman’s ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence, LDN = Levenshtein Distance Normalized,AE = Autoencoder, ED = Encoder-Decoder) and geo-graphic and genetic distance. Asterisks represent sig-nificant correlations.

    eraged representation.

    5.2.4 LSTM Encoder-DecoderFor the sake of completeness, I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair. This modelis very similar to the LSTM autoencoder, exceptthat it takes an OIA input and reconstructs an NIAoutput, instead of taking an NIA form as input andreconstructing the same string. I train the modelas described above.

    5.3 Correlations

    Table 1 gives correlation coefficients (Spearman’sρ) between linguistic distance metrics and non-linguistic distance metrics. In general, correlationswith Glottolog patristic distance are quite poor.This is surprising for Levenshtein Distance Nor-malized, given the high correlation with patristicdistance reported by Borin et al. (2014). Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languages,and not cognates, as I do here, it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence, at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree, as described by the au-thors; it is not clear that this would have yieldedany improvement). On the other hand, the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance, indi-cating a strong geographic signal. Average Jensen-Shannon divergence between pairs of languages’sound change distributions shows a strong associ-ation with geographic distance as well.

    Divergence/distances based on the deepmodel, the LSTM Autoencoder, and the LSTM

    Encoder-Decoder show significant correlationswith geospatial distance, albeit lower ones. It isnot entirely clear what accounts for this disparity.Intuitively, we expect more shallow chronologicalfeatures to correlate with geographic distance. Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation, and show a low geographic signal for thisreason, though this highly provisional idea is notborne out by any genetic signal.

    It is not clear how to assess the meaning ofthese correlations at this stage. Nevertheless, deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact, as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data.

    6 Outlook

    This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate, but havediversified to the extent that traditional dialecto-metric approaches are not applicable. While themodel requires some refinement, its results pointin a promising direction. Modifying prior distribu-tions could potentially produce more informativeresults, as could tweaking hyperparameters of thelearning algorithms employed. Additionally, it islikely that the model will benefit from hidden lay-ers of higher dimension J , as well as bidirectionalapproaches, and despite the misgivings regard-ing LSTM and GRUs stated above, future workwill probably benefit from incorporating these andrelated architectures (e.g., attention). Addition-ally, the models used in this paper assumed dis-crete latent variables, attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groups.However, continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf. Murawaki, 2015).

    This paper provides a new way of looking atdialectology and linguistic affiliation; with refine-ment and expansion, it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios. It is

  • 118

    hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languages’ histories.

    ReferencesGeorge Baumann. 1975. Drei Jaina-Gedichte in Alt-

    Gujarātı̄: Edition, Übersetzung, Grammatik, undGlossar. Franz Steiner, Wiesbaden.

    Leonard Bloomfield. 1933. Language. Holt, Rinehartand Winston, New York.

    Lars Borin, Anju Saxena, Taraka Rama, and BernardComrie. 2014. Linguistic landscaping of southasia using digital language resources: Genetic vs.areal linguistics. In Ninth International Conferenceon Language Resources and Evaluation (LREC’14),pages 3137–3144.

    Alexandre Bouchard-Côté, David Hall, Thomas L.Griffiths, and Dan Klein. 2013. Automated recon-struction of ancient languages using probabilisticmodels of sound change. Proceedings of the Na-tional Academy of Sciences, 110:4224–4229.

    Alexandre Bouchard-Côté, Percy Liang, Thomas Grif-fiths, and Dan Klein. 2007. A probabilistic approachto diachronic phonology. In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL), pages 887–896, Prague. Association for Computational Lin-guistics.

    Alexandre Bouchard-Côté, Percy S Liang, Dan Klein,and Thomas L Griffiths. 2008. A probabilistic ap-proach to language change. In Advances in NeuralInformation Processing Systems, pages 169–176.

    R. Bouckaert, P. Lemey, M. Dunn, S. J. Greenhill,A. V. Alekseyenko, A. J. Drummond, R. D. Gray,M. A. Suchard, and Q. D. Atkinson. 2012. Mappingthe origins and expansion of the Indo-European lan-guage family. Science, 337(6097):957–960.

    Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-drew M Dai, Rafal Jozefowicz, and Samy Ben-gio. 2015. Generating sentences from a continu-ous space. Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL).

    Chundra Cathcart. to appear. A probabilistic assess-ment of the Indo-Aryan Inner-Outer Hypothesis.Journal of Historical Linguistics.

    Ashwini Deo. 2018. Dialects in the Indo-Aryan land-scape. In Charles Boberg, John Nerbonne, and Do-minic Watt, editors, The Handbook of Dialectology,pages 535–546. John Wiley & Sons, Oxford.

    Laurent Dinh and Vincent Dumoulin. 2016. Train-ing neural Bayesian nets. http://www.iro.umontreal.ca/bengioy/cifar/NCAP2014-summerschool/slides/Laurent_dinh_cifar_presentation.pdf.

    Jeffrey Elman. 1990. Finding structure in time. Cogni-tive Science, 14(2):179–211.

    Murray B. Emeneau. 1966. The dialects of Old-Indo-Aryan. In Jaan Puhvel, editor, Ancient Indo-European dialects, pages 123–138. University ofCalifornia Press, Berkeley.

    Harald Hammarström, Robert Forkel, and MartinHaspelmath. 2017. Glottolog 3.3. Max Planck In-stitute for the Science of Human History.

    Eric P Hamp. 1987. On the sibilants of romani. Indo-Iranian Journal, 30(2):103–106.

    Gerhard Jäger. 2014. Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights. In Quantifying LanguageDynamics, pages 155–204. Brill.

    Banikanta Kakati. 1941. Assamese, its formation anddevelopment. Government of Assam, Gauhati.

    Yoon Kim, Sam Wiseman, and Alexander M Rush.2018. A tutorial on deep latent variable models ofnatural language. arXiv preprint arXiv:1812.06834.

    Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In InternationalConference on Learning Representations (ICLR).

    Johann-Mattis List. 2012. SCA. Phonetic alignmentbased on sound classes. In M. Slavkovik and D. Las-siter, editors, New directions in logic, language, andcomputation, pages 32–51. Springer, Berlin, Heidel-berg.

    Colin P. Masica. 1991. The Indo-Aryan languages.Cambridge University Press, Cambridge.

    Yaron Matras. 2002. Romani – A Linguistic Introduc-tion. Cambridge University Press, Cambridge.

    R. S. McGregor. 1968. The language of Indrajit of Or-cha. Cambridge University Press, Cambridge.

    Yugo Murawaki. 2015. Continuous space representa-tions of linguistic typology and their application tophylogenetic inference. In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 324–334.

    John Nerbonne. 2009. Data-driven dialectology. Lan-guage and Linguistics Compass, 3(1):175–198.

    John Nerbonne and Wilbert Heeringa. 2001. Computa-tional comparison and classification of dialects. Di-alectologia et Geolinguistica, 9:69–83.

    http://www. iro.umontreal.ca/ ̃bengioy/cifar/NCAP2014-summerschool/slides/ Laurent_dinh_cifar_presentation.pdfhttp://www. iro.umontreal.ca/ ̃bengioy/cifar/NCAP2014-summerschool/slides/ Laurent_dinh_cifar_presentation.pdfhttp://www. iro.umontreal.ca/ ̃bengioy/cifar/NCAP2014-summerschool/slides/ Laurent_dinh_cifar_presentation.pdfhttp://www. iro.umontreal.ca/ ̃bengioy/cifar/NCAP2014-summerschool/slides/ Laurent_dinh_cifar_presentation.pdfhttp://www. iro.umontreal.ca/ ̃bengioy/cifar/NCAP2014-summerschool/slides/ Laurent_dinh_cifar_presentation.pdfhttp://glottolog.org/ accessed 2017-12-13

  • 119

    Thomas Oberlies. 2005. A historical grammar ofHindi. Leykam, Graz.

    Jonathan K. Pritchard, Matthew Stephens, and Pe-ter Donnelly. 2000. Inference of population struc-ture using multilocus genotype data. Genetics,155(2):945–959.

    Jelena Prokić and John Nerbonne. 2008. Recognisinggroups among dialects. International journal of hu-manities and arts computing, 2(1-2):153–172.

    Taraka Rama and Çağrı Çöltekin. 2016. Lstm autoen-coders for dialect analysis. In Proceedings of theThird Workshop on NLP for Similar Languages, Va-rieties and Dialects (VarDial3), pages 25–32.

    Taraka Rama, Çağrı Çöltekin, and Pavel Sofroniev.2017. Computational analysis of gondi dialects.In Proceedings of the Fourth Workshop on NLPfor Similar Languages, Varieties and Dialects (Var-Dial), pages 26–35.

    Rajesh Ranganath, Linpeng Tang, Laurent Charlin, andDavid Blei. 2015. Deep exponential families. InArtificial Intelligence and Statistics, pages 762–771.

    Ger Reesink, Ruth Singer, and Michael Dunn. 2009.Explaining the linguistic diversity of Sahul usingpopulation models. PLoS Biology, 7:e1000241.

    Caley Smith. 2017. The dialectology of Indic. InJared Klein, Brian Joseph, and Matthias Fritz, edi-tors, Handbook of Comparative and Historical Indo-European Linguistics, pages 417–447. De Gruyter,Berlin, Boston.

    Jaroslav Strnad. 2013. Morphology and Syntax of OldHindi. Brill, Leiden.

    Kaj Syrjänen, Terhi Honkola, Jyri Lehtinen, AnttiLeino, and Outi Vesakoski. 2016. Applying popu-lation genetic approaches within languages: Finnishdialects as linguistic populations. Language Dy-namics and Change, 6:235–283.

    Paul Tedesco. 1965. Turner’s Comparative Dictionaryof the Indo-Aryan Languages. Journal of the Amer-ican Oriental Society, 85:368–383.

    Matthew Toulmin. 2009. From linguistic to sociolin-guistic reconstruction: the Kamta historical sub-group of Indo-Aryan. Pacific Linguistics, ResearchSchool of Pacific and Asian Studies, The AustralianNational University, Canberra.

    Ralph L. Turner. 1962–1966. A comparative dictionaryof Indo-Aryan languages. Oxford University Press,London.

    Ralph L. Turner. 1975 [1967]. Geminates after longvowel in Indo-aryan. In R.L. Turner: Collected Pa-pers 1912–1973, pages 405–415. Oxford UniversityPress, London.

    Martijn Wieling, Eliza Margaretha, and John Ner-bonne. 2012. Inducing a measure of phonetic simi-larity from pronunciation variation. Journal of Pho-netics, 40(2):307–314.