Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

26
Data description & conversion Application and examples Lemon-aid: using lemon to aid quantitative historical linguistic analysis Steven Moran & Martin Br¨ ummer University of Zurich & University of Leipzig 1 Steven Moran & Martin Br¨ ummer Lemon-aid: using lemon for QuantHistLing

description

 

Transcript of Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Page 1: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

Lemon-aid: using lemon to aid quantitativehistorical linguistic analysis

Steven Moran & Martin Brummer

University of Zurich & University of Leipzig

1 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 2: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

Goals

Convert dictionary and wordlist data into the Lexicon Modelfor Ontologies, aka lemonLeverage Linked Data (LD) to combine disparate lexicalresources (50+) from the QuantHistLing research unitResulting LD resources provide researchers with:

more linguistic data in the Linguistic Linked Open Data Cloud(LLOD)a translation graph to query across the underlying lexicons anddictionaries to extract semantically-aligned wordlists

2 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 3: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

Talk map

Data description & conversionOntological modelApplication and examples

3 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 4: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

QuantHistLing project

Quantitative Historical Linguistics (QuantHistLing) researchunit aims to uncover and clarify phylogenetic relationshipsbetween native South American languages using quantitativemethods

http://quanthistling.info/

There are two main objectives of the project:Digitalization of lexical resources on South Americanlanguages andThe development of computer-assisted methods andalgorithms to quantitatively analyze the digitized data

4 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 5: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

QuantHistLing data

QuantHistLing aims to digitize around 500 works, most ofwhich are currently only available in print and many of whichare the only resources available for the languages that theydescribe

http://quanthistling.info/index.php?id=resources

5 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 6: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

QuantHistLing data

Simple data output format that contains metadata (prefixedwith “@”) and tab-delimited lexical output

6 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 7: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

QuantHistLing data format

The first row following the metadata contains the data headerwith the fields: QLCID, HEAD, HEAD DOCULECT,TRANSLATION, TRANSLATION DOCULECTThey correspond respectively to the internal QLC uniqueidentifier, the headword in the dictionary, the doculect of theheadword (or in other words the language which thisparticular document describes), the translation for the givenheadword, and the doculect that the translation is given inFor each resource a data dump with the same format isprovided by the project

7 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 8: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

Conversion from QHL-to-LD

We convert the QLC data into Linked Data that conforms tothe Lemon model with a simple Python scriptLemon is an ontological model for modeling lexicons andmachine-readable dictionaries for linking to the Semantic Weband the Linked Data cloud

http://lemon-model.net/

Lemon developers also active in the W3C Ontology-LexicaCommunity Group

Goal is to “develop models for the representation of lexica (andmachine readable dictionaries) relative to ontologies”http://www.w3.org/community/ontolex/

8 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 9: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

Implementation of QHL data in Lemon

9 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 10: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

Implementation of QHL data in Lemon

10 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 11: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

Implementation of QHL data in Lemon

11 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 12: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

Implementation of QHL data in Lemon

12 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 13: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

Implementation of QHL data in Lemon

13 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 14: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

OverviewDataConversion

Implementation of QHL data in Lemon

According to the Lemon model, senses should link to anontological entity like a DBpedia resourceHowever, we only have the strings representing the word formswhich is not enough data to reliably link to other knowledgebaseswe argue that it is linguistically correct to link the senses ofthe entries (instead of their word forms) to their respectivetranslationsso the ‘sense’ resources serve a purpose, even if they don’tlink the meaning of the entry

14 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 15: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Application

A major goal in historical-comparative linguistics is theidentification of cognates, i.e. sets of words in genealogicallyrelated languages that have been derived from a commonword or root (e.g. English ‘is’, German ‘ist’, Latin ‘est’, fromIndo-European ‘esti’).Modeling dictionaries and lexicons in a pivot ontology usingoverlaps in translations is one way to merge several resourcesinto one RDF graph for querying and extractingsemantically-aligned wordlists, which can then be used asinput into computational historical linguistics tools such asLingPy (List and Moran, 2013).

15 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 16: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Application

As a first step, we have converted the QHL data into RDFand it is available online through a SPARQL endpoint.

http://linked-data.org/sparql/ (preliminary)http://quanthistlist.info/lod/ (coming soon)a dump is available at http://linked-data.org/datasets/

Querying the combined dictionaries and lexicons isstraightforward

16 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 17: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Return all triples

returns over 3.8 million triples

17 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 18: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Pairs of languages in the translation graph that containwritten forms for the lexical sense “casa”

18 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 19: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Implementation of QHL data in Lemon

19 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 20: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Languages in the translation graph that contain writtenforms for the lexical sense “casa”

66 language pairsmarked entry shows word forms are not normalized andcontain data that can be analyzed further

20 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 21: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Queries

Can be easily extended to incorporate entire wordlists, such asthe Swadesh list (Swadesh, 1952) or Leipzig-Jakarta list(Tadmor et al., 2010)The combination of disparate data from many dictionaries andlexicons is a first step in a computational historical linguisticspipeline

Results are given in the source documents’ orthographicrepresentationsThey must be normalized into an interlingual pivot, such asthe International Phonetic Alphabet, if phonetic or phonemicanalysis is to be applied to the dataNext step before producing phonetic alignments and cognatejudgements based on metrics and algorithms for calculatinglexical similarity

21 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 22: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Conclusion

From data being digitized and extracted from print resources,we are creating machine-readable lexicons that are bothinteroperable with each other (we link semantic senses usingthe Lemon ontology model) and with other linguistics sourcesWe also interlink the resulting dictionary resources with otherlanguage resources in the LLOD via ISO639-3 codes

22 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 23: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Future work - simplify!

Build algorithms that identify semantically similartranslation-pairs from terse translations

Identify that doculect translations like“coarsely grind”“grind up, crush well”“grind lightly (chili pepper, millet for a quick snack)”“grind lightly (groundnuts) with stones”

for different languages can be mapped to a simpler form suchas “to crush/grind” for initial comparative analysis

23 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 24: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Future work - annotate!

Use NLP Interchange Format (Hellmann et al., 2012) to keeptrack of where information in the dictionaries comes from - orin other words, use NIF combined with Lemon to annotate theQHL data sources for provenance

24 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 25: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Future work - link!

Link to further resources that contain linguistic andnon-linguistic information

Typological data and geographic variables that may provideuseful information for determining the genealogical andgeographical relatedness of languages

25 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing

Page 26: Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Data description & conversionApplication and examples

ApplicationExamplesConclusion

Many thanks!

Organiziers and participants of LDL-2013QuantHistLing Research Unit - Univeristy of Marburg(Michael Cysouw, PI)University of Zurich, University of Marburg and University ofLeipzig

26 Steven Moran & Martin Brummer Lemon-aid: using lemon for QuantHistLing