Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow...

Post on 04-Mar-2020

4 views 0 download

Transcript of Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow...

PanLexPanlingual

Lexical Collaboration

Jonathan PoolUniversity of Washington Computational Linguistics Laboratory

22 April 2008

Task 1.

You encounter the word “!धाना%यापक”.

What language is it in?

What does it mean?

Task 2.

You encounter the word list at http://www.geonames.de/peace.html.

Is its content already in PanLex?

If not, how can you contribute it?

Pre-DemoPanLex (http://panlex.org/cgi-bin/panlex13.cgi)

Facilitate panlingual:

4. Vigor.

3. Discursive intertranslatability.

2. Lexical intertranslatability.

1. Lexical collaboration.

Goals

Goal 1: Facilitate panlingual lexical collaboration

How?

Strategy

1. Assemble valuable panlingual data.

2. Make the data accessible.

3. Invite contributions to the data.

4. Localize the interface panlingually.

Tactic 1: Assemble valuable panlingual data.

How?

Tactics

Borrow data from TransGraph.

Expression (lexeme) equivalences from 357 dictionaries.

13 multilingual, 344 bilingual.

1050 languages.

2.5 million expressions.

8 million expression tokens.

Accept (mainly) TransGraph’s lightweight schema.

An expression is just a string in a language.

A meaning is just a source-specific ID.

A denotation is just a source assigning a meaning to an expression.

A translation is just 2+ denotations with the same meaning.

TransGraph Data

Example

englinguisticsell

γλωσσολογία

turdilbilim

estkeeleteadus

6290415713

Tactic 2: Make the data accessible.

How?

Tactics

Open-source (PostgreSQL) database (vs. TransGraph).

Perl CGI-DBI application to query and modify the data.

Domain "panlex.org" to access the application.

All data exposed (vs. PanImages).

Data retrievable interactively and by plain-text or XML file export.

Tactic 3: Invite contributions to the data.

How?

Tactics

User contributions nondestructive.

Not a Wiki, not moderated.

Contributable data:

[Language varieties (vs. TransGraph languages).]

Expressions.

Sources.

Denotations.

Contribution modes:

Batch (file upload; plain-text or XML).

Incremental (interactive editing).

Tactic 4: Localize the interface panlingually.

How?

Tactics

In vivo localization.

Interface entirely lemmatic.

Therefore, PanLex can translate the interface.

Translation core: developer-attested translations.

Translation periphery: election with sources voting.

Test 1 (expert user):

15 query and modification tasks with test questions.

Failures and comments inspired interface changes.

Test 2 (expert user):

Found, formatted, checked, and uploaded data from:

Nepali-Esperanto dictionary.

English-Yiddish dictionary.

Eight-language medical glossary.

Evaluation

Coverage

Add dictionaries.

Recruit user-added dictionaries.

Add source types:

Thesauri.

WordNets.

Library subject headings.

Locale repositories.

Monolingual resources.

Export additions to TransGraph.

Future Work

*eng: Englishhun: magyar*fra: français

*deu: Deutschces: češtinahrv: hrvatski*tur: Türkçe

spa: españolest: eesti

ita: italiano*epo: Esperanto

ara: العربيةfin: suomi

jpn: 日本語nld: Nederlandspor: português*rus: русский

bre: brezhonegsrp: српскиron: română

kur: kurdîswe: svenska

0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000

24,495

25,315

26,516

27,072

29,436

36,779

38,237

46,921

50,247

54,439

56,122

62,928

72,861

73,628

82,503

92,146

96,735

110,623

135,505

172,435

264,927

428,550

isl: íslenskasqi: shqipepol: polski

*nob: bokmålnci: Classical Nahuatl

nah: nawatlahtollibel: беларуская

cat: catalàlat: latine

gle: Gaeilgedan: dansk

oci: lenga occitanaplt: Plateau Malagasy

tuk: türkmenslk: slovenčina

slv: slovenščinaoji: ᐊᓂᔑᓇᐯ

chy: Tsétsêhéstaestselij: lengua lígure

zho: 漢語eus: euskara

frp: lenga arpitanaell: ελληνικά

glv: chengey Vanninglg: galego

cym: Cymraegmlt: Malti

art: ISO 639afr: Afrikaansnep: )पाली

yid: ייִדישheb: עבריתkor: 한국어

ltz: Lëtzebuergesch Sproochang: Englisce sprǣc

ind: bahasa Indonesiafao: føroyskt

aym: aymar arupap: Papiamentu

0 4,000 8,000 12,000 16,000 20,000 24,000

4,7984,9605,1815,3655,9796,3706,7506,8676,8687,0057,5187,6557,6757,8767,9038,2248,7808,8508,9029,0929,2409,3809,5549,81910,04910,05110,593

12,56213,00313,595

14,89115,614

17,13217,24518,01218,37819,35620,03220,513

lit: lietuviųhbs: Serbo-Croatian

bul: българскиgla: Gàidhlig na h-Alba

yua: yukatekyor: èdè Yorùbáfro: Old French

quz: Cusco Quechuacor: yeth Kernewek

pqm: Malecite-Passamaquoddyfry: Frysk

nds: Plattdüütsche Sprookvie: tiếng Việthmn: Hmoob

qul: North Bolivian Quechuaido: Ido

lav: latviešubos: bosanskitel: తJలుగు

roh: lingua rumantschaina: interlingua

urd: اردوary: Moroccan Arabic

ukr: українськаkab: ثاقبايليث

pcd: langue picardefas: فارسی

tgl: Tagalogcos: lingua corsa

got: gutiska razdamly: Bahasa Melayu

tpi: Tok Pisinswa: kiswahili

msa: bahasa Melayutha: ภาษาไทย

nap: lengua nnapulitanapes: فارسی

hin: ,हदीqus: Santiago del Estero Quichua

0 1,000 2,000 3,000 4,000 5,000

1,5251,5431,5871,6161,6891,7371,8091,9581,9682,0122,0832,1852,2392,2712,2962,3912,4852,5362,6492,6662,7582,8572,8702,9122,9193,0083,1213,2043,3403,4633,463

3,7573,7763,9604,1034,3214,3354,482

4,747

Features

More query functions.

User SQL entry.

Usability

Test and improve interface.

Non-expert interface.

Standards

Lemmatic forms (e.g., English “to”).

Multiword lexemes.

Future Work