Language Identification: A neural network approach

Language Iden fica on:a Neural Network approach

Alberto Simões1 José João Almeida2 Simon D. Byers3

1CEHUM, Minho's [email protected]

2CCTC, Minho's [email protected]

3AT&T Labs, Bedminster [email protected]

SLATE2014, 19--20th June 2014

Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach

In which languages are these texts?

Malgranda Sablodezerto estasdezerto de Okcidenta Aŭstralio

Esperanto

Po nepavykusių pirmųjųbandymų su kukurūzais

Lithuanian



俄罗斯眼下不具备航母建造、停泊和维护所需的基础设施和条件

Simplified Chinese

임금체계�개편은�기본적으로노사�합의�또는

Korean



جلوگیری کردند. گروه دوم هم بهPersian

আেবদনকারীেদর পক্েষ শুনািন কেরন িফদাBengali



ဦးသနိး္စနိအ္စိုးရရ �ဲဝန�္ကးီအမာ်းစဟုာ စစဗ္ုိလန္�ဲ

စစဗ္ိုလလ္ထူြကေ္တြBurmese

આ રસ મ લ િનચોડી સારીરી િમકસ કરો અ લાસમ

Gujara


Approaches

Using a dic onary of words for each language:Problem: amount of word forms!

Using language features:compute unigrams, bigrams, trigrams, …;compute short words;compute word beginnings or termina ons;

Then use language models:Naïve Bayes;Hidden Markov Models (HMM);Support Vector Machines (SVM);Neural Networks (NN);


Mo va on for a new tool

lack of a decent iden fica on tool for Perl;

use of Chrome Language Detec on library is limited:how to add new languages?how to restrict results to specific languages?

there are tools for other programming languages:language interoperability can be a hassle;not clear how to add new languages;


Why using a Neural Network?

learn how Neural Networks work!

an approach where:training is tedious and slow;iden fica on is easy to implement;iden fica on efficient when BLAS available;

therefore:possible to use trained data in different programming languages;easy to restrict analysis to a set of languages;iden fica on probabili es are comparable;


Neural Network Architecture

x1

x2

x3

. . .

xn

input layer(features)

a(2)1

a(2)2

a(2)3

. . .

a(2)s2

y1

y2

. . .

yK

Θ(1) Θ(2)

outputlayer


Preparing Training Data

texts from TED website;more than 105 languages available!English texts were matched against English dic onary;OOV items are removed from the English texts and from otherlanguage texts (trying to remove named en es wri en in theirEnglish form from other texts).

Example

…began spoken word poet Sarah Kay, in a talk that inspired twostanding ova ons at TED2011. She tells the story of hermetamorphosis — from a wide-eyed teenager soaking in verse atNew York's Bowery Poetry Club to a teacher connec ng kids withthe power of self-expression through Project V.O.I.C.E. — andgives two breathtaking performances of ``B'' and ``Hiroshima.''


Two kind of Features

Used AlphabetWhich are the computer characters used in the text?Are they usually used in Asia c, Arabic or La n text?

Used Sequences of CharactersWhich unigrams, bigrams or trigrams are used?Which are most common for each language?


Alphabet Features

Count number of Unicode characters in the following classes:C1 La n characters, only a-z, without diacri cs;C2 Cyrillic characters (0x0410-0x042F and 0x0430-0x044F);C3 Hiragana and Katakana characters (0x3040-0x30FF);C4 Hangul characters (0xAC00-0xD7AF, 0x1100-0x11FF,

0x3130-0x318F, 0xA960-0xA97F and 0xD7B0-0xD7FF);C5 Kanji characters (0x4E00-0x9FAF);C6 Simplified Chinese characters (2877 hand defined characters);C7 Tradi onal Chinese characters (2663 hand defined characters);C8 Arabic characters (0x0600-0x06FF);C9 Thai characters (0x0E00-0x0E7F);C10 Greek characters (0x0370-0x03FF and 0x1F00-0x1FFF).


Binariza on of Alphabet Features

In order of reducing entropy in the NN:Alphabet features are binarized using a set of rules:

set C1 ⇐ C1 > 0.20set C2 ⇐ C2 > 0.20set C3 ⇐ C3 > 0.20set C4 ⇐ C4 > 0.20set C6 ⇐ C5 > 0.30 ∧ C6 > C7set C7 ⇐ C5 > 0.30 ∧ C6 < C7set C8 ⇐ C8 > 0.20set C9 ⇐ C9 > 0.20set C10 ⇐ C10 > 0.20

whereset Ci ⇔ Ci ← 1 ∧ ∀j ̸=i Cj ← 0


Trigram Features

Why Trigrams?

bigrams would be too small when comparing very closelanguages like Portuguese and Spanish;

tetragrams would be too big for some languages (like Asia c's),where some glyphs represent words or morphemes;

as punctua on and numbers were removed, and spacesnormalized, trigrams would be able to capture, as well, the endor beginning of words as well as to capture single characterwords that appear surrounded by spaces.


Trigram Features: example

Für mich war das eine neue Erkenntnis. Und ich denke, mit derZeit, in den kommenden Jahren, Wir haben Künstler, aber leiderhaben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nureine Form kultureller Integra on. Wir haben erkannt, dass seitkurzem immer mehr Leutea

Top occurring trigramsen␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766


Trigram Features: Merging

features← {};for L ∈ L do

trigrams← ∅;for file ∈ FilesL do

T← computeTrigrams(file) ; // Str→ INT← mostOccurring(T) ; // Top 30 trigramsfor t ∈ keys(T) do

trigrams[t]← trigrams[t] + 1;

T← mostOccurring(T) ;features← features ∪ keys(trigrams);


Training Data Matrix (excerpt)

Alphabet Features Trigram FeaturesLa n Greek Cyril. ␣pa ới␣ par nia ест ати. ата

PT 1 0 0 0.0041 0 0.0038 0.0001 0 0 0PT 1 0 0 0.0039 0 0.0036 0 0 0 0RU 0 0 1 0 0 0 0 0.0020 0.0004 0.0003RU 0 0 1 0 0 0 0 0.0026 0.0005 0.0002UK 0 0 1 0 0 0 0 0.0003 0.0034 0.0001UK 0 0 1 0 0 0 0 0.0003 0.0026 0.0001VI 1 0 0 0 0.0028 0 0 0 0 0VI 1 0 0 0 0.0029 0 0.0001 0 0 0


Experiment 1: 25 languages

Arabic (AR)Bulgarian (BG)German (DE)Modern Greek (EL)Spanish (ES)Persian (FA)French (FR)Hebrew (HE)Hungarian (HU)Italian (IT)Japanese (JA)Korean (KO)Dutch (NL)

Polish (PL)Portuguese (PT)Brazilian Portuguese (PT-BR)Romanian (RO)Russian (RU)Serbian (SR)Thai (TH)Turkish (TR)Ukrainian (UK)Vietnamese (VI)Tradi onal Chinese (ZH-TW)Simplified Chinese (ZH-CN)


Exp 1: Training and Test Sets

Training Set (30 files/lang) Test Set (21 files/lang)Lang. Smaller Larger x̄ σ Smaller Larger x̄ σ

ar 871921 969387 907562 21392 863 4618 2366 1210bg 988450 1087435 1027581 23663 660 2099 1091 378de 588200 653508 618463 16475 677 3890 1554 842el 773265 885770 841203 22653 550 3297 1590 705es 578806 651240 617341 17637 897 3850 2342 935fa 651807 766206 697212 28994 600 5221 1338 967fr 639582 705675 673414 15377 936 4088 1879 689he 806098 877218 836222 20545 559 3649 1586 878hu 406271 454506 431797 13131 729 6045 2175 1356it 588147 643252 616391 14348 1260 6607 2991 1370ja 538033 606053 569956 18871 323 785 495 133ko 737118 817651 773168 20550 530 1603 780 233nl 533497 580313 557724 14033 552 1949 1115 381pl 521184 591299 551259 17938 435 3092 1605 694

pt-br 596158 643215 617734 14028 920 3189 1953 589pt 338272 378872 355800 10605 486 5875 2031 1169ro 592714 650375 616051 15442 718 3254 1438 695ru 1019789 1144200 1069884 31232 662 2470 1444 526sr 349389 433221 379344 20560 834 6493 1813 1263th 529484 601244 565082 18551 334 3242 1396 734tr 494191 549998 524271 12774 332 5390 1559 1121uk 370785 434683 395312 16641 299 15435 2430 3553vi 470057 541930 510409 17246 680 6237 1555 1359

zh-cn 536438 595027 562728 14457 495 6331 1695 1559zh-tw 514993 588860 542879 16000 270 1721 925 428


Exp1: Accuracy

Language 1500 iters. 4000 iters.ar, bg, de 100% 100%el, es, fa 100% 100%fr, he, hu 100% 100%it, ja, ko 100% 100%

nl, pl 100% 100%pt 5% 52% wrongly classifies as pt-br

pt-br 100% 76% wrongly classifies as ptro, ru, sr 100% 100%th, tr, uk 100% 100%

vi, zh-cn, zh-tw 100% 100%


Exp1: Comparison of PT variants

PT PT-BR


Experiment 2: 55 languages

Afrikaans

Albanian

Arabic

Bulgarian

Bengali

Catalan

Czech

Danish

German

ModernGreek

English

Esperanto

Spanish

Estonian

Persian

Finnish

French

Galician

Gujara

Hebrew

Hindi

Hungarian

Armenian

Indonesian

Italian

Japanese

Georgian

Kannada

Korean

Kurdish

Lithuanian

Latvian

Macedonian

Malayalam

Marathi

Burmese

Nepali

Dutch

Polish

Portuguese

Romanian

Russian

Slovak

Slovenian

Somali

Serbian

Swedish

Tamil

Thai

Turkish

Ukrainian

Urdu

Vietnamese

Chinese(simplified)

Chinese(tradi onal)


Exp 2: Results

55 languages,1.126 features,Θ(l) take 11MB on disk (binary format),running 7500 itera ons of learning algorithm,during 6574 minutes and 50.386 seconds (more than 4.5 days),s ll 21 test files per language,46 seconds to run over the 1155 test files,accuracy of 99.740%,mis-iden fica ons:

2 Bulgarian texts detected as Macedonian,1 Danish text detected as Dutch.


Conclusions

Up to 96% of accuracy when tes ng few languages, andincluding two Portuguese variants;Over 99.7% of accuracy for 55 languages;NN are able to grow, but training me grows exaggeratedly;The choice of features is relevant;(if we know a specific detail will be good to dis nguish alanguage, add it to the network!)Obtained results are not ``determinis c''. Although the samepropor on of results are expected, the random ini aliza on ofthe network may lead to some different results in differentnumber of itera ons.


Future Work

Reduce number of trigrams per language and include unigrams;Compute distribu on differences between near languages;Make experiments on training different neural networks foreach alphabet;Include a regulariza on coefficient (λ ̸= 0);Make experiments to Deep Neural Networks;Test language iden fica on on short texts (namely Twi ertweets).


Language Iden fica on:a Neural Network approach

Alberto Simões1 José João Almeida2 Simon D. Byers3

1CEHUM, Minho's [email protected]

2CCTC, Minho's [email protected]

3AT&T Labs, Bedminster [email protected]

SLATE2014, 19--20th June 2014


Language Identification: A neural network approach

Software

Transcript of Language Identification: A neural network approach