Language technology for morphologically rich...

Post on 18-Oct-2020

9 views 0 download

Transcript of Language technology for morphologically rich...

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languages

Language technologyfor morphologically rich languages

Trond TrosterudGiellatekno, Centre for Saami Language Technology

http://giellatekno.uit.no/

September 5, 2017

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languages

Contents

A very subjective history of language technology

A model for all the other languages

Conclusion

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

A very subjective history of language technology

▶ The computers came with the cold warOur task was to build MT from Russian to English

▶ First attempt (ask the cryptographers):▶ Machine translation seen as a noisy channel?

▶ Second attempt (ask the linguists):▶ Generative grammar promised to ...

generate grammatical sentences

▶ 1966: The Alpac report▶ We (the linguists) had failed

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Some of the critique is still valid

▶ Bar-Hillel 1960:▶ Little John was looking for his toy box. Finally he found it.

The box was in the pen.▶ Google Translate 2017:

▶ Lille John var på utkikk etter sin leketøyboks. Til slutt fanthan det. Boksen var i pennen.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

The post-Alpac world of formal linguistics 1

▶ Not that much MT for a long while, but:▶ Formal linguistics

▶ Until 1980: Chomskyan generative grammar▶ After 1980: Chomsky went for ”Universal Grammar”

(= left the field of grammar modelling)▶ Alternative generative models (LFG, HPSG)

▶ did not result in robust parsers

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

The post-Alpac world of formal linguistics 2

▶ An alternative approach to morphophonology▶ C. Douglas Johnson 1972:

Formal Aspects of Phonological Descriptionrewrite-rules ( A → B | C _ D ) as finite-state transducers

▶ Kimmo Koskenniemi 1983: Rewrite rules as parallel relations▶ Around 1990: Xerox builds efficient compilers

▶ The word form problem was solved(we will return to the relevance this has fortomorrow’s shared task)

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

In came the nineties

▶ Finally, the linguists had broken the code:we came up with a technology combining robustness anddepth

1. Finite-state transducers had solved analysis / generation2. Constraint grammar solved the homonymy problem

▶ Disambiguating ambiguity in context:John tries to walk the walk==> context-sensitive disambiguation rules(Fred Karlsson, Pasi Tapanainen, Eckhard Bick)

▶ Our moment in the limelight:The British National Corpus was annotated byFinite-state transducers and Constraint Grammar

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Then two things happened:

1. The inventors of these techniques commercialised themand lifted it out of the common development(thus there were no open compilers or grammars,but grammar checkers for MS Word, annotating gold corporafor statistical models)

2. Computers got faster and the algorithms better

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Statistical methods won the day

▶ Every time I fire a linguist my system improves▶ Morphology is handled by lists▶ Different types of processing is handled via machine learning

▶ Performance went down, but algorithms were opengood data were closed!

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

A side note on language typology

▶ We know this quote: “Take a language like, say,▶ But languages are not like English

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

There is a growing interest in extending the scope oflanguage technology

▶ (cf this workshop)▶ A natural choice (?): extend the model we had for English,

to these other languages▶ So far, not too many success stories on this front

(there are taggers, but not that many end-user applications)▶ No spellchecker for any North American languages▶ Very few languages have grammar checkers▶ Far worse MT into Finnish than into other EU languages▶ Bad MT between, say, Swedish and Norwegian▶ In short, a paucity of working solutions for the majority of

languages

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Meanwhile, in the grammatical camp:

▶ We have been extending the domain of the rules from thenineties

▶ Adding:▶ grammatical functions▶ semantic roles▶ dependency relations

▶ into a both robust and deep analysis(dependency annotation at > 95%)

▶ and we have got open compilers▶ ... but our time in the limelight is over

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

So, the limelight is gone, but here I am, on another scene

▶ As witnessed by the growing concern and a growing numberof workshops:The morphologically rich languages are not that easy

▶ Identifying the morphemes is not enough▶ Perhaps we should have a second look at what happened in

the nineties

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

My answer: A viable model for “all the other languages”

▶ Each language needs a team▶ Programmer (shared)▶ Computational linguist (shared)▶ Linguist▶ ... and eventually a native speaker (and preferably linguist)

▶ Here is the thing:For every language, there is a linguisthaving devoted his or her life to itlanguage technology has something to offer:==> a test bed for his or her grammatical model

▶ Each team would share the common infrastructureThe Linux model, as it were

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

But can we repeat the Linux model for languagetechnology?

▶ It turns out we can▶ cf. two examples▶ http://giellatekno.uit.no/doc/lang/▶ http://wiki.apertium.org/

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Language technology in practice

1. Common, scaleable infrastructure2. Language models3. A pipeline for making practical applications

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Common, scaleable infrastructure

Figure: A schematic overview of the Giella infrastructure

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Figure: Circumpolar languages in the Giella infrastructure

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The language models

Figure: A composed finite-state transducer gives the accusative of NorthSaami gussa, ‘cow’

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Figure: Automaton and transducer

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Figure: Combining writing systems

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The interface to applications

▶ morphological analyser▶ are turned into spellcheckers for LibreOffice and MS Office▶ There are two North American spell checkers

▶ + dictionaries▶ gives click-in-text e-dictionaries

▶ + lexical selection and transfer rules▶ gives machine translation

▶ Also: Keyboards for all platforms

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Compilation is standardised

▶ All this with one command:make APPLICATION for LANGUAGE

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Tools

▶ http://gtweb.uit.no/korp▶ http://sanit.oahpa.no▶ http://giellatekno.uit.no/doc/infra/

GettingStarted.html

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Figure: The S curve

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Rule-based machine translation

▶ Bar-Hillel’s critique was flawed (and unfair towards GoogleTranslate)

▶ His example was not authentic▶ and the purpose of the MT program he envisaged was not

stated▶ One important points: Rule-based system may correct errors▶ Another point: We may get efficient text production systems

between closely related languages

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The Apertium languages A-K

Aekyom, Afrikaans, Albanian, Arabic, Aragonese, Armenian, Assamese,Asturian, Avaric, Aymara, Azerbaijani, Bashkir, Basque, Belarusian,Bengali, Bislama, Breton, Bulgarian, Buriat, Catalan, Cebuano, CentralKurdish, Chinese, Chukot, Church Slavic, Chuvash, Corsican, CrimeanTatar, Cusco Quechua, Czech, Danish, Dargwa, Dhivehi, Dolgan,Domung, Dutch, Eastern Apurímac Quechua, Eastern Mari, English,Erzya, Esperanto, Estonian, Evenki, Faroese, Finnish, French, Gagauz,Galician, Ganda, Georgian, German, Gilaki, Guarani, Gujarati, Haitian,Halh Mongolian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Igbo, InariSami, Indonesian, Ingush, Interlingua, Interlingua (International AuxiliaryLanguage Association), Iranian Persian, Irish, Italian, Kara-Kalpak,Karachay-Balkar, Karelian, Kashmiri, Kashubian, Kazakh, Khakas,Kirghiz, Komi, Komi-Zyrian, Korean, Kumyk, Kven Finnish

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The Apertium languages L-Z

Lao, Latin, Latvian, Lingala, Lithuanian, Liv, Livvi, Lower Sorbian, Luang,Lule Sami, Luxembourgish, Macedo-Romanian, Macedonian, Malay,Malay , Malayalam, Maltese, Manx, Marathi, Mari, Medumba, ModernGreek, Moksha, Morisyen, Nanai, Neapolitan, Nepali, Nogai, NorthernKurdish, Northern Sami, Norwegian, Norwegian Bokmål, NorwegianNynorsk, Occitan, Ossetian, Ottoman Turkish, Panjabi, PeripheralMongolian, Persian, Polish, Portuguese, Romanian, Romansh, Rundi,Russian, Sanskrit, Sardinian, Scots, Scottish Gaelic, Serbo-Croatian,Sicilian, Sindhi, Sinhala, Slovak, Slovenian, Southern Altai, SouthernSami, Spanish, Spanish Sign Language, Standard Latvian, Swahili, Swati,Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Tetum, Thai, Turkish,Turkmen, Tuvinian, Udmurt, Uighur, Ukrainian, Upper Sorbian, Urdu,Uzbek, Vietnamese, Vlax Romani, Võro, Wayuu, Welsh, Western Frisian,Western Mari, Wolaytta, Xhosa, Xibe, Yakut, Yiddish, Yoruba, Zulu

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Am I against statistical models?

▶ No.▶ Admittedly, my motto is “Don’t guess, if you know”

▶ But I imagine you can make better guesses the more you know▶ So, check whether your language is on the Apertium or Giella

lists above before you start guessing

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The place for statistical models

▶ Language is complex, and many facets are learned via big data(caveat: not that big for minority languages)

▶ So, in a way, I suggest “business as usual”,but with a sounder foundation

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Don’t guess if you know?

▶ An apropos to the forthcoming shared task▶ There is no need to look for the morphemes in the languages

of the shared task▶ All of them have been analysedalready (open source)

and we even distinguish between their different grammaticalinterpretation

▶ So I really would like to see what you could achieve standingon our shoulders

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesConclusion

Conclusion

▶ The challenge posed by morphologically rich languages havebeen solved

▶ The fact that the solution isn’t fashionable at the momentshould not prevent us from making use of it

▶ More fashionable approaches are welcome to take it from there