Morphological Normalization and Collocation Extraction

K.U. LeuvenLeuven2008-05-08

Morphological Normalizationand Collocation Extraction

Jan Šnajder, Bojana Dalbelo Bašić, Marko TadićUniversity of Zagreb

Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences

[email protected], [email protected], [email protected]

Seminar at the K. U. Leuven, Department of Computing ScienceLeuven

2008-05-08


Morphological Normalization

Jan Šnajder, Marko TadićUniversity of Zagreb

Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences

[email protected], [email protected], [email protected]

Seminar at the K. U. Leuven, Department of Computing ScienceLeuven

2008-05-08


Talk overview who we are? what are we doing? morphological processing: normalization lemmatization vs. stemming Mollex: a system for normalization of

Croatian usage in document indexing and text

classification collocations as features collocation extraction by co-occurrence

measures usage of genetic programming


Who we are?

University of Zagreb, Croatia founded 1669, 52,500 undergraduate students

two faculties in the same mission build the systems that will develop and enable the

usage of language resources and tools for Croatian


Who we are 2? Faculty of Humanities and

Social Sciences Institute / Department of

Linguistics dealing with basic

computational linguistic tasks for Croatian compiling and processing large scale

language resources Croatian National Corpus, Croatian

Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebank

tagger, lemmatizer chunker, parser NERC system


Who we are 3? Faculty of Electrical Engineering and Computing

Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab

Knowledge Technogies Laboratory Group deals with text preprocessing techniques for Croatian for

machine learning procedures dimensionality reduction and document

clustering in the vector space model + visualisation

automatic indexing ofdocuments

intelligent, language specificinformation retrieval andextraction


What are we doing? working jointly on several research projects

AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA)

RMJT: Computational Linguistic Models and Language Technologies for Croatian (national research programme, two of five projects) Croatian language resources and their annotation

2007-2011, prof. Marko Tadić Knowledge discovery in textual data

2007-2011, prof. Bojana Dalbelo Bašić CADIAL: Computer Aided Document Indexing for

Accessing Legislation joint Flemish-Croatian project 2007-2009 prof. Marie-Francine Moens & prof. Bojana Dalbelo

Bašić


Morphological processing computational linguistic / NLP task

important for inflectionally rich languages, e.g. Croatian noun in 14 word-forms (7 cases, 2 numbers):

N: student studentiG: studenta studenataD: studentu studentimaA: studenta studenteV: studentu studentiL: studentu studentimaI: studentom studentima

unlike English noun in 2(3?) word-forms (2 numbers+ possesive?):

Sg: student Poss: (student’s)Pl: students

present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ...


Morphological processing 2 three basic subtasks in inflection processing

1. generation of (all) word-forms (WFs) of a lexeme2. analysis of WFs i.e. recognizing the values of

morphosyntactical categories of a WF in text3. recognizing to which lexeme(s) a WF belongs to

the last one helps us in avoiding the problem of data sparsness in many text processing tasks, e.g. information retrieval, text mining, document

indexing normalization: conflating the morphological

variants of a word to a single representative form two main ways to do that

1. linguistically motivated: lemmatization2. computationally motivated: stemming


Morphological processing 3 lemmatization

replacing the WF with its proper base WF, usually called lemma e.g. mapping theoretical maximum of (e.g. 14)

WFs to 1 lemma lexicon based

large lexicons of all (generated) WFs needed preparation expensive in time and manpower mostly realized by databases

algorithmic based mostly FST: compact, efficient, fast lexicon of lemmas and their inflectional

patterns needed anyway


Morphological processing 4 stemming

reducing the WF from the end by truncating the possible endings

does not have to respect the linguistic boundariesvuk+Ø > *vu+kØvuk+a > *vu+kavuč+e > *vu+če

reducing all the WFs to a common beginning problems where there are many

morphonological adaptationssla+ti > *?+slatišalj+em > *?+šaljem


Morphological normalization Croatian language (like most Slavic

languages) is morphologically complex elaborated inflectional and derivational

morphology problematic for most NLP applications

requires the use of substantial linguistic knowledge

our lexicon based approach to normalization is somewhere in between lemmatization and stemming suitable for other inflectionally complex

languages


Croatian Morphology1. high degree of affixation

word-forms are obtained by suffixation, prefixation, phonological alternations, stem extension

inflection nouns: declination (7 cases, 2 numbers) verbs: conjugation (tenses, persons, numbers,

genders) adjectives: declination (7 cases, 2 numbers, 3

genders), comparison (3 degrees), and definiteness

derivation a large number of rules for deriving nouns from

verbs, verbs from nouns, possessive adjectives, ...


Croatian Morphology 2 inflection examples

adjective: brz, brza, brzi, brzima, brzih, brzoj, brze, brzim, brzog, brzoga, brz, brza, brzo, brzom, brzomu, brži, bržeg, brža, brži, bržima, bržih, bržoj, brže, bržim, bržem, bržima, najbrži, bržeg, najbrža, najbržima, najbržih, najbrže, najbržim, najbrži, najbržoj, ...

noun: brzina, brzinom, brzine, brzinama, brzinu, brzina, brzini

adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, brzinsko, brzinskog, brzinskoga,…

adverb: brzo, brže, najbrže, brzinski derivation examples

brz > brzina > brzinski > …


Croatian Morphology 32. high degree of homography

vode = voda (water) | voditi (to lead) | vod (a platoon)

requires disambiguation (POS/MSD tagging)3. affix ambiguity

many ambiguous suffixation rules e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i e.g. bolnic-a / bolnic-om vs. brodolom /

brodolom-a possible mismatches at inflectional level

narančast / narančast-om vs. ruž / ruž-om (not ruža)

possible mismatches at derivational level e.g. kralj / kralj-ica vs. stan / stan-ica


Lexicon based normalization lexicon-based morphological normalisation

a morphological lexicon associates to each WF its morphological norm (lemma, stem,...) and, optionally, a MSD

incorporates linguistic knowledge and thus avoids aforementioned pitfalls

drawbacks made by linguists, expensive and time-

consuming problems with coverage (neologisms, jargons,

…) our approach

rule-based acquisition of large coverage morphological lexica from raw (unannotated) corpora


Our approach1. acquisition of inflectional lexicon

input: raw corpora and sets of inflectional and derivational rules in convenient (grammarbook-like) formalism

2. normalisation of word-forms inflectional (lemmatization) inflectional + derivational

comparable to stemming (but more precise) advantages

can be used as both a lemmatizer (with MSD) and a stemmer (with variable degree of conflation)

provides good lexicon coverage requires only limited linguistic expertise


Morphology representation e.g. noun inflectional paradigm

vojnik (soldier)Case Singular PluralN vojnik-Ø vojnic-iG vojnik-a vojnik-aD vojnik-u vojnic-imaA vojnik-a vojnik-eV vojnič-e vojnic-iL vojnik-u vojnic-imaI vojnik-om vojnic-ima


Morphology representation 2 defines inflectional and derivational rules uses functions as building blocks:

A) condition functions B) string transformation functions

each defined using a higer-order function e.g.

sfx sfx('a') sfx('a')('vojnik') = 'vojnika' sfx(‘e’) alt(pal) (sfx('e') alt(pal))('vojnik') = 'vojniče'


Morphology representation 3Case Singular PluralN vojnik-Ø vojnic-iG vojnik-a vojnik-aD vojnik-u vojnic-imaA vojnik-a vojnik-eV vojnič-e vojnic-iL vojnik-u vojnic-imaI vojnik-om vojnic-ima

(s.ends('k','g','h')(s) consGroup(s), {null, sfx(‘a’), sfx(‘u’), sfx(‘om’), sfx(‘e’) alt(pal), sfx(‘i’) alt(sib), sfx(‘ima’) alt(sib), sfx(‘e’)})


Morphology representation 4 suitable also for more complex paradigms

(c, {null, sfx(‘a’), sfx(‘u’), ..., sfx(‘ima’)} {sfx(‘og’), sfx(‘om’), ..., sfx(‘ima’)} {sfx(‘i’) alt(jot), sfx(‘eg’) alt(jot), ..., sfx(‘ima’) alt(jot)} {sfx(‘i’) alt(jot) pfx(‘naj’), ..., sfx(‘ima’) alt(jot) pfx(‘naj’)})


Morphology representation 5 advantages

resembles to morphology description as found in traditional grammar books

requires minimum amount of linguistic knowledge

highly expressive: arbitrary HOF functions can be defined

can be aplied to other morphologically similar languages

implemented in Haskell purely functional programming language requires minimum programming skills


Lexicon acquisition uses inflectional rules + raw corpora to

extract lemmas and their paradigms uses frequency counts of WFs attested in the

corpus much of the ambiguity is resolved by

language-dependent heuristics plausibility, priority

linguistic quality is not vital word-form conflation rather than generation human intervention is not required


Results example lexicon

acquired from 20 Mw newspaper corpus based on 90 inflectional and >300

derivational rules contains ca 42,000 lemmas associated with

over 500,000 WFs performance

linguistic quality F1 = 88% per type coverage 96% per type and 98% per token understemming = 7% overstemming < 4%

can be improved further by manual editing


Derivational normalization inflectional lexicon is partitioned into

equivalence classes based on derivational rules

degree of normalisation depends on the number of derivational rules used

problem with semantics context, degrees derivation is not so semantically regular as inflection


References and applications Reference

Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of Inflectional Lexica for Morphological Normalisation // Information Processing and Management, 2008. (in press)

Applied in document indexing projects AIDE & CADIAL www.cadial.org Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-

Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / J. Van Nieuwenhove & P. Popelier (eds). Brugge : Die Keure, 2008. pp. 107-117.

Applied in text classification Malenica, Mislav; Šmuc, Tomislav; Jan, Šnajder; Dalbelo

Bašić, Bojana. Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. // Information Processing and Management, 44 (2008), 1; 325-339.


Thank youfor your attention!

Morphological Normalization and Collocation Extraction

Documents

Transcript of Morphological Normalization and Collocation Extraction