Morphological Normalization and Collocation Extraction
description
Transcript of Morphological Normalization and Collocation Extraction
K.U. LeuvenLeuven2008-05-08
Morphological Normalizationand Collocation Extraction
Jan Šnajder, Bojana Dalbelo Bašić, Marko TadićUniversity of Zagreb
Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences
[email protected], [email protected], [email protected]
Seminar at the K. U. Leuven, Department of Computing ScienceLeuven
2008-05-08
K.U. LeuvenLeuven2008-05-08
Morphological Normalization
Jan Šnajder, Marko TadićUniversity of Zagreb
Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences
[email protected], [email protected], [email protected]
Seminar at the K. U. Leuven, Department of Computing ScienceLeuven
2008-05-08
K.U. LeuvenLeuven2008-05-08
Talk overview who we are? what are we doing? morphological processing: normalization lemmatization vs. stemming Mollex: a system for normalization of
Croatian usage in document indexing and text
classification collocations as features collocation extraction by co-occurrence
measures usage of genetic programming
K.U. LeuvenLeuven2008-05-08
Who we are?
University of Zagreb, Croatia founded 1669, 52,500 undergraduate students
two faculties in the same mission build the systems that will develop and enable the
usage of language resources and tools for Croatian
K.U. LeuvenLeuven2008-05-08
Who we are 2? Faculty of Humanities and
Social Sciences Institute / Department of
Linguistics dealing with basic
computational linguistic tasks for Croatian compiling and processing large scale
language resources Croatian National Corpus, Croatian
Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebank
tagger, lemmatizer chunker, parser NERC system
K.U. LeuvenLeuven2008-05-08
Who we are 3? Faculty of Electrical Engineering and Computing
Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab
Knowledge Technogies Laboratory Group deals with text preprocessing techniques for Croatian for
machine learning procedures dimensionality reduction and document
clustering in the vector space model + visualisation
automatic indexing ofdocuments
intelligent, language specificinformation retrieval andextraction
K.U. LeuvenLeuven2008-05-08
What are we doing? working jointly on several research projects
AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA)
RMJT: Computational Linguistic Models and Language Technologies for Croatian (national research programme, two of five projects) Croatian language resources and their annotation
2007-2011, prof. Marko Tadić Knowledge discovery in textual data
2007-2011, prof. Bojana Dalbelo Bašić CADIAL: Computer Aided Document Indexing for
Accessing Legislation joint Flemish-Croatian project 2007-2009 prof. Marie-Francine Moens & prof. Bojana Dalbelo
Bašić
K.U. LeuvenLeuven2008-05-08
Morphological processing computational linguistic / NLP task
important for inflectionally rich languages, e.g. Croatian noun in 14 word-forms (7 cases, 2 numbers):
N: student studentiG: studenta studenataD: studentu studentimaA: studenta studenteV: studentu studentiL: studentu studentimaI: studentom studentima
unlike English noun in 2(3?) word-forms (2 numbers+ possesive?):
Sg: student Poss: (student’s)Pl: students
present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ...
K.U. LeuvenLeuven2008-05-08
Morphological processing 2 three basic subtasks in inflection processing
1. generation of (all) word-forms (WFs) of a lexeme2. analysis of WFs i.e. recognizing the values of
morphosyntactical categories of a WF in text3. recognizing to which lexeme(s) a WF belongs to
the last one helps us in avoiding the problem of data sparsness in many text processing tasks, e.g. information retrieval, text mining, document
indexing normalization: conflating the morphological
variants of a word to a single representative form two main ways to do that
1. linguistically motivated: lemmatization2. computationally motivated: stemming
K.U. LeuvenLeuven2008-05-08
Morphological processing 3 lemmatization
replacing the WF with its proper base WF, usually called lemma e.g. mapping theoretical maximum of (e.g. 14)
WFs to 1 lemma lexicon based
large lexicons of all (generated) WFs needed preparation expensive in time and manpower mostly realized by databases
algorithmic based mostly FST: compact, efficient, fast lexicon of lemmas and their inflectional
patterns needed anyway
K.U. LeuvenLeuven2008-05-08
Morphological processing 4 stemming
reducing the WF from the end by truncating the possible endings
does not have to respect the linguistic boundariesvuk+Ø > *vu+kØvuk+a > *vu+kavuč+e > *vu+če
reducing all the WFs to a common beginning problems where there are many
morphonological adaptationssla+ti > *?+slatišalj+em > *?+šaljem
K.U. LeuvenLeuven2008-05-08
Morphological normalization Croatian language (like most Slavic
languages) is morphologically complex elaborated inflectional and derivational
morphology problematic for most NLP applications
requires the use of substantial linguistic knowledge
our lexicon based approach to normalization is somewhere in between lemmatization and stemming suitable for other inflectionally complex
languages
K.U. LeuvenLeuven2008-05-08
Croatian Morphology1. high degree of affixation
word-forms are obtained by suffixation, prefixation, phonological alternations, stem extension
inflection nouns: declination (7 cases, 2 numbers) verbs: conjugation (tenses, persons, numbers,
genders) adjectives: declination (7 cases, 2 numbers, 3
genders), comparison (3 degrees), and definiteness
derivation a large number of rules for deriving nouns from
verbs, verbs from nouns, possessive adjectives, ...
K.U. LeuvenLeuven2008-05-08
Croatian Morphology 2 inflection examples
adjective: brz, brza, brzi, brzima, brzih, brzoj, brze, brzim, brzog, brzoga, brz, brza, brzo, brzom, brzomu, brži, bržeg, brža, brži, bržima, bržih, bržoj, brže, bržim, bržem, bržima, najbrži, bržeg, najbrža, najbržima, najbržih, najbrže, najbržim, najbrži, najbržoj, ...
noun: brzina, brzinom, brzine, brzinama, brzinu, brzina, brzini
adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, brzinsko, brzinskog, brzinskoga,…
adverb: brzo, brže, najbrže, brzinski derivation examples
brz > brzina > brzinski > …
K.U. LeuvenLeuven2008-05-08
Croatian Morphology 32. high degree of homography
vode = voda (water) | voditi (to lead) | vod (a platoon)
requires disambiguation (POS/MSD tagging)3. affix ambiguity
many ambiguous suffixation rules e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i e.g. bolnic-a / bolnic-om vs. brodolom /
brodolom-a possible mismatches at inflectional level
narančast / narančast-om vs. ruž / ruž-om (not ruža)
possible mismatches at derivational level e.g. kralj / kralj-ica vs. stan / stan-ica
K.U. LeuvenLeuven2008-05-08
Lexicon based normalization lexicon-based morphological normalisation
a morphological lexicon associates to each WF its morphological norm (lemma, stem,...) and, optionally, a MSD
incorporates linguistic knowledge and thus avoids aforementioned pitfalls
drawbacks made by linguists, expensive and time-
consuming problems with coverage (neologisms, jargons,
…) our approach
rule-based acquisition of large coverage morphological lexica from raw (unannotated) corpora
K.U. LeuvenLeuven2008-05-08
Our approach1. acquisition of inflectional lexicon
input: raw corpora and sets of inflectional and derivational rules in convenient (grammarbook-like) formalism
2. normalisation of word-forms inflectional (lemmatization) inflectional + derivational
comparable to stemming (but more precise) advantages
can be used as both a lemmatizer (with MSD) and a stemmer (with variable degree of conflation)
provides good lexicon coverage requires only limited linguistic expertise
K.U. LeuvenLeuven2008-05-08
Morphology representation e.g. noun inflectional paradigm
vojnik (soldier)Case Singular PluralN vojnik-Ø vojnic-iG vojnik-a vojnik-aD vojnik-u vojnic-imaA vojnik-a vojnik-eV vojnič-e vojnic-iL vojnik-u vojnic-imaI vojnik-om vojnic-ima
K.U. LeuvenLeuven2008-05-08
Morphology representation 2 defines inflectional and derivational rules uses functions as building blocks:
A) condition functions B) string transformation functions
each defined using a higer-order function e.g.
sfx sfx('a') sfx('a')('vojnik') = 'vojnika' sfx(‘e’) alt(pal) (sfx('e') alt(pal))('vojnik') = 'vojniče'
K.U. LeuvenLeuven2008-05-08
Morphology representation 3Case Singular PluralN vojnik-Ø vojnic-iG vojnik-a vojnik-aD vojnik-u vojnic-imaA vojnik-a vojnik-eV vojnič-e vojnic-iL vojnik-u vojnic-imaI vojnik-om vojnic-ima
(s.ends('k','g','h')(s) consGroup(s), {null, sfx(‘a’), sfx(‘u’), sfx(‘om’), sfx(‘e’) alt(pal), sfx(‘i’) alt(sib), sfx(‘ima’) alt(sib), sfx(‘e’)})
K.U. LeuvenLeuven2008-05-08
Morphology representation 4 suitable also for more complex paradigms
(c, {null, sfx(‘a’), sfx(‘u’), ..., sfx(‘ima’)} {sfx(‘og’), sfx(‘om’), ..., sfx(‘ima’)} {sfx(‘i’) alt(jot), sfx(‘eg’) alt(jot), ..., sfx(‘ima’) alt(jot)} {sfx(‘i’) alt(jot) pfx(‘naj’), ..., sfx(‘ima’) alt(jot) pfx(‘naj’)})
K.U. LeuvenLeuven2008-05-08
Morphology representation 5 advantages
resembles to morphology description as found in traditional grammar books
requires minimum amount of linguistic knowledge
highly expressive: arbitrary HOF functions can be defined
can be aplied to other morphologically similar languages
implemented in Haskell purely functional programming language requires minimum programming skills
K.U. LeuvenLeuven2008-05-08
Lexicon acquisition uses inflectional rules + raw corpora to
extract lemmas and their paradigms uses frequency counts of WFs attested in the
corpus much of the ambiguity is resolved by
language-dependent heuristics plausibility, priority
linguistic quality is not vital word-form conflation rather than generation human intervention is not required
K.U. LeuvenLeuven2008-05-08
Results example lexicon
acquired from 20 Mw newspaper corpus based on 90 inflectional and >300
derivational rules contains ca 42,000 lemmas associated with
over 500,000 WFs performance
linguistic quality F1 = 88% per type coverage 96% per type and 98% per token understemming = 7% overstemming < 4%
can be improved further by manual editing
K.U. LeuvenLeuven2008-05-08
Derivational normalization inflectional lexicon is partitioned into
equivalence classes based on derivational rules
degree of normalisation depends on the number of derivational rules used
problem with semantics context, degrees derivation is not so semantically regular as inflection
K.U. LeuvenLeuven2008-05-08
References and applications Reference
Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of Inflectional Lexica for Morphological Normalisation // Information Processing and Management, 2008. (in press)
Applied in document indexing projects AIDE & CADIAL www.cadial.org Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-
Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / J. Van Nieuwenhove & P. Popelier (eds). Brugge : Die Keure, 2008. pp. 107-117.
Applied in text classification Malenica, Mislav; Šmuc, Tomislav; Jan, Šnajder; Dalbelo
Bašić, Bojana. Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. // Information Processing and Management, 44 (2008), 1; 325-339.
K.U. LeuvenLeuven2008-05-08
Thank youfor your attention!