Grammar development in XLE - unideb.huweb.unideb.hu/...Linguistics_2012/.../morphology1.pdf ·...
Transcript of Grammar development in XLE - unideb.huweb.unideb.hu/...Linguistics_2012/.../morphology1.pdf ·...
-
Gábor Csernyi Department of English Linguistics
University of Debrecen [email protected]
http://ieas.unideb.hu/csernyi
-
„The quest for an efficient method for the analysis and generation of word-forms is no longer an academic research topic, although morphological analyzers still remain to be written for all but the commercially most important languages.”
(Karlsson & Karttunen 1997)
Morphological processing (including analysis and generation as well) is an important component of many (sub)fields of natural language processing: text-to-speech systems, machine translation, information retrieval, etc.
2
-
A linguistic field concerned with the study of the internal structure of words.
What is a word? A finite sequence of letters built over a finite set of symbols (i.e. the alphabet).
Word form vs. lexeme. opens, opener OPEN
3
-
Form variations of the same word; Comes with changes is grammatical features
number [singular / plural], e.g. house - houses;
person [1st / 2nd / 3rd], e.g. think - thinks;
tense [past / present ( / future)], e.g.: call - called;
gender [feminine / mascular /neuter]
case [accusate / dative / genitive / locative / etc.]
4
-
Creating new words teach [V] teacher [N] record [V] record [N]
5
-
Words are built from morphemes. Morpheme:
the smallest meaningful units of language;
carries lexical meaning, or indicates grammatical features (present, 3rd person, singular)
6
-
Types of morphemes: Free: can stand alone as a word (e.g. cat, book). Bound: must be attached to another morpheme (root/base/stem),
carries no (lexical) meaning on its own (e.g. -ed, -s, un-).
Another classification of morphemes: Lexical morphemes:
lexeme – all the forms with the same meaning; lemma – the form that conventionally represents the lexeme (source: Wikipedia) open-class
Grammatical morphemes (function words) carry grammatical meaning/function (source: Wordnet) closed-class
7
-
Root: „The primary lexical unit of a word, which carries the most significant aspects of semantic content and cannot be reduced into smaller constituents ” (source: Wikipedia)
Stem: the unit to which an inflectional ending is added. misconducts stem: misconduct inflection(al suffix): -s root: conduct derivational prefix: mis-
8
-
Base: an element to which affixes (inflectional/derivational) are added. misconducts
base: misconduct
inflection(al suffix): -s
base / root: conduct
derivational prefix: mis-
9
-
Any process that whereby a new form is produced from a base. Characterising factors: Productivity, regularity
10
-
Attaching an affix to a base. Affix
prefix: to the beginning of the base e.g.: undo
suffix: after the base e.g.: dogs
infix: inside the base ?passer-by
circumfix: around the base
11
-
Forming compounds by putting two or more words together. Note: orthography issues words separated (white spaces between words)
e.g.: red wine words hyphenated
e.g.: time-consuming words joined without white space
e.g.: bedtime
12
-
Forming new word that is different in terms of grammati-cal category and/or meaning compared to the original base. Note: conversion (i.e. zero derivation) is a subtype. Characteristics: rule-based process productive (certain derivational affixes connect
to certain base forms) output open to other (derivational) processes
e.g.: product productive productivity 13
-
Cliticisation e.g.: they’ve
Reduplication e.g.: ?very very expensive
Internal change e.g.: sink - sank – sunk
Suppletion e.g.: good - better/best 14
-
Clipping, abbreviation, acronymy advertisement ad; manuscript MS; frequently asked questions FAQ
Blending e.g.: motor + hotel motel
Backformation e.g.: beggar beg 15
-
Analysis: identifying morphemes and deriving morphosyntactic features of word forms. e.g.: elephants: elephant + s elephant +Noun +Pl > morphological analyzer/parser
Generation: to provide all possible word forms of a root/stem. > morphological generator
16
-
Segmentation How to identify the morphemes in a word form, and how to segment a word form?
Morphographemics How to identify and account for alternation forms (spelling changes, e.g. carry vs. carried)?
Morphotactics Are there any constraints on the position/order of morphemes (to constitute a valid word form)?
17
-
Basic techniques: Lexicon with full forms:
all possible word forms are stored in the lexicon; pattern matching as lexical lookup Problems: large lexicon, slow processing; redundancy issues; language creativity and productivity?
Lemma lexicon: lexicon for lemmas (+ morphosyntactic interpretaions) lexicon for affixes: affixes + morphotactics lexical lookup: finding lemma + affix(es) sequences Problems: morphographemics, suppletion.
18
-
Two level morphology: Introduced by Kimmo Koskenniemi (Koskenniemi (1983) Two-level morphology : a general computational model for word-form recognition and production.) Features: language-independent can be used for analysis and generation as well two parallel leveles: surface level, lexical level symbol-to-symbol correspondence between the two levels
lexical level: elephant +Noun +Pl surface level: elephants
transducers responsible for mapping between surface and lexical level main components: rule component with two level rules; lexical
component (lemmas and affixes in the form of continuation classes)
19
-
Approaches making use of the two-level morphology: Finite state morphological transducers:
popular because of its efficiency in speed and small size; both for analysis and generation.
Unification-based morphology:
lexicon: allomorphs of the same base form; sequences of affixes analyzed as a whole (Gestalt-view); two-level morphology used for accounting for spelling
changes; also, continuation classes. unifiability checks
20
-
Machine learning approaches to morphological processing
Making generalizations; finding out the morphological rules automatically by the machine.
Two types:
Supervised: morphological rule learning with the help of examples.
Unsupervised: structure identification of word forms in large, unannotated texts.
21
-
1. Humor (High-speed Unification Morphology)
unification-based system
developed by Morphologic
22
-
2. HunMorph
open-source tool
C/C++ runtime layer toolkit as an extension to MySpell (reimplementation of the Ispell spellchecker)
affix stripping methods
language-independent runtime environment
Language-specific dictionaries (dictionary, affix)
23
-
3. Xerox FST for Hungarian
24