Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn...
-
Upload
hester-shonda-warren -
Category
Documents
-
view
222 -
download
0
Transcript of Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn...
Lecture 3, 7/27/2005 Natural Language Processing 1
CS60057Speech &Natural Language
Processing
Autumn 2005
Lecture 4
28 July 2005
Lecture 3, 7/27/2005 Natural Language Processing 2
Morphology
Study of the rules that govern the combination of morphemes.
Inflection: same word, different syntactic information Run/runs/running, book/books
Derivation: new word, different meaning Often different part of speech, but not always Possible/possibly/impossible, happy/happiness
Compounding: new word, each part is a word Blackboard, bookshelf lAlakamala, banabAsa
Lecture 3, 7/27/2005 Natural Language Processing 3
Morphology Level: The Mapping
Formally: A+ 2(L,C1,C2,...,Cn)
A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes)
L is the set of possible lemmas, uniquely identified Ci are morphological categories, such as:
grammatical number, gender, case person, tense, negation, degree of comparison, voice, aspect, ... tone, politeness, ... part of speech (not quite morphological category, but...)
A, L and Ci are obviously language-dependent
Lecture 3, 7/27/2005 Natural Language Processing 4
Bengali/Hindi Inflectional Morphology
Certain languages encode more syntax in morphology than in syntax
Some of inflectional suffixes that nouns can have: singular/plural : Gender possessive markers : case markers :
Different Karakas
Inflectional suffixes that verbs can have: Hindi: Tense, aspect, modality, person, gender, number Bengali: Tense, aspect, modality, person
Order among inflectional suffixes (morphotactics ) Chhelederke baigulokei
Lecture 3, 7/27/2005 Natural Language Processing 5
Bengali/ Hindi Derivational Morphology
Derivational morphology is very rich.
Lecture 3, 7/27/2005 Natural Language Processing 6
English Inflectional Morphology
Nouns have simple inflectional morphology. plural -- cat / cats possessive -- John / John’s
Verbs have slightly more complex inflectional, but still relatively simple inflectional morphology. past form -- walk / walked past participle form -- walk / walked gerund -- walk / walking singular third person -- walk / walks
Verbs can be categorized as: main verbs modal verbs -- can, will, should primary verbs -- be, have, do
Regular and irregular verbs: walk / walked -- go / went
Lecture 3, 7/27/2005 Natural Language Processing 7
Regulars and Irregulars Some words misbehave (refuse to follow the rules)
Mouse/mice, goose/geese, ox/oxen Go/went, fly/flew
The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.
Lecture 3, 7/27/2005 Natural Language Processing 8
Regular and Irregular Verbs
Regulars… Walk, walks, walking, walked, walked
Irregulars Eat, eats, eating, ate, eaten Catch, catches, catching, caught, caught Cut, cuts, cutting, cut, cut
Lecture 3, 7/27/2005 Natural Language Processing 9
Derivational Morphology
Quasi-systematicity Irregular meaning change Changes of word class
Some English derivational affixes -ation : transport / transportation -er : kill / killer -ness : fuzzy / fuzziness -al : computation / computational -able : break / breakable -less : help / helpless un : do / undo re : try / retry
renationalizationability
Lecture 3, 7/27/2005 Natural Language Processing 10
Derivational Examples
Verb/Adj to Noun
-ation computerize computerization
-ee appoint appointee
-er kill killer
-ness fuzzy fuzziness
Lecture 3, 7/27/2005 Natural Language Processing 11
Derivational Examples
Noun/Verb to Adj
-al Computation Computational
-able Embrace Embraceable
-less Clue Clueless
Lecture 3, 7/27/2005 Natural Language Processing 12
Compute
Many paths are possible… Start with compute
Computer -> computerize -> computerization Computation -> computational Computer -> computerize -> computerizable Compute -> computee
Lecture 3, 7/27/2005 Natural Language Processing 13
Parts of A Morphological Processor
For a morphological processor, we need at least followings: Lexicon : The list of stems and affixes together with basic
information about them such as their main categories (noun, verb, adjective, …) and their sub-categories (regular noun, irregular noun, …).
Morphotactics : The model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word.
Orthographic Rules (Spelling Rules) : These spelling rules are used to model changes that occur in a word (normally when two morphemes combine).
Lecture 3, 7/27/2005 Natural Language Processing 14
Lexicon
A lexicon is a repository for words (stems). They are grouped according to their main categories.
noun, verb, adjective, adverb, … They may be also divided into sub-categories.
regular-nouns, irregular-singular nouns, irregular-plural nouns, … The simplest way to create a morphological parser, put all possible
words (together with its inflections) into a lexicon. We do not this because their numbers are huge (theoretically for
Turkish, it is infinite)
Lecture 3, 7/27/2005 Natural Language Processing 15
Morphotactics
Which morphemes can follow which morphemes.
Lexicon:
regular-noun irregular-pl-noun irreg-sg-noun pluralfox geese goose -s
cat sheep sheep
dog mice mouse
Simple English Nominal Inflection (Morphotactic Rules)
0
1
2
reg-nounplural (-s)
irreg-sg-noun
irreg-pl-noun
Lecture 3, 7/27/2005 Natural Language Processing 16
Combine Lexicon and Morphotacticsf
ox
sc a t
d o g
s
h e ep
g
o
e e
o s
e
m
o u s
i c
e
This only says yes or no. Does not give lexical representation.It accepts a wrong word (foxs).
Lecture 3, 7/27/2005 Natural Language Processing 17
FSAs and the Lexicon
This will actual require a kind of FSA : the Finite State Transducer (FST)
We will give a quick overview
First we’ll capture the morphotactics The rules governing the ordering of affixes in a language.
Then we’ll add in the actual words
Lecture 3, 7/27/2005 Natural Language Processing 18
Simple Rules
Lecture 3, 7/27/2005 Natural Language Processing 19
Adding the Words
Lecture 3, 7/27/2005 Natural Language Processing 20
Derivational Rules
Lecture 3, 7/27/2005 Natural Language Processing 21
Parsing/Generation vs. Recognition
Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the
structure in it (parsing) Or we have some structure and we want to produce a surface
form (production/generation)
Example From “cats” to “cat +N +PL”
Lecture 3, 7/27/2005 Natural Language Processing 22
Why care about morphology?
`Stemming’ in information retrieval Might want to search for “aardvark” and find pages with both
“aardvark” and “aardvarks”
Morphology in machine translation Need to know that the Spanish words quiero and quieres are
both related to querer ‘want’
Morphology in spell checking Need to know that misclam and antiundoggingly are not words
despite being made up of word parts
Lecture 3, 7/27/2005 Natural Language Processing 23
Can’t just list all words
Turkish word Uygarlastiramadiklarimizdanmissinizcasin `(behaving) as if you are among those whom we could not civilize’
Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Lecture 3, 7/27/2005 Natural Language Processing 24
Finite State Transducers
The simple story Add another tape Add extra symbols to the transitions
On one tape we read “cats”, on the other we write “cat +N +PL”
Lecture 3, 7/27/2005 Natural Language Processing 25
Transitions
c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s
c:c a:a t:t +N:ε +PL:s
Lecture 3, 7/27/2005 Natural Language Processing 26
Lexical to Intermediate Level
Lecture 3, 7/27/2005 Natural Language Processing 27
FST Properties
FSTs are closed under: union, inversion, and composition.
union : The union of two regular relations is also a regular relation. inversion : The inversion of a FST simply switches the input and
output labels. This means that the same FST can be used for both directions of a
morphological processor.
composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1
to O2, then composition of T1 and T2 (T1oT2) maps from I1 to O2.
We use these properties of FSTs in the creation of the FST for a morphological processor.
Lecture 3, 7/27/2005 Natural Language Processing 28
A FST for Simple English Nominals
reg-noun
irreg-sg-noun
irreg-pl-noun
+N: є
+N: є
+N: є
+S:#+PL:^s#
+SG:#
+PL:#
Lecture 3, 7/27/2005 Natural Language Processing 29
FST for stems
A FST for stems which maps roots to their root-class
reg-noun irreg-pl-noun irreg-sg-noun fox g o:e o:e se goose
cat sheep sheep
dog m o:i u:є s:c e mouse
fox stands for f:f o:o x:x When these two transducers are composed, we have a FST which maps
lexical forms to intermediate forms of words for simple English noun inflections.
Next thing that we should handle is to design the FSTs for orthographic rules, and combine all these transducers.
Lecture 3, 7/27/2005 Natural Language Processing 30
Multi-Level Multi-Tape Machines
A frequently use FST idiom, called cascade, is to have the output of one FST read in as the input to a subsequent machine.
So, to handle spelling we use three tapes: lexical, intermediate and surface
We need one transducer to work between the lexical and intermediate levels, and a second (a bunch of FSTs) to work between intermediate and surface levels to patch up the spelling.
+PL+Ngod
sgod
s #^god
lexical
intermediate
surface
Lecture 3, 7/27/2005 Natural Language Processing 31
Lexical to Intermediate FST
Lecture 3, 7/27/2005 Natural Language Processing 32
Orthographic Rules
We need FSTs to map intermediate level to surface level. For each spelling rule we will have a FST, and these FSTs run parallel.
Some of English Spelling Rules: consonant doubling -- 1-letter consonant doubled before ing/ed --
beg/begging E deletion - Silent e dropped before ing and ed -- make/making E insertion -- e added after s, z, x, ch, sh before s -- watch/watches Y replacement -- y changes to ie before s, and to i before ed -- try/tries K insertion -- verbs ending with vowel+c we add k -- panic/panicked
We represent these rules using two-level morphology rules: a => b / c__d rewrite a as b when it occurs between c and d.
Lecture 3, 7/27/2005 Natural Language Processing 33
FST for E-Insertion Rule
E-insertion rule: є => e / {x,s,z}^ __ s#
Lecture 3, 7/27/2005 Natural Language Processing 34
Generating or Parsing with FST Lexicon and Rules
Lecture 3, 7/27/2005 Natural Language Processing 35
Accepting Foxes
Lecture 3, 7/27/2005 Natural Language Processing 36
Intersection
We can intersect all rule FSTs to create a single FST. Intersection algorithm just takes the Cartesian product of
states. For each state qi of the first machine and qj of the second
machine, we create a new state qij
For input symbol a, if the first machine would transition to state
qn and the second machine would transition to qm the new
machine would transition to qnm.
Lecture 3, 7/27/2005 Natural Language Processing 37
Composition
Cascade can turn out to be somewhat pain. it is hard to manage all tapes it fails to take advantage of restricting power of the machines
So, it is better to compile the cascade into a single large machine.
Create a new state (x,y) for every pair of states x є Q1 and y є Q2. The transition function of composition will be defined as follows:
δ((x,y),i:o) = (v,z) if
there exists c such that δ1(x,i:c) = v
and δ2(y,c:o) = z
Lecture 3, 7/27/2005 Natural Language Processing 38
Intersect Rule FSTs
lexical tape
LEXICON-FST
intermediate tape
FST1 … FSTn
surface tape
=> FSTR = FST1 ^ … ^ FSTn
Lecture 3, 7/27/2005 Natural Language Processing 39
Compose Lexicon and Rule FSTs
lexical tape
LEXICON-FST
intermediate tape
surface tape
FSTR = FST1 ^ … ^ FSTn
=> LEXICON-FST o FSTR
lexical tape
surface level
Lecture 3, 7/27/2005 Natural Language Processing 40
Porter Stemming
Some applications (some informational retrieval applications) do not the whole morphological processor.
They only need the stem of the word. A stemming algorithm (Port Stemming algorithm) is a
lexicon-free FST. It is just a cascaded rewrite rules. Stemming algorithms are efficient but they may introduce
errors because they do not use a lexicon.