1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing...
-
Upload
blaise-dixon -
Category
Documents
-
view
217 -
download
0
Transcript of 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing...
1CSC 9010- NLP - 3: Morphology, Finite State Transducers
CSC 9010Natural Language Processing
Lecture 3: Morphology, Finite State Transducers
Paula MatuszekMary-Angela Papalaskari
Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ andJim Martin: http://www.cs.colorado.edu/~martin/csci5832.html
2CSC 9010- NLP - 3: Morphology, Finite State Transducers
Today
Elementary MorphologyComputational morphologyFinite State TransducersLexicon-only schemesRule-only schemesLab: Introduction to NLTK
3CSC 9010- NLP - 3: Morphology, Finite State Transducers
MorphologyMorphology:
The study of the way words are built up from smaller meaning units.Morphemes:
The smallest meaningful unit in the grammar of a language.Contrasts:
Derivational vs. InflectionalRegular vs. IrregularConcatinative vs. Templatic (root-and-pattern)
A useful resource:Glossary of linguistic terms by Eugene Looshttp://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm
4CSC 9010- NLP - 3: Morphology, Finite State Transducers
Examples (English)
“unladylike”3 morphemes, 4 syllables
un- ‘not’lady ‘(well behaved) female adult human’-like ‘having the characteristics of’
Can’t break any of these down further without distorting the meaning of the units
“technique”1 morpheme, 2 syllables
“dogs”2 morphemes, 1 syllable
-s, a plural marker on nouns
5CSC 9010- NLP - 3: Morphology, Finite State Transducers
Morpheme DefinitionsRoot
The portion of the word that:– is common to a set of derived or inflected forms, if any, when all affixes
are removed – is not further analyzable into meaningful elements– carries the principal portion of meaning of the words
StemThe root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.
AffixA bound morpheme that is joined before, after, or within a root or stem.
Clitica morpheme that functions syntactically like a word, but does not appear as an independent phonological word
– Spanish: un beso, las aguas– English: Hal’s (genetive marker) – Proto-European: Kwe -que (Latin), te (Greek), and –ca (Sanskrit)
6CSC 9010- NLP - 3: Morphology, Finite State Transducers
Inflectional vs. Derivational
Word ClassesParts of speech: noun, verb, adjectives, etc.Word class dictates how a word combines with morphemes to form new words
Inflection:Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast.
– Doesn’t change the word class– Usually produces a predictable, non-idiosyncratic change of
meaning.
Derivation:The formation of a new word or inflectable stem from another word or stem.
7CSC 9010- NLP - 3: Morphology, Finite State Transducers
Inflectional Morphology
Adds: tense, number, person, mood, aspect
Word class doesn’t changeWord serves new grammatical roleExamples
come is inflected for person and number:The pizza guy comes at noon.
las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s
las manzanas rojas (‘the red apples’)
8CSC 9010- NLP - 3: Morphology, Finite State Transducers
Derivational Morphology
Nominalization (formation of nouns from other parts of speech, primarily verbs in English):
computerizationappointeekillerfuzziness
Formation of adjectives (primarily from nouns) computationalcluelessEmbraceable
Diffulcult cases:building from which sense of “build”?
9CSC 9010- NLP - 3: Morphology, Finite State Transducers
Concatinative MorphologyMorpheme+Morpheme+Morpheme+…Stems: also called lemma, base form, root, lexeme
hope+ing hoping hop hopping
AffixesPrefixes: AntidisestablishmentarianismSuffixes: AntidisestablishmentarianismInfixes: hingi (borrow) – humingi (borrower) in TagalogCircumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languagesuygarlaştıramadıklarımızdanmışsınızcasına (Turkish)uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casınaBehaving as if you are among those whom we could not cause to become civilized
10CSC 9010- NLP - 3: Morphology, Finite State Transducers
Templatic MorphologyRoots and Patterns
Example: Hebrew verbsRoot:
– Consists of 3 consonants CCC– Carries basic meaning
Template:– Gives the ordering of consonants and vowels– Specifies semantic information about the verb
Active, passive, middle voiceExample:
– lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught)
11CSC 9010- NLP - 3: Morphology, Finite State Transducers
Nouns and Verbs (in English)
Nouns have simple inflectional morphologycatcat+s, cat+’s
Verbs have more complex morphology
12CSC 9010- NLP - 3: Morphology, Finite State Transducers
Nouns and Verbs (in English)
NounsHave simple inflectional morphologyCat/CatsMouse/Mice, Ox, Oxen, Goose, Geese
VerbsMore complex morphologyWalk/WalkedGo/Went, Fly/Flew
13CSC 9010- NLP - 3: Morphology, Finite State Transducers
Regular (English) Verbs
Morphological Form Classes Regularly Inflected Verbs
Stem walk merge try map
-s form walks merges tries maps
-ing form walking merging trying mapping
Past form or –ed participle walked merged tried mapped
14CSC 9010- NLP - 3: Morphology, Finite State Transducers
Irregular (English) Verbs
Morphological Form Classes Irregularly Inflected Verbs
Stem eat catch cut
-s form eats catches cuts
-ing form eating catching cutting
Past form ate caught cut
-ed participle eaten caught cut
15CSC 9010- NLP - 3: Morphology, Finite State Transducers
“To love” in Spanish
16CSC 9010- NLP - 3: Morphology, Finite State Transducers
Syntax and Morphology
Phrase-level agreementSubject-Verb
– John studies hard (STUDY+3SG)
Noun-Adjective– Las vacas hermosas
Sub-word phrasal structuresנויספרבש
נו+ים+ספר+ב+ש
That+in+book+PL+Poss:1PLWhich are in our books
17CSC 9010- NLP - 3: Morphology, Finite State Transducers
Phonology and Morphology
Script Limitations
Spoken English has 14 vowels– heed hid hayed head had hoed hood who’d hide
how’d taught Tut toy enough
English Alphabet has 5– Use vowel combinatios: far fair fare– Consonantal doubling (hopping vs. hoping)
18CSC 9010- NLP - 3: Morphology, Finite State Transducers
Computational MorphologyApproaches
Lexicon onlyRules onlyLexicon and Rules
– Finite-state Automata– Finite-state Transducers
SystemsWordNet’s morphyPCKimmo
– Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay
– Accurate but complex– http://www.sil.org/pckimmo/
Two-level morphology– Commercial version available from InXight Corp.
BackgroundChapter 3 of Jurafsky and MartinA short history of Two-Level Morphology
– http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/
19CSC 9010- NLP - 3: Morphology, Finite State Transducers
Computational MorphologyWORD STEM (+FEATURES)*
cats cat +N +PLcat cat +N +SGcities city +N +PLgeese goose +N +PLducks (duck +N +PL) or
(duck +V +3SG)merging merge +V +PRES-PARTcaught (catch +V +PAST-PART) or
(catch +V +PAST)
20CSC 9010- NLP - 3: Morphology, Finite State Transducers
FSAs and the Lexicon
First we’ll capture the morphotacticsThe rules governing the ordering of affixes in a language.
Then we’ll add in the actual words
21CSC 9010- NLP - 3: Morphology, Finite State Transducers
Simple Rules
22CSC 9010- NLP - 3: Morphology, Finite State Transducers
Adding the Words
23CSC 9010- NLP - 3: Morphology, Finite State Transducers
Derivational Rules
24CSC 9010- NLP - 3: Morphology, Finite State Transducers
Parsing/Generation vs. Recognition
Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing)Or we have some structure and we want to produce a surface form (production/generation)
ExampleFrom “cats” to “cat +N +PL” and back
Morphological analysis
25CSC 9010- NLP - 3: Morphology, Finite State Transducers
Finite State Transducers
The simple storyAdd another tapeAdd extra symbols to the transitions
On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around.
26CSC 9010- NLP - 3: Morphology, Finite State Transducers
FSTs
27CSC 9010- NLP - 3: Morphology, Finite State Transducers
Transitions
c:c means read a c on one tape and write a c on the other+N:ε means read a +N symbol on one tape and write nothing on the other+PL:s means read +PL and write an s
c:c a:a t:t +N:ε +PL:s
28CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity
Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state.
Didn’t matter which path was actually traversed
In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result
29CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity
What’s the right parse forUnionizableUnion-ize-ableUn-ion-ize-able
Each represents a valid path through the derivational morphology machine.
30CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity
There are a number of ways to deal with this problem
Simply take the first output foundFind all the possible outputs (all paths) and return them all (without choosing)Bias the search so that only one or a few likely paths are explored
31CSC 9010- NLP - 3: Morphology, Finite State Transducers
The Gory Details
Of course, its not as easy as “cat +N +PL” <-> “cats”
As we saw earlier there are geese, mice and oxenBut there are also a whole host of spelling/pronunciation changes that go along with inflectional changes
Cats vs Dogs
Multi-tape machines
32CSC 9010- NLP - 3: Morphology, Finite State Transducers
Multi-Level Tape Machines
We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape
33CSC 9010- NLP - 3: Morphology, Finite State Transducers
Lexical to Intermediate Level
34CSC 9010- NLP - 3: Morphology, Finite State Transducers
Intermediate to Surface
The add an “e” rule as in fox^s# <-> foxes
35CSC 9010- NLP - 3: Morphology, Finite State Transducers
Foxes
36CSC 9010- NLP - 3: Morphology, Finite State Transducers
Foxes
37CSC 9010- NLP - 3: Morphology, Finite State Transducers
FST Review
FSTs allow us to take an input and deliver a structure based on itOr… take a structure and create a surface formOr take a structure and create another structure
In many applications its convenient to decompose the problem into a set of cascaded transducers where
The output of one feeds into the input of the next.We’ll see this scheme again for deeper semantic processing.
38CSC 9010- NLP - 3: Morphology, Finite State Transducers
Overall Plan
39CSC 9010- NLP - 3: Morphology, Finite State Transducers
Lexicon-only Morphology
acclaim acclaim $N$
acclaim acclaim $V+0$
acclaimed acclaim $V+ed$
acclaimed acclaim $V+en$
acclaiming acclaim $V+ing$
acclaims acclaim $N+s$
acclaims acclaim $V+s$
acclamation acclamation $N$
acclamations acclamation $N+s$
acclimate acclimate $V+0$
acclimated acclimate $V+ed$
acclimated acclimate $V+en$
acclimates acclimate $V+s$
acclimating acclimate $V+ing$
• The lexicon lists all surface level and lexical level pairs
• No rules …
• Analysis/Generation is easy
• Very large for English
• What about
•Arabic or
•Turkish or
• Chinese?
40CSC 9010- NLP - 3: Morphology, Finite State Transducers
Stemming vs Morphology
Sometimes you just need to know the stem of a word and you don’t care about the structure.In fact you may not even care if you get the right stem, as long as you get a consistent string.This is stemming… it most often shows up in IR applications
41CSC 9010- NLP - 3: Morphology, Finite State Transducers
Stemming in IR
Run a stemmer on the documents to be indexedRun a stemmer on users queriesMatch
This is basically a form of hashing
Example: Computerizationization -> -ize computerizeize -> ε computer
42CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter StemmerStep 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible
43CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter
No lexicon neededBasically a set of staged sets of rewrite rules that strip suffixesHandles both inflectional and derivational suffixes
Doesn’t guarantee that the resulting stem is really a stem (see first bullet)
Lack of guarantee doesn’t matter for IR
44CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter StemmerErrors of Omission
European Europeanalysis analyzesmatrices matrixnoise noisyexplain explanation
Errors of Commissionorganization organdoing doegeneralization genericnumerical numerousuniversity universe
45CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex
You work as the Villanova telephone operator. Someone calls looking for:
Dr Papalarsky or Dr Matuzka
???????? What do you type as your query string?
46CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex
1. Keep the first letter2. Drop non-initial occurrences of vowels, h, w
and y3. Replace the remaining letters with numbers
according to group (e.g.. b, f, p, and v -> 14. Replace strings of identical numbers with a
single number (333 -> 3)5. Drop any numbers beyond a third one
47CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex
Effect is to map (hash) all similar sounding transcriptions to the same code.Structure your directory so that it can be accessed by code as well as by correct spellingUsed for census records, phone directories, author searches in libraries etc.