Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language...
Transcript of Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language...
Natural Language Processing
Morphological Analysis
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
Natural Language Processing 1(11)
What’s in a word?
I Word processing so far:I
Tokenization – segmenting sentences into words
IPart-of-speech tagging – classifying words grammatically
I Words have structure:I
runs, ran and running are inflected forms of the verb run
Iunfriendly is derived from friendly, which is derived from friend
Iinduce, product and reduction have the same root duc
I Morphological analysis – exploring the structure of words
Natural Language Processing 2(11)
Why does morphology matter?
I Information retrieval:I
A query for phones should match both phone and phones
I Language modeling:I
If we have seen scrutinize, we can predict scrutinized
I Machine translation:I
Swedish bilen corresponds to English the car
Natural Language Processing 3(11)
Language Variation
English
French
Dutch
Italian
Portuguese
Spanish
Danish
Swedish
German
Greek
Finnish
0 15 30 45 60
Number of unique word forms in 10k sentences
Natural Language Processing 4(11)
Morphology
I Words are built up of minimal meaningful elements calledmorphemes:
Iplayed = play-ed
Icats = cat-s
Iunfriendly = un-friend-ly
I Two types of morphemes:I
Stems: play, cat, friend
IAffixes: -ed, -s, un-, -ly
I Two main types of affixes:I
Prefixes precede the stem: un-
ISuffixes follow the stem: -ed, -s, un-, -ly
Natural Language Processing 5(11)
Language Variation
Mainly prefixing Equal prefixing and suffixing Mainly suffixing
Natural Language Processing 6(11)
Quiz 1
I Which of the following are morphemes of unbelievable?1. un
2. unbe
3. evable
4. able
Natural Language Processing 7(11)
Inflection
I Inflection relates different forms of the same word
Lemma Singular Pluralcat cat catsdog dog dogsknife knife knivessheep sheep sheepmouse mouse mice
I Note:I
The lemma is the canonical form found in dictionaries
IAffixation sometimes involves spelling changes (knife – knives)
IInflection does not always involve affixation (mouse – mice)
Natural Language Processing 8(11)
Word Formation
I Morphological processes can be used to form new words
I Derivation = stem + affixfriend + -ly = friendly
un- + -friendly = unfriendly
unfriendly + -ness = unfriendliness
I Compounding = stem + stemjärn (iron) + väg (road) = järnväg (railway)
järnväg + korsning (crossing) = järnvägskorsning (railway crossing)
järnvägskorsning + olycka (accident) = järnvägskorsningsolycka (railway crossing accident)
Natural Language Processing 9(11)
Morphological Analysis
I Morphological analysis:I
token ! lemma + part of speech + grammatical features
I Examples:I
cats ! cat+N+plur
Iplayed ! play+V+past
Ikatternas ! katt+N+plur+def+gen
I Often non-deterministic (more than one solution):I
plays ! play+N+plur
Iplays ! play+V+3sg
I Lemmatization:I
token ! lemma
Natural Language Processing 10(11)
Morphological Analysis
I Morphological analysis:I
token ! lemma + part of speech + grammatical features
I Examples:I
cats ! cat+N+plur
Iplayed ! play+V+past
Ikatternas ! katt+N+plur+def+gen
I Often non-deterministic (more than one solution):I
plays ! play+N+plur
Iplays ! play+V+3sg
I Lemmatization:I
token ! lemma
Natural Language Processing 10(11)
Morphological Analysis
I Morphological analysis:I
token ! lemma + part of speech + grammatical features
I Examples:I
cats ! cat+N+plur
Iplayed ! play+V+past
Ikatternas ! katt+N+plur+def+gen
I Often non-deterministic (more than one solution):I
plays ! play+N+plur
Iplays ! play+V+3sg
I Lemmatization:I
token ! lemma
Natural Language Processing 10(11)
Morphological Analysis
I Morphological analysis:I
token ! lemma + part of speech + grammatical features
I Examples:I
cats ! cat+N+plur
Iplayed ! play+V+past
Ikatternas ! katt+N+plur+def+gen
I Often non-deterministic (more than one solution):I
plays ! play+N+plur
Iplays ! play+V+3sg
I Lemmatization:I
token ! lemma
Natural Language Processing 10(11)
Quiz 2
I Which of the following pairs are cases of inflection?1. play – played
2. play – player
3. play – playing
4. play – playground
Natural Language Processing 11(11)
Natural Language Processing
Finite State Morphology
Joakim Nivre
Uppsala UniversityDepartment of Linguistics and Philology
Natural Language Processing 1(12)
Finite State Morphology
IMorphological analysis:
I token ! lemma + part of speech + grammatical featuresI
Finite state morphology:
I Efficient implementation using finite state automataI Start with recognition, add output later
Natural Language Processing 2(12)
Finite State AutomataRecap: Finite State Automata
START END
a
b
ab b
c
ca
c
a
c
b
b
Can be viewed as either emitting or recognizing strings
Sharon Goldwater ANLP Lecture 3 3
IStates: start, end, intermediate
ITransitions between states
ICan be viewed as emitting or recognizing strings
Natural Language Processing 3(12)
One Word One Word
S Ewalk
Basic finite state automaton:
• start state
• transition that emits the wordwalk
• end state
Sharon Goldwater ANLP Lecture 3 4
IStart state
IEnd state
ITransition that emits word/stem walk
Natural Language Processing 4(12)
One Word and One InflectionOne Word and One Inflection
S 1walk +ed
E
Two transitions and intermediate state
• first transition emits walk
• second transition emits +ed
! walked
Sharon Goldwater ANLP Lecture 3 5
IIntermediate state
IFirst transition emits stem walk
ISecond transition emits -ed
Natural Language Processing 5(12)
One Word and Multiple InflectionsOne Word and Multiple Inflections
S 1walk +ed
E
+ing
+s
Multiple transitions between states
• three di�erent paths
! walks, walked, walking
Sharon Goldwater ANLP Lecture 3 6
IMultiple affix transitions
IThree paths: walks, walked, walking
Natural Language Processing 6(12)
Multiple Words and Multiple InflectionsMultiple Words and Multiple Inflections
S 1walk +ed
E
+ing
+s
report
laugh
Multiple stems
• implements regular verb morphology! laughs, laughed, laughing
walks, walked, walkingreports, reported, reporting
Sharon Goldwater ANLP Lecture 3 7
IMultiple stems
IMultiple paths: laughs, . . . , walked, . . . , reporting
IImplements regular verb morphology
Natural Language Processing 7(12)
Composition
IConstructing an FSA gets very complicated
IBuild components as separate FSAs
I L = FSA for lexicon (lemmas)I D = FSA for derivational morphology (optional)I I = FSA for inflectional morphology
ICompose L + D + I using standard algorithms
I Each component can be composed in turn
Natural Language Processing 8(12)
Finite State Transducers
IFSAs can be used as morphological recognizers
IA morphological analyzer should produce output:
Iwalked ! walk+V+past
Ireporting ! report+V+prog
IUse a finite-state transducer (FST)
IReplace symbols with input-output pairs x : y
Natural Language Processing 9(12)
FST for VerbsFST for verbs
verb−reg
+1sg:s
+prog:ing
+past:ed+V:
where x means x:x and x: means x:�.
Sharon Goldwater ANLP Lecture 3 21
3
Natural Language Processing 10(12)
Disambiguation
IFSTs often produce multiple analyses for a single form:
Iwalks ! walk+V+3sg
Iwalks ! walk+N+plur
ICan be combined with statistical taggers for disambiguation
Natural Language Processing 11(12)
QuizFST for verbs
verb−reg
+1sg:s
+prog:ing
+past:ed+V:
where x means x:x and x: means x:�.
Sharon Goldwater ANLP Lecture 3 21
3
IWhat analysis does the FST above give for the word walking?
1. walk+V+3sg
2. walk+V+past
3. walk+V+prog
Natural Language Processing 12(12)
Natural Language Processing
Stemming
Joakim Nivre
Uppsala UniversityDepartment of Linguistics and Philology
Natural Language Processing 1(9)
Stemming
IStemming = find the stem by stripping off affixes
Iplay ! play
Ireplayed ! re-play-ed
Icomputerized ! comput-er-ize-d
ISimplified morphological analysis
I Group tokens that contain the same stemI Usually no distinction between inflection and derivationI Useful for certain types of application
INot the same as lemmatization
Word Stem Lemma
played play playreplayed play replayunfriendly friend unfriendly
Natural Language Processing 2(9)
Stemming
IStemming = find the stem by stripping off affixes
Iplay ! play
Ireplayed ! re-play-ed
Icomputerized ! comput-er-ize-d
ISimplified morphological analysis
I Group tokens that contain the same stemI Usually no distinction between inflection and derivationI Useful for certain types of application
INot the same as lemmatization
Word Stem Lemma
played play playreplayed play replayunfriendly friend unfriendly
Natural Language Processing 2(9)
The Porter Stemmer
IThe Porter stemmer
I Widely used stemming algorithm for EnglishI Ported to other languages as well
IMethodology:
I A sequence of steps strip off successive layers of affixesI Only the first matching rule in each step is appliedI Later steps may “clean up” unfortunate side effects
Natural Language Processing 3(9)
Example: Step 1
Rule Condition Example Exception
1.1 (X)-sses ! -ss caresses ! caress
1.2 (X)-ies ! -i ponies ! poni
1.3 (X)-ss ! -ss caress ! caress
1.4 (X)-s ! ✏ if VC 2 X cats ! cat bus 6! bu
IRule 1.4 removes inflectional -s
IStem is required to contain a VC sequence
IRules 1.1–1.3 catch specific patterns that would lead to errors
Natural Language Processing 4(9)
Example: Steps 2a and 2b
Rule Condition Example Exception
2a.1 (X)-eed ! -ee if V 2 X agreed ! agree feed 6! fee
2a.2 (X)-ed ! ✏ if V 2 X plastered ! plaster bled 6! bl
2a.3 (X)-ing ! ✏ if V 2 X motoring ! motor sing 6! s
2b.1 -at ! -ate conflat(ed) ! conflate
2b.2 -bl ! -ble troubl(ed) ! trouble
2b.3 -iz ! -ize siz(ed) ! size
2b.4 -CC ! C if C 62 {l, s, z} putt(ing) ! put fall(ing) 6! fal
2b.5 -C
1
VC
2
! C
1
VC
2
e if C
2
62 {w, x, y} fil(ing) ! file fail(ing) 6! faile
IStep 2a handles verb inflections (-ed, -ing)
IStep 2b “cleans up” exceptional cases
Natural Language Processing 5(9)
Example: Step 5
Rule Condition Example
5.1 (X)-icate ! -ic if VC 2 X triplicate ! triplic
5.2 (X)-ative ! ✏ if VC 2 X formative ! form
5.3 (X)-alize ! -al if VC 2 X formalize ! formal
5.4 (X)-iciti ! -ic if VC 2 X electriciti ! electric
5.5 (X)-ical ! -ic if VC 2 X electrical ! electric
5.6 (X)-ful ! ✏ if VC 2 X hopeful ! hope
5.7 (X)-ness ! ✏ if VC 2 X goodness ! good
IStep 5 handles (some) derivational endings
ISome rules presuppose earlier steps (electricity ! electriciti)
IRules for all steps can be found in Jurafsky & Martin
Natural Language Processing 6(9)
Example Outputs
ISuccessful:
Icomputers ! computer ! comput
Icomputing ! comput
Isinging ! sing
Icontrolling ! controll ! control
Igeneralizations ! generalization ! generalize ! general
IUnsuccessful:
Ielephants ! elephant ! eleph
Idoing ! do ! doe
Natural Language Processing 7(9)
From Stemming to Lemmatization
IAdvantages of stemming:
I Simple and efficientI No lexicon required
ISimilar techniques can be used for lightweight lemmatization:
I Porter style rules can handle regular inflectionI Irregular inflection can be added as more specific special cases
1. mice ! mouse
2. women ! woman
. . .
101. (X)-s ! ✏
Natural Language Processing 8(9)
Quiz
Rule Condition
(X)-sses ! -ss
(X)-ies ! -i
(X)-ss ! -ss
(X)-s ! ✏ if VC 2 X
IWhich tokens are lemmatized correctly by the rules above?
1. dog
2. dogs
3. bus
4. buses
Natural Language Processing 9(9)