Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

81
Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning

Transcript of Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Page 1: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Part-of-speech tagging

A simple but useful form of linguistic analysis

Christopher Manning

Page 2: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

Parts of Speech

• Perhaps starting with Aristotle in the West (384–322 BCE), there was the idea of having parts of speech• a.k.a lexical categories, word classes, “tags”, POS

• It comes from Dionysius Thrax of Alexandria (c. 100 BCE) the idea that is still with us that there are 8 parts of speech• But actually his 8 aren’t exactly the ones we are taught today

• Thrax: noun, verb, article, adverb, preposition, conjunction, participle, pronoun

• School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection

Page 3: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Open class (lexical) words

Closed class (functional)

Nouns Verbs

Proper Common

Modals

Main

Adjectives

Adverbs

Prepositions

Particles

Determiners

Conjunctions

Pronouns

… more

… more

IBM

Italy

cat / cats

snowsee

registered

can

had

old older oldest

slowly

to with

off up

the some

and or

he its

Numbers

122,312

one

Interjections Ow Eh

Page 4: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

Open vs. Closed classes

• Open vs. Closed classes• Closed:

• determiners: a, an, the• pronouns: she, he, I• prepositions: on, under, over, near, by, …• Why “closed”?

• Open: • Nouns, Verbs, Adjectives, Adverbs.

Page 5: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

POS Tagging

• Words often have more than one POS: back• The back door = JJ• On my back = NN• Win the voters back = RB• Promised to back the bill = VB

• The POS tagging problem is to determine the POS tag for a particular instance of a word.

Page 6: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

POS Tagging

• Input: Plays well with others• Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS• Output: Plays/VBZ well/RB with/IN others/NNS• Uses:

• Text-to-speech (how do we pronounce “lead”?)• Can write regexps like (Det) Adj* N+ over the output for phrases, etc.• As input to or to speed up a full parser• If you know the tag, you can back off to it in other tasks

Penn Treebank POS tags

Page 7: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

POS tagging performance

• How many tags are correct? (Tag accuracy)• About 97% currently• But baseline is already 90%

• Baseline is performance of stupidest possible method• Tag every word with its most frequent tag• Tag unknown words as nouns

• Partly easy because• Many words are unambiguous• You get points for them (the, a, etc.) and for punctuation marks!

Page 8: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

Deciding on the correct part of speech can be difficult even for people

• Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG

• All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN

• Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD

Page 9: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

How difficult is POS tagging?

• About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech

• But they tend to be very common words. E.g., that• I know that he is honest = IN• Yes, that play was nice = DT• You can’t go that far = RB

• 40% of the word tokens are ambiguous

Page 10: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Part-of-speech tagging revisited

A simple but useful form of linguistic

analysis

Christopher Manning

Page 11: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

Sources of information

• What are the main sources of information for POS tagging?• Knowledge of neighboring words

• Bill saw that man yesterday• NNP NN DT NN NN• VB VB(D) IN VB NN

• Knowledge of word probabilities• man is rarely used as a verb….

• The latter proves the most useful, but the former also helps

Page 12: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

More and Better Features Feature-based tagger

• Can do surprisingly well just looking at a word by itself:• Word the: the DT• Lowercased wordImportantly: importantly RB• Prefixes unfathomable: un- JJ• Suffixes Importantly: -ly RB• Capitalization Meridian: CAP NNP• Word shapes 35-year: d-x JJ

• Then build a maxent (or whatever) model to predict tag• Maxent P(t|w): 93.7% overall / 82.6% unknown

Page 13: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

Overview: POS Tagging Accuracies

• Rough accuracies:• Most freq tag: ~90% / ~50%

• Trigram HMM: ~95% / ~55%• Maxent P(t|w): 93.7% / 82.6%• TnT (HMM++): 96.2% / 86.0%• MEMM tagger: 96.9% / 86.9%• Bidirectional dependencies: 97.2% / 90.0%• Upper bound: ~98% (human agreement)

Most errors on unknown

words

Page 14: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

How to improve supervised results?

• Build better features!

• We could fix this with a feature that looked at the next word

• We could fix this by linking capitalized words to their lowercase versions

PRP VBD IN RB IN PRP VBD .They left as soon as he arrived .

NNP NNS VBD VBN .Intrinsic flaws remained undetected .

RB

JJ

Page 15: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

Tagging Without Sequence Information

t0

w0

Baseline

t0

w0w-1 w1

Three Words

Model Features

Token Unknown

Sentence

Baseline 56,805

93.69% 82.61% 26.74%

3Words 239,767

96.57% 86.78% 48.27%Using words only in a straight classifier works as well as a basic (HMM or discriminative) sequence model!!

Page 16: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

Christopher Manning

Summary of POS Tagging

For tagging, the change from generative to discriminative model does not by itself result in great improvement

One profits from models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis, etc.

An MEMM allows integration of rich features of the observations, but can suffer strongly from assuming independence from following observations; this effect can be relieved by adding dependence on following words

This additional power (of the MEMM ,CRF, Perceptron models) has been shown to result in improvements in accuracy

The higher accuracy of discriminative models comes at the price of much slower training

Page 17: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

17

Introduction to Natural Language Processing (600.465)

Tagging, Tagsets, and Morphology

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 18: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

18

The task of (Morphological) Tagging

• Formally: A+ → T• A is the alphabet of phonemes (A+ denotes any non-empty

sequence of phonemes)– often: phonemes ~ letters

• T is the set of tags (the “tagset”)

• Recall: 6 levels of language description:• phonetics ... phonology ... morphology ... syntax ... meaning ...

- a step aside:

• Recall: A+ → 2(L,C1,C2,...,Cn) → T morphology tagging: disambiguation ( ~ “select”)

tagging

Page 19: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

19

Tagging Examples

• Word form: A+ → 2(L,C1,C2,...,Cn) → T– He always books the violin concert tickets early.

• MA: books → {(book-1,Noun,Pl,-,-),(book-2,Verb,Sg,Pres,3)}

• tagging (disambiguation): ... → (Verb,Sg,Pres,3)

– ...was pretty good. However, she did not realize...• MA: However → {(however-1,Conj/coord,-,-,-),(however-2,Adv,-,-,-)}

• tagging: ... → (Conj/coord,-,-,-)

– [æ n d] [g i v] [i t] [t u:] [j u:] (“and give it to you”)• MA: [t u:] → {(to-1,Prep),(two,Num),(to-2,Part/inf),(too,Adv)}

• tagging: ... → (Prep)

Page 20: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

20

Tagsets

• General definition:– tag ~ (c1,c2,...,cn)

– often thought of as a flat list

T = {ti}i=1..n

with some assumed 1:1 mapping

T ↔ (C1,C2,...,Cn)

• English tagsets (see MS):– Penn treebank (45) (VBZ: Verb,Pres,3,sg, JJR: Adj. Comp.)

– Brown Corpus (87), Claws c5 (62), London-Lund (197)

Page 21: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

21

Other Language Tagsets

• Differences:– size (10..10k)

– categories covered (POS, Number, Case, Negation,...)

– level of detail

– presentation (short names vs. structured (“positional”))

• Example:

– Czech: AGFS3----1A----

POS

SUBPOS

GENDER

NUMBER

CASE

POSSG

POSSNPERSON

TENSEDCOMP

NEG

VOICE

VAR

Page 22: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

22

Tagging Inside Morphology

• Do tagging first, then morphology:

• Formally: A+ → T → (L,C1,C2,...,Cn)

• Rationale: – have |T| < |(L,C1,C2,...,Cn)| (thus, less work for the tagger) and

keep the mapping T → (L,C1,C2,...,Cn) unique.

• Possible for some languages only (“English-like”)• Same effect within “regular” A+ → 2(L,C1,C2,...,Cn) → T:

– mapping R : (C1,C2,...,Cn) → Treduced,

then (new) unique mapping U: A+ ⅹTreduced → (L,T)

Page 23: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

23

Lemmatization

• Full morphological analysis:MA: A+ → 2(L,C1,C2,...,Cn)

(recall: a lemma l ∈L is a lexical unit (~ dictionary entry ref)

• Lemmatization: reduced MA:– L: A+ → 2L; w → {l: (l,t1,t2,...,tn) ∈MA(w)}

– again, need to disambiguate (want: A+ → L)

(special case of word sense disambiguation, WSD)

– “classic” tagging does not deal with lemmatization

(assumes lemmatization done afterwards somehow)

Page 24: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

24

Morphological Analysis: Methods

• Word form list• books: book-2/VBZ, book-1/NNS

• Direct coding• endings: verbreg:s/VBZ, nounreg:s/NNS, adje:er/JJR, ...• (main) dictionary: book/verbreg, book/nounreg, nic/adje:nice

• Finite state machinery (FSM)• many “lexicons”, with continuation links: reg-root-lex → reg-end-lex• phonology included but (often) clearly separated

• CFG, DATR, Unification, ...• address linguistic rather than computational phenomena• in fact, better suited for morphological synthesis (generation)

Page 25: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

25

Word Lists

• Works for English– “input” problem: repetitive hand coding

• Implementation issues:– search trees

– hash tables (Perl!)

– (letter) trie:

• Minimization? t

t

a

n

d

at,Prep

a,Art

a,Artv

ant,NNand,Conj

Page 26: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

26

Word-internal1 Segmentation (Direct)

• Strip prefixes: (un-, dis-, ...)• Repeat for all plausible endings:

– Split rest: root + ending (for every possible ending)– Find root in a dictionary, keep dictionary information

• in particular, keep inflection class (such as reg, noun-irreg-e, ...)

– Find ending, check inflection+prefix class match– If match found:

• Output root-related info (typically, the lemma(s))• Output ending-related information (typically, the tag(s)).

1Word segmentation is a different problem (Japanese, speech in general)

Page 27: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

27

Finite State Machinery

• Two-level Morphology (TLM) - KIMMO– phonology + “morphotactics” (= morphology)

• Both components use finite-state automata:– phonology: “two-level rules”, converted to FSA

• e:0 ⇔_ +:0 e:e r:r (nice+er nicer)

– morphology: linked lexicons• root-dic: book/”book” end-noun-reg-dic

• end-noun-reg-dic: +s/”NNS”

• Integration of the two possible (and simple)

Page 28: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

28

Finite State Transducer (for phonology)

• FST is a FSA where– symbols are pairs (r:s) from a finite alphabets R and S.

• “Checking” run:– input data: sequence of pairs, output: Yes/No (accept/do not)

– use as a FSA

• Analysis run:– input data: sequence of only s ∈ S (TLM: surface);

– output: seq. of r ∈ R (TLM: lexical), + lexicon “glosses” (pos tag, etc)

• Synthesis (generation) run:– same as analysis except roles are switched: S ↔R

Page 29: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

29

Parallel Rules, Zero Symbols

• Parallel Rules:– Each rule ~ one FST

– Run in parallel

– Any of them fails path fails

• Zero symbols (one side only, even though 0:0 o.k.)– behave like any other

– (nice+er -> nicer) e:0

+:0

F5

F6

Page 30: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

30

The Lexicon (for morphotactics)

• Ordinary FSA (“lexical” alphabet only)• Used for analysis only

– additional constraint:• lexical string must pass the linked lexicon list

• Implemented as a FSA; compiled from lists of strings and lexicon links

• Example:

b o o k

ka n+ s

“bank”

“book”

“NNS”

Page 31: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

31

TLM Analysis Example• Bücher:

• suppose each surface letter corresponds to the same symbol at the lexical level, just ü might be ü as well as u lexically; plus zeroes (+:0), (0:0)

• Use the FST as before.

• Use lexicons:

root: Buch “book” end-reg-uml

Bündni “union” end-reg-s

end-reg-uml: +0 “NNomSg”

+er “NNomPl”B:B Bu:Bü Buc:Büc Buch:Büch Buch+e:Büch0e Buch+er:Büch0er

Bü:Bü Büc:Bücu

ü

Page 32: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

32

TLM: Generation

• Do not use the lexicon (well you have to put the “right” lexical strings together somehow!)

• Start with a lexical string L.• Generate all possible pairs l:s for every symbol in L.• Find all (hopefully only 1!) traversals through the FST

which end in a final state. • From all such traversals, print out the sequence of

surface letters.

Page 33: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

33

TLM: Some Remarks

• Parallel FST (incl. final lexicon FSA)– can be compiled into a (gigantic) FST

– maybe not so gigantic (XLT - Xerox Language Tools)

• “Double-leveling” the lexicon:– allows for generation from lemma, tag

– needs: rules with strings of unequal length

• Rule Compiler– Karttunen, Kay

• PC-KIMMO: free version from www.sil.org (Unix,too)

Page 34: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

34

*Introduction to Natural Language Processing (600.465)

Tagging (disambiguation): An Overview

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 35: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

35

Rule-based Disambiguation• Example after-morphology data (using Penn tagset):

I watch a fly .

NN NN DT NN .

PRP VB NN VB

VBP VBP• Rules using

– word forms, from context & current position

– tags, from context and current position

– tag sets, from context and current position

– combinations thereof

Page 36: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

36

Example Rules I watch a fly .

NN NN DT NN .

PRP VB NN VB

VBP VBP

• If-then style:• DTeq,-1,Tag

(implies NNin,0,Set as a condition) • PRPeq,-1,Tag and DTeq,+1,Tag VBP• {DT,NN}sub,0,Set DT• {VB,VBZ,VBP,VBD,VBG}inc,+1,Tag not DT

• Regular expressions:• not(<*,*,DTnot• not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• not(<*,{DT,NN}sub,notDT• not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)

Page 37: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

37

Implementation

• Finite State Automata– parallel (each rule ~ automaton);

• algorithm: keep all paths which cause all automata say yes

– compile into single FSA (intersection)

• Algorithm:– a version of Viterbi search, but:

• no probabilities (“categorical” rules)

• multiple input:– keep track of all possible paths

Page 38: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

38

Example: the FSA

• R1: not(<*,*,DTnot• R2: not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• R3: not(<*,{DT,NN}sub,notDT• R4: not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)

• R1:

• R3:

<*,*,DT notF1 F2 N3

anything else

anything else

anything

<*,{DT,NN}sub,notDTF1 N2

anything else

anything

Page 39: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

39

Applying the FSA• R1: not(<*,*,DTnot• R2: not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• R3: not(<*,{DT,NN}sub,notDT• R4: not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)

• R1 blocks: remains: or

• R2 blocks: remains e.g.: and more

• R3 blocks: remains only:• R4 R1!

I watch a

NN DT

PRP VB

a fly

DT

VB

VBP

a fly

DT NN

a fly

NN

NN VB

VBP

I watch a

DT

PRP

VBP

I watch a fly .

NN NN DT NN .

PRP VB NN VB

VBP VBP

a

NN

a

DT

Page 40: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

40

Applying the FSA (Cont.)

• Combine:

• Result:

I watch a fly .

NN NN DT NN .

PRP VB NN VB

VBP VBP

a fly

DT NN

a fly

NN

NN VB

VBP

I watch a

DT

PRP

VBP

a

DT

I watch a fly .

PRP VBP DT NN .

Page 41: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

41

Tagging by Parsing

• Build a parse tree from the multiple input:

• Track down rules: e.g., NP → DT NN: extract (a/DT fly/NN)

• More difficult than tagging itself; results mixed

I watch a fly .

NN NN DT NN .

PRP VB NN VB

VBP VBP

NP

VP

S

Page 42: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

42

Statistical Methods (Overview)

• “Probabilistic”:• HMM

– Merialdo and many more (XLT)

• Maximum Entropy– DellaPietra et al., Ratnaparkhi, and others

• Rule-based:• TBEDL (Transformation Based, Error Driven Learning)

– Brill’s tagger

• Example-based– Daelemans, Zavrel, others

• Feature-based (inflective languages)• Classifier Combination (Brill’s ideas)

Page 43: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

43

*Introduction to Natural Language Processing (600.465)

HMM Tagging

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 44: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

44

Review

• Recall:– tagging ~ morphological disambiguation

– tagset VT (C1,C2,...Cn)

• Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ...

– mapping w → {t ∈VT} exists [just word tagging!]• restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn)

where A is the language alphabet, L is the set of lemmas

– extension to punctuation, sentence boundaries (treated as words)

Page 45: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

45

The Setting

• Noisy Channel setting: Input (tags) Output (words)

The channel

NNP VBZ DT... (adds “noise”) John drinks the ...

• Goal (as usual): discover “input” to the channel (T, the tag seq.) given the “output” (W, the word sequence)

– p(T|W) = p(W|T) p(T) / p(W)

– p(W) fixed (W given)...

argmaxT p(T|W) = argmaxT p(W|T) p(T)

Page 46: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

46

The Model

• Two models (d=|W|=|T| word sequence length):– p(W|T) = i=1..dp(wi|w1,...,wi-1,t1,...,td)

– p(T) = i=1..dp(ti|t1,...,ti-1)

• Too much parameters (as always)• Approximation using the following assumptions:

• words do not depend on the context

• tag depends on limited history: p(ti|t1,...,ti-1) p(ti|ti-n+1,...,ti-1)

– n-gram tag “language” model

• word depends on tag only: p(wi|w1,...,wi-1,t1,...,td) p(wi|ti)

Page 47: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

47

The HMM Model Definition

• (Almost) the general HMM:– output (words) emitted by states (not arcs)

– states: (n-1)-tuples of tags if n-gram tag model used

– five-tuple (S, s0, Y, PS, PY), where:

• S = {s0,s1,s2,...,sT} is the set of states, s0 is the initial state,

• Y = {y1,y2,...,yV} is the output alphabet (the words),

• PS(sj|si) is the set of prob. distributions of transitions

– PS(sj|si) = p(ti|ti-n+1,...,ti-1); sj = (ti-n+2,...,ti), si = (ti-n+1,...,ti-1)

• PY(yk|si) is the set of output (emission) probability distributions

– another simplification: PY(yk|si) = PY(yk|sj) if si and sj contain the same tag as the rightmost element: PY(yk|si) = p(wi|ti)

Page 48: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

48

Supervised Learning (Manually Annotated Data Available)

• Use MLE– p(wi|ti) = cwt(ti,wi) / ct(ti)

– p(ti|ti-n+1,...,ti-1) = ctn(ti-n+1,...,ti-1,ti) / ct(n-1)(ti-n+1,...,ti-1)

• Smooth (both!)– p(wi|ti): “Add 1” for all possible tag,word pairs using a

predefined dictionary (thus some 0 kept!)

– p(ti|ti-n+1,...,ti-1): linear interpolation:• e.g. for trigram model:

p’(ti|ti-2,ti-1) = 3 p(ti|ti-2,ti-1) + 2 p(ti|ti-1) + 1 p(ti) + 0 / |VT|

Page 49: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

49

Unsupervised Learning

• Completely unsupervised learning impossible– at least if we have the tagset given- how would we associate

words with tags?

• Assumed (minimal) setting:– tagset known

– dictionary/morph. analysis available (providing possible tags for any word)

• Use: Baum-Welch algorithm– “tying”: output (state-emitting only, same dist. from two states

with same “final” tag)

Page 50: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

50

Comments on Unsupervised Learning

• Initialization of Baum-Welch– is some annotated data available, use them

– keep 0 for impossible output probabilities

• Beware of:– degradation of accuracy (Baum-Welch criterion:

entropy, not accuracy!)

– use heldout data for cross-checking

• Supervised almost always better

Page 51: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

51

Unknown Words

• “OOV” words (out-of-vocabulary)– we do not have list of possible tags for them

– and we certainly have no output probabilities

• Solutions:– try all tags (uniform distribution)

– try open-class tags (uniform, unigram distribution)

– try to “guess” possible tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitalization)

Page 52: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

52

Running the Tagger

• Use Viterbi– remember to handle unknown words

– single-best, n-best possible

• Another option:– assign always the best tag at each word, but consider all

possibilities for previous tags (no back pointers nor a path-backpass)

– introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors)

Page 53: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

53

(Tagger) Evaluation

• A must: Test data (S), previously unseen (in training)– change test data often if at all possible! (“feedback cheating”)– Error-rate based

• Formally: – Out(w) = set of output “items” for an input “item” w– True(w) = single correct output (annotation) for w

– Errors(S) = i=1..|S|(Out(wi) ≠ True(wi))

– Correct(S) = i=1..|S|(True(wi) ∈ Out(wi))

– Generated(S) = i=1..|S||Out(wi)|

Page 54: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

54

Evaluation Metrics

• Accuracy: Single output (tagging: each word gets a single tag)– Error rate: Err(S) = Errors(S) / |S|

– Accuracy: Acc(S) = 1 - (Errors(S) / |S|) = 1 - Err(S)

• What if multiple (or no) output?– Recall: R(S) = Correct(S) / |S|

– Precision: P(S) = Correct(S) / Generated(S)

– Combination: F measure: F = 1 / (/P + (1-)/R)• is a weight given to precision vs. recall; for =.5, F = 2PR/(R+P)

Page 55: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

55

*Introduction to Natural Language Processing (600.465)

Maximum Entropy Tagging

Dr. Jan Hajiè

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 56: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

56

The Task, Again

• Recall:– tagging ~ morphological disambiguation

– tagset VT (C1,C2,...Cn)

• Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ...

– mapping w → {t ∈VT} exists• restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn)

where A is the language alphabet, L is the set of lemmas

– extension to punctuation, sentence boundaries (treated as words)

Page 57: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

57

Maximum Entropy Tagging Model

• General

p(y,x) = (1/Z) ei=1..Nifi(y,x)

Task: find i satisfying the model and constraints

• Ep(fi(y,x)) = di

where • di = E’(fi(y,x)) (empirical expectation i.e. feature frequency)

• Tagging

p(t,x) = (1/Z) ei=1..Nifi(t,x)

• t ∈ Tagset,

• x ~ context (words and tags alike; say, up to three positions R/L)

Page 58: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

58

Features for Tagging

• Context definition– two words back and ahead, two tags back, current word:

• xi = (wi-2,ti-2,wi-1,ti-1,wi,wi+1,wi+2)

– features may ask any information from this window• e.g.:

– previous tag is DT– previous two tags are PRP$ and MD, and the following word is “be”– current word is “an”– suffix of current word is “ing”

• do not forget: feature also contains ti, the current tag:

– feature #45: suffix of current word is “ing” & the tag is VBG ⇔ f45 = 1

Page 59: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

59

Feature Selection• The PC1 way

– (try to) test all possible feature combinations• features may overlap, or be redundant; also, general or specific -

impossible to select manually

– greedy selection:• add one feature at a time, test if (good) improvement:

– keep if yes, return to the pool of features if not

– even this is costly, unless some shortcuts are made• see Berger & DPs for details

• The other way: – use some heuristic to limit the number of features

• 1Politically (or, Probabilistically-stochastically) Correct

Page 60: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

60

Limiting the Number of Features

• Always do (regardless whether you’re PC or not):– use contexts which appear in the training data (lossless selection)

• More or less PC, but entails huge savings (in the number of features to estimate i weights for):– use features appearing only L-times in the data (L ~ 10)

– use wi-derived features which appear with rare words only

– do not use all combinations of context (this is even “LC1”)

– but then, use all of them, and compute the i only once using the Generalized Iterative Scaling algorithm

• 1Linguistically Correct

Page 61: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

61

Feature Examples (Context)

• From A. Ratnaparkhi (EMNLP, 1996, UPenn)– ti = T, wi = X (frequency c > 4):

• ti = VBG, wi = selling

– ti = T, wi contains uppercase char (rare):

• ti = NNP, tolower(wi) ≠ wi

– ti = T, ti-1 = Y, ti-2 = X:

• ti = VBP, ti-2 = PRP, ti-1 = RB

• ?Other examples of possible features:– ti = T, tj is X, where j is the closest left position where Y

• ti = VBZ, tj = NN, Y ⇔ tj ∈ {NNP, NNS, NN}

Page 62: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

62

Feature Examples (Lexical/Unknown)

• From A. Ratnaparkhi :– ti = T, suffix(wi)= X (length X < 5):

• ti = JJ, suffix(wi) = eled (traveled, leveled, ....)

– ti = T, prefix(wi)= X (length X < 5):

• ti = JJ, prefix(wi) = well (well-done, well-received,...)

– ti = T, wi contains hyphen:

• ti = JJ, ‘-’ in wi (open-minded, short-sighted,...)

• Other possibility, for example:– ti = T, wi contains X:

• ti = NounPl, wi contains umlaut (ä,ö,ü) (Wörter, Länge,...)

Page 63: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

63

“Specialized” Word-based Features

• List of words with most errors (WSJ, Penn Treebank):– about, that, more, up, ...

• Add “specialized”, detailed features:– ti = T, wi = X, ti-1 = Y, ti-2 = Z:

• ti = IN, wi = about, ti-1 = NNS, ti-2 = DT

– possible only for relatively high-frequency words

• Slightly better results (also, inconsistent [test] data)

Page 64: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

64

Maximum Entropy Tagging: Results

• For details, see A Ratnaparkhi• Base experiment (133k words, < 3% unknown):

– 96.31% word accuracy

• Specialized features added:– 96.49% word accuracy

• Consistent subset (training + test)– 97.04% word accuracy (97.13% w/specialized features)

• This is the best result on WSJ so far.

Page 65: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

65

]Introduction to Natural Language Processing (600.465)

Feature-Based Tagging

Dr. Jan Hajiè

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 66: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

66

The Task, Again

• Recall:– tagging ~ morphological disambiguation

– tagset VT (C1,C2,...Cn)

• Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ...

– mapping w → {t ∈VT} exists• restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn)

where A is the language alphabet, L is the set of lemmas

– extension to punctuation, sentence boundaries (treated as words)

Page 67: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

67

Feature Selection Problems

• Main problem with Maximum Entropy [tagging]:– Feature Selection (if number of possible features is in

the hundreds of thousands or millions)

– No good way• best so far: Berger & DP’s greedy algorithm

• heuristics (cutoff based: ignore low-count features)

• Goal:– few but “good” features (“good” ~ high predictive

power ~ leading to low final cross entropy)

Page 68: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

68

Feature-based Tagging

• Idea:– save on computing the weights (i)

• are they really so important?

– concentrate on feature selection

• Criterion (training):– error rate (~ accuracy; borrows from Brill’s tagger)

• Model form (probabilistic - same as for Maximum Entropy):

p(y|x) = (1/Z(x)) ei=1..Nifi(y,x)

→Exponential (or Loglinear) Model

Page 69: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

69

Feature Weight (Lambda) Approximation

• Let Y be the sample space from which we predict (tags in our case), and fi(y,x) a b.v. feature

• Define a “batch of features” and a “context feature”:B(x) = {fi; all fi’s share the same context x}

fB(x)(x’) = 1 ⇔dfx x’ (x is part of x’)• in other words, holds wherever a context x is found

• Example: f1(y,x) = 1 ⇔df y=JJ, left tag = JJ

f2(y,x) = 1 ⇔df y=NN, left tag = JJ

B(left tag = JJ) = {f1, f2} (but not, say, [y=JJ, left tag = DT])

Page 70: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

70

Estimation• Compute:

p(y|B(x)) = (1/Z(B(x))) d=1..|T|(yd,y)fB(x)(xd)• frequency of y relative to all places where any of B(x) features holds for

some y; Z(B(x)) is the natural normalization factor Z(B(x)) = d=1..|T| fB(x)(xd)

“compare” to uniform distribution: (y,B(x)) = p(y|B(X)) / (1 / |Y|)

(y,B(x)) > 1 for p(y|B(x)) better than uniform; and vice versa

• If fi(y,x) holds for exactly one y (in a given context x),then we have 1:1 relation between (y,B(x)) and fi(y,x) from B(x)

and i = log ((y,B(x)))NB: works in constant time

independent of j, j≠ i

Page 71: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

71

What we got

• Substitute:

p(y|x) = (1/Z(x)) ei=1..Nifi(y,x) =

= (1/Z(x)) i=1..N(y,B(x))fi(y,x)

= (1/Z(x)) i=1..N (|Y| p(y|B(x)))fi(y,x)

= (1/Z’(x)) i=1..N (p(y|B(x)))fi(y,x)

= (1/Z’(x)) B(x’); x’ x p(y|B(x’)) ... Naive Bayes (independence assumption)

Page 72: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

72

The Reality

• take advantage of the exponential form of the model (do not reduce it completely to naive Bayes):– vary (y,B(x)) up and down a bit (quickly)

• captures dependence among features

– recompute using “true” Maximum Entropy• the ultimate solution

– combine feature batches into one, with new (y,B(x’))• getting very specific features

Page 73: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

73

Search for Features

• Essentially, a way to get rid of unimportant features:– start with a pool of features extracted from full data– remove infrequent features (small threshold, < 2)– organize the pool into batches of features

• Selection from the pool P:– start with empty S (set of selected features)– try all features from the pool, compute (y,B(x)), compute error

rate over training data.– add the best feature batch permanently; stop when no correction

made [complexity: |P| x |S| x |T|]

Page 74: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

74

Adding Features in Blocks, Avoiding the Search for the Best

• Still slow; solution: add ten (5,20) best features at a time, assuming they are independent (i.e., the next best feature would change the error rate the same way as if no intervening addition of a feature is made).

• Still slow [(|P| x |S| x |T|)/10, or 5, or 20]; solution:• Add all features improving the error rate by a certain

threshold; then gradually lower the threshold down to the desired value; complexity [|P| x log|S| x |T|] if

• threshold(n+1) = threshold(n) / k, k > 1 (e.g. k = 2)

Page 75: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

75

Types of Features

• Position:– current

– previous, next

– defined by the closest word with certain major POS

• Content:– word (w), tag(t) - left only, “Ambiguity Class” (AC) of a

subtag (POS, NUMBER, GENDER, CASE, ...)

• Any combination of position and content• Up to three combinations of (position,content)

Page 76: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

76

Ambiguity Classes (AC)

• Also called “pseudowords” (MS, for word sense disambiguationi task), here: “pseudotags”

• AC (for tagging) is a set of tags (used as an indivisible token).– Typically, these are the tags assigned by a morphology to a given

word:• MA(books) [restricted to tags] = { NNS, VBZ }:

AC = NNS_VBZ

• Advantage: deterministic→looking at the ACs (and words, as before) to the right allowed

Page 77: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

77

Subtags

• Inflective languages: too many tags → data sparseness• Make use of separate categories (remember morphology):

– tagset VT (C1,C2,...Cn)

• Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ...

• Predict (and use for context) the individual categories• Example feature:

– previous word is a noun, and current CASE subtag is genitive

• Use separate ACs for subtags, too (ACPOS = N_V)

Page 78: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

78

Combining Subtags

• Apply the separate prediction (POS, NUMBER) to – MA(books) = { (Noun, Pl), (VerbPres, Sg)}

• Now what if the best subtags are– Noun for POS

– Sg for NUMBER• (Noun, Sg) is not possible for books

• Allow only possible combinations (based on MA)

• Use independence assumption (Tag = (C1, C2, ..., Cn)):

(best) Tag = argmaxTag ∈MA(w) i=1..|Categories| p(Ci|w,x)

Page 79: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

79

Smoothing

• Not needed in general (as usual for exponential models)– however, some basic smoothing has an advantage of

not learning unnecessary features at the beginning

– very coarse: based on ambiguity classes• assign the most probable tag for each AC, using MLE

• e.g. NNS for AC = NNS_VBZ

– last resort smoothing: unigram tag probability

– can be even parametrized from the outside

– also, needed during training

Page 80: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

80

Overtraining

• Does not appear in general– usual for exponential models– does appear in relation to the training curve:

– but does not go down until very late in the training (singletons do cause overtraining)

training accuracy

test accuracy

Page 81: Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning.

81

Results

• Training: Orwell 1984 (EU, MULTEXT-EAST)– translation into many languages, manually annotated by

tags and lexical information; ~100k training words

• Test: 20k words from the end of Orwell’s 1984• Results:

Language Ambiguity W+Punct Acc. W only Acc. Features LearntEnglish 38.7 98.39 98.03 1088Romanian 40.0 97.35 96.71 3055Czech 46.0 92.96 90.23 2398Estonian 40.2 95.95 94.19 2412Hungarian 26.3 97.71 96.94 1853Slovene 38.0 93.18 91.29 3426