Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation...

46
Lecture 6 POS Tagging Methods Topics Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-? Readings: Chapter 5.4-? February 3, 2011 CSCE 771 Natural Language Processing

Transcript of Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation...

Lecture 6 POS Tagging Methods

Lecture 6 POS Tagging Methods

Topics Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning

Readings: Chapter 5.4-?Readings: Chapter 5.4-?

February 3, 2011

CSCE 771 Natural Language Processing

– 2 – CSCE 771 Spring 2011

OverviewOverviewLast TimeLast Time

Overview of POS Tags

TodayToday Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggers

ReadingsReadings Chapter 5.4-5.?

– 3 – CSCE 771 Spring 2011

History of TaggingHistory of Tagging

Dionysius Thrax of Alexandria (circa 100 B.C.) wrote a Dionysius Thrax of Alexandria (circa 100 B.C.) wrote a “techne” which summarized linguistic knowledge of “techne” which summarized linguistic knowledge of the daythe day Terminology that is still used 2000 years later

Syntax, dipthong, clitic, analogy

Also included eight “parts-of-speech” basis of subsequent POS descriptions of Greek, Latin and European languages

NounVerbPronounPrepositionAdverbConjunctionParticipleArticle

– 4 – CSCE 771 Spring 2011

History of TaggingHistory of Tagging

100 BC Dionysis Thrax – document eight parts of speech100 BC Dionysis Thrax – document eight parts of speech

1959 Harris (U Penn) first tagger as part of TDAP parser 1959 Harris (U Penn) first tagger as part of TDAP parser projectproject

1963 Klien and Simmons() Computational Grammar Coder 1963 Klien and Simmons() Computational Grammar Coder CGCCGC Small lexicon (1500 exceptional words), morphological analyzer

and context disambiguator

1971 Green and Rubin TAGGIT expanded CGC 1971 Green and Rubin TAGGIT expanded CGC More tags 87 and bigger dictionary Achieved 77% accuracy when applied to Brown Corpus

1983 Marshal/Garside – CLAWS tagger1983 Marshal/Garside – CLAWS tagger Probabilistic algorithm using tag bigram probabilities

– 5 – CSCE 771 Spring 2011

History of Tagging (continued)History of Tagging (continued)

1988 Church () PARTS tagger1988 Church () PARTS tagger Extended CLAWS idea Stored P(tag | word) * P(tag | previous n tags) instead of P(word | tag) * P(tag | previous n tags) used in HMM

taggers

1992 Kupiec HMM tagger1992 Kupiec HMM tagger

1994 Schütze and Singer () variable length Markov Models1994 Schütze and Singer () variable length Markov Models

1994 Jelinek/Magerman – decision trees for the 1994 Jelinek/Magerman – decision trees for the probabilitiesprobabilities

1996 Ratnaparkhi () - the Maximum Entropy Algorithm1996 Ratnaparkhi () - the Maximum Entropy Algorithm

1997 Brill – unsupervised version of the TBL algorithm1997 Brill – unsupervised version of the TBL algorithm

– 6 – CSCE 771 Spring 2011

POS TaggingPOS Tagging

Words often have more than one POS: Words often have more than one POS: backback The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag The POS tagging problem is to determine the POS tag for a particular instance of a word.for a particular instance of a word.

– 7 – CSCE 771 Spring 2011

How Hard is POS Tagging? Measuring AmbiguityHow Hard is POS Tagging? Measuring Ambiguity

– 8 – CSCE 771 Spring 2011

Two Methods for POS TaggingTwo Methods for POS Tagging

1.1. Rule-based taggingRule-based tagging (ENGTWOL)

2.2. StochasticStochastic1. Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

– 9 – CSCE 771 Spring 2011

Rule-Based TaggingRule-Based Tagging

Start with a dictionaryStart with a dictionary

Assign all possible tags to words from the dictionaryAssign all possible tags to words from the dictionary

Write rules by hand to selectively remove tagsWrite rules by hand to selectively remove tags

Leaving the correct tag for each word.Leaving the correct tag for each word.

– 10 – CSCE 771 Spring 2011

Start With a DictionaryStart With a Dictionary• she:she: PRPPRP

• promised:promised: VBN,VBDVBN,VBD

• toto TOTO

• back:back: VB, JJ, RB, NNVB, JJ, RB, NN

• the:the: DTDT

• bill:bill: NN, VB NN, VB

• Etc… for the ~100,000 words of English with more Etc… for the ~100,000 words of English with more than 1 tagthan 1 tag

– 11 – CSCE 771 Spring 2011

Assign Every Possible TagAssign Every Possible Tag

NNNN

RBRB

VBNVBN JJ VBJJ VB

PRPPRP VBDVBD TO TO VB DTVB DT NN NN

SheShe promised to back thepromised to back the bill bill

– 12 – CSCE 771 Spring 2011

Write Rules to Eliminate TagsWrite Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”follows “<start> PRP”

NNNN

RBRB

JJJJ VB VB

PRPPRP VBDVBD TO VB TO VB DT DT NNNN

SheShe promisedpromised toto back theback the billbill

VBN

– 13 – CSCE 771 Spring 2011

Stage 1 of ENGTWOL TaggingStage 1 of ENGTWOL Tagging

First Stage: Run words through FST First Stage: Run words through FST morphological analyzer to get all parts of morphological analyzer to get all parts of speech.speech.

Example: Example: Pavlov had shown that salivation …Pavlov had shown that salivation …

PavlovPavlov PAVLOV N NOM SG PROPERPAVLOV N NOM SG PROPERhadhad HAVE V PAST VFIN SVOHAVE V PAST VFIN SVO

HAVE PCP2 SVOHAVE PCP2 SVOshownshown SHOW PCP2 SVOO SVO SVSHOW PCP2 SVOO SVO SVthatthat ADVADV

PRON DEM SGPRON DEM SGDET CENTRAL DEM SGDET CENTRAL DEM SGCSCS

salivationsalivation N NOM SGN NOM SG

– 14 – CSCE 771 Spring 2011

Stage 2 of ENGTWOL TaggingStage 2 of ENGTWOL TaggingSecond Stage: Apply NEGATIVE constraints.Second Stage: Apply NEGATIVE constraints.

Example: Adverbial “that” ruleExample: Adverbial “that” rule Eliminates all readings of “that” except the one in

“It isn’t that odd”

GivenGiven input: “that” input: “that”IfIf(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier

(+2 SENT-LIM) ;following which is E-O-S(+2 SENT-LIM) ;following which is E-O-S

(NOT -1 SVOC/A) ; and the previous word is not a(NOT -1 SVOC/A) ; and the previous word is not a

; verb like “consider” which ; verb like “consider” which

; allows adjective complements ; allows adjective complements

; in “I consider that odd”; in “I consider that odd”

ThenThen eliminate non-ADV tags eliminate non-ADV tagsElseElse eliminate ADV eliminate ADV

– 15 – CSCE 771 Spring 2011

Hidden Markov Model TaggingHidden Markov Model Tagging

Using an HMM to do POS tagging is a special case of Using an HMM to do POS tagging is a special case of Bayesian inferenceBayesian inference Foundational work in computational linguistics Bledsoe 1959: OCR Mosteller and Wallace 1964: authorship identification

It is also related to the “noisy channel” model that’s the It is also related to the “noisy channel” model that’s the basis for ASR, OCR and MT basis for ASR, OCR and MT

– 16 – CSCE 771 Spring 2011

POS Tagging as Sequence ClassificationPOS Tagging as Sequence ClassificationWe are given a sentence (an “observation” or “sequence We are given a sentence (an “observation” or “sequence

of observations”)of observations”) Secretariat is expected to race tomorrow

What is the best sequence of tags that corresponds to What is the best sequence of tags that corresponds to this sequence of observations?this sequence of observations?

Probabilistic view:Probabilistic view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence

which is most probable given the observation sequence of n words w1…wn.

– 17 – CSCE 771 Spring 2011

Getting to HMMsGetting to HMMs

We want, out of all sequences of n tags tWe want, out of all sequences of n tags t11…t…tnn the single the single tag sequence such that P(ttag sequence such that P(t11…t…tnn|w|w11…w…wnn) is highest.) is highest.

Hat ^ means “our estimate of the best one”Hat ^ means “our estimate of the best one”

ArgmaxArgmaxxx f(x) means “the x such that f(x) is maximized” f(x) means “the x such that f(x) is maximized”

– 18 – CSCE 771 Spring 2011

Getting to HMMsGetting to HMMs

This equation is guaranteed to give us the best tag This equation is guaranteed to give us the best tag sequencesequence

But how to make it operational? How to compute this But how to make it operational? How to compute this value?value?

Intuition of Bayesian classification:Intuition of Bayesian classification: Use Bayes rule to transform this equation into a set of other

probabilities that are easier to compute

– 19 – CSCE 771 Spring 2011

Using Bayes RuleUsing Bayes Rule

– 20 – CSCE 771 Spring 2011

Likelihood and PriorLikelihood and Prior

– 21 – CSCE 771 Spring 2011

Two Kinds of ProbabilitiesTwo Kinds of Probabilities

Tag transition probabilities p(tTag transition probabilities p(tii|t|ti-1i-1)) Determiners likely to precede adjs and nouns

That/DT flight/NNThe/DT yellow/JJ hat/NNSo we expect P(NN|DT) and P(JJ|DT) to be highBut P(DT|JJ) to be:

Compute P(NN|DT) by counting in a labeled corpus:

– 22 – CSCE 771 Spring 2011

Two Kinds of ProbabilitiesTwo Kinds of Probabilities

Word likelihood probabilities p(wWord likelihood probabilities p(wii|t|tii))VBZ (3sg Pres verb) likely to be “is”Compute P(is|VBZ) by counting in a

labeled corpus:

– 23 – CSCE 771 Spring 2011

Example: The Verb “race”Example: The Verb “race”

Secretariat/Secretariat/NNPNNP is/ is/VBZVBZ expected/ expected/VBNVBN to/ to/TO TO racerace//VBVB tomorrow/tomorrow/NRNR

People/People/NNSNNS continue/ continue/VBVB to/ to/TOTO inquire/ inquire/VBVB the/ the/DTDT reason/reason/NNNN for/ for/ININ the/ the/DTDT racerace//NNNN for/ for/ININ outer/ outer/JJJJ space/space/NNNN

How do we pick the right tag?How do we pick the right tag?

– 24 – CSCE 771 Spring 2011

Disambiguating “race”Disambiguating “race”

– 25 – CSCE 771 Spring 2011

ExampleExample

P(NN|TO) = .00047P(NN|TO) = .00047

P(VB|TO) = .83P(VB|TO) = .83

P(race|NN) = .00057P(race|NN) = .00057

P(race|VB) = .00012P(race|VB) = .00012

P(NR|VB) = .0027P(NR|VB) = .0027

P(NR|NN) = .0012P(NR|NN) = .0012

P(VB|TO)P(NR|VB)P(race|VB) = .00000027P(VB|TO)P(NR|VB)P(race|VB) = .00000027

P(NN|TO)P(NR|NN)P(race|NN)=.00000000032P(NN|TO)P(NR|NN)P(race|NN)=.00000000032

So we (correctly) choose the verb reading,So we (correctly) choose the verb reading,

– 26 – CSCE 771 Spring 2011

Hidden Markov ModelsHidden Markov Models

What we’ve described with these two kinds of What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)probabilities is a Hidden Markov Model (HMM)

– 27 – CSCE 771 Spring 2011

DefinitionsDefinitions

A A weighted finite-state automatonweighted finite-state automaton adds adds probabilities to the arcsprobabilities to the arcs The sum of the probabilities leaving any arc must

sum to one

A A Markov chainMarkov chain is a special case of a WFST in is a special case of a WFST in which the input sequence uniquely which the input sequence uniquely determines which states the automaton will determines which states the automaton will go throughgo through

Markov chains can’t represent inherently Markov chains can’t represent inherently ambiguous problemsambiguous problems Useful for assigning probabilities to unambiguous

sequences

– 28 – CSCE 771 Spring 2011

Markov Chain for WeatherMarkov Chain for Weather

– 29 – CSCE 771 Spring 2011

Markov Chain for WordsMarkov Chain for Words

– 30 – CSCE 771 Spring 2011 30

Markov Chain: “First-order observable Markov Model”Markov Chain: “First-order observable Markov Model”A set of states A set of states

Q = q1, q2…qN; the state at time t is qt

Transition probabilities: Transition probabilities: a set of probabilities A = a01a02…an1…ann.

Each aij represents the probability of transitioning from state i to state j

The set of these is the transition probability matrix A

Current state only depends on previous state Current state only depends on previous state

P(qi |q1...qi 1) P(qi |qi 1)

– 31 – CSCE 771 Spring 2011

Markov Chain for WeatherMarkov Chain for Weather

What is the probability of 4 consecutive rainy days?What is the probability of 4 consecutive rainy days?

Sequence is rainy-rainy-rainy-rainySequence is rainy-rainy-rainy-rainy

I.e., state sequence is 3-3-3-3I.e., state sequence is 3-3-3-3

P(3,3,3,3) = P(3,3,3,3) = 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432

– 32 – CSCE 771 Spring 2011

HMM for Ice CreamHMM for Ice Cream

You are a climatologist in the year 2799You are a climatologist in the year 2799

Studying global warmingStudying global warming

You can’t find any records of the weather in Baltimore, You can’t find any records of the weather in Baltimore, MA for summer of 2007MA for summer of 2007

But you find Jason Eisner’s diaryBut you find Jason Eisner’s diary

Which lists how many ice-creams Jason ate every date Which lists how many ice-creams Jason ate every date that summerthat summer

Our job: figure out how hot it wasOur job: figure out how hot it was

– 33 – CSCE 771 Spring 2011

Hidden Markov ModelHidden Markov Model

For Markov chains, the output symbols are the same as For Markov chains, the output symbols are the same as the states.the states. See hot weather: we’re in state hot

But in part-of-speech tagging (and other things)But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags

So we need an extension!So we need an extension!

A A Hidden Markov ModelHidden Markov Model is an extension of a Markov chain is an extension of a Markov chain in which the input symbols are not the same as the in which the input symbols are not the same as the states.states.

This means This means we don’t know which state we are inwe don’t know which state we are in..

– 34 – CSCE 771 Spring 2011

States States Q = qQ = q11, q, q22……qqN; N;

Observations Observations O= oO= o11, o, o22……ooN; N; Each observation is a symbol from a vocabulary V = {v1,v2,

…vV}

Transition probabilitiesTransition probabilities Transition probability matrix A = {aij}

Observation likelihoodsObservation likelihoods Output probability matrix B={bi(k)}

Special initial probability vector Special initial probability vector

i P(q1 i) 1iN

aij P(qt j |qt 1 i) 1i, j N

bi(k) P(X t ok |qt i)

Hidden Markov ModelsHidden Markov Models

– 35 – CSCE 771 Spring 2011

Eisner TaskEisner Task

GivenGiven Ice Cream Observation Sequence: 1,2,3,2,2,2,3…

Produce:Produce: Weather Sequence: H,C,H,H,H,C…

– 36 – CSCE 771 Spring 2011

HMM for Ice CreamHMM for Ice Cream

– 37 – CSCE 771 Spring 2011

Transition ProbabilitiesTransition Probabilities

– 38 – CSCE 771 Spring 2011

Observation LikelihoodsObservation Likelihoods

– 39 – CSCE 771 Spring 2011

DecodingDecoding

Ok, now we have a complete model that can Ok, now we have a complete model that can give us what we need. Recall that we need to give us what we need. Recall that we need to getget

We could just enumerate all paths given the We could just enumerate all paths given the input and use the model to assign input and use the model to assign probabilities to each.probabilities to each. Not a good idea. Luckily dynamic programming (last seen in Ch. 3

with minimum edit distance) helps us here

– 40 – CSCE 771 Spring 2011

The Viterbi AlgorithmThe Viterbi Algorithm

– 41 – CSCE 771 Spring 2011

Viterbi ExampleViterbi Example

– 42 – CSCE 771 Spring 2011

Viterbi SummaryViterbi Summary

Create an arrayCreate an array With columns corresponding to inputs Rows corresponding to possible states

Sweep through the array in one pass filling the columns Sweep through the array in one pass filling the columns left to right using our transition probs and left to right using our transition probs and observations probsobservations probs

Dynamic programming key is that we need only store Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).the MAX prob path to each cell, (not all paths).

– 43 – CSCE 771 Spring 2011

EvaluationEvaluation

So once you have you POS tagger running how do you So once you have you POS tagger running how do you evaluate it?evaluate it? Overall error rate with respect to a gold-standard test set. Error rates on particular tags Error rates on particular words Tag confusions...

– 44 – CSCE 771 Spring 2011

Error AnalysisError Analysis

Look at a confusion matrixLook at a confusion matrix

See what errors are causing problemsSee what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

– 45 – CSCE 771 Spring 2011

EvaluationEvaluation

The result is compared with a manually coded “Gold The result is compared with a manually coded “Gold Standard”Standard” Typically accuracy reaches 96-97% This may be compared with result for a baseline tagger (one

that uses no context).

Important: 100% is impossible even for human Important: 100% is impossible even for human annotators.annotators.

– 46 – CSCE 771 Spring 2011

SummarySummary

Parts of speechParts of speech

TagsetsTagsets

Part of speech taggingPart of speech tagging

HMM TaggingHMM TaggingMarkov ChainsHidden Markov Models