TP2663 Pemprosesan Bahasa Tabii - ftsm.ukm.my 05 Part of Speech.pdf1 TP2663 Pemprosesan Bahasa Tabii...

18
TP2663 Pemprosesan Bahasa Tabii Part of Speech Tagging Part of Speech tagging Part of speech tagging Parts of speech What’s POS tagging good for anyhow? Tag sets Rule-based tagging Statistical tagging “TBL” tagging Parts of Speech 8 (ish) traditional parts of speech Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS We’ll use POS most frequently POS examples N noun chair, bandwidth, pacing V verb study, debate, munch ADJ adj purple, tall, ridiculous ADV adverb unfortunately, slowly, P preposition of, by, to PRO pronoun I, me, mine DET determiner the, a, that, those

Transcript of TP2663 Pemprosesan Bahasa Tabii - ftsm.ukm.my 05 Part of Speech.pdf1 TP2663 Pemprosesan Bahasa Tabii...

1

TP2663 Pemprosesan Bahasa Tabii

Part of Speech Tagging

Part of Speech tagging

Part of speech taggingParts of speechWhat’s POS tagging good for anyhow?Tag setsRule-based taggingStatistical tagging“TBL” tagging

Parts of Speech

8 (ish) traditional parts of speechNoun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etcThis idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.)Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POSWe’ll use POS most frequently

POS examples

N noun chair, bandwidth, pacingV verb study, debate, munchADJ adj purple, tall, ridiculousADV adverb unfortunately, slowly,P preposition of, by, toPRO pronoun I, me, mineDET determiner the, a, that, those

2

POS Tagging: Definition

The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

thekoalaputthekeysonthetable

WORDSTAGS

NVPDET

POS Tagging exampleWORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

What is POS tagging good for?Speech synthesis:

How to pronounce “lead”?INsult inSULTOBject obJECTOVERflow overFLOWDIScount disCOUNTCONtent conTENT

Stemming for information retrievalKnowing a word is a NOUN tells you it gets pluralsCan search for “aardvarks” get “aardvark”

Parsing and speech recognition and etcPossessive pronouns (my, your, her) followed by nounsPersonal pronouns (I, you, he) likely to be followed by verbs

Open and closed class wordsClosed class: a relatively fixed membership

Prepositions: of, in, by, …Auxiliaries: may, can, will had, been, …Pronouns: I, you, she, mine, his, them, …Usually function words (short common words which play a role in grammar)

Open class: new ones can be created all the timeEnglish has 4: Nouns, Verbs, Adjectives, AdverbsMany languages have all 4, but not all!In Lakhota and possibly Chinese, what English treats as adjectives act more like verbs.

3

Open class wordsNouns

Proper nouns (Stanford University, Boulder, Neal Snider,Margaret Jacks Hall). English capitalizes these.Common nouns (the rest). German capitalizes these.Count nouns and mass nouns

Count: have plurals, get counted: goat/goats, one goat, two goatsMass: don’t get counted (snow, salt, communism) (*two snows)

Adverbs: tend to modify thingsUnfortunately, John walked home extremely slowly yesterdayDirectional/locative adverbs (here,home, downhill)Degree adverbs (extremely, very, somewhat)Manner adverbs (slowly, slinkily, delicately)

Verbs:In English, have morphological affixes (eat/eats/eaten)

Closed Class Words

IdiosyncraticExamples:

prepositions: on, under, over, …particles: up, down, on, off, …determiners: a, an, the, …pronouns: she, who, I, ..conjunctions: and, but, or, …auxiliary verbs: can, may should, …numerals: one, two, three, third, …

Prepositions from CELEX

Frequency counts are from the COBUILD 16 million word corpus.

English particles

From Quirk et al. (1985)

4

Pronouns: CELEX Conjunctions

Modal Verbs POS tagging: Choosing a tagset

There are so many parts of speech, potential distinctions we can drawTo do POS tagging, need to choose a standard set of tags to work withCould pick very coarse tagsets

N, V, Adj, Adv.More commonly used set is finer grained, the “PennTreeBank tagset”, 45 tags

PRP$, WRB, WP$, VBGEven more fine-grained tagsets exist

5

PRP$PRP

Using the UPenn tagset

The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”)Except the preposition/complementizer “to” is just marked “to”.

POS Tagging

Part-of-speech tagging is the process of assigning part-of-speech or other lexical class markers to each word in a corpus.Words often have more than one POS: back

The back door = JJ (adjective)On my back = NN (noun)Win the voters back = RB (particle)Promised to back the bill = VB (verb)

The POS tagging problem is to determine the POS tag for a particular instance of a word.

How hard is POS tagging? Measuring ambiguity

Unambiguous (1 tag): 35,340Ambiguous (2-7 tags): 4,100

17 tags

26 tags

125 tags

614 tags

2643 tags

3,7602 tags

(Derose, 1988)

6

3 methods for POS tagging

1. Rule-based tagging(ENGTWOL)

2. Stochastic (=Probabilistic) taggingHMM (Hidden Markov Model) tagging

3. Transformation-based taggingBrill tagger

Rule-based tagging

Start with a dictionaryAssign all possible tags to words from the dictionaryWrite rules by hand to selectively remove tagsLeaving the correct tag for each word.

Start with a dictionary

• she: PRP• promised: VBN,VBD• to TO• back: VB, JJ, RB, NN• the: DT• bill: NN, VB

• Etc… for the ~100,000 words of English

Use the dictionary to assign every possible tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

7

Write rules to eliminate tags

Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”

NNRBJJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

Sample ENGTWOL Lexicon

Stage 1 of ENGTWOL TaggingFirst Stage: Run words through FST morphological analyzer to get all parts of speech.Example: Pavlov had shown that salivation …Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

Stage 2 of ENGTWOL Tagging

Second Stage: Apply constraints.Constraints used in negative way.Example: Adverbial “that” ruleGiven input: “that”If

(+1 A/ADV/QUANT)(+2 SENT-LIM)(NOT -1 SVOC/A)

Then eliminate non-ADV tagsElse eliminate ADV

8

Statistical Tagging

Based on probability theory

A few slides on probability

Experiment (trial)Repeatable procedure with well-defined possible outcomes

Sample spaceComplete set of outcomes

EventAny subset of outcomes from sample space

Random VariableUncertain outcome in a trial

Probability definitionsProbability

How likely is it to get a certain outcome?Rate of getting that outcome in all trials

Probability of drawing a spade from 52 well-shuffled playing cards:

Distribution: Probabilities associated with each outcome a random variable can take

Each outcome has probability between 0 and 1The sum of all outcome probabilities is 1.

Probability and part of speech tags

What’s the probability of drawing a 2 from a deck of 52 cards with four 2s?

What’s the probability of a random word (from a random dictionary page) being a verb?

P(drawing a two) = 452

=1

13= .077

P(drawing a verb) = # of ways to get a verball words

9

Probability and part of speech tags

• What’s the probability of a random word (from a random dictionary page) being a verb?

• How to compute each of these• All words = just count all the words in the dictionary• # of ways to get a verb: number of words which are

verbs!• If a dictionary has 50,000 entries, and 10,000 are

verbs…. P(V) is 10000/50000 = 1/5 = .20

P(drawing a verb) = # of ways to get a verball words

Conditional Probability• Written P(A|B).• Let’s say A is “it’s raining”.• Let’s say P(A) in drought-stricken California is .01• Let’s say B is “it was sunny ten minutes ago”• P(A|B) means “what is the probability of it raining now if it was

sunny 10 minutes ago”• P(A|B) is probably way less than P(A)• Let’s say P(A|B) is .0001

Conditional Probability and Tags• P(Verb) is the probability of a randomly selected word being a verb.• P(Verb|race) is “what’s the probability of a word being a verb given that it’s

the word “race”?• Race can be a noun or a verb.• It’s more likely to be a noun.• P(Verb|race) can be estimated by looking at some corpus and saying “out

of all the times we saw ‘race’, how many were verbs?

• In Brown corpus, P(Verb|race) = 96/98 = .98

P(V | race) = Count(race is verb)total Count(race)

Bayes’ Theorem

Bayes’ Theorem lets us swap the order of dependence between eventsWe saw that Bayes’ Theorem:

P(B)B)P(A,B)|P(A =

P(B)A)P(A)|P(BB)|P(A =

10

By the definition, P(A | B) = P(A & B) / Pr(B) and P(B | A) = P(B & A) / P(A), so P(A | B) / P(B | A) = P(A) / P(B), or P(A | B) = P(B | A) * P(A) / P(B)

Bayes’ Theorem cont

Pr(Rain | Win) = (Pr(Win|Rain) * Pr (Rain)) / Pr (Win)= (0.5 * 0.3) / ).2= 0.75

Bayes’ Theorem cont

P(B)A)P(A)|P(BB)|P(A =

Suppose a horse Harry ran 100 races and won 20. One would say P(Win) = 0.2. Suppose also that 30 of these races were in wet conditions, and of these, Harry won 15. It looks like Harry does better in the rain: the probability that Harry would win given that it was raining would be 0.5. Conditional probability captures this idea: we write Pr(Win|Rain) = 0.5.

Two events are said to be independent of each other if the occurrence of one does not effect the probability of the occurrence of the other: P(e | e') = P(e). Using the definition of conditional probability, this is equivalent to saying

P(e & e') = P(e) * P(e').

Independence

a corpus of simple sentences containing 1,273,000 words, 1000 uses of the word flies, 400 of them as an N and 600 of them as a V.

Pr(flies) = 1000 / 1,273,000 = 0.00078554595Pr(flies & N) = 400 / 1,273,000 = 0.00031421838Pr(flies & V) = 600 / 1,273,000 = 0.00047132757

estimate Pr(V | flies):

Pr(V | flies) = Pr(flies & V) / Pr(flies)= 0.00047132757 / 0.00078554595= 0.6

Independence

11

Most frequent tag

Some ambiguous words have a more frequent tag and a less frequent tag:Consider the word “a” in these 2 sentences:

would/MD prohibit/VB a/DT suit/NN for/INrefund/NNof/IN section/NN 381/CD (/( a/NN )/) ./.

Which do you think is more frequent?

Counting in a corpusWe could count in a corpusA corpus: an on-line collection of text, often linguistically annotatedThe Brown Corpus: 1 million words from 1961Part of speech tagged at U PennThe results:

FW3

NN6

DT21830

The Most Frequent Tag algorithm

For each wordCreate a dictionary with each possible tag for a wordTake a tagged corpusCount the number of times each tag occurs for that word

Given a new sentenceFor each word, pick the most frequent tag for that word from the corpus.

The Most Frequent Tag algorithm: the dictionary

For each word, we said:Create a dictionary with each possible tag for a word…

Q: Where does the dictionary come from?A: One option is to use the same corpus that we use for computing the tags

12

Using a corpus to build a dictionary

The/DT City/NNP Purchasing/NNP Department/NNP,/, the/DT jury/NN said/VBD,/, is/VBZ lacking/VBG in/IN experienced/VBN clerical/JJ personnel/NNS …From this sentence, dictionary is:clericaldepartmentexperiencedinisjury…

Hidden Markov Model (HMM) Taggers

Recall the previous example, P(V/flies) = 0.6P(N/flies) = 0.4

verb was chosen because of the 60% probabilities. However if the word flies preceded by the, it much be likely a noun.HMM is used to exploit this information.

Let w1, ..., wT be a sequence of words. We want to find the sequence of lexical categories C1, ..., CT that maximizes

1. P(C1, ..., CT | w1, ..., wT)

Unfortunately, it would take far too much data to generate reasonable estimates for such sequences directly from a corpus, so direct methods cannot be applied.

However, there are approximation methods. To develop them, we must restate the problem using Bayes' rule, which says that this conditional probability equals

HMM Taggers2. P(C1, ..., CT) * P(w1, ..., wT | C1, ..., CT) / P(w1, ..., wT)

The common denominator Pr(w1, ..., wT) does not affect the answer, so in effect we want to find a sequence C1, ..., CT that maximizes

3. P(C1, ..., CT) * P (w1, ..., wT | C1, ..., CT)

These probabilities can be approximated by probabilities that are easier to collect by making some independence assumptions. Theseindependence assumptions are not valid, but the estimates work reasonably well in practice.

HMM Taggers

13

Bigrams and Trigrams

The first factor in formula (3) (Pr(C1, ..., CT) * Pr(w1, ..., wT| C1, ..., CT) ), before , Pr(C1, ..., CT) , can be approximated by a sequence of probabilities based on a limited number of previous categories, say 1 or 2 previous categories. The bigram model uses P(Ci | Ci-1). The trigrammodel uses P(Ci | Ci-2 Ci-1). Such models are called n-gram models.

Using bigrams, the following approximation can be used:

P(C1, ..., CT) ≅ Πi=1,T P(Ci | Ci-1)

The second probability in formula 3 (P(C1, ..., CT) * P(w1, ..., wT | C1, ..., CT) ),P(w1, ..., wT | C1, ..., CT)

can be approximated by assuming that a word appears in a category independent of the words in the preceding or succeeding categories - it is approximated by:

P(w1, ..., wT | C1, ..., CT) ≅ Πi=1,T P(wi | Ci).

With these approximations, we are down to finding the sequence C1, ..., CTmaximizing

Πi=1,T P(Ci | Ci-1) P(wi | Ci)

These probabilities can be estimated from a tagged corpus. For example, the probability that a V follows an N can be estimated as follows:

P(Ci = V | Ci-1 = N) ≅ Count(N at position i-1 and V at i)Count(N at position i-1)

Bigrams and Trigrams

To account for the beginning of a sentence, we invent a pseudo category ø at position 0 as the value of C0. Then for the sequence ART N V N using bigrams:

P(ART N V N) ≅ P(ART | ø) x P(N | ART) x P(V | N) x P(N | V)

Bigrams and Trigrams Bigram Frequencies with a Toy Corpus

Table 7.4 in Allen gives some bigram frequencies for an artificially generated corpus of simple sentences. The corpus consisted of 300 sentences and had words in only four categories: N, V, ART, and P, including 833 nouns, 300 verbs, 558 articles, and 307 prepositions for a total of 1998 words. To deal with the problem of sparse data, any bigram not listed in Figure 7.4 was assumed to have a token probability of 0.0001.

Pr(Ci = V | Ci-1 = N) ≅ Count(N at position i-1 and V at i)Count(N at position i-1)

14

Bigram Frequencies with a Toy Corpus

Figure 7.7: A network capturing the bigram

probabilities

(A Markov chain )

Hidden Markov Model (HMM).

ø

The probability that the sequence N V ART N generates the output Flies like a flower is computed as follows:

The probability of the path N V ART N given Fig. 7.7, is .29 * .43 * .65 * 1 = 0.081. The probability of the output being Flies like a flower is computed from the output probabilities in Fig. 7.6:

P(flies | N) * P(like | V) * P(a | ART) * P(flower | N)= 0.025 * .1 * .36 * .063= 5.4 E- 5

Multiplying these together gives us the likelihood that the HMM would generate the sequence: 4.6 E- 6.

Hidden Markov Model (HMM).

15

Figure 7.8: Encoding the 256 possible sequences, exploiting the Markov assumption

Hidden Markov Model (HMM).Viterbi algorithm

find the four best sequences for the two words Flies like: the best ending with like as a V, the best as an N, the best as a P and the best as an ART. You then use this information to find the four best sequences for the words flies like a, each one ending in a different category. This process is repeated until all the words are accounted for. This algorithm is usually called the Viterbi algorithm

Null

Flies/V

Flies/N

Flies/P

Flies/art

= 0.076 * 0.0001= 7.6 * 10-6

= 0.025 * 0.29= 0.00725

0

0

Flies/V

Flies/N

like/V

like/N

like/P

like/art

0.00031

1.3 * 10-5

0.00022

0.

Viterbi algorithmTransformation-Based Tagging (Brill Tagging)

Combination of Rule-based and stochastic tagging methodologies

Like rule-based because rules are used to specify tags in a certain environmentLike stochastic approach because machine learning is used—with tagged corpus as input

Input:tagged corpusdictionary (with most frequent tags)

16

Transformation-Based Tagging (cont.)

Basic Idea:Set the most probable tag for each word as a start valueChange tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order

Training is done on tagged corpus:Write a set of rule templatesAmong the set of rules, find one with highest scoreContinue from 2 until lowest score threshold is passedKeep the ordered set of rules

Rules make errors that are corrected by later rules

TBL Rule Application

Tagger labels every word with its most-likely tagFor example: race has the following probabilities in the Brown corpus:

P(NN|race) = .98P(VB|race)= .02

Transformation rules make changes to tags“Change NN to VB when previous tag is TO”… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN

TBL: Rule Learning2 parts to a rule

Triggering environmentRewrite rule

The range of triggering environments of templates (from Manning & Schutze 1999:363)

Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+31 *2 *3 *4 *5 *6 *7 *8 *9 *

TBL: The Tagging Algorithm

Step 1: Label every word with most likely tag (from dictionary)Step 2: Check every possible transformation & select one which most improves tagging Step 3: Re-tag corpus applying the rulesRepeat 2-3 until some criterion is reached, e.g., X% correct with respect to training corpusRESULT: Sequence of transformation rules

17

TBL: Rule Learning (cont.)

Problem: Could apply transformations ad infinitum!Constrain the set of transformations with “templates”:

Replace tag X with tag Y, provided tag Z or word Z’ appears in some position

Rules are learned in ordered sequence Rules may interact.Rules are compact and can be inspected by humans

Templates for TBL

TBL: Problems

First 100 rules achieve 96.8% accuracyFirst 200 rules achieve 97.0% accuracyExecution Speed: TBL tagger is slower than HMM approachLearning Speed: Brill’s implementation over a day (600k tokens)

BUT …(1) Learns small number of simple, non-stochastic rules (2) Can be made to work faster with FST(3) Best performing algorithm on unknown words

Tagging Unknown Words

New words added to (newspaper) language 20+ per monthPlus many proper names …Increases error rates by 1-2%Method 1: assume they are nounsMethod 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set.Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN.

18

Using Morphological Information Evaluation

The result is compared with a manually coded “Gold Standard”

Typically accuracy reaches 96-97%This may be compared with result for a baseline tagger (one that uses no context).

Important: 100% is impossible even for human annotators.