Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

19
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10

Transcript of Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

Page 1: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

Part-of-Speech Tagging

Foundation of Statistical NLP

CHAPTER 10

Page 2: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

2

Contents

Markov Model TaggersHidden Markov Model TaggersTransformation-Based Learning of TagsTagging Accuracy and Uses of Taggers

Page 3: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

3

Markov Model Taggers

Markov propertiesLimited horizon

Time invariant

cf. Wh-extraction (Chomsky)a. Should Peter buy a book?

b. Which book should Peter buy?

)|(),...,|( 111 ij

iij

i XtXPXXtXP

)|()|( 121 XtXPXtXP ji

ji

Page 4: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

4

Markov Model Taggers

The probabilistic modelFinding the best tagging t1,n for a sentence

w1,n

n

iiiii

t

nnt

ttPtwP

wtP

n

n

11

,1,1

)|()|(maxarg

)|(maxarg

,1

,1

ex: P(AT NN BEZ IN AT VB | The bear is on the move)

Page 5: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

5

assumtion• words are independent of each other• a word’s identity only depends on its tag

)()|(maxarg

)(

)()|(maxarg)|(maxarg

,1,1,1

,1

,1,1,1,1,1

,1

,1,1

nnnt

n

nnn

tnn

t

tPtwP

wP

tPtwPwtP

n

nn

n

iiiii

nnnn

n

iii

nnnn

n

ininnn

ttPtwP

ttPttPttPtwP

ttPttPttPtwPtPtwP

11

122111

122,111,11

,1,1,1,1

)|()|(

)|()|()|()|(

)|()|()|()|()()|(

Page 6: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

6

Markov Model Taggers

Trainingfor all tags t j do

for all tags t k do

end

end

for all tags t j do

for all words w l do

end

end

)(

),(:)|(

j

jkjk

tC

ttCttP

)(

):(:)|(

j

jljl

tC

twCtwP

Page 7: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

7

First tagSecond tag

AT BEZ IN NN VB PERIOD

AT 0 0 0 48636 0 19

BEZ 1973 0 426 187 0 38

IN 43322 0 1325 17314 0 185

NN 1067 3720 42470 11773 614 21392

VB 6072 42 4758 1476 129 1522

PERIOD 8016 75 4656 1329 954 0

AT BEZ IN NN VB PERIOD

bear 0 0 0 10 43 0

is 0 10065 0 0 0 0

move 0 0 0 36 133 0

on 0 0 5484 0 0 0

president 0 0 0 382 0 0

progress 0 0 0 108 4 0

the 69016 0 0 0 0 0

. 0 0 0 0 0 48809

Page 8: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

8

Markov Model TaggersTagging (the Viterbi algorithm)

Page 9: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

9

Variations

The models for unknown words1. assuming that they can be any part of

speech

2. using morphological to make inferences about a possible parts of speech

Page 10: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

10

Z: normalization constant

)|/()|()|(1

)|( jjjjl thyphendingsPtdcapitalizePtwordunknownPZ

twP

Page 11: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

11

Variation

Trigram taggers

Interpolation

Variable Memory Markov Model (VMMM)

)|()|()()|( 2,133122111,1 iiiiiiii ttPttPtPttP

Page 12: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

12

VariationSmoothing

Reversibility

)(

),()1()|(

1

11

j

jjjj

tC

ttCttP

)|()......|()|()(

)()......()(

)()......()()(

)|()......|()|()()(

21121

121

,13,22,11

123121,1

ttPttPttPtP

tPtPtP

tPtPtPtP

ttPttPttPtPtP

nnnnn

n

nn

nnn

Kl: the number of

possible parts of speech

of w ll

l

ljlj

KwC

wtCwtP

)(

1),()|(

Page 13: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

13

Variation

Sequence vs. tag by tag

Time flies like an arrow.

a. NN VBZ RB AT NN. P(.) = 0.01

b. NN NNS VB AT NN. P(.) = 0.01

there is no large difference in accuracy between maximizing the sequence and tag

Page 14: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

14

Hidden Markov Model Taggers

When we have no tagged training data

Initializing all parameters with the dictionary informationJelinek’s methodKupiec’s method

Page 15: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

15

Hidden Markov Model Taggers

Jelinek’s methodinitializing the HMM with the MLE for P(w

k|t i

)assuming that words occur equally likely with

each of their possible tags.

otherwise)(

1

for allowedspeech ofpart anot is if0

)(

)(

*.

*.

*.

.

l

lj

lj

w

mmj

llj

lj

wT

wtb

wCb

wCbb

m

T(w j): the number of

tags allowed for w j

Page 16: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

16

Hidden Markov Model Taggers

Kupiec’s methodgrouping all words with the same possible

parts of speech into ‘metawords’ uL

not to fine-tune parameters for each word

TLwtLjwu ljlL ,...,1for allowed is|

otherwise1

L j if0

)(

)(

*.

*.

*.

.

L

b

uCb

uCbb

Lj

u LLj

LLjLj

L

Page 17: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

17

Hidden Markov Model Taggers

Trainingafter initialization, the HMM is trained using

the Forward-Backward algorithm

Taggingequal to VMM

! the difference between VMM tagging and HMM tagging is in how we train the model, not in how we tag.

Page 18: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

18

Hidden Markov Model Taggers

The effect of initialization on HMMovertraining problem

D0 maximum likelihood estimates from a tagged training corpus

D1 correct ordering only of lexical probabilities

D2 lexical probabilities proportional to overall tag probabilities

D3 equal lexical probabilities for all tags admissible for a word

T0 maximum likelihood estimates from a tagged training corpus

T1 equal probabilities for all transitions

Page 19: Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

19

Use Visible Markov Modela sufficiently large training textsimilar to the intended text of application

Run Forward-Backward for a few iterationsno training texttraining and test text are very differentbut at least some lexical information

Run Forward-Backward for a larger number of iterationsno lexical information