Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

Part-of-Speech Tagging

Foundation of Statistical NLP

CHAPTER 10

2

Contents

Markov Model TaggersHidden Markov Model TaggersTransformation-Based Learning of TagsTagging Accuracy and Uses of Taggers

3

Markov Model Taggers

Markov propertiesLimited horizon

Time invariant

cf. Wh-extraction (Chomsky)a. Should Peter buy a book?

b. Which book should Peter buy?

)|(),...,|( 111 ij

iij

i XtXPXXtXP

)|()|( 121 XtXPXtXP ji

ji

4


The probabilistic modelFinding the best tagging t1,n for a sentence

w1,n

n

iiiii

t

nnt

ttPtwP

wtP

n

n

11

,1,1

)|()|(maxarg

)|(maxarg

,1

,1

ex: P(AT NN BEZ IN AT VB | The bear is on the move)

5

assumtion• words are independent of each other• a word’s identity only depends on its tag

)()|(maxarg

)(

)()|(maxarg)|(maxarg

,1,1,1

,1

,1,1,1,1,1

,1

,1,1

nnnt

n

nnn

tnn

t

tPtwP

wP

tPtwPwtP

n

nn

n

iiiii

nnnn

n

iii

nnnn

n

ininnn

ttPtwP

ttPttPttPtwP

ttPttPttPtwPtPtwP

11

122111

122,111,11

,1,1,1,1

)|()|(

)|()|()|()|(

)|()|()|()|()()|(

6


Trainingfor all tags t j do

for all tags t k do

end

end

for all tags t j do

for all words w l do

end

end

)(

),(:)|(

j

jkjk

tC

ttCttP

)(

):(:)|(

j

jljl

tC

twCtwP

7

First tagSecond tag

AT BEZ IN NN VB PERIOD

AT 0 0 0 48636 0 19

BEZ 1973 0 426 187 0 38

IN 43322 0 1325 17314 0 185

NN 1067 3720 42470 11773 614 21392

VB 6072 42 4758 1476 129 1522

PERIOD 8016 75 4656 1329 954 0

AT BEZ IN NN VB PERIOD

bear 0 0 0 10 43 0

is 0 10065 0 0 0 0

move 0 0 0 36 133 0

on 0 0 5484 0 0 0

president 0 0 0 382 0 0

progress 0 0 0 108 4 0

the 69016 0 0 0 0 0

. 0 0 0 0 0 48809

8

Markov Model TaggersTagging (the Viterbi algorithm)

9

Variations

The models for unknown words1. assuming that they can be any part of

speech

2. using morphological to make inferences about a possible parts of speech

10

Z: normalization constant

)|/()|()|(1

)|( jjjjl thyphendingsPtdcapitalizePtwordunknownPZ

twP

11

Variation

Trigram taggers

Interpolation

Variable Memory Markov Model (VMMM)

)|()|()()|( 2,133122111,1 iiiiiiii ttPttPtPttP

12

VariationSmoothing

Reversibility

)(

),()1()|(

1

11

j

jjjj

tC

ttCttP

)|()......|()|()(

)()......()(

)()......()()(

)|()......|()|()()(

21121

121

,13,22,11

123121,1

ttPttPttPtP

tPtPtP

tPtPtPtP

ttPttPttPtPtP

nnnnn

n

nn

nnn

Kl: the number of

possible parts of speech

of w ll

l

ljlj

KwC

wtCwtP

)(

1),()|(

13

Variation

Sequence vs. tag by tag

Time flies like an arrow.

a. NN VBZ RB AT NN. P(.) = 0.01

b. NN NNS VB AT NN. P(.) = 0.01

there is no large difference in accuracy between maximizing the sequence and tag

14

Hidden Markov Model Taggers

When we have no tagged training data

Initializing all parameters with the dictionary informationJelinek’s methodKupiec’s method

15


Jelinek’s methodinitializing the HMM with the MLE for P(w

k|t i

)assuming that words occur equally likely with

each of their possible tags.

otherwise)(

1

for allowedspeech ofpart anot is if0

)(

)(

*.

*.

*.

.

l

lj

lj

w

mmj

llj

lj

wT

wtb

wCb

wCbb

m

T(w j): the number of

tags allowed for w j

16


Kupiec’s methodgrouping all words with the same possible

parts of speech into ‘metawords’ uL

not to fine-tune parameters for each word

TLwtLjwu ljlL ,...,1for allowed is|

otherwise1

L j if0

)(

)(

*.

*.

*.

.

L

b

uCb

uCbb

Lj

u LLj

LLjLj

L

17


Trainingafter initialization, the HMM is trained using

the Forward-Backward algorithm

Taggingequal to VMM

! the difference between VMM tagging and HMM tagging is in how we train the model, not in how we tag.

18


The effect of initialization on HMMovertraining problem

D0 maximum likelihood estimates from a tagged training corpus

D1 correct ordering only of lexical probabilities

D2 lexical probabilities proportional to overall tag probabilities

D3 equal lexical probabilities for all tags admissible for a word

T0 maximum likelihood estimates from a tagged training corpus

T1 equal probabilities for all transitions

19

Use Visible Markov Modela sufficiently large training textsimilar to the intended text of application

Run Forward-Backward for a few iterationsno training texttraining and test text are very differentbut at least some lexical information

Run Forward-Backward for a larger number of iterationsno lexical information

Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

Documents

Transcript of Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.