CS 479, section 1: Natural Language Processing

CS 479, section 1:Natural Language Processing

Lecture #23: Part of Speech Tagging, Hidden Markov Models (cont.)

Thanks to Dan Klein of UC Berkeley for some of the materials used in this lecture.

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.

http://creativecommons.org/licenses/by-sa/3.0/



Announcements Project #2, Part 2

Early: today Due: Monday Thanks for your ongoing feedback

Coming up: Project #3 Feature engineering for POS tagging

Reading Report #10 on the tagging paper by Toutanova and Manning Due: Wednesday

Mid-term exam Question #4 Optional practice exercises can replace q. #4!

Objectives Use Hidden Markov Models efficiently for POS

tagging

Understand the Viterbi algorithm

If time permits, think about using HMM as a language model

Possible Scenarios

Training Labeled data* Unlabeled data Partially labeled data

Test / Run-time Unlabeled data* Partially labeled data

* We’re focusing on straight-forwardsupervised learning for now.

Disambiguation Tagging is disambiguation: Finding the best tag sequence

Given an HMM (i.e., distributions for transitions and emissions), we can score any word sequence and tag sequence together (jointly):

What is the score?

Fed raises interest rates 0.5 percent . NNP VBZ NN NNS CD NN . STOP

𝑃 (�́� ,�́� )=𝑃 (𝑁𝑁𝑃|) ⋅ 𝑃 (𝐹𝑒𝑑|𝑁𝑁𝑃 )⋅ 𝑃 (𝑉𝐵𝑍|𝑁𝑁𝑃 )⋅ 𝑃 (𝑟𝑎𝑖𝑠𝑒𝑠|𝑉𝐵𝑍 )⋅ 𝑃 (𝑁𝑁|𝑉𝐵𝑍 ) ⋅ 𝑃 (𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡|𝑁𝑁 )⋅…

Option 1: Enumeration Exhaustive enumeration: list all possible tag sequences, score

each one, pick the best one (the “Viterbi” state sequence)

Fed raises interest rates …

What’s wrong with this approach? What to do about it?

NNP VBZ NN NNS CD NN

NNP NNS NN NNS CD NN

NNP VBZ VB NNS CD NN

logP = -23

logP = -29

logP = -27

…

Exhaustive tree-structured search Start with just the single empty tagged sentence At each derivation step, consider all extensions of previous hypotheses

What’s wrong with this approach? What to do about it?

Option 2: Breadth-First Search

<>

Fed:NNP

Fed:VBN

Fed:VBD

Fed:NNP raises:NNS

Fed:NNP raises:VBZ

Fed:VBN raises:NNS

Fed:VBN raises:VBZ

Fed raises interest

Fed:NNP raises:NNS interest:NN

Fed:NNP raises:NNS interest:VB

Possibilities Best-first search

A*

Close cousin: Branch and Bound

Also possible: Dynamic programming – we’ll focus our efforts here for now

Fast, approximate searches

Recall the main ideas:1. Typically: Solve an optimization problem2. Devise a minimal description (address) for any problem instance

and sub-problem3. Divide problems into sub-problems: define the recurrence to

specify the relationship of problems to sub-problems4. Check that the optimality property holds: An optimal solution to

a problem is built from optimal solutions to sub-problems.5. Store results – typically in a table – and re-use the solutions to

sub-problems in the table as you build up to the overall solution.6. Back-trace / analyze the table to extract the composition of the

final solution.

Option 3: Dynamic Programming

Partial Taggings as Paths Step #2: Devise a minimal description (address) for any problem instance

and sub-problem At address ( , ): Optimal tagging of tokens 1 through , ending in tag .𝑡 𝑖 𝑖 𝑡

Partial taggings from position 1 through are represented as paths through a trellis of tag states, ending in some tag state.

Each arc connects two tag states and has a weight, which is the combination of: and

:0

NNP:1

VBN:1

NNS:2

VBZ:2

NN:3

VB:3

Fed raises interest𝑷 (𝐕𝐁𝐙∨𝐍𝐍𝐏)⋅𝑷 (𝐫𝐚𝐢𝐬𝐞𝐬∨𝐕𝐁𝐙 )

The Recurrence Step #3: Divide problems into sub-problems: define the

recurrence to specify the relationship of problems to sub-problems

Score of a best path (tagging) up to position ending in state :

Step-by-step:

Also store a back-pointer:

Step #4: Check that the optimality property holds: An optimal solution to a problem is built from optimal solutions to sub-problems.

What does the optimality property have to say about computingthe score of the best tagging ending in the highlighted state?

The optimal tagging for is constructed from optimal taggings for

Optimality PropertyKey idea for the Viterbi algorithm

:0

NNP:1

VBN:1

NNS:2

VBZ:2

NN:3

VB:3

Fed raises interest

Iterative Algorithm Step #5: Store results – typically in a table – and re-

use the solutions to sub-problems in the table as you build up to the overall solution.

For each state and word position , compute and :

For = 1 to do:For each in tag-set do:

Viterbi: DP Table Step #6: Back-trace / analyze the table to extract the

composition of the final solution.

Fed raises interest …

VBN

NNP

NN

VB

NNS

VBZ

…

Viterbi: DP Table

𝜋 0 (⋄ )𝛿0(⋄)

VBN

NNP

NN

VB

NNS

VBZ

…

𝜋 1 (𝑁𝑁𝑆 )𝛿1(𝑁𝑁𝑆)

𝜋 1 (𝑉𝐵𝑍 )𝛿1(𝑉𝐵𝑍 )

𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )

𝜋 1 (𝑉𝐵 )𝛿1(𝑉𝐵)


𝜋 1 (𝑉𝐵𝑁 )𝛿1(𝑉𝐵𝑁 )

𝜋 1 (𝑁𝑁𝑃 ) 𝛿1(𝑁𝑁𝑃 )



𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )


𝜋2 (𝑉𝐵𝑁 )𝛿2(𝑉𝐵𝑁 )

𝜋 2 (𝑁𝑁𝑃 ) 𝛿2(𝑁𝑁𝑃)



𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )

𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )


𝜋 3 (𝑁𝑁𝑃 )𝛿3(𝑁𝑁𝑃 )

Viterbi: DP Table

Start at max and backtrace

𝜋 0 (⋄ )𝛿0(⋄)

VBN

NNP

NN

VB

NNS

VBZ

…



𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )







𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )






𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )

𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )



Viterbi: DP Table

𝜋 0 (⋄ )𝛿0(⋄)

VBN

NNP

NN

VB

NNS

VBZ

…



𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )







𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )






𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )

𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )



Read off the best sequence:

Start at max and backtrace

Efficiency

𝜋 0 (⋄ )𝛿0(⋄)

VBN

NNP

NN

VB

NNS

VBZ

…



𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )







𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )






𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )

𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )



Implementation Trick For higher-order (>1) HMMs, we need to facilitate dynamic programming

Allow scores to be computed locally without worrying about earlier structure of the trellis

Define one state for each tuple of tags E.g., for order 2, =

This corresponds to a constrained 1st order HMM over tag n-grams

1 0 1 2 1 2

0 1 1 2 1

1

0 1

1

1 0 1 2 1 11

1

2

( , ) ( , , , ,..., , ) ( | , ) ( |

( , , , , ,..., ,

)

, ) ( , | , ) ( | ,

( | ) ( |

( , , ,..., , )

)

( , )

)

n i i

n

i ii

n

i i i i i i ii

n

i ii

i

n n

i i

n

P t w P t t t t t w P t t t P w t

P t t t t t t t t w P t t t t P w t t

P s s P w s

P s s s s w P s w

1 2 1 2 1 1

1

Note:( | , ) ( , | , ) ( | )

( | ) ( | , ) ( | )i i i i i i i i i

i i i i i i i

P t t t P t t t t P s s

P w t P w t t P w s

Implementation Trick A constrained 1st order HMM over tag n-grams:

Constraints:

<,>s1 s2 sn

w1 w2 wn

s0

< , t1> < t1, t2> < tn-1, tn>

How Well Does It Work? Choose the most common tag

90.3% with a bad unknown word model 93.7% with a good one!

TnT (Brants, 2000): A carefully smoothed trigram tagger 96.7% on WSJ text

State-of-Art is approx. 97.3%

Noise in the data Many errors in the training and test corpora

Probably about 2% guaranteed errorfrom noise (on this data)

NN NN NNchief executive officer

JJ NN NNchief executive officer

JJ JJ NNchief executive officer

NN JJ NNchief executive officer

DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …

Option 4: Beam Search Adopt a simple pruning strategy in the Viterbi trellis:

A beam is a set of partial hypotheses: Keep top k (use a priority queue) Keep those within a factor of the best Keep widest variety or some combination of the above

Is this OK? Beam search works well in practice

Will give you the optimal answer if the beam is wide enough … and you need optimal answers to validate your beam search

,:0

,NNP:1

,VBN:1

NNP,NNS:2

NNP,VBZ:2

VBN,VBZ:2

VBN,NNS:2

NNS,NN:3

NNS,VB:3

VBZ,NN:3

VBZ,VB:3

Fed raises interest

HMMs: Basic Problems and Solutions

Given an input sequence and an HMM , Decoding / Tagging: Choose optimal tag sequence for using

Solution: Viterbi Algorithm

Evaluation / Computing the marginal probability: Compute , the marginal probability of using Solution: Forward Algorithm

Just use a sum instead of max in the Viterbi algorithm!

Re-estimation / Unsupervised or Semi-supervised Training: Adjust to maximize , the probability of an entire data set, either entirely or partially unlabeled, respectively Solution: Baum-Welch algorithm

Also known as the Forward-Backward algorithm Another example of EM

Speech Recognition Revisited

Given an input sequence of acoustic feature vectors and a (composed) HMM , Decoding / Tagging: Choose optimal word

(tag) sequence for feature sequence Solution: Viterbi Algorithm

Next

Better Tagging Features using Maxent Dealing with unknown words Adjacent words Longer-distance features

Soon: Named-Entity Recognition

Back to tags t for the moment (instead of states s). We have a generative model of tagged sentences:

We can turn this into a distribution over sentences by marginalizing out the tag sequences:

Problem: too many sequences! And beam search isn’t going to help this time. Why not?

1 2( , ) ( | , ) ( | )i i i i ii

P t w P t t t P w t

1 2( ) ( , ) ( | , ) ( | )i i i i it t i

P w P t w P t t t P w t

How’s the HMM as a LM?

How’s the HMM as a LM? POS tagging HMMs are terrible as LMs!

Don’t capture long-distance effects like a parser could Don’t capture local collocational effects like n-grams

But other HMM-like LMs can work very well

I bought an ice cream ___

The computer that I set up yesterday just ___

c2

w1 w2 wnSTART

cnc1

CS 479, section 1: Natural Language Processing

Documents

Transcript of CS 479, section 1: Natural Language Processing