CS 479, section 1: Natural Language Processing
description
Transcript of CS 479, section 1: Natural Language Processing
CS 479, section 1:Natural Language Processing
Lecture #23: Part of Speech Tagging, Hidden Markov Models (cont.)
Thanks to Dan Klein of UC Berkeley for some of the materials used in this lecture.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
Announcements Project #2, Part 2
Early: today Due: Monday Thanks for your ongoing feedback
Coming up: Project #3 Feature engineering for POS tagging
Reading Report #10 on the tagging paper by Toutanova and Manning Due: Wednesday
Mid-term exam Question #4 Optional practice exercises can replace q. #4!
Objectives Use Hidden Markov Models efficiently for POS
tagging
Understand the Viterbi algorithm
If time permits, think about using HMM as a language model
Possible Scenarios
Training Labeled data* Unlabeled data Partially labeled data
Test / Run-time Unlabeled data* Partially labeled data
* We’re focusing on straight-forwardsupervised learning for now.
Disambiguation Tagging is disambiguation: Finding the best tag sequence
Given an HMM (i.e., distributions for transitions and emissions), we can score any word sequence and tag sequence together (jointly):
What is the score?
Fed raises interest rates 0.5 percent . NNP VBZ NN NNS CD NN . STOP
𝑃 (�́� ,�́� )=𝑃 (𝑁𝑁𝑃|) ⋅ 𝑃 (𝐹𝑒𝑑|𝑁𝑁𝑃 )⋅ 𝑃 (𝑉𝐵𝑍|𝑁𝑁𝑃 )⋅ 𝑃 (𝑟𝑎𝑖𝑠𝑒𝑠|𝑉𝐵𝑍 )⋅ 𝑃 (𝑁𝑁|𝑉𝐵𝑍 ) ⋅ 𝑃 (𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡|𝑁𝑁 )⋅…
Option 1: Enumeration Exhaustive enumeration: list all possible tag sequences, score
each one, pick the best one (the “Viterbi” state sequence)
Fed raises interest rates …
What’s wrong with this approach? What to do about it?
NNP VBZ NN NNS CD NN
NNP NNS NN NNS CD NN
NNP VBZ VB NNS CD NN
logP = -23
logP = -29
logP = -27
…
Exhaustive tree-structured search Start with just the single empty tagged sentence At each derivation step, consider all extensions of previous hypotheses
What’s wrong with this approach? What to do about it?
Option 2: Breadth-First Search
<>
Fed:NNP
Fed:VBN
Fed:VBD
Fed:NNP raises:NNS
Fed:NNP raises:VBZ
Fed:VBN raises:NNS
Fed:VBN raises:VBZ
Fed raises interest
Fed:NNP raises:NNS interest:NN
Fed:NNP raises:NNS interest:VB
Possibilities Best-first search
A*
Close cousin: Branch and Bound
Also possible: Dynamic programming – we’ll focus our efforts here for now
Fast, approximate searches
Recall the main ideas:1. Typically: Solve an optimization problem2. Devise a minimal description (address) for any problem instance
and sub-problem3. Divide problems into sub-problems: define the recurrence to
specify the relationship of problems to sub-problems4. Check that the optimality property holds: An optimal solution to
a problem is built from optimal solutions to sub-problems.5. Store results – typically in a table – and re-use the solutions to
sub-problems in the table as you build up to the overall solution.6. Back-trace / analyze the table to extract the composition of the
final solution.
Option 3: Dynamic Programming
Partial Taggings as Paths Step #2: Devise a minimal description (address) for any problem instance
and sub-problem At address ( , ): Optimal tagging of tokens 1 through , ending in tag .𝑡 𝑖 𝑖 𝑡
Partial taggings from position 1 through are represented as paths through a trellis of tag states, ending in some tag state.
Each arc connects two tag states and has a weight, which is the combination of: and
:0
NNP:1
VBN:1
NNS:2
VBZ:2
NN:3
VB:3
Fed raises interest𝑷 (𝐕𝐁𝐙∨𝐍𝐍𝐏)⋅𝑷 (𝐫𝐚𝐢𝐬𝐞𝐬∨𝐕𝐁𝐙 )
The Recurrence Step #3: Divide problems into sub-problems: define the
recurrence to specify the relationship of problems to sub-problems
Score of a best path (tagging) up to position ending in state :
Step-by-step:
Also store a back-pointer:
Step #4: Check that the optimality property holds: An optimal solution to a problem is built from optimal solutions to sub-problems.
What does the optimality property have to say about computingthe score of the best tagging ending in the highlighted state?
The optimal tagging for is constructed from optimal taggings for
Optimality PropertyKey idea for the Viterbi algorithm
:0
NNP:1
VBN:1
NNS:2
VBZ:2
NN:3
VB:3
Fed raises interest
Iterative Algorithm Step #5: Store results – typically in a table – and re-
use the solutions to sub-problems in the table as you build up to the overall solution.
For each state and word position , compute and :
For = 1 to do:For each in tag-set do:
Viterbi: DP Table Step #6: Back-trace / analyze the table to extract the
composition of the final solution.
Fed raises interest …
VBN
NNP
NN
VB
NNS
VBZ
…
Viterbi: DP Table
𝜋 0 (⋄ )𝛿0(⋄)
VBN
NNP
NN
VB
NNS
VBZ
…
𝜋 1 (𝑁𝑁𝑆 )𝛿1(𝑁𝑁𝑆)
𝜋 1 (𝑉𝐵𝑍 )𝛿1(𝑉𝐵𝑍 )
𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )
𝜋 1 (𝑉𝐵 )𝛿1(𝑉𝐵)
Fed raises interest …
𝜋 1 (𝑉𝐵𝑁 )𝛿1(𝑉𝐵𝑁 )
𝜋 1 (𝑁𝑁𝑃 ) 𝛿1(𝑁𝑁𝑃 )
𝜋 2 (𝑁𝑁𝑆 )𝛿2(𝑁𝑁𝑆)
𝜋 2 (𝑉𝐵𝑍 )𝛿2(𝑉𝐵𝑍 )
𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )
𝜋 2 (𝑉𝐵 )𝛿2(𝑉𝐵)
𝜋2 (𝑉𝐵𝑁 )𝛿2(𝑉𝐵𝑁 )
𝜋 2 (𝑁𝑁𝑃 ) 𝛿2(𝑁𝑁𝑃)
𝜋 3 (𝑁𝑁𝑆 )𝛿3(𝑁𝑁𝑆)
𝜋 3 (𝑉𝐵𝑍 )𝛿3(𝑉𝐵𝑍 )
𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )
𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )
𝜋 3 (𝑉𝐵𝑁 )𝛿3(𝑉𝐵𝑁 )
𝜋 3 (𝑁𝑁𝑃 )𝛿3(𝑁𝑁𝑃 )
Viterbi: DP Table
𝜋 0 (⋄ )𝛿0(⋄)
VBN
NNP
NN
VB
NNS
VBZ
…
𝜋 1 (𝑁𝑁𝑆 )𝛿1(𝑁𝑁𝑆)
𝜋 1 (𝑉𝐵𝑍 )𝛿1(𝑉𝐵𝑍 )
𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )
𝜋 1 (𝑉𝐵 )𝛿1(𝑉𝐵)
Fed raises interest …
𝜋 1 (𝑉𝐵𝑁 )𝛿1(𝑉𝐵𝑁 )
𝜋 1 (𝑁𝑁𝑃 ) 𝛿1(𝑁𝑁𝑃 )
𝜋 2 (𝑁𝑁𝑆 )𝛿2(𝑁𝑁𝑆)
𝜋 2 (𝑉𝐵𝑍 )𝛿2(𝑉𝐵𝑍 )
𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )
𝜋 2 (𝑉𝐵 )𝛿2(𝑉𝐵)
𝜋2 (𝑉𝐵𝑁 )𝛿2(𝑉𝐵𝑁 )
𝜋 2 (𝑁𝑁𝑃 ) 𝛿2(𝑁𝑁𝑃)
𝜋 3 (𝑁𝑁𝑆 )𝛿3(𝑁𝑁𝑆)
𝜋 3 (𝑉𝐵𝑍 )𝛿3(𝑉𝐵𝑍 )
𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )
𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )
𝜋 3 (𝑉𝐵𝑁 )𝛿3(𝑉𝐵𝑁 )
𝜋 3 (𝑁𝑁𝑃 )𝛿3(𝑁𝑁𝑃 )
Viterbi: DP Table
Start at max and backtrace
𝜋 0 (⋄ )𝛿0(⋄)
VBN
NNP
NN
VB
NNS
VBZ
…
𝜋 1 (𝑁𝑁𝑆 )𝛿1(𝑁𝑁𝑆)
𝜋 1 (𝑉𝐵𝑍 )𝛿1(𝑉𝐵𝑍 )
𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )
𝜋 1 (𝑉𝐵 )𝛿1(𝑉𝐵)
Fed raises interest …
𝜋 1 (𝑉𝐵𝑁 )𝛿1(𝑉𝐵𝑁 )
𝜋 1 (𝑁𝑁𝑃 ) 𝛿1(𝑁𝑁𝑃 )
𝜋 2 (𝑁𝑁𝑆 )𝛿2(𝑁𝑁𝑆)
𝜋 2 (𝑉𝐵𝑍 )𝛿2(𝑉𝐵𝑍 )
𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )
𝜋 2 (𝑉𝐵 )𝛿2(𝑉𝐵)
𝜋2 (𝑉𝐵𝑁 )𝛿2(𝑉𝐵𝑁 )
𝜋 2 (𝑁𝑁𝑃 ) 𝛿2(𝑁𝑁𝑃)
𝜋 3 (𝑁𝑁𝑆 )𝛿3(𝑁𝑁𝑆)
𝜋 3 (𝑉𝐵𝑍 )𝛿3(𝑉𝐵𝑍 )
𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )
𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )
𝜋 3 (𝑉𝐵𝑁 )𝛿3(𝑉𝐵𝑁 )
𝜋 3 (𝑁𝑁𝑃 )𝛿3(𝑁𝑁𝑃 )
Viterbi: DP Table
𝜋 0 (⋄ )𝛿0(⋄)
VBN
NNP
NN
VB
NNS
VBZ
…
𝜋 1 (𝑁𝑁𝑆 )𝛿1(𝑁𝑁𝑆)
𝜋 1 (𝑉𝐵𝑍 )𝛿1(𝑉𝐵𝑍 )
𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )
𝜋 1 (𝑉𝐵 )𝛿1(𝑉𝐵)
Fed raises interest …
𝜋 1 (𝑉𝐵𝑁 )𝛿1(𝑉𝐵𝑁 )
𝜋 1 (𝑁𝑁𝑃 ) 𝛿1(𝑁𝑁𝑃 )
𝜋 2 (𝑁𝑁𝑆 )𝛿2(𝑁𝑁𝑆)
𝜋 2 (𝑉𝐵𝑍 )𝛿2(𝑉𝐵𝑍 )
𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )
𝜋 2 (𝑉𝐵 )𝛿2(𝑉𝐵)
𝜋2 (𝑉𝐵𝑁 )𝛿2(𝑉𝐵𝑁 )
𝜋 2 (𝑁𝑁𝑃 ) 𝛿2(𝑁𝑁𝑃)
𝜋 3 (𝑁𝑁𝑆 )𝛿3(𝑁𝑁𝑆)
𝜋 3 (𝑉𝐵𝑍 )𝛿3(𝑉𝐵𝑍 )
𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )
𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )
𝜋 3 (𝑉𝐵𝑁 )𝛿3(𝑉𝐵𝑁 )
𝜋 3 (𝑁𝑁𝑃 )𝛿3(𝑁𝑁𝑃 )
Read off the best sequence:
Start at max and backtrace
Efficiency
𝜋 0 (⋄ )𝛿0(⋄)
VBN
NNP
NN
VB
NNS
VBZ
…
𝜋 1 (𝑁𝑁𝑆 )𝛿1(𝑁𝑁𝑆)
𝜋 1 (𝑉𝐵𝑍 )𝛿1(𝑉𝐵𝑍 )
𝜋 1 (𝑁𝑁 )𝛿1(𝑁𝑁 )
𝜋 1 (𝑉𝐵 )𝛿1(𝑉𝐵)
Fed raises interest …
𝜋 1 (𝑉𝐵𝑁 )𝛿1(𝑉𝐵𝑁 )
𝜋 1 (𝑁𝑁𝑃 ) 𝛿1(𝑁𝑁𝑃 )
𝜋 2 (𝑁𝑁𝑆 )𝛿2(𝑁𝑁𝑆)
𝜋 2 (𝑉𝐵𝑍 )𝛿2(𝑉𝐵𝑍 )
𝜋 2 (𝑁𝑁 )𝛿2(𝑁𝑁 )
𝜋 2 (𝑉𝐵 )𝛿2(𝑉𝐵)
𝜋2 (𝑉𝐵𝑁 )𝛿2(𝑉𝐵𝑁 )
𝜋 2 (𝑁𝑁𝑃 ) 𝛿2(𝑁𝑁𝑃)
𝜋 3 (𝑁𝑁𝑆 )𝛿3(𝑁𝑁𝑆)
𝜋 3 (𝑉𝐵𝑍 )𝛿3(𝑉𝐵𝑍 )
𝜋 3 (𝑁𝑁 )𝛿3(𝑁𝑁 )
𝜋 3 (𝑉𝐵 )𝛿3(𝑉𝐵 )
𝜋 3 (𝑉𝐵𝑁 )𝛿3(𝑉𝐵𝑁 )
𝜋 3 (𝑁𝑁𝑃 )𝛿3(𝑁𝑁𝑃 )
Implementation Trick For higher-order (>1) HMMs, we need to facilitate dynamic programming
Allow scores to be computed locally without worrying about earlier structure of the trellis
Define one state for each tuple of tags E.g., for order 2, =
This corresponds to a constrained 1st order HMM over tag n-grams
1 0 1 2 1 2
0 1 1 2 1
1
0 1
1
1 0 1 2 1 11
1
2
( , ) ( , , , ,..., , ) ( | , ) ( |
( , , , , ,..., ,
)
, ) ( , | , ) ( | ,
( | ) ( |
( , , ,..., , )
)
( , )
)
n i i
n
i ii
n
i i i i i i ii
n
i ii
i
n n
i i
n
P t w P t t t t t w P t t t P w t
P t t t t t t t t w P t t t t P w t t
P s s P w s
P s s s s w P s w
1 2 1 2 1 1
1
Note:( | , ) ( , | , ) ( | )
( | ) ( | , ) ( | )i i i i i i i i i
i i i i i i i
P t t t P t t t t P s s
P w t P w t t P w s
Implementation Trick A constrained 1st order HMM over tag n-grams:
Constraints:
<,>s1 s2 sn
w1 w2 wn
s0
< , t1> < t1, t2> < tn-1, tn>
How Well Does It Work? Choose the most common tag
90.3% with a bad unknown word model 93.7% with a good one!
TnT (Brants, 2000): A carefully smoothed trigram tagger 96.7% on WSJ text
State-of-Art is approx. 97.3%
Noise in the data Many errors in the training and test corpora
Probably about 2% guaranteed errorfrom noise (on this data)
NN NN NNchief executive officer
JJ NN NNchief executive officer
JJ JJ NNchief executive officer
NN JJ NNchief executive officer
DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …
Option 4: Beam Search Adopt a simple pruning strategy in the Viterbi trellis:
A beam is a set of partial hypotheses: Keep top k (use a priority queue) Keep those within a factor of the best Keep widest variety or some combination of the above
Is this OK? Beam search works well in practice
Will give you the optimal answer if the beam is wide enough … and you need optimal answers to validate your beam search
,:0
,NNP:1
,VBN:1
NNP,NNS:2
NNP,VBZ:2
VBN,VBZ:2
VBN,NNS:2
NNS,NN:3
NNS,VB:3
VBZ,NN:3
VBZ,VB:3
Fed raises interest
HMMs: Basic Problems and Solutions
Given an input sequence and an HMM , Decoding / Tagging: Choose optimal tag sequence for using
Solution: Viterbi Algorithm
Evaluation / Computing the marginal probability: Compute , the marginal probability of using Solution: Forward Algorithm
Just use a sum instead of max in the Viterbi algorithm!
Re-estimation / Unsupervised or Semi-supervised Training: Adjust to maximize , the probability of an entire data set, either entirely or partially unlabeled, respectively Solution: Baum-Welch algorithm
Also known as the Forward-Backward algorithm Another example of EM
Speech Recognition Revisited
Given an input sequence of acoustic feature vectors and a (composed) HMM , Decoding / Tagging: Choose optimal word
(tag) sequence for feature sequence Solution: Viterbi Algorithm
Next
Better Tagging Features using Maxent Dealing with unknown words Adjacent words Longer-distance features
Soon: Named-Entity Recognition
Extra
Back to tags t for the moment (instead of states s). We have a generative model of tagged sentences:
We can turn this into a distribution over sentences by marginalizing out the tag sequences:
Problem: too many sequences! And beam search isn’t going to help this time. Why not?
1 2( , ) ( | , ) ( | )i i i i ii
P t w P t t t P w t
1 2( ) ( , ) ( | , ) ( | )i i i i it t i
P w P t w P t t t P w t
How’s the HMM as a LM?
How’s the HMM as a LM? POS tagging HMMs are terrible as LMs!
Don’t capture long-distance effects like a parser could Don’t capture local collocational effects like n-grams
But other HMM-like LMs can work very well
I bought an ice cream ___
The computer that I set up yesterday just ___
c2
w1 w2 wnSTART
cnc1