Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...
Transcript of Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...
CS 562: Empirical Methodsin Natural Language Processing
Nov 2010
Liang Huang ([email protected])
Unit 3: Structured Learning(part 2) Discriminative Models: Perceptron & CRF
Thursday, November 18, 2010
Discriminative Models
A Quiz on Forest
2
Generative models
t̂ = arg maxt
P(t | w)
= arg maxt
P(w, t)P(w)
= arg maxt
P(w, t)
More work to learn to generate w
But can handle incomplete data: P(w) =�
t
P(w, t)
Wednesday, November 25, 2009Thursday, November 18, 2010
Discriminative Models3
The generative story
P(w, t) = P(t) × P(w | t)
≈ P(t1)n�
i=2
P(ti | ti−1)P(stop | tn) ×n�
i=1
P(wi | ti)
SOURCE-CHANNEL
HMM
P(t | w) ≈n�
i=1
P(ti | wi)
DEFICIENT
P(t | w) ≈n�
i=1
P(ti | wi) × P(t1)n�
i=2
P(ti | ti−1)P(stop | tn)
Wednesday, November 25, 2009
you generated ti twice!
Thursday, November 18, 2010
Discriminative Models
Perceptron is ...
• an extremely simple algorithm
• almost universally applicable
• and works very well in practice
4
Thursday, November 18, 2010
Discriminative Models
Perceptron is ...
• an extremely simple algorithm
• almost universally applicable
• and works very well in practice
4
vanilla perceptron(Rosenblatt, 1958)
structured perceptron(Collins, 2002)
the man bit the dog
DT NN VBD DT NN
Thursday, November 18, 2010
Discriminative Models
Generic Perceptron• online-learning: one example at a time
• learning by doing
• find the best output under the current weights
• update weights at mistakes
5
inferencexi
update weightszi
yi
w
Thursday, November 18, 2010
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
Thursday, November 18, 2010
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
Thursday, November 18, 2010
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
Thursday, November 18, 2010
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
Thursday, November 18, 2010
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
Thursday, November 18, 2010
Discriminative Models
!
Structured Perceptron
7
inferencexi
update weightszi
yi
w
Thursday, November 18, 2010
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
Discriminative Models
Averaged Perceptron
• more stable and accurate results
• approximation of voted perceptron (Freund & Schapire, 1999)
9
j
!
0
jj + 1
=
!
j
Wj
Thursday, November 18, 2010
Discriminative Models
Averaging Tricks
• Daume (2006, PhD thesis)
10
Thursday, November 18, 2010
Discriminative Models
Do we need smoothing?
• smoothing is much easier in discriminative models
• just make sure for each feature template, its subset templates are also included
• e.g., to include (t0 w0 w-1) you must also include
• (t0 w0) (t0 w-1) (w0 w-1)
• and maybe also (t0 t-1) because t is less sparse than w
11
x
y
Thursday, November 18, 2010
Comparison with Other Models
Thursday, November 18, 2010
Discriminative Models
from HMM to MEMM
13
HMM: joint distribution MEMM: locally normalized(per-state conditional)
Thursday, November 18, 2010
Discriminative Models
Label Bias Problem
• bias towards states with fewer outgoing transitions
• a problem with all locally normalized models
14
Thursday, November 18, 2010
Discriminative Models
Conditional Random Fields
• globally normalized (no label bias problem)
• but training requires expected features counts
• (related to the fractional counts in EM)
• need to use Inside-Outside algorithm (sum)
• Perceptron just needs Viterbi (max)
15
Thursday, November 18, 2010
Experiments
Thursday, November 18, 2010
Discriminative Models
Experiments: Tagging• (almost) identical features from (Ratnaparkhi, 1996)
• trigram tagger: current tag ti, previous tags ti-1, ti-2
• current word wi and its spelling features
• surrounding words wi-1 wi+1 wi-2 wi+2..
17
Thursday, November 18, 2010
Discriminative Models
Experiments: NP Chunking
• B-I-O scheme
• features:
• unigram model
• surrounding words and POS tags
18
Rockwell International Corp. B I I's Tulsa unit said it signed B I I O B O a tentative agreement ...B I I
Thursday, November 18, 2010
Discriminative Models
Experiments: NP Chunking• results
• (Sha and Pereira, 2003) trigram tagger
• voted perceptron: 94.09% vs. CRF: 94.38%19
Thursday, November 18, 2010
Discriminative Models
Other NLP Applications
• dependency parsing (McDonald et al., 2005)
• parse reranking (Collins)
• phrase-based translation (Liang et al., 2006)
• word segmentation
• ... and many many more ...
20
Thursday, November 18, 2010
Theory
Thursday, November 18, 2010
Discriminative Models
Vanilla Perceptron
22
Thursday, November 18, 2010
Discriminative Models
Vanilla Perceptron
23
Thursday, November 18, 2010
Discriminative Models
Vanilla Perceptron
24
Thursday, November 18, 2010
Discriminative Models
Convergence Theorem
• Data is separable if and only if perceptron converges
• number of updates is bounded by
• γ is the margin; R = maxi || xi ||
• This result generalizes to structured perceptron
• Also in Collins paper: theorems for non-separable cases and generalization bounds
25
(R/!)2
R = maxi
!!(xi, yi) " !(xi, zi)!
Thursday, November 18, 2010
Discriminative Models
Perceptron Conclusion
• a very simple framework that can work with many structured problems and that works very well
• all you need is (fast) 1-best inference
• much simpler than CRFs and SVMs
• can be applied to parsing, translation, etc.
• generalization bounds depend on separability
• not the (exponential) size of the search space
• limitation 1: features local on y (non-local on x)
• limitation 2: no probabilistic interpretation (vs. CRF)
26
Thursday, November 18, 2010
Discriminative Models
From HMM to CRF
27(Sutton and McCallum, 2007) -- recommended reading
Thursday, November 18, 2010
Discriminative Models28
Maximum likelihoodL = log
N�
j=1
P(t j | wj)
= logN�
j=1
1Z(wj)
exp
n�
i=1
λi fi(wj, t j)
= logN�
j=1
exp��n
i=1 λi fi(wj, t j)�
�t exp
��ni=1 λi fi(wj, t)
�
=
N�
j=1
n�
i=1
λi fi(wj, t j) − log�
t
expn�
i=1
λi fi(wj, t)
Wednesday, November 25, 2009Thursday, November 18, 2010
Discriminative Models29
Maximum likelihoodL = log
N�
j=1
P(t j | wj)
= logN�
j=1
1Z(wj)
exp
n�
i=1
λi fi(wj, t j)
= logN�
j=1
exp��n
i=1 λi fi(wj, t j)�
�t exp
��ni=1 λi fi(wj, t)
�
=
N�
j=1
n�
i=1
λi fi(wj, t j) − log�
t
expn�
i=1
λi fi(wj, t)
Wednesday, November 25, 2009Thursday, November 18, 2010
Discriminative Models30
Optimizing likelihood∂L∂λi=
N�
j=1
�fi(wj, t j) − E[ fi(wj, t)]
�
Gradient ascent:
Stochastic gradient ascent: for each
λi ← λi + αN�
j=1
�fi(wj, t j) − E[ fi(wj, t)]
�
(wj, t j)
λi ← λi + α�
fi(wj, t j) − E[ fi(wj, t)]�
Wednesday, November 25, 2009Thursday, November 18, 2010
Discriminative Models31
Expected feature counts
E[ fV,loves] =� 1
Zα[F]µ(V, loves)β[G]
Z = α[END]
F G
µ(V, loves)α[F] β[G]
Wednesday, November 25, 2009Thursday, November 18, 2010
Discriminative Models
CRF : SGD : Perceptron• CRF = conditional random fields = structured maxent
• SGD = stochastic gradient descent = online CRF
• CRF => (online) => SGD => (viterbi) => Perceptron
• update rules (very similar):
• CRF: batch, E[...]
• SGD: one example at a time, expectation E[...]
• Perceptron: one example at a time, and Viterbi
32
Thursday, November 18, 2010