Discriminative Models in NLP - Stanford University
Transcript of Discriminative Models in NLP - Stanford University
![Page 1: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/1.jpg)
Discriminative Models in NLP
CS224N/Ling237April 30, 2003
![Page 2: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/2.jpg)
Overview
! Conditional Log-linear (Maximum Entropy) Models! Toy example! Parameter estimation
! Maximum Entropy Markov Models (Discriminative Sequence Models ) for IE
! Some Theory of Generative/Discriminative Models for Classification
! Empirical Comparison (Word Sense Disambiguation and Part-of-Speech Tagging)
![Page 3: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/3.jpg)
Conditional Log-linear Models
! A class of models increasingly popular and successful in machine learning of natural language
! Not generative models like n-gram, HMM and Naive Bayes
! Diverse features can be defined, and information from them is combined in a way to best discriminate between target classes
! These models have been applied to single classification (PP attachment, Word Sense Disambiguation, etc.) and in sequence tasks (POS tagging, IE, Parsing); we will see examples of some of these
![Page 4: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/4.jpg)
Conditional Log-linear Models
! Given a set of training data (x_1,c_1), …,(x_n,c_n), define features
! Learn a model of the form
! Now the question is how to choose the features (f ) and the parameters ( )
RCXf j →×:
∑ ∑
∑
= =
== K
kjk
m
jj
j
m
jj
cxf
cxfxcP
1 1
1
)),(exp(
)),(exp()|(
λ
λ
λ
![Page 5: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/5.jpg)
Conditional Log-linear Models: Example
! We are searching the Web for documents having to do with Bush's attitudes toward energy conservation. (x=documents, c=1/0)
! Three features useful for such a task might be:f1(doc,c)=1, iff “President” appears in doc and c=1 (0 otherwise)
f2(doc,c)=1, iff “conservation” appears in doc and c=1 (0 otherwise)
f3(doc,c)=1, iff “President” and “conservation” appear in doc and c=1(0 otherwise)
! We can define as many features as we like, depending on the words in the document; features that are conjunctions or disjunctions of other features are often useful
![Page 6: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/6.jpg)
Fitting the Model
! To find the parameters write out the conditional log-likelihood of the training data and maximize it
! The log-likelihood is concave and has a single maximum; use your favorite numerical optimization package
321 ,, λλλ
)|(log)(1
i
n
ii xcPDCLogLik ∑
=
=
![Page 7: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/7.jpg)
Fitting the ModelGeneralized Iterative Scaling
! A simple optimization algorithm which works when the features are non-negative
! We need to define a slack feature to make the features sum to a constant over all considered pairs from
! Define
! Add new feature
CX ×
),(max1,
cxfM i
m
jjci ∑
=
=
),(),(1
1 cxfMcxfm
jjm ∑
=+ −=
![Page 8: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/8.jpg)
Generalized Iterative Scaling
! Compute empirical expectation for all features
! Initialize! Repeat
! Compute feature expectations according to current model
! Update parameters
! Until converged
1...1,0 +== mjjλ),(1)(
1~ ii
n
ijjp cxf
NfE ∑
=
=
),()|(1)(1 1
kij
N
ii
K
kkj cxfxcP
NfE tp ∑∑
= =
=
+=+
)()(
log1 ~)()1(
jp
jptj
tj fE
fEM t
λλ
![Page 9: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/9.jpg)
Conditional Log-linear and Maximum Entropy Models
! These models are also called maximum entropy models
! The model having maximum entropy and satisfying the constraints
! Is the same model as the log-linear model of the form we saw before that maximizes the training data conditional likelihood
jfEfE jpjp ∀= ),()( ~
![Page 10: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/10.jpg)
Discriminative Sequence Models: Maximum Entropy Markov Models for IE
A. McCallum, D. Freitag,F. Pereira (2000)MEMMs have the the following graphical model
! The distribution of next state given previous state and current observation is estimated using a maximum entropy model
! Some slides by Fernando Pereira will be used
s1 s2 s3
o1 o2 o3
![Page 11: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/11.jpg)
![Page 12: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/12.jpg)
![Page 13: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/13.jpg)
Application Q-A Pairs from FAQ
! Task: Given an FAQ document – a sequence of lines of text, classify each line as head, question, answer, or tail.
! Possible HMM model: four states (head, question, answer,tail). ! States emit every token separately (very sparse) ! States emit features of lines that are less sparse
and helpful for disambiguation
![Page 14: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/14.jpg)
![Page 15: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/15.jpg)
Difficulties with Generative Models: Generating Multiple Features
! Generating lines
! Apparently non-independent but very difficult to model otherwise
! Easy with a conditional log-linear model – reverse the arcs
s
? blank prev is blank
…
∑ ∑
∑== K
jk
m
j
j
m
jj
slinesf
slinesflinessP 1
)),,(exp(
))',,(exp(),|'(
λ
λ
= =k j1 1
![Page 16: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/16.jpg)
![Page 17: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/17.jpg)
![Page 18: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/18.jpg)
What is Next
! Theory - General ML perspective on
Generative and Discriminative Models
! Examples – Traffic Lights and Word Sense
Disambiguation
! The case of Part-of-Speech Tagging
![Page 19: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/19.jpg)
The Classification Problem
Given a training set of iid samples T=(X1,Y1) … (Xn,Yn) of input and class variables from an unknown distribution D(X,Y), estimate a function that predicts the class from the input variables
The goal is to come up with a hypothesis with minimum expected loss (usually 0-1 loss)
Under 0-1 loss the hypothesis with minimum expected loss is the Bayes optimal classifier
∑Ω>∈<
≠=YX
XhYYXDherr,
))(ˆ(),()ˆ( δ
)|(maxarg)( XYDXhY Υ∈
=
)(ˆ Xh
)(ˆ Xh
![Page 20: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/20.jpg)
Approaches to Solving Classification Problems - I
1. Generative. Try to estimate the probability distribution of the data D(X,Y)! specify a parametric model family! choose parameters by maximum likelihood on
training data
! estimate conditional probabilities by Bayes rule
! classify new instances to the most probable class Y according to
)(/),()|( ˆˆˆ XPYXPXYPθθθ
=
)|(ˆ XYPθ
:),( Θ∈θθ YXPθ
),()|(1
ii
n
i
YXPTL ∏=
= θθ
![Page 21: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/21.jpg)
Approaches to Solving Classification Problems - I
2. Discriminative. Try to estimate the conditional distribution D(Y|X) from data.! specify a parametric model family ! estimate parameters by maximum conditional
likelihood of training data! classify new instances to the most probable class
Y according to
3. Discriminative. Distribution-free. Try to estimate directly from data so that its expected loss will be minimized
:)|( Θ∈θθ XYP
)|(ˆ XYPθ
θ
)|(),|(1
ii
n
i
XYPXTCL ∏=
= θθ
)(ˆ Xh
![Page 22: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/22.jpg)
Axes for comparison of different approaches
! Asymptotic accuracy! Accuracy for limited training data! Speed of convergence to the best
hypothesis! Complexity of training! Modeling ease
![Page 23: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/23.jpg)
Generative-Discriminative Pairs
Definition: If a generative and discriminative parametric model family can represent the same set of conditional probability distributions they are a generative-discriminative pair
Example: Naïve Bayes and Logistic Regression
Y
X2X1
,...,2,1 KY ∈ 1,0, 21 ∈XX
∑=
======
==
Ki
NB iYXPiYXPiYPiYXPiYXPiYPXXiYP
...121
2121 )|()|()(
)|()|()(),|(
∑=
++++
==
Kiiii
iiiLR XX
XXXXiYP
...102211
0221121 )exp(
)exp(),|(λλλ
λλλ
)|( XYPθ
![Page 24: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/24.jpg)
Comparison of Naïve Bayes and Logistic Regression
! The NB assumption that features are independent given the class is not made by logistic regression
! The logistic regression model is more general because it allows a larger class of probability distributions for the features given classes
)exp()exp()(
),()|,(
)|()|()|,(
02211
...102211
2121
2121
iii
Kiiii
LR
NB
XXXXiYP
XXPiYXXP
iYXPiYXPiYXXP
λλλλλλ
++++=
==
====
∑=
![Page 25: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/25.jpg)
Example: Traffic Lights
Lights Working Lights Broken
P(g,r,w) = 3/7 P(r,g,w) = 3/7 P(r,r,b) = 1/7
Reality
P~
Working?
NS EW
! Model assumptions false!! JL and CL estimates differ…
JL: P(w) = 6/7 CL: (w) = εP(r|w) = ½ (r|w) = ½P(r|b) = 1 (r|b) = 1
NB Model
P~
P~
P~
![Page 26: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/26.jpg)
Joint Traffic LightsLights Working
3/14 3/14 3/14 3/14
Lights Broken2/14
Conditional likelihood of working is 1
Conditional likelihood of working is > ½! Incorrectly assigned!
Accuracy: 6/7
0 00
![Page 27: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/27.jpg)
Conditional Traffic LightsLights Working
ε/4 ε/4 ε/4 ε/4
Lights Broken
Conditional likelihood of working is still 1
Now correctly assigned to broken.
Accuracy: 7/7CL perfect (1)JL low (to 0)
0 0 0
1-ε
![Page 28: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/28.jpg)
Comparison of Naïve Bayes and Logistic Regression
Harder parameter estimation problem, ignores information in P(X)
Large bias if the independence assumptions are very wrong
Disadvantages
Linear log-oddsindependence of features given class
Model assumptions
More robust and accurate because fewer assumptions
Faster convergence, uses information in P(X), faster training
Advantages
+Training Speed
+Convergence+Accuracy
Logistic RegressionNaïve Bayes
))|,()|,(log(
21
21
jYXXPiYXXP
==
![Page 29: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/29.jpg)
Some Experimental Comparisons
15
20
25
30
35
40
45
50LRNB
erro
r
training data size
0
10
20
30
40
50
60
5 1000 2000 3000 4000
LRNB
training data sizeer
ror
Klein & Manning 2002(WSD line and hard data)
Ng & Jordan 2002
(15 datasets from UCI ML)
![Page 30: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/30.jpg)
Part-of-Speech TaggingUseful Features
! In most cases information about surrounding words/tags is strong disambiguator
“The long fenestration was tiring . “! Useful features
! tags of previous/following words P(NN|JJ)=.45;P(VBP|JJ)=0.0005
! identity of word being tagged/surrounding words! suffix/prefix for unknown words, hyphenation,
capitalization! longer distance features ! others we haven’t figured out yet
![Page 31: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/31.jpg)
HMM Tagging Models - I
t3
w3
Independence Assumptions
• ti is independent of t1…ti-2 and w1…wi-1 given ti-1
• words are independent given their tags
t1 t2
w1 w2
states can be single tags or pairs of successive tags or variable length sequences of last tags
unknown words (Weischedel et al. 93)t
uw Cap? suffix hyph
![Page 32: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/32.jpg)
HMM Tagging Models - Brants 2000
! Highly competitive with other state-of-the art models! Trigram HMM with smoothed transition probabilities! Capitalization feature becomes part of the state –
each tag state is split into two e.g. NN <NN,cap>,<NN,not cap>
! Suffix features for unknown words
)(ˆ/)|(~)(ˆ)|)(|()|(
tagPsuffixtagPsuffixP
suffixwtagsuffixPtagwP
≈
=
)(ˆ...)|(ˆ)|(ˆ)|(~121 tagPsuffixtagPsuffixtagPsuffixtagP nnnn λλλ +++= −
t
suffixn suffixn-1 suffix2 suffix1
→
![Page 33: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/33.jpg)
CMM Tagging Models
t1 t2 t3
w1 w2 w3
Independence Assumptions
• ti is independent of t1…ti-2 and w1…wi-1 given ti-1
• ti is independent of all following observations
• no independence assumptions on the observation sequence
•Dependence of current tag on previous and future observations can be added; overlapping features of the observation can be taken as predictors
![Page 34: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/34.jpg)
MEMM Tagging Models -II
Ratnaparkhi (1996)! local distributions are estimated using maximum
entropy models! used previous two tags, current word, previous two
words, next two words! suffix, prefix, hyphenation, and capitalization features
for unknown words
8897MEMM(T. et al 2003)
85.5696.63MEMM(Ratn 1996)
85.596.7HMM (Brants 2000)
Unknown Words
Overall Accuracy
Model
![Page 35: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/35.jpg)
HMM vs CMM – IJohnson (2001)
95.3%
94.4%
95.5%
AccuracyModel
tj+1
wj+1
tj+1
wj+1wj
tj+1
tj
wj wj+1
tj
wj
tj
![Page 36: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/36.jpg)
HMM vs CMM - II
! The per-state conditioning of the CMM has been observed to exhibit label bias (Bottou, Lafferty) and observation bias (Klein & Manning )
90.4489.2291.23CMM+CMMHMM
t1 t2 t3
w1 w2 w3
t1 t2 t3
w1 w2 w3
Unobserving words with unambiguous tags improved performance significantly
![Page 37: Discriminative Models in NLP - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022013000/61c9df6977e7a5552b11dcfa/html5/thumbnails/37.jpg)
Summary of Tagging Review
For tagging, the change from generative to discriminative model does not by itself result in great improvement
One profits from discriminative models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis,etc
The CMM model allows integration of rich features of the observations, but suffers strongly from assuming independence from following observations; this effect can be relieved by adding dependence on following words
This additional power (of the CMM ,CRF, Perceptron models) has been shown to result in improvements in accuracy
The higher accuracy of discriminative models comes at the price of much slower training