Post on 16-Mar-2018
ML in NLP June 02
NASSLI 1
Machine Learning in NaturalLanguage Processing
Fernando PereiraUniversity of Pennsylvania
NASSLLI, June 2002Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee,Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby
ML in NLP
Introduction
ML in NLP
Why ML in NLPn Examples are easier to create than rulesn Rule writers miss low frequency casesn Many factors involved in language interpretationn People do it
n AIn Cognitive science
n Let the computer do itn Moore’s lawn storagen lots of data
ML in NLP
Classificationn Document topic
n Word sensetreasury bonds ´ chemical bonds
politics, business national, environment
ML in NLP June 02
NASSLI 2
ML in NLP
N(bin)
S(dumped)
NP-C(workers)
N(workers)
workers
VP(dumped)
V(dumped)
dumped
NP-C(sacks)
N(sacks)
sacks
PP(into)
P(into) NP-C(bin)
D(a)
a bin
into
Analysisn Tagging
n Parsing
laterdecadesupshowthatsymptomscausingJJNNSRPVBPWDTNNSVBG
ML in NLP
Language Modelingn Is this a likely English sentence?
n Disambiguate noisy transcriptionIt’s easy to wreck a nice beachIt’s easy to recognize speech
P(colorless green ideas sleep furiously)P(furiously sleep ideas green colorless)
ª 2 ¥105
ML in NLP
Inferencen Translation
n Information extractionligações covalentesficovalent bondsobrigações do tesourofitreasury bonds
Sara Lee to Buy 30% of DIM
Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest inParis-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 milliondollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.
The investment includes the purchase of 5 million newly issued DIM sharesvalued at about 5 million dollars, and a loan of about 15 million dollars, it said.The loan is convertible into an additional 16 million DIM shares, it noted.
The proposed agreement is subject to approval by the French government, itsaid.
acquireracquired
ML in NLP
Machine Learning Approachn Algorithms that write programsn Specify
• Form of output programs• Accuracy criterion
n Input: set of training examplesn Output: program that performs as accurately
as possible on the training examplesn But will it work on new examples?
ML in NLP June 02
NASSLI 3
ML in NLP
Fundamental Questionsn Generalization: is the learned program useful on new
examples?n Statistical learning theory: quantifiable tradeoffs between
number of examples, complexity of program class, andgeneralization error
n Computational tractability: can we find a good programquickly?n If not, can we find a good approximation?
n Adaptation: can the program learn quickly from newevidence?n Information-theoretic analysis: relationship between adaptation
and compression
ML in NLP
Learning Tradeoffs
Program class complexity
Err
or o
f be
st p
rogr
am
training
testing on new examples
Overfitting
Rote learning
ML in NLP
Machine Learning Methodsn Classifiers
n Document classificationn Disambiguation disambiguation
n Structured modelsn Taggingn Parsingn Extraction
n Unsupervised learningn Generalizationn Structure induction
ML in NLP
Jargonn Instance: event type of interest
n Document and its classn Sentence and its analysisn …
n Supervised learning: learn classificationfunction from hand-labeled instances
n Unsupervised learning: exploit correlations toorganize training instances
n Generalization: how well does it work onunseen data
n Features: map instance to set of elementaryevents
ML in NLP June 02
NASSLI 4
ML in NLP
Classification Ideasn Represent instances by feature vectorsn Contentn Context
n Learn function from feature vectorsn Classn Class-probability distribution
n Redundancy is our friend: many weakclues
ML in NLP
Structured Model Ideasn Interdependent decisions
n Successive parts-of-speechn Parsing/generation stepsn Lexical choice
• Parsing• Translation
n Combining decisionsn Sequential decisionsn Generative modelsn Constraint satisfaction
ML in NLP
Unsupervised Learning Ideasn Clustering: class inductionn Latent variables
I’m thinking of sports fi more sporty wordsn Distributional regularities
• Know words by the company they keepn Data compressionn Infer dependencies among variables:
structure learningML in NLP
Methodological Detour
n Empiricist/information-theoretic view:words combine following their associationsin previous material
n Rationalist/generative view:words combine according to a formalgrammar in the class of possible natural-language grammars
ML in NLP June 02
NASSLI 5
ML in NLP
Chomsky’s Challenge to Empiricism
(1) Colorless green ideas sleep furiously.
(2) Furiously sleep ideas green colorless.
… It is fair to assume that neither sentence (1) nor(2) (nor indeed any part of these sentences) hasever occurred in an English discourse. Hence, inany statistical model for grammaticalness, thesesentences will be ruled out on identical grounds asequally ‘remote’ from English. Yet (1), thoughnonsensical, is grammatical, while (2) is not.
Chomsky 57
ML in NLP
The Return of Empiricismn Empiricist methods work:
n Markov models can capture a surprising fraction ofthe unpredictability in language
n Statistical information retrieval methods beatalternatives
n Statistical parsers are more accurate thancompetitors based on rationalist methods
n Machine-learning, statistical techniques close tohuman performance in part-of-speech tagging, sensedisambiguation
n Just engineering tricks?
ML in NLP
Unseen Eventsn Chomsky’s implicit assumption: any model must
assign zero probability to unseen eventsn naïve estimation of Markov model probabilities from
frequenciesn no latent (hidden) events
n Any such model overfits data: many events arelikely to be missing in any finite sample
\ The learned model cannot generalize to unseendata
\ Support for poverty of the stimulus arguments
ML in NLP
The Science of Modelingn Probability estimates can be smoothed to
accommodate unseen eventsn Redundancy in language supports
effective statistical inference procedures\ the stimulus is richer than it might seemn Statistical learning theory: generalization
ability of a model class can be measuredindependently of model representation
n Beyond Markov models: effects of latentconditioning variables can be estimatedfrom data
ML in NLP June 02
NASSLI 6
ML in NLP
Richness of the Stimulusn Information about: mutual information
n between linguistic and non-linguistic eventsn between parts of a linguistic event
n Global coherence:banks can now sell stocks and bonds
n Word statistics carry more informationthan it might seemn Markov models in speech recognitionn Success of bag-of-words model in information
retrievaln Statistical machine translation
n How far can these methods go?ML in NLP
Questionsn Generative or discriminative?n Structured models: local classification or
global constraint satisfaction?n Does unsupervised learning help?
ML in NLP
Classification
ML in NLP
Generative or Discriminative?n Generative modelsn Estimate the instance-label distribution
n Discriminative modelsn Estimate the label-given-instance distribution
n Minimize an upper-bound on training error
†
p(x,y)
†
p(y | x)
†
[[ f (xi ) ≠ yi ]]i
Â
ML in NLP June 02
NASSLI 7
ML in NLP
Simple Generative Modeln Binary naïve Bayes:n Represent instances by sets of binary
features• Does word occur in document• …
n Finite predefined set of classes
C
F1 F2 F3 F4
†
P(F1,...,Fn ,C) = P(C) P(Fi | C)i
’
P(c | F1,...,Fn )µ P(c) P(Fi | C)i
’
ML in NLP
Generative Claimsn Easy to train: just countn Language modeling: probability of
observed formsn More robustn Small training setsn Label noise
n Full advantage of probabilistic methods
ML in NLP
Discriminative Modelsn Define functional form for
n Binary classification: define a discriminantfunction
n Adjust parameter(s) q to maximizeprobability of training labels/minimizeerror
†
p(y | x;q )
†
y = signh(x;q)
ML in NLP
Simple Discriminative Formsn Linear discriminant function
n Logistic form:
n Multi-class exponential form (maxent):
†
h(x;q0 ,q1,...,qn ) = q0 + qi fii (x)
†
P(+1 | x) = 11+ exp- h(x;q )
†
h(x,y;q0 ,q1,...,qn ) = q0 + qi fii (x,y)
P(y | x;q) = exp h(x, y;q )exp h(x, ¢ y ;q)
¢ y Â
ML in NLP June 02
NASSLI 8
ML in NLP
Discriminative Claimsn Focus modeling resources on instance-to-
label mappingn Avoid restrictive probabilistic assumptions
on instance distributionn Optimize what you care aboutn Higher accuracy
ML in NLP
Classification Tasksn Document categorization
n News categorizationn Message filteringn Web page selection
n Taggingn Named entityn Part-of-speechn Sense disambiguation
n Syntactic decisionsn Attachment
ML in NLP
Document Modelsn Binary vector
n Frequency vector
n N-gram language model
†
ft (d) ≡ t Œ d
†
tf(d, t) = t Œ d idf(d, t) = D d Œ D : t Œ draw frequency : rt (d) = tf(d,t)TF * IDF : xt (d) = log(1+ tf(d, t))log(1+ idf(d,t))
†
p(d | c) = p(d | c) p(di | d1...di-1;c)i=1d’
p(di | d1...di-1;c) ª p(di | di-n ...di-1;c)ML in NLP
Term Weighting and FeatureSelectionn Select or weigh most informative featuresn TF*IDF: adjust term weight by how
document-specific the term isn Feature selection:n Remove low, unreliable countsn Mutual informationn Information gainn Other statistics
ML in NLP June 02
NASSLI 9
ML in NLP
Documents vs. Vectors (I)n Many documents have the same binary or
frequency vectorn Document multiplicity must be handled
correctly in probability modelsn Binary naïve Bayes
n Multiplicity is not recoverable
†
p(f | c) = ft p(t | c) + (1- ft )(1- p(t | c))[ ]t’
ML in NLP
Documents vs. Vectors (2)n Document probability (unigram language
model)
n Raw frequency vector probability
†
p(d | c) = p(d | c) p(di | c)i=1d’
†
rt = tf(d,t)
p(r | c) = p(L | c)L! p(t | c)rt
rt!t’ where L = rttÂ
ML in NLP
Documents vs. Vectors (3)n Unigram model:
n Vector model:
†
p(c | d) =p(c)p(d | c) p(di | c)i=1
d’p( ¢ c )p( d | ¢ c ) p(di | ¢ c )i=1
d’¢ c Â
†
p(c | r) =p(c)p(L | c) p(t | c)rt
t’p( ¢ c )p(L | ¢ c ) p(t | ¢ c )rt
t’¢ c Â
ML in NLP
Linear Classifiersn Embedding into high-dimensional vector
spacen Geometric intuitions and techniquesn Easier separability
• Increase dimension with interaction terms• Nonlinear embeddings (kernels)
n Swiss Army knife
ML in NLP June 02
NASSLI 10
ML in NLP
Kinds of Linear Classifiersn Naïve Bayesn Exponential modelsn Large margin classifiersn Support vector machines (SVM)n Boosting
n Online methodsn Perceptronn Winnow
ML in NLP
Learning Linear Classifiersn Rocchio
n Widrow-Hoff
n (Balanced) winnow
†
wk = max 0,xkxŒcÂ
c-
xkxœcÂD - c
Ê
Ë
Á Á
ˆ
¯
˜ ˜
†
w ¨ w - 2h(w.x i - yi )x i
†
y = sign(w+ ⋅ x - w- ⋅ x -q )
positive error : w+ ¨ aw+ ,w- ¨ bw-,a > 1> b > 0
negative error : w+ ¨ bw+ ,w- ¨ aw-
ML in NLP
Linear Classificationn Linear discriminant function
†
h(x) = w ⋅ x + b = wk xkk + b
b
w
xxx
x
x
xx
x
xxx
oo
oo
ooo
oo
ML in NLP
Marginn Instance marginn Normalized (geometric) margin
n Training set margin g
†
gi = yi w ⋅ x i + b( )
xxx x
xx
xi
x
o
oo
oo oo
gi
xj
gjg
†
gi = yiww
⋅ x i +bw
Ê
Ë Á
ˆ
¯ ˜
ML in NLP June 02
NASSLI 11
ML in NLP
Perceptron Algorithmn Givenn Linearly separable training set Sn Learning rate h>0
†
w ¨ 0;b ¨ 0;R = maxi x irepeat
for i = 1...Nif yi w ⋅ x i + b( ) £ 0
w ¨ w + hyi x ib ¨ b + hyi R2
until there are no mistakesML in NLP
Dualityn Final hypothesis is a linear combination of
training points
n Dual perceptron algorithm
†
w = ai yi x ii ai ≥ 0
†
a ¨ 0;b ¨ 0;R = maxi x irepeat
for i = 1...Nif yi a j y jx j ⋅ x ij + b( ) £ 0
ai ¨ ai + 1b ¨ b + yi R2
until there are no mistakes
ML in NLP
Why Maximize the Margin?n There is a constant c such that for any
data distribution D with support in a ball ofradius R and any training sample S of sizeN drawn from D
where g is the margin of h in S
†
p err(h) £cN
R2
g 2 log2 N + log 1 d( )Ê
Ë Á Á
ˆ
¯ ˜ ˜
Ê
Ë Á Á
ˆ
¯ ˜ ˜ ≥ 1- d
ML in NLP
Canonical Hyperplanesn Multiple representations for the same
hyperplane
n Canonical hyperplane: functional margin = 1n Geometric margin for canonical hyperplane
†
lw,lb( ) l > 0
†
g =12
ww
⋅ x+ -ww
⋅ x-
Ê
Ë Á
ˆ
¯ ˜
=1
2 ww ⋅ x+ - w ⋅ x-( )
=1w
x+g
x-
ML in NLP June 02
NASSLI 12
ML in NLP
Convex Optimization (1)n Constrained optimization problem:
n Lagrangian function:
n Dual problem:
†
minwŒWÕ¬n f (w)subject to gi (w) £ 0
h j (w) = 0
†
L(w,a,b ) = f (w) + ai gi (w) + b jh j (w)jÂiÂ
†
maxa ,b infwŒW L(w,a ,b)subject to ai ≥ 0
ML in NLP
Convex Optimization (2)n Kuhn-Tucker conditions:n f convexn gi, hj affine (h(w)=Aw-b)n Solution w*, a*, b* must satisfy:
†
∂L(w*,a*,b* )∂w
= 0
∂L(w*,a*,b* )∂b
= 0
ai*gi (w* ) = 0gi (w* ) £ 0
ai* ≥ 0
Complementary condition:parameter is non-zero iffconstraint is active
ML in NLP
Maximizing the Margin (1)n Given a separable training sample:
n Lagrangian:†
minw,b w = w ⋅ w subject to yi w ⋅ x i + b( ) ≥1
†
L(w,b,a ) =12
w ⋅ w - ai yi w ⋅ x i + b( ) -1[ ]iÂ∂L(w,b,a)
∂w= w - yiai x i = 0iÂ
∂L(w,b,a)∂b
= yiaii = 0
ML in NLP
Maximizing the margin (2)n Dual Lagrangian at stationary point:
n Dual maximization problem:
n Maximum margin weight vector:
†
W(a) = L(w*,b*,a) = aii -12
yi y jaia jx i⋅i , j x j
†
maxa W (a)subject to ai ≥ 0
yiaii = 0
†
w* = yiai*x ii with margin g = 1 w* = ai
*iŒsvÂ( )-1 2
ML in NLP June 02
NASSLI 13
ML in NLP
Building the Classifiern Computing the offset (from primal
constraints):
n Decision function:
†
b* =maxyi =-1 w* ⋅ x i + min yi =1 w* ⋅ x i
2
†
h(x) = sgn yiai*x i⋅ x + b*
iÂ( )
ML in NLP
Consequencesn Complementarity condition yields support
vectors:
n Functional margin of 1 implies minimumgeometric margin
†
ai yi w* ⋅ x i + b( ) -1[ ] = 0ai > 0 fi w* ⋅ x i + b = yi
†
g = 1 w*
ML in NLP
General SVM Formn Margin maximization for an arbitrary
kernel K
n Decision rule
†
maxa aii -12
yi y jaia jK (x i ,i , j x j )subject to ai ≥ 0
yiaii = 0
†
h(x) = sgn yiai*K (x i ,x) + b*
iÂ( )
ML in NLP
Soft Marginn Handles non-separable casen Primal problem (2-norm):
n Dual problem:
†
minw,b,x w ⋅ w + C xi2
iÂsubject to yi w ⋅ x i + b( ) ≥ 1- xi
xi ≥ 0
†
maxa aii -12
yi y jaia j x i ⋅ x j +1C
dijÊ Ë Á
ˆ ¯ ˜ i , jÂ
subject to ai ≥ 0yiaii = 0
ML in NLP June 02
NASSLI 14
ML in NLP
Conditional Maxent Modeln Model form
n Useful propertiesn Multi-classn May use different features for different classesn Training is convex optimization
†
p(y | x;L) =exp lk fk (x,y)kÂ
Z(x;l)Z(x;L) = exp lk fk (x,y)kÂyÂ
ML in NLP
Dualityn Maximize conditional log likelihood
n Maximizing conditional entropy
subject to constraints
yields
†
˜ L = argmaxL log p(yi | x ii ;L)
†
˜ p = argmax p - p(y | x i ) log p(y | x i )yÂiÂ[ ]
†
fk (x i ,yi )i = p(y | x i ) fk (x i ,y)yÂ
†
˜ p (y | x) = p(y | x; ˜ L )
ML in NLP
Relationship to (Binary) LogisticDiscrimination
†
p(+1 | x) =exp lk fk (x,+1)kÂ
exp lk fk (x,+1)k + exp lk fk (x,-1)kÂ=
11+ exp- lk fk (x,+1) - fk (x,-1)( )kÂ
=1
1+ exp- lk gk (x)kÂ
ML in NLP
Relationship to LinearDiscriminationn Decision rule
n Bias term: parameter for “always on”feature
n Question: relationship to other trainers forlinear discriminant functions
†
sign log p(+1| x)p(-1 | x)
Ê
Ë Á
ˆ
¯ ˜ = sign lk gk (x)kÂ
ML in NLP June 02
NASSLI 15
ML in NLP
Solution Techniques (I)n Generalized iterative scaling (GIS)
n Parameter updates
n Requires that features add up to constant independent ofinstance or label (add slack feature)
†
lk ¨ lk +1C
logfk (x i ,yi )iÂ
p(y | x i ;L) fk (x i ,y)yÂiÂ
†
fk (x i ,y)k = C "i,y
ML in NLP
Solution Techniques (2)n Improved iterative scaling (IIS)
n Parameter updates
n For binary features reduces to solving a polynomial withpositive coefficients
n Reduces to GIS if feature sum constant
†
lk ¨ lk + dk
fk (x i ,yi ) = p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)yÂiÂiÂ
f # (x, y) = fk (x, y)kÂ
ML in NLP
Deriving IIS (1)n Conditional log-likelihood
n Log-likelihood update
†
l(L) = log p(yi | x ii ;L)
†
l(L + D) - l(L) = D ⋅ f (x i ,yi ) - log Z (x i ;L + D)Z (x i ;L)
È
Î Í
˘
˚ ˙ iÂ
= D ⋅ f (x i ,yi ) - log e L+D( )⋅ f (xi ,y)
Z (x i ;L)yÂiÂiÂ
= D ⋅ f (x i ,yi ) - log p(y | x i ;L)eD⋅ f (xi ,y)yÂiÂiÂ
log x £ x -1( ) ≥ D ⋅ f (x i ,yi ) + N - p(y | x i ;L)eD⋅ f (xi ,y)yÂiÂiÂ
A(D)1 2 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4
ML in NLP
Deriving IIS (2)n By Jensen’s inequality:
n Maximize lower bound on update
†
A(D) ≥ D ⋅ f (x i ,yi ) + N -iÂ
p(y | x i ;L) fk (x i ,y)f # (x i ,y)
edk f # (xi ,y)kÂyÂi = B(D)
∂B(D)∂dk
= fk (x i ,yi )i - p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)yÂiÂ
ML in NLP June 02
NASSLI 16
ML in NLP
Solution Techniques (3)n GIS very slow if slack variable takes large
valuesn IIS faster, but still problematicn Recent suggestion: use standard convex
optimization techniquesn Eg. Conjugate gradientn Some evidence of faster convergence
ML in NLP
Gaussian Priorn Log-likelihood gradient
n Modified IIS update
†
∂l(L)∂lk
= fk (x i ,yi )i - p(y | x i ;L) fk (x i ,y)iÂyÂi -lks k
2
†
lk ¨ lk + dk
fk (x i ,yi )i =
p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)yÂi +
lk + dks k
2
f # (x,y) = fk (x,y)kÂ
ML in NLP
Instance Representationn Fixed-size instance (PP attachment):
binary featuresn Word identityn Word class
n Variable-size instance (documentclassification)n Word identityn Word relative frequency in document
ML in NLP
Enriching Featuresn Word n-gramsn Sparse word n-gramsn Character n-grams (noisy transcriptions:
speech, OCR)n Unknown word features: suffixes,
capitalizationn Feature combinations (cf. n-grams)
ML in NLP June 02
NASSLI 17
ML in NLP
I understood each and every word you said but not the order in which they appeared.
ML in NLP
Structured Models:Finite State
ML in NLP
Structured Model Applicationsn Language modelingn Story segmentationn POS taggingn Information extraction (IE)n (Shallow) parsing
ML in NLP
Structured Modelsn Assign a labeling to a sequencen Story segmentationn POS taggingn Named entity extractionn (Shallow) parsing
ML in NLP June 02
NASSLI 18
ML in NLP
Constraint Satisfaction inStructured Modelsn Train to minimize labeling loss
n Computing the best labeling:
n Efficient minimization requires:n A common currency for local labeling decisionsn Efficient algorithm to combine the decisions
†
argminy Loss(x,y | ˆ q )
†
ˆ q = argminq Loss(xi ,yi |q )iÂ
ML in NLP
Local Classification Modelsn Train to minimize the per-decision loss in
context
n Apply by guessing context and findingeach lowest-loss label:
†
ˆ q = argminq loss(yi, j | xi,yi( j);q )0£ j<|xi |
ÂiÂ
†
argminy j loss(yj | x, ˆ y ( j); ˆ q )
ML in NLP
Structured Model Claimsn Constraint satisfaction
n Principledn Probabilistic interpretation allows model
compositionn Efficient optimal decoding
n Local classificationn Wider range of modelsn More efficient trainingn Heuristic decoding comparable to pruning in
global models
ML in NLP
Example: Language Modelingn n-gram (n-1 order Markov) model:
n example, character n-grams (Shannon, Lucky):n0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ1 OCRO HLI RGWR NMIELWIS EU LL2 ON IE ANTSOUTINYS ARE T INCTORE ST3 IN NO IST LAT WHEY CRATICT FROURE4 ABOVERY UPONDULTS WELL THE CODERST
w1 wk – n + 1 wk – 1wk
n
P(wk|w1 wk – 1) ª P(wk|wk – n + 1 wk – 1)
ML in NLP June 02
NASSLI 19
ML in NLP
Markov’s UnreasonableEffectivenessn Entropy estimates for English
n Local word relations dominate statistics (Jelinek):1 The are to know the issues necessary role
2 This will have this problems data thing
2 One the understand these the information that
7 Please need use problem people point
9 We insert all tools issues
98 resolve old
1641 important
1.34human prediction (Cover & King 78)1.75word trigrams (Brown et al 92)4.43compress
bits/charmodel
ML in NLP
Limits of Markov Modelsn No dependencyn Likelihoods based on
sequencing, not dependency
1 The are to know the issues necessary role
2 This will have this problems data thing
2 One the understand these the information that
7 Please need use problem people point
9 We insert all tools issues
98 resolve old
1641 important
ML in NLP
n What’s the probability of unseen events?n Bias forces nonzero probabilities for some
unseen eventsn Typical bias: tie probabilities of related events
n specific unseen event à general seen eventeat pineapple à eat _
n event decomposition: event à event1 « event2eat pineapple à eat _ « _ pineapple
n Factoring via latent variables:
Unseen Events (1)
P(eat | pineapple) ª P(eat | C) ¥ P(C | pineapple)CÂ
ML in NLP
Unseen Events (2)n Discount estimates for seen eventsn Use leftover for unseen eventsn How to allocate leftover?n Back-off from unseen event to less specific
seen events: n-gram to n-1-gramn Hypothesize hidden cause for unseen events:
latent variable modeln Relate unseen event to distributionally similar
seen events
ML in NLP June 02
NASSLI 20
ML in NLP
Important Detour:Latent Variable Models
ML in NLP
Expectation-Maximization (EM)n Latent (hidden) variable models
n Examples:n Mixture modelsn Class-based models (hidden classes)n Hidden Markov models
†
p(y,x,z | L)p(y,x | L) = p(y,x,z | L)zÂ
ML in NLP
Maximizing Likelihoodn Data log-likelihood
n Find parameters that maximize (log-)likelihood
†
D = (x1,y1),...,(xN ,yN ){ }L(D | L) = log p(x i ,yi )i = ˜ p (x,y) log p(x,y | L)x ,yÂ
˜ p (x,y) =i : x i = x,yi = y
N
†
ˆ L = argmaxL ˜ p (x, y) log p(x,y | L)x ,yÂ
ML in NLP
Convenient Lower Bounds (1)n Convex function
n Jensen’s inequality
if f is convex and p is a probability density
†
f p(x)xxÂ( ) £ p(x) f (x)xÂ
f
†
x0
†
x1
†
ax0 + 1- a( )x1
†
f ax0 + 1- a( )x1( )
†
f x0( )
†
f x1( )
†
af x0( ) + 1- a( ) f x1( )
ML in NLP June 02
NASSLI 21
ML in NLP
Convenient Lower Bounds (2)
†
L(D | l)
†
l
lower bound
†
E0
†
E1
†
M0
ML in NLP
Auxiliary Functionn Find a convenient non-negative function
that lower-bounds likelihood increase
n Maximize lower bound:
†
L(D | ¢ L ) - L(D | L) ≥ Q( ¢ L ,L) ≥ 0
†
Li+1 = argmax ¢ L Q ¢ L ,Li( )
ML in NLP
Commentsn Likelihood keeps increasing, but
n Can get stuck in local maximum (or saddle point!)n Can oscillate between different local maxima with
same log-likelihoodn If maximizing auxiliary function is too hard, find
some L that increases likelihood: generalizedEM (GEM)
n Sum over hidden variable values can beexponential if not done carefully (sometimes notpossible)
ML in NLP
Example: Mixture Modeln Base distributions
n Mixture coefficients
n Mixture distribution
†
pi (y) :1 £ i £ m
†
li ≥ 0 lii=1m = 1
†
p(y | L) = li pi (y)iÂ
ML in NLP June 02
NASSLI 22
ML in NLP
Auxiliary Quantitiesn Mixture coefficient i = prior probability of
being in class In Joint probability
n Auxiliary function
†
p(c,y | L) = lc pc (y)
†
Q( ¢ L ,L) = ˜ p (y) p(c | y,L) log p(y,c | ¢ L )p(y,c | L)cÂyÂ
ML in NLP
Solutionn E step:
n M-step:
†
Ci =1¢ l i
˜ p (y) li pi (y)l j p j (y)jÂyÂ
†
li ¨Ci
C jjÂ
ML in NLP
More Finite-State Models
ML in NLP
Example: Information Extractionn Given: types of entities and relationships
we are interested inn People, places, organizations, dates,
amounts, materials, processes, …n Employed by, located in, used for, arrived
when, …n Find all entities and relationships of the
given types in source materialn Collect in suitable database
ML in NLP June 02
NASSLI 23
ML in NLP
IE Example
n Rely on:n Syntactic structuren Phrase classification
Nance, who is also a paid consultant to ABC News , said …
person
person-descriptor
employeerelationCo-reference
organization
ML in NLP
IE Methodsn Partial matching:n Hand-built patternsn Automatically-trained hidden Markov modelsn Cascaded finite-state transducers
n Parsing-based:n Parse the whole text:
• Shallow parser (chunking)• Automatically-induced grammar
n Classify phrases and phrase relations asdesired entities and relationships
ML in NLP
Global Constraint Modelsn Train to minimize labeling loss
n Computing the best labeling:
n Efficient minimization requires:n A common currency for local labeling decisionsn A dynamic programming algorithm to combine the
decisions
†
argminy Loss(x,y | ˆ q )
†
ˆ q = argminq Loss(xi ,yi |q )iÂ
ML in NLP
Local Classification Modelsn Train to minimize the per-symbol loss in
context
n Apply by guessing context and findingeach lowest-loss label:
†
ˆ q = argminq loss(yi, j | xi,yi( j);q )0£ j<|xi |
ÂiÂ
†
argminy j loss(yj | x, ˆ y ( j); ˆ q )
ML in NLP June 02
NASSLI 24
ML in NLP
Structured Model Claimsn Global constraintn Principledn Probabilistic interpretation allows model
compositionn Efficient optimal decoding
n Local classifiern Wider range of modelsn More efficient trainingn Heuristic decoding comparable to pruning in
global modelsML in NLP
Generative vs. Discriminativen Hidden Markov models (HMMs):
generative, globaln Conditional exponential models (MEMMs,
CRFs): discriminative, globaln Boosting, winnow: discriminative, local
ML in NLP
Generative Modelsn Stochastic process that generates
instance-label pairsn Process structuren Process parameters
n (Hypothesize structure)n Estimate parameters from training data
ML in NLP
Model Structuren Decompose the generation of instances
into elementary stepsn Define dependencies between stepsn Parameterize the dependenciesn Useful descriptive language: graphical
models
ML in NLP June 02
NASSLI 25
ML in NLP
Binary Naïve Bayesn Represent instances by sets of binary
featuresn Does word occur in documentn …
n Finite predefined set of classes
C
F1 F2 F3 F4
†
P(F1,...,Fn ,C) = P(C) P(Fi | C)i
’
P(c | F1,...,Fn )µ P(c) P(Fi | C)i
’
ML in NLP
Discrete Hidden Markov Modeln Instances: symbol sequencesn Labels: class sequences
C0 C1 C2 C3 C4 …
X0 X1 X2 X3 X4
†
P(X,C) = P(C0 )P(X0 | C0 ) P(Ci | Ci-1)i’ P(Xi | Ci )
ML in NLP
Generating Multiple Featuresn Instances: sequences of feature sets
n Word identityn Word properties (eg. spelling, capitalization)
n Labels: class sequences
C0 C1 C2 C3 C4 …
F0,0
F0,1
F0,2
F1,0
F1,1
F1,2
F2,0
F2,1
F2,2
F3,0
F3,1
F3,2
F4,0
F4,1
F4,2
n Limitation: conditionally independent featuresML in NLP
Independence or Intractabilityn Trees are good: each node has a single
immediate ancestor, joint probabilitycomputed in linear time
n But that forces features to be conditionallyindependent given the class
n Unrealisticn Suffixes and capitalizationn “San” and “Francisco” in document
ML in NLP June 02
NASSLI 26
ML in NLP
Score Card¸ No independence assumptions¸ Richer features: combinations of existing
features- Optimization problem for parameters- Limited probabilistic interpretation- Insensitive to input distribution
ML in NLP
Information Extraction with HMMs
n Parameters = P(s|s’) , P(o|s) for all states in S={s1,s2,…}n Observations: wordsn Training: maximize probability of observations (+ prior).n For IE, states indicate “database field”.
[Seymore & McCallum ‘99][Freitag & McCallum ‘99]
ML in NLP
Problems with HMMs1. Would prefer richer representation of text:
multiple overlapping features, whole chunks of textn Example line features:
n length of linen line is centeredn percent of non-alphabeticsn total amount of white spacen line contains two verbsn line begins with a numbern line is grammatically a question
n Example word features:n identity of wordn word is in all capsn word ends in “-tion”n word is part of a noun phrasen word is in bold fontn word is on left hand side of pagen word is under node X in WordNet
2. HMMs are generative models.Generative models do not handle easily overlapping, non-independent features.Would prefer a conditional model: P({s…}|{o…}).
ML in NLP
Solution: Conditional Model
P(o|s)P(s|s’)
P(s|o,s’)
For the time being, capture dependencyon s’ with |S| independent functions, Ps’(s|o)
Hidden Markov model Maximum entropyMarkov model
(represented by exponential model)
Each state contains a “next-state classifier”black box, that, given the next observation, willproduce a probability distribution over possiblenext states, Ps’(s|o).
s’
s
ML in NLP June 02
NASSLI 27
ML in NLP
HMM MEMM
st-1 st
ot
P(o|s)P(s|s’) P(s|o,s’)
st-1 st
ot
Two Sequence Models
n Standard belief propagation: forward-backwardprocedure.
n Viterbi and Baum-Welch follow naturally.ML in NLP
Transition Featuresn Model Ps’ (s|o) in terms of multiple arbitrary
overlapping (binary) features.n Example observation predicates:l o is the word “apple”l o is capitalizedl o is on a left-justified line
n Feature f depends on both a predicate bn … and a destination state s.
†
f<b,s>(o, s' ) =1 if b(o) is true and s'= s0 otherwise
Ï Ì Ó
ML in NLP
n Per-state conditional maxent model
n Training: each state model independentlyfrom labeled sequences
Next-State Classifier
†
Ps ' (s | o) =1
Z(o, s' )exp l<b,q> f<b,q>(o, s)
<b,q>
ÂÊ
Ë Á
ˆ
¯ ˜ ˜
ML in NLP
X-NNTP-Poster: NewsHound v1.33Archive-name: acorn/faq/part2Frequency: monthly
2.6) What configuration of serial cable should I use?
Here follows a diagram of the necessary connections for common terminalprograms to work properly. They are as far as I know the informal standardagreed upon by commercial comms software developers for the Arc.
Pins 1, 4, and 8 must be connected together inside the 9 pin plug. Thisis to avoid the well known serial port chip bugs. The modem’s DCD (DataCarrier Detect) signal has been re-routed to the Arc’s RI (Ring Indicator)most modems broadcast a software RING signal anyway, and even then it’sreally necessary to detect it for the model to answer the call.
2.7) The sound from the speaker port seems quite muffled. How can I get unfiltered sound from an Acorn machine?
All Acorn machine are equipped with a sound filter designed to removehigh frequency harmonics from the sound output. To bypass the filter, hookinto the Unfiltered port. You need to have a capacitor. Look for LM324 (chip39) and and hook the capacitor like this:
Example: Q-A pairs from FAQ
ML in NLP June 02
NASSLI 28
ML in NLP
n 38 files belonging to 7 UseNet FAQs
n Procedure: For each FAQ, train on one file, teston other; average.
Experimental Data
<head> X-NNTP-Poster: NewsHound v1.33<head> Archive-name: acorn/faq/part2<head> Frequency: monthly<head><question> 2.6) What configuration of serial cable should I use?<answer><answer> Here follows a diagram of the necessary connection<answer> programs to work properly. They are as far as I know<answer> agreed upon by commercial comms software developers fo<answer><answer> Pins 1, 4, and 8 must be connected together inside<answer> is to avoid the well known serial port chip bugs. The
ML in NLP
Features in Experimentsbegins-with-numberbegins-with-ordinalbegins-with-punctuationbegins-with-question-wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-numbercontains-httpcontains-non-spacecontains-numbercontains-pipe
contains-question-markcontains-question-wordends-with-question-markfirst-alpha-is-capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-spaceonly-punctuationprev-is-blankprev-begins-with-ordinalshorter-than-30
ML in NLP
Models Testedn ME-Stateless: A single maximum entropy
classifier applied to each line independently.n TokenHMM: A fully-connected HMM with four
states, one for each of the line categories,each of which generates individual tokens(groups of alphanumeric characters andindividual punctuation characters).
n FeatureHMM: Identical to TokenHMM, onlythe lines in a document are first converted tosequences of features.
n MEMM: maximum entropy Markov modelML in NLP
ResultsLearner Segmentation
precisionSegmentation
recallME-Stateless 0.038 0.362
TokenHMM 0.276 0.140
FeatureHMM 0.413 0.529
MEMM 0.867 0.681
ML in NLP June 02
NASSLI 29
ML in NLP
n Example (after Bottou ‘91):
n Bias toward states with fewer outgoingtransitions.
n Per-state normalization does not allow therequired score(1,2|ro) << score(1,2|ri).
Label Bias Problem
0
4 5
321
r
ri
o b
b
6
rib
rob
†
P(1,2 | ro) = P(1 | r)P(2 | o,1)P(1 | r)1P(1 | r)P(2 | i,1)P(1,2 | ri)
ML in NLP
Proposed Solutionsn Determinization:n not always possiblen state-space explosion
n Fully-connected models:n lacks prior structural knowledge.
n Conditional random fields (CRFs):n Allow some transitions to vote more
strongly than othersn Whole sequence rather than per-state
normalization
0
5
32
1r
i
o b
b
6
rib
rob
ML in NLP
From HMMs to CRFs
†
s = s1...sn o = o1...on
†
P(s | o) = P(st | st-1,ot )t=1
n
’
=1
Z(st-1,ot )exp
l f f (st ,st-1)f
Â
+ m f g(st ,ot )gÂ
Ê
Ë
Á Á
ˆ
¯
˜ ˜ t=1
n
’
HMM
MEMM
CRF
†
P(s | o) µ P(st | st-1)P(ot | st )t=1
n
’
†
P(s | o) =1
Z(o)exp
l f f (st ,st-1)f
Â
+ m f g(st ,ot )gÂ
Ê
Ë
Á Á
ˆ
¯
˜ ˜ t=1
n
’ML in NLP
CRF General Formn State sequence is a Markov random field
conditioned on the observation sequence.
n Model form:
n Features represent the dependency betweensuccessive states conditioned on theobservations
n Dependence on whole observation sequence o(not possible in HMMs).
†
P(s | o) =1
Z(o)exp l f f (st-1,st ,o,t)
fÂ
t=1
n
ÂÊ
Ë Á
ˆ
¯ ˜ ˜
ML in NLP June 02
NASSLI 30
ML in NLP
n Matrix notation
n Efficient normalization: forward-backwardalgorithm
Efficient Estimation
†
Mt (s',s | o) = expL t (s',s | o)L t (s',s | o) = l f f (st-1,st ,o,i)
fÂPL (s | o) =
1ZL (o)
Mi(st-1,st | o)t
’ZL (o) = (M1(o)M2(o)LMn +1(o))start,stop
ML in NLP
Forward-Backward Calculationsn For any path function
†
G(s) = gt (st-1,st )tÂ
†
ELG = PL (s | o)G(s)sÂ
=a t (s' | o)gt +1(s',s)Mt +1(s',s | o)b t +1(s | o)
ZL (o)t ,s',sÂ
a t (o) = a t-1(o)Mt (o)b t (o ¢ ) = Mt +1(o)b t +1(o)ZL (o) = an +1(end | o) = b0(start | o)
ML in NLP
Trainingn Maximizen Log-likelihood gradient
n Methods: iterative scaling, conjugategradient
n Comparable to standard Baum-Welch
†
L(L) = log PL (sk | ok )kÂ
†
∂L(L)∂l f
= # f (sk | ok )k - EL # f (S | ok )
kÂ
# f (s | o) = f (st-1,st ,o,t)tÂ
ML in NLP
Label Bias Experimentn Data source: noisy version of
n P(intended symbol) = 29/30, P(other) = 1/30.n Train both an MEMM and a CRF with identical
topologies on data from the source.n Compute decoding error: CRF 4.6%, MEMM 42%
(2,000 training samples, 500 test)
0
4 5
321
r
ri
o b
b
6
rib
rob
ML in NLP June 02
NASSLI 31
ML in NLP
Mixed-Order Sourcesn Data generated by mixing sparse first and second
order HMMs with varying mixing coefficient.n Modeled by first-order HMM, MEMM and CRF
(without contextual or overlapping features).
ML in NLP
Part-of-Speech Taggingn Trained on 50% of the 1.1 million words in the
Penn treebank. In this set, 5.45% of the wordsoccur only once, and were mapped to “oov”.
n Experiments with two different sets of features:n traditional: just the wordsn take advantage of power of conditional models: use
words, plus overlapping features: capitalized, beginswith #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.
ML in NLP
POS Tagging Resultsmodel error oov errorHMM 5.69% 45.99%MEMM 6.37% 54.61%CRF 5.55% 48.05%MEMM+ 4.81% 26.99%CRF+ 4.27% 23.76%
ML in NLP
Structured Models:Stochastic Grammars
ML in NLP June 02
NASSLI 32
ML in NLP
Beyond Finite State
n Constituency and meaningful relationsWhat sleeps? How does it sleep? What kind of ideas?
n Related sentencesSleep furiously is all that colorless green ideas do.
S
NP
N'
N
Adj
Adj V
N'
N'
VP
AdvVP
ideas
revolutionary
new
advance
slowly
ML in NLP
Stochastic Context-Free GrammarsS
NP
N'
N
Adj
Adj V
N'
N'
VP
AdvVP
ideas
revolutionary
new
advance
slowly
S Æ NP VP 1.0NP Æ Det N¢ 0.2NP Æ N¢ 0.8N¢ Æ Adj N¢ 0.3N¢ Æ N 0.7VP Æ VP Adv 0.4VP Æ V 0.6N Æ ideas 0.1N Æ people 0.2
V Æ sleep 0.3
ML in NLP
Stochastic CFG Inferencen Inside-outside algorithm (Baker 79): find rule
probabilities that locally maximize thelikelihood of a training corpus (instance of EM)
n Extended inside-outside algorithm: useinformation about training corpus phrasestructure to guide rule probability reestimation
l Better modeling of phrase structurel Improved convergencel Improved computational complexity
ML in NLP
SCFG Derivations
=
¥
¥n Context-freeness fi
independence ofderivation steps
S
NP
N'
N
Adj
Adj V
N'
N'
VP
AdvVP
ideas
revolutionary
new
advance
slowly
P( ) S
NP
V
N'
VP
AdvVP
advance
slowly
P( )
Adj
revolutionary
P( ) N'
N
Adj N'
ideas
new
P( )
P(N¢ Æ Adj N¢)
¥
ML in NLP June 02
NASSLI 33
ML in NLP
S
NP
N'
N
Adj
Adj V
N'
N'
VP
AdvVP
ideas
revolutionary
new
advance
slowly
Inside-Outside Reestimationn Ratios of expected rule frequencies to
expected phrase frequenciesphrase context
phrase
n Computed fromthe probabilities ofeach phrase andphrase context inthe trainingcorpus
n Iterate
Pn + 1(N¢ Æ Adj N¢) =
En(#N¢ Æ Adj N¢)En(#N¢)
ML in NLP
Problems with I-O Reestimationn Hill-climbing procedure: sensitivity to initial
rule probabilitiesn Does not learn grammar structure directly:
only implicit in rule probabilities
n Linguistically inadequate grammars: highmutual information sequences are groupedinto phrases
((What (((is the) cheapest)fare))((I can) (get ?)))))))
Contrast: Is $300 the cheapest fare?
ML in NLP
(Partially) Bracketed Textn Hand-annotated text with
(some) phrase boundaries(((List (the fares (for((flight) (number891)))))).)
n Use only derivationscompatible with trainingbracketing
ML in NLP
Predictive Power
2.53
3.54
4.55
0 20 40 60 80iteration
cros
s ent
ropy
raw train
bracketedtrain
2.53
3.54
4.55
0 20 40 60 80iteration
cros
s ent
ropy
raw test
bracketedtest
Training:
Test:
ML in NLP June 02
NASSLI 34
ML in NLP
Bracketing Accuracyn Accuracy criterion: proportion of phrases in
most likely analysis compatible with tree bankbracketing
20
40
60
80
100
0 20 40 60 80iteration
accu
racy
raw test
bracketedtest
n Conclusion: structure is not evident from distributionalone
ML in NLP
Limitations of SCFGsn Likelihoods independent of particular wordsn Markovian assumption on syntactic categories
Markov models:
Hierarchical models:
We need to resolve the issue
ML in NLP
We need to resolve the issue
We need to resolve the issue
Lexicalization
nMarkov model:
nDependency model:
ML in NLP
Best Current Modelsn Representation: surface trees with head-word
propagationn Generative power still context-freen Model variables:
head word, dependency type, argument vs. adjunct, heaviness,slash
n Main challenge: smoothing method for unseendependencies
n Learned from hand-parsed text (treebank)n Around 90% constituent accuracy
ML in NLP June 02
NASSLI 35
ML in NLP
Lexicalized Tree (Collins 98)
Dependency Direction Relationworkers Æ dumped L ·NP, S, VPÒsacks Æ dumped R ·NP, VP, VÒinto Æ dumped R ·PP, VP, VÒa Æ bin L ·D, NP, NÒbin Æ into R ·NP, PP, PÒ
S(dumped)
NP-C(workers)
N(workers)
workers
VP(dumped)
V(dumped)
dumped
NP-C(sacks)
N(sacks)
sacks
PP(into)
P(into) NP-C(bin)
D(a) N(bin)
a bin
into
ML in NLP
Inducing Representations
ML in NLP
Unsupervised Learningn Latent variable modelsn Model observables from latent variablesn Search for “good” set of latent variables
n Information bottleneckn Find efficient compression of some
observablesn … preserving the information about other
observables
ML in NLP
Do Induced Classes Help?n Generalizationn Better statistics for coarser events
n Dimensionality reductionn Smaller modelsn Improved classification accuracy
ML in NLP June 02
NASSLI 36
ML in NLP
Chomsky’s Challengeto Empiricism
(1) Colorless green ideas sleep furiously.
(2) Furiously sleep ideas green colorless.
… It is fair to assume that neither sentence (1) nor(2) (nor indeed any part of these sentences) hasever occurred in an English discourse. Hence, inany statistical model for grammaticalness, thesesentences will be ruled out on identical grounds asequally ‘remote’ from English. Yet (1), thoughnonsensical, is grammatical, while (2) is not.
Chomsky 57
ML in NLP
Complex Eventsn What Chomsky was talking about: Markov
models — state is just a record ofobservations
n But statistical models can have hiddenstate:n representation of past experiencen uncertainty about correct grammarn uncertainty about correct interpretation of
experience: ambiguityn Probabilistic relationships involving
hidden variables can be induced fromobservable data alone: EM algorithm
ML in NLP
In “Any Model”?n Factored bigram model:
n Trained for large-vocabulary speech recognitionfrom newswire text by EM
P(colorless green ideas sleep furiously)P(furiously sleep ideas green colorless)
ª 2 ¥105
P(wi+1 | wi ) ª P(wi+1 | c) ¥ P(c | wi )c=1
16Â
P(w1Lwn ) ª P(w1) P(i=2
n’ wi+1 | wi )
ML in NLP
Distributional Clusteringn Automatic grouping of words according to the
contexts in which they appearn Approach to data sparseness: approximate the
distribution of a relatively rare event (word) by thecollective distribution of similar events (cluster)
n Sense ambiguity ≈ membership in several “soft”clusters
n Case study: cluster nouns according to the verbsthat take them as direct objects
ML in NLP June 02
NASSLI 37
ML in NLP
Training Datan Universe: two word classes V and N, a
single relation between them (eg. mainverb – head noun of verb’s direct object)
n Data: frequencies fvn of (v,n) pairsextracted from text by parsing or patternmatching
ML in NLP
Distributional RepresentationDescribing n Œ N: use conditional
distribution p(V | n)
00.020.04
0.060.08
0.1
buy
hold
issue
own
purc
hase
sele
ct sell
tend
er
trade
rela
tive
frequ
ency stock
bond
ML in NLP
n Markov condition:
n Find p(Ñ | N) to maximizemutual information forfixed I(Ñ ,N)
n Solution:
Reminder: Bottleneck Model
p( ˜ n | v) = p( ˜ n n | n)p(n | v)
I( ˜ N ,V ) = p( ˜ n ,v )˜ n ,v log p( ˜ n ,v )
p( ˜ n )p(v )
Ñ
N V
I(Ñ,N)
I(N,V)
I(Ñ,V)
p( ˜ n | n) =p( ˜ n )Zn
exp(-bDKL p(V | n) || p(V | ˜ n )( )ML in NLP
n The scale parameter b (“inversetemperature”) determines how much anoun contributes to nearby centroids
Search for Cluster Solutions
b0
n b increases fi clusters split fi hierarchicalclustering
ML in NLP June 02
NASSLI 38
ML in NLP
Small Examplen Cluster the 64 most common direct objects of
fire in 1988 Associated Press newswire
missile 0.835rocket 0.850bullet 0.917gun 0.940
officer 0.484aide 0.612chief 0.649manager 0.651
shot 0.858bullet 0.925rocket 0.930missile 1.037
gun 0.758missile 0.786weapon 0.862rocket 0.875
ML in NLP
Mutual Information Ratios
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8
I(Ñ,N)/H(N)
I(Ñ,V
)/I(Ñ
,N)
Increa
sing b
ML in NLP
Using Clusters for Predictionn Model verb–object associations through object
clusters:
n Depends on b
n Intuition: the associations of a word are a mixtureof the associations of the sense classes to whichthe word belongs
ˆ p (v | n) = p(v | ˜ n )p( ˜ n | n)˜ n
Â
ˆ p (v | n) =
p(v | ˜ n )p( ˜ n |n)
˜ n ’
Zn
used inexperiments
more appropriate
ML in NLP
Evaluationn Relative entropy of held-out data
to asymmetric modeln Decision task: which of two verbs
is more likely to take a noun asdirect object, estimated from themodel for training data in whichthe pairs relating the noun to oneof the verbs have been deleted
ML in NLP June 02
NASSLI 39
ML in NLP
Relative Entropy
0
1
2
3
4
5
6
0 200 400 600
number of clusters
avg
rela
tive
entro
py (b
its)
traintestnew
n Train: 756,721 verb-object pair trainingset
n Test: 81,240 pairheld-out test set
n New: held-out datafor 1000 nouns notin the training data
Verb-object pairs from 1988 AP Newswire
ML in NLP
Decision Task
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500
number of clustersde
cisio
n er
ror
all
exceptional
n Which of v1 and v2 is more likely to take object on Held out: 104 (v2,o) pairs – need to guess from
pc(v2)n Testing: compare all v1 and v2 such that (v2,o) was
held out and (v1,o) occursn All: all test datan Exceptional: verb pairs
whose frequency ratiowith o is reversed fromtheir overall frequencyratio