Log-Linear Models in NLP
description
Transcript of Log-Linear Models in NLP
![Page 1: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/1.jpg)
Log-Linear Models in NLP
Noah A. SmithDepartment of Computer Science /
Center for Language and Speech ProcessingJohns Hopkins [email protected]
![Page 2: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/2.jpg)
Outline
Maximum Entropy principleLog-linear modelsConditional modeling for
classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection
![Page 3: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/3.jpg)
DataFor now, we’re
just talking about modeling data. No task.
How to assign
probability to each
shape type?
![Page 4: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/4.jpg)
3.19
2.12
0 0
1.06
4.25
3.19
Maximum Likelihood
1.06
0 0
0 0
1.06
0 0
1.06
11 degrees of freedom (12 –
1).
How to smooth?
Fewer parameters?
![Page 5: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/5.jpg)
Some other kinds of models
Color
Shape Size
Pr(Color, Shape, Size) = Pr(Color) • Pr(Shape | Color) • Pr(Size | Color, Shape)
0.5
0.5
0.125
0.375
0.500
0.125
0.375
0.500
large 0.000
small 1.000
large 0.333
small 0.667
large 0.250
small 0.750
large 1.000
small 0.000
large 0.000
small 1.000
large 0.000
small 1.000
11 degrees of freedom (1 + 4 +
6).
These two are the same!
These two are the same!
![Page 6: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/6.jpg)
Some other kinds of models
Color
Shape Size
Pr(Color, Shape, Size) = Pr(Color) • Pr(Shape) • Pr(Size | Color, Shape)
0.5
0.5
0.125
0.375
0.500
9 degrees of freedom (1 + 2 +
6).large 0.000
small 1.000
large 0.333
small 0.667
large 0.250
small 0.750
large 1.000
small 0.000
large 0.000
small 1.000
large 0.000
small 1.000
![Page 7: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/7.jpg)
Some other kinds of models
Color
Shape Size
Pr(Color, Shape, Size) = Pr(Size) • Pr(Shape | Size) • Pr(Color | Size)
large
0.667
0.333
small
0.462
0.538
large
0.333
0.333
0.333
small 0.077
0.385
0.538
large 0.375
small 0.625
7 degrees of freedom (1 + 2 +
4).
No zeroes here ...
![Page 8: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/8.jpg)
Some other kinds of models
Color
Shape Size
Pr(Color, Shape, Size) = Pr(Size) • Pr(Shape) • Pr(Color)
0.125
0.375
0.500
large 0.375
small 0.625
4 degrees of freedom (1 + 2 +
1).0.5
0.5
![Page 9: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/9.jpg)
This is difficult.
Different factorizations affect:smoothing
# parameters (model size)model complexity
“interpretability”goodness of fit
...Usually, this
isn’t done empirically,
either!
![Page 10: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/10.jpg)
Desiderata
•You decide which features to use.•Some intuitive criterion tells you how to use them in the model.•Empirical.
![Page 11: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/11.jpg)
Maximum Entropy
“Make the model as uniform as possible ...
but I noticed a few things that I want to model ...
so pick a model that fits the data on those things.”
![Page 12: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/12.jpg)
Occam’s Razor
One should not increase,
beyond what is necessary, the
number of entities
required to explain
anything.
![Page 13: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/13.jpg)
Uniform model
small 0.083 0.083 0.083
small 0.083 0.083 0.083
large 0.083 0.083 0.083
large 0.083 0.083 0.083
![Page 14: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/14.jpg)
Constraint: Pr(small) = 0.625
small 0.104 0.104 0.104
small 0.104 0.104 0.104
large 0.063 0.063 0.063
large 0.063 0.063 0.063
0.625
![Page 15: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/15.jpg)
Pr( , small) = 0.048
small 0.024 0.144 0.144
small 0.024 0.144 0.144
large 0.063 0.063 0.063
large 0.063 0.063 0.063
0.625
0.048
![Page 16: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/16.jpg)
Pr(large, ) = 0.125
small 0.024 0.144 0.144
small 0.024 0.144 0.144
large 0.063 0.063 0.063
large 0.063 0.063 0.063
0.625
?
0.048
![Page 17: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/17.jpg)
Questions
Does a solution always exist?
Is there a way to express the
model succinctly?
Is there an efficient way to
solve this problem?
What to do if it
doesn’t?
![Page 18: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/18.jpg)
Entropy
• A statistical measurement on a distribution.• Measured in bits. [0, log2||]• High entropy: close to uniform• Low entropy: close to deterministic• Concave in p.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
![Page 19: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/19.jpg)
The Max Ent Problem
0
0.5
1 0
0.5
1
0
0.5
1
1.5
p1p2
H
Max
![Page 20: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/20.jpg)
The Max Ent Problem
objective function is H
probabilities sum to 1 ...
... and are nonnegative
expected feature value under the
model
expected feature value from the data
n constraints
picking a distribution
![Page 21: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/21.jpg)
The Max Ent Problem
0
0.5
1 0
0.5
1
0
0.5
1
1.5
p1p2
H
![Page 22: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/22.jpg)
About feature constraints
1 if x is small,
0 otherwise
1 if x is a small ,
0 otherwise
1 if x is large and
light,0 otherwise
![Page 23: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/23.jpg)
Mathematical Magic
Max
constrained|| variables (p)concave in p
unconstrainedN variables (θ)concave in θ
![Page 24: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/24.jpg)
What’s the catch?
The model takes on a specific, parameterized form.
It can be shown that any max-ent model must take this form.
![Page 25: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/25.jpg)
Outline
Maximum Entropy principleLog-linear modelsConditional modeling for
classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection
![Page 26: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/26.jpg)
Log-linear models
Log linear
![Page 27: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/27.jpg)
Log-linear models
Unnormalized probability, or
weight
Partition function
One parameter (θi) for each feature.
![Page 28: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/28.jpg)
Mathematical Magic
Max
constrained|| variables (p)concave in p
unconstrainedN variables (θ)concave in θ
Max ent problem
Log-linear ML problem
![Page 29: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/29.jpg)
What does MLE mean?
Independence among examples
Arg max is the same in the log
domain
![Page 30: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/30.jpg)
MLE: Then and Now
Directed models Log-linear models
Concave Concave
Constrained (simplex) Unconstrained
“Count and normalize”
(closed form solution)Iterative methods
![Page 31: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/31.jpg)
Iterative Methods
• Generalized Iterative Scaling• Improved Iterative Scaling• Gradient Ascent• Newton/Quasi-Newton Methods
– Conjugate Gradient– Limited-Memory Variable Metric– ...
All of these methods are
correct and will converge to the right answer; it’s just a matter of
how fast.
![Page 32: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/32.jpg)
Questions
Does a solution always exist?
Is there a way to express the
model succinctly?
Is there an efficient way to
solve this problem?
Yes, if the constraints come from the data.
Yes, many iterative methods.
Yes, a log-linear model.
![Page 33: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/33.jpg)
Outline
Maximum Entropy principleLog-linear modelsConditional modeling for
classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection
![Page 34: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/34.jpg)
Conditional Estimation
Classification Rule:
Training Objective:
examples labels
![Page 35: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/35.jpg)
Maximum Likelihood
object
label
![Page 36: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/36.jpg)
Maximum Likelihood
object
label
![Page 37: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/37.jpg)
Maximum Likelihood
object
label
![Page 38: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/38.jpg)
Maximum Likelihood
object
label
![Page 39: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/39.jpg)
Conditional Likelihood
object
label
![Page 40: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/40.jpg)
Remember:
log-linear models
conditional estimation
![Page 41: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/41.jpg)
The Whole Picture
Directed models Log-linear models
MLE“Count &
Normalize”®
Unconstrained concave
optimization
CLEConstrained
concave optimization
Unconstrained concave
optimization
![Page 42: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/42.jpg)
Log-linear models: MLE vs. CLE
Sum over all example types
all labels.
Sum over all labels.
![Page 43: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/43.jpg)
Classification Rule
Pick the most probable label y:
We don’t need to compute the
partition function at test time!But it does need to
be computed during training.
![Page 44: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/44.jpg)
Outline
Maximum Entropy principleLog-linear modelsConditional modeling for
classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection
![Page 45: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/45.jpg)
Ratnaparkhi’s POS Tagger (1996)
• Probability model:
• Assume unseen words behave like rare words.– Rare words ≡ count < 5
• Training: GIS• Testing/Decoding: beam search
![Page 46: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/46.jpg)
Features: common words
the
stories
about
well-heeled
communities
anddeveloper
s
DT
NNS IN JJ NNS CC NNS
about
IN
stories
IN
the
IN
well-heeled
IN
communities
INNNS INDT
NNS IN
![Page 47: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/47.jpg)
Features: rare words
the
stories
about
well-heeled
communities
anddeveloper
s
DT
NNS IN JJ NNS CC NNS
about
JJ
stories
JJ
communities
JJ
and
JJIN JJNNS IN JJ
w...
JJ
we...
JJ
wel...
JJ
well...
JJ
...d
JJ
...ed
JJ
...led
JJ
...eled
JJ
...-...
JJ
![Page 48: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/48.jpg)
The “Label Bias” Problem
born towealt
h
VBN TO NN
born torun
VBN TO VBborn to
wealth
VBN TO NN
born towealt
h
VBN TO NN
born towealt
h
VBN TO NN
born torun
VBN TO VB
born torun
VBN TO VB
born torun
VBN TO VB
born torun
VBN TO VB
born torun
VBN TO VB
born torun
VBN TO VB
(4)
(6)
![Page 49: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/49.jpg)
The “Label Bias” Problem
■VBN
VBN, IN
IN, NN
VBN, TO
born
to
towealth
run
born to wealth
Pr(VBN | born) Pr(IN | VBN, to) Pr(NN | VBN, IN, wealth) = 1 * .4 * 1
Pr(VBN | born) Pr(TO | VBN, to) Pr(VB | VBN, TO, wealth) = 1 * .6 * 1
TO, VB
![Page 50: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/50.jpg)
Is this symptomati
c of log-linear
models?No!
![Page 51: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/51.jpg)
Tagging Decisions
tag1tag1
tag2
tag2
tag2
tag3
tag3
tag3
tag3
tag3
tag3
tag3
At each decision point, the total weight is 1.
tagn
Choose the path with the greatest weight.
tag3
A
B
C
A
B
C
D
A
B
D
B
You must choose tag2 = B, even if B is a terrible tag
for word2.Pr(tag2 = B | anything at
all!) = 1
■
You never pay a penalty for it!
![Page 52: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/52.jpg)
Tagging Decisions in an HMM
tag1tag1
tag2
tag2
tag2
tag3
tag3
tag3
tag3
tag3
tag3
tag3
At each decision point, the total weight can be 0.
tagn
Choose the path with the greatest weight.
tag3
A
B
C
A
B
C
D
A
B
D
B
You may choose to discontinue this path if B can’t tag word2. Or pay a
high cost.
■
![Page 53: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/53.jpg)
Outline
Maximum Entropy principleLog-linear modelsConditional modeling for
classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection
![Page 54: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/54.jpg)
Conditional Random Fields
• Lafferty, McCallum, and Pereira (2001)
• Whole sentence model with local features:
![Page 55: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/55.jpg)
Simple CRFs as Graphs
PRP$ NN VBZ ADV
My cat begs silently
PRP$ NN VBZ ADV
My cat begs silently
Compare with an HMM:
Weights, added together
.
Log-probs, added
together.
![Page 56: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/56.jpg)
What can CRFs do that HMMs can’t?
PRP$ NN VBZ ADV
My cat begs silently
ADV
...ly
VBZ
...s
![Page 57: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/57.jpg)
An Algorithmic Connection
What is the partition function?
Total weight of all paths.
![Page 58: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/58.jpg)
CRF weight training
• Maximize log-likelihood:
• Gradient:
Total weight of all paths.
Forward algorith
m.Expected feature values.
Forward-backwar
d algorith
m.
![Page 59: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/59.jpg)
Forward, Backward, and Expectations
fk is the number
of firings; each
firing is at some position
Markovian property
forward weight backward weight
forward weight
![Page 60: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/60.jpg)
Forward, Backward, and Expectations
forward weight backward weight
forward weight to final state = weight of all paths
![Page 61: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/61.jpg)
Forward-Backward’s Clients
Training a CRF Baum-Welch
supervised(labeled data)
unsupervised
concave bumpy
converges to global max
converges to local max
max p(y | x)(conditional training)
max p(x) (y unknown)
![Page 62: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/62.jpg)
A Glitch
• Suppose we notice that -ly words are always adverbs.
• Call this feature 7.
-ly words are all ADV; this is maximal.
The expectation can’t exceed the max (it can’t even reach it).The gradient will always be positive.
![Page 63: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/63.jpg)
The Dark Side of Log-Linear Models
![Page 64: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/64.jpg)
Outline
Maximum Entropy principleLog-linear modelsConditional modeling for
classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection
![Page 65: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/65.jpg)
Regularization
• θs shouldn’t have huge magnitudes• Model must generalize to test data
• Example: quadratic penalty
![Page 66: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/66.jpg)
Bayesian Regularization:Maximum A Posteriori
Estimation
![Page 67: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/67.jpg)
Independent Gaussians Prior (Chen and Rosenfeld, 2000)
Independence
Gaussian
0-mean, identical variance.
Quadratic penalty!
![Page 68: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/68.jpg)
Alternatives
• Different variances for different parameters• Laplacian prior (1-norm)
• Exponential prior (Goodman, 2004)
• Relax the constraints (Kazama & Tsujii, 2003)
All θk ≥ 0.
Not differentiable.
![Page 69: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/69.jpg)
Effect of the penalty
unsmoothedGoodman's smoothing
θk
![Page 70: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/70.jpg)
Kazama & Tsujii’s box constraints
The primal Max Ent problem:
![Page 71: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/71.jpg)
Sparsity
• Fewer features → better generalization
• E.g., support vector machines
• Kazama & Tsujii’s prior, and Goodman’s, give sparsity.
![Page 72: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/72.jpg)
Sparsity
-1 -0.5 0 0.5 1 1.5 2 2.5 3
Kazama & Tsujii's smoothingGoodman's smoothing
Gaussian smoothing
θk
penalty
Cusp; function is not differentiable
here.
Gradient is 0.
![Page 73: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/73.jpg)
Outline
Maximum Entropy principleLog-linear modelsConditional modeling for
classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection
![Page 74: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/74.jpg)
Feature Selection
• Sparsity from priors is one way to pick the features. (Maybe not a good way.)
• Della Pietra, Della Pietra, and Lafferty (1997) gave another way.
![Page 75: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/75.jpg)
Back to the original example.
![Page 76: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/76.jpg)
Nine features.
• f1 = 1 if , 0 otherwise
• f2 = 1 if , 0 otherwise
• f3 = 1 if , 0 otherwise
• f4 = 1 if , 0 otherwise
• f5 = 1 if , 0 otherwise
• f6 = 1 if , 0 otherwise
• f7 = 1 if , 0 otherwise
• f8 = 1 if , 0 otherwise
• f9 = 1 unless some other feature fires; θ9 << 0
What’s
wrong
here?
θi = log counti
![Page 77: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/77.jpg)
The Della Pietras’ & Lafferty’s Algorithm
1. Start out with no features.2. Consider a set of candidates.
• Atomic features.• Current features conjoined with atomic features.
3. Pick the candidate g with the greatest gain:
4. Add g to the model.5. Retrain all parameters.6. Go to 2.
![Page 78: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/78.jpg)
PRP$
Feature Induction: Example
PRP$ NN VBZ ADV
My cat begs silently
Atomic features:
NN VBZ ADV
My cat begs silently
Selected features:
Other candidates:
NN VBZ
PRP$ NNNN
cat
![Page 79: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/79.jpg)
Outline
Maximum Entropy principleLog-linear modelsConditional modeling for
classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection
![Page 80: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/80.jpg)
Conclusions
Probabilistic models:robustness
data-orientedmathematically understood
Hacks:explanatory power
exploit expert’s choice of features(can be) more data-oriented
Log-linear models
The math is beautiful and easy
to implement. You pick
the features; the rest is just math!
![Page 81: Log-Linear Models in NLP](https://reader035.fdocuments.in/reader035/viewer/2022081505/56815731550346895dc4ceb4/html5/thumbnails/81.jpg)
Thank you!