Maximum Entropy Model
description
Transcript of Maximum Entropy Model
![Page 1: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/1.jpg)
Maximum Entropy Maximum Entropy ModelModel
ELN – Natural Language ELN – Natural Language ProcessingProcessing
Slides by: Fei Xia
![Page 2: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/2.jpg)
HistoryHistory
The concept of Maximum The concept of Maximum Entropy can be traced back Entropy can be traced back along multiple threads to along multiple threads to Biblical timesBiblical times
Introduced to NLP by Berger Introduced to NLP by Berger et. al. (1996)et. al. (1996)
Used in many NLP tasks: MT, Used in many NLP tasks: MT, Tagging, Parsing, PP Tagging, Parsing, PP attachment, LM, …attachment, LM, …
![Page 3: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/3.jpg)
OutlineOutline
Modeling: Intuition, basic Modeling: Intuition, basic concepts, …concepts, …
Parameter trainingParameter trainingFeature selectionFeature selectionCase studyCase study
![Page 4: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/4.jpg)
Reference papersReference papers
(Ratnaparkhi, 1997)(Ratnaparkhi, 1997)(Ratnaparkhi, 1996)(Ratnaparkhi, 1996)(Berger et. al., 1996)(Berger et. al., 1996)(Klein and Manning, 2003)(Klein and Manning, 2003)
Different notations.Different notations.
![Page 5: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/5.jpg)
ModelingModeling
![Page 6: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/6.jpg)
The basic ideaThe basic idea
Goal: estimate Goal: estimate ppChoose Choose pp with maximum entropy (or with maximum entropy (or
“uncertainty”) subject to the “uncertainty”) subject to the constraints (or “evidence”).constraints (or “evidence”).
BAx
xpxppH )(log)()(
BbAawherebax ),,(
![Page 7: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/7.jpg)
SettingSetting
From training data, collect (a, b) From training data, collect (a, b) pairs:pairs: a: thing to be predicted (e.g., a class
in a classification problem) b: the context Ex: POS tagging:
• a=NN• b=the words in a window and previous
two tagsLearn the probability of each (a, b): Learn the probability of each (a, b):
pp((aa, , bb))
![Page 8: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/8.jpg)
Features in POS tagging Features in POS tagging (Ratnaparkhi, 1996)(Ratnaparkhi, 1996)
context (a.k.a. history) allowable classes
Condition Features
wi is not rare wi = X & ti = T
wi is rare X is prefix of wi, |X| ≤ 4 & ti = T
X is suffix of wi, |X| ≤ 4 & ti = T
wi contains number & ti = T
wi contains uppercase character & ti = T
wi contains hyphen & ti = T
wi ti-1 = X & ti = T
ti-2 ti-1 = X Y & ti = T
wi-1 = X & ti = T
wi-2 = X & ti = T
wi+1 = X & ti = T
wi+2 = X & ti = T
![Page 9: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/9.jpg)
Maximum EntropyMaximum Entropy
Why Why maximummaximum entropy? entropy?Maximize entropy = Minimize Maximize entropy = Minimize
commitmentcommitmentModel all that is known and assume Model all that is known and assume
nothing about what is unknown. nothing about what is unknown. Model all that is known: satisfy a set of
constraints that must hold Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy
![Page 10: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/10.jpg)
(Maximum) Entropy(Maximum) Entropy
Entropy: the uncertainty of a Entropy: the uncertainty of a distributiondistributionQuantifying uncertainty (“surprise”)Quantifying uncertainty (“surprise”) Event x Probability px
“surprise” log(1/px)
Entropy: expected surprise (over Entropy: expected surprise (over pp):):
xxx
xp
pppH
pEpH
log)(
1log)(
p(HEADS)
H
![Page 11: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/11.jpg)
Ex1: Coin-flip example (Klein Ex1: Coin-flip example (Klein & Manning 2003)& Manning 2003) Toss a coin: p(H)=p1, p(T)=p2Toss a coin: p(H)=p1, p(T)=p2 Constraint: p1 + p2 = 1Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)?Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p)Answer: choose the p that maximizes H(p)
x
xpxppH )(log)()(
p1
H
p1=0.3
![Page 12: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/12.jpg)
ConvexityConvexity
Constrained Constrained HH((pp)) = – = – Σ Σ x log x log x x is convex:is convex: – x log x is convex – Σ x log x is convex (sum of
convex functions is convex) The feasible region of
constrained H is a linear subspace (which is convex)
The constrained entropy surface is therefore convex
The maximum likelihood The maximum likelihood exponential model (dual) exponential model (dual) formulation is also convexformulation is also convex
![Page 13: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/13.jpg)
Ex2: An MT example (Berger Ex2: An MT example (Berger et. al., 1996)et. al., 1996)
Possible translation for the word “in” is:
{dans, en, à, au cours de, pendant}
Constraint:
p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1
Intuitive answer:p(dans) = 1/5p(en) = 1/5p(à) = 1/5
p(au cours de) = 1/5p(pendant) = 1/5
![Page 14: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/14.jpg)
An MT example (cont)An MT example (cont)
Constraint:p(dans) + p(en) = 1/5
p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1
Intuitive answer:p(dans) = 1/10p(en) = 1/10p(à) = 8/30
p(au cours de) = 8/30p(pendant) = 8/30
![Page 15: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/15.jpg)
An MT example (cont)An MT example (cont)
Constraint:p(dans) + p(en) = 1/5p(dans) + p(à) = 1/2
p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1
Intuitive answer:p(dans) =p(en) =p(à) =
p(au cours de) =p(pendant) =
![Page 16: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/16.jpg)
Ex3: POS tagging (Klein and Ex3: POS tagging (Klein and Manning, 2003)Manning, 2003) Lets say we have the following event space:Lets say we have the following event space:
… … and the following empirical data:and the following empirical data:
Maximize H:Maximize H:
… … want probabilities: E[want probabilities: E[NN, NNS, NNP, NN, NNS, NNP, NNPS, VBZ, VBDNNPS, VBZ, VBD] = 1] = 1
1/6 1/6 1/6 1/6 1/6 1/6
1/e 1/e 1/e 1/e 1/e 1/e
NN NNS NNP NNPS BVZ VBD
3 5 11 13 3 1
![Page 17: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/17.jpg)
Ex3 (cont)Ex3 (cont)
Too uniform!Too uniform! N* are more common than V*, so we add the feature N* are more common than V*, so we add the feature
ffNN = {NN, NNS, NNP, NNPS}, with E[ = {NN, NNS, NNP, NNPS}, with E[ffNN] = 32/36] = 32/36
… … and proper nous are more frequent than common and proper nous are more frequent than common nouns, so we add nouns, so we add ffPP = {NNP, NNPS}, with E[ = {NNP, NNPS}, with E[ffPP] = ] =
24/3624/36
… … we could keep refining the models. E.g. by adding we could keep refining the models. E.g. by adding a feature to distinguish singular vs. plural nouns, or a feature to distinguish singular vs. plural nouns, or verb types.verb types.
4/36 4/36 12/36 12/36 2/36 2/36
NN NNS NNP NNPS VBZ VBD
8/36 8/36 8/36 8/36 2/36 2/36
![Page 18: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/18.jpg)
Ex4: overlapping features (Klein Ex4: overlapping features (Klein and Manning, 2003)and Manning, 2003) Maxent models handle overlapping featuresMaxent models handle overlapping features Unlike a NB model, there is no double counting!Unlike a NB model, there is no double counting! But do not automatically model feature interactions.But do not automatically model feature interactions.
![Page 19: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/19.jpg)
Modeling the problemModeling the problem
Objective function: Objective function: HH((pp))Goal: Among all the distributions Goal: Among all the distributions
that satisfy the constraints, choose that satisfy the constraints, choose the one, the one, p*p*, that maximizes , that maximizes HH((pp))..
Question: How to represent Question: How to represent constraints?constraints?
)(maxarg* pHpPp
![Page 20: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/20.jpg)
FeaturesFeatures
Feature (a.k.a. feature function, Indicator function) is a Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: binary-valued function on events:
ffjj : : {0, 1}, {0, 1}, = = AA BB
A: the set of possible classes (e.g., tags in POS tagging)A: the set of possible classes (e.g., tags in POS tagging)
B: space of contexts (e.g., neighboring words/ tags in POS B: space of contexts (e.g., neighboring words/ tags in POS tagging)tagging)
Example:Example:
..0
"")(&1),(
wo
thatbcurWordDETaifbaf j
![Page 21: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/21.jpg)
Some notationsSome notations
S
)(~ xp
)(xp
jf
Finite training sample of events:
Observed probability of x in S:
The model p’s probability of x:
The jth feature:
Observed expectation of fi :(empirical count of fi)
Model expectation of fi
x
jjp xfxpfE )()(
x
jjp xfxpfE )()(~~
![Page 22: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/22.jpg)
ConstraintsConstraints
Model’s feature expectation = observed Model’s feature expectation = observed feature expectationfeature expectation
How to calculate ?How to calculate ?
jpjp fEfE ~
jp fE ~
..0
"")(&1),(
wo
thatbcurWordDETaifbaf j
x
N
ij
jjp N
xfxfxpfE 1
~
)()()(~
![Page 23: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/23.jpg)
Training data Training data observed observed eventsevents
![Page 24: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/24.jpg)
Restating the problemRestating the problem
)(maxarg* pHpPp
}},...,1{,|{ ~ kjfEfEpP jpjp
},...,1,{ ~ kjdfEfE jjpjp
x
xp 1)(
The task: find p* s.t.
where
Objective function: -H(p)
Constraints:
Add a feature babaf ,1),(0
10~0 fEfE pp
![Page 25: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/25.jpg)
QuestionsQuestions
Is Is PP empty? empty?Does Does p*p* exist? exist?Is Is p*p* unique? unique?What is the form of What is the form of p*p*? ? How to find How to find p*p*??
![Page 26: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/26.jpg)
What is the form of What is the form of p*p*? ? (Ratnaparkhi, 1997)(Ratnaparkhi, 1997)
}},...,1{,|{ ~ kjfEfEpP jpjp
}0,)(|{1
)(
j
k
j
xfj
jxppQ
Theorem: if p* P Q then
Furthermore, p* is unique.
)(maxarg* pHpPp
![Page 27: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/27.jpg)
Using Lagrangian multipliersUsing Lagrangian multipliers
)()()(0
j
k
jjpj dfEpHpA
0
1
010
1
)(
1)(1)(
0
0
0
)(
)(
1)()(log
0)()(log1
0)(/)))()((()(log)((
0)('
eZwhereZ
exp
eexp
xfxp
xfxp
xpdxfxpxpxp
pA
xf
xfxf
j
k
jj
j
k
jj
jx
jx
k
jj
j
k
jj
j
k
jjj
k
jj
Minimize A(p):
Derivative = 0
![Page 28: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/28.jpg)
Two equivalent formsTwo equivalent forms
Z
exp
xf j
k
jj )(
1
)(
k
j
xfj
jxp1
)()(
jjZ ln
1
![Page 29: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/29.jpg)
The log-likelihood of the empirical distribution as predicted by a model q is defined as
Relation to Maximum Relation to Maximum LikelihoodLikelihood
x
xqxpqL )(log)(~)(
p~
QPp * )(maxarg* qLpQq
Theorem: if then Furthermore, p* is unique.
![Page 30: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/30.jpg)
Goal: find p* in P, which maximizes H(p).
It can be proved that when p* exists it is unique.
The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample
Summary (so far)Summary (so far)
},...,1,|{ ~ kjfEfEpP jpjp
}0,)(|{1
)(
j
k
j
xfj
jxppQ
p~
![Page 31: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/31.jpg)
Summary (cont)Summary (cont)
Adding constraints (features):Adding constraints (features):
(Klein and Manning, 2003)(Klein and Manning, 2003) Lower maximum entropy Raise maximum likelihood of data Bring the distribution further from
uniform Bring the distribution closer to data
![Page 32: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/32.jpg)
Parameter estimationParameter estimation
![Page 33: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/33.jpg)
AlgorithmsAlgorithms
Generalized Iterative Scaling Generalized Iterative Scaling (GIS):(GIS): (Darroch and Ratcliff, 1972)
Improved Iterative Scaling Improved Iterative Scaling (IIS):(IIS): (Della Pietra et al., 1995)
![Page 34: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/34.jpg)
GIS: setupGIS: setup
Requirements for running GIS:Requirements for running GIS:Obey form of model and constraints:Obey form of model and constraints:
An additional constraint:An additional constraint:
Add a new feature Add a new feature ffkk+1+1::
Z
exp
xf j
k
jj )(
1
)(
jjp dfE
k
jj Cxfx
1
)(
k
jjk xfCxfx
11 )()(
k
jj
xxfC
1
)(max
![Page 35: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/35.jpg)
GIS algorithmGIS algorithm
Compute Compute ddjj, , jj=1, …, =1, …, kk+1+1
Initialize (any values, e.g., 0) Initialize (any values, e.g., 0) Repeat until convergeRepeat until converge for each j
• compute
where
• update
Z
exp
xf
n
j
k
j
nj )(
)(
1
1
)(
)(
)()()()( xfxpfE j
x
njp n
)(log1
)(
)()1(
jp
inj
nj fE
d
C n
)1(j
![Page 36: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/36.jpg)
Approximation for calculating Approximation for calculating feature expectationfeature expectation
N
i Aaiji
Bb Aaj
BbAaj
BbAaj
x BbAajjjp
bafbapN
bafbapbp
bafbapbp
bafbapbp
bafbapxfxpfE
1
,
,
,
),()|(1
),()|()(~
),()|()(~
),()|()(
),(),()()(
![Page 37: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/37.jpg)
Properties of GISProperties of GIS
L(pL(p(n+1)(n+1)) >= L(p) >= L(p(n)(n))) The sequence is guaranteed to converge to The sequence is guaranteed to converge to
p*.p*. The converge can be very slow.The converge can be very slow.
The running time of each iteration is O(NPA):The running time of each iteration is O(NPA): N: the training set size P: the number of classes A: the average number of features that are active
for a given event (a, b).
![Page 38: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/38.jpg)
Feature selectionFeature selection
![Page 39: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/39.jpg)
Feature selectionFeature selection
Throw in many features and let the machine Throw in many features and let the machine select the weightsselect the weights Manually specify feature templates
Problem: too many featuresProblem: too many features
An alternative: greedy algorithmAn alternative: greedy algorithm Start with an empty set S Add a feature at each iteration
![Page 40: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/40.jpg)
NotationNotation
The gain in the log-likelihood of the training data:
After adding a feature:
With the feature set S:
![Page 41: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/41.jpg)
Feature selection algorithm Feature selection algorithm (Berger et al., 1996)(Berger et al., 1996)
Start with S being empty; thus pStart with S being empty; thus pss is is uniform.uniform.
Repeat until the gain is small enoughRepeat until the gain is small enough For each candidate feature f
• Computer the model using IIS• Calculate the log-likelihood gain
Choose the feature with maximal gain, and add it to S
fSp
Problem: too expensive
![Page 42: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/42.jpg)
Approximating gains (Berger Approximating gains (Berger et. al., 1996) et. al., 1996)
Instead of recalculating all the weights, Instead of recalculating all the weights, calculate only the weight of the new feature.calculate only the weight of the new feature.
![Page 43: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/43.jpg)
Training a MaxEnt ModelTraining a MaxEnt Model
Scenario #1:Scenario #1: Define features templatesDefine features templates Create the feature setCreate the feature set Determine the optimum feature weights via GIS or IISDetermine the optimum feature weights via GIS or IIS
Scenario #2:Scenario #2: Define feature templatesDefine feature templates Create candidate feature set SCreate candidate feature set S At every iteration, choose the feature from S (with At every iteration, choose the feature from S (with
max gain) and determine its weight (or choose top-n max gain) and determine its weight (or choose top-n features and their weights).features and their weights).
![Page 44: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/44.jpg)
Case studyCase study
![Page 45: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/45.jpg)
POS tagging (Ratnaparkhi, POS tagging (Ratnaparkhi, 1996)1996)
Notation variation: Notation variation: fj(a, b): a: class, b: context
fj(hi, ti): h: history for ith word, t: tag for ith word
History:History:
hhii = { = {wwii,, w wii-1-1, , wwii-2-2, , wwii+1+1, , wwii+2+2, , ttii-2-2, t, tii-1-1}}
Training data:Training data: Treat it as a list of (hi, ti) pairs How many pairs are there?
![Page 46: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/46.jpg)
Using a MaxEnt ModelUsing a MaxEnt Model
Modeling: Modeling:
Training: Training: Define features templates Create the feature set Determine the optimum feature weights via GIS or
IIS
Decoding: Decoding:
![Page 47: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/47.jpg)
ModelingModeling
)|(
),|(
),...,|,...,(
1
111
1
11
i
n
ii
inn
ii
nn
htp
twtp
wwttP
Tt
thp
thphtp
'
)',(
),()|(
![Page 48: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/48.jpg)
Training step 1: define Training step 1: define feature templatesfeature templates
History hi Tag ti
Condition Features
wi is not rare wi = X & ti = T
wi is rare X is prefix of wi, |X| ≤ 4 & ti = T
X is suffix of wi, |X| ≤ 4 & ti = T
wi contains number & ti = T
wi contains uppercase character & ti = T
wi contains hyphen & ti = T
wi ti-1 = X & ti = T
ti-2 ti-1 = X Y & ti = T
wi-1 = X & ti = T
wi-2 = X & ti = T
wi+1 = X & ti = T
wi+2 = X & ti = T
![Page 49: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/49.jpg)
Step 2: Create feature setStep 2: Create feature set
Collect all the features from the training dataThrow away features that appear less than 10 times
![Page 50: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/50.jpg)
Step 3: determine the Step 3: determine the feature weightsfeature weightsGISGIS
Training time:Training time: Each iteration: O(NTA):
• N: the training set size• T: the number of allowable tags• A: average number of features that are active for a (h, t).
How many features?How many features?
![Page 51: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/51.jpg)
Decoding: Beam searchDecoding: Beam search
Generate tags for Generate tags for ww11, find top , find top NN, set , set ss11jj
accordingly, accordingly, jj =1, 2, …, =1, 2, …, NN for for i = 2 i = 2 to to nn ( (nn is the sentence length) is the sentence length)
for j =1 to Ngenerate tags for wi, given s(i-1)j as previous tag context
append each tag to s(i-1)j to make a new sequence
find N highest prob sequences generated above, and set sij accordingly, j = 1, …, N
Return highest prob sequence Return highest prob sequence ssnn11..
![Page 52: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/52.jpg)
Beam searchBeam search
Beam inference:Beam inference: At each position keep the top At each position keep the top kk complete sequences complete sequences Extend each sequence in each local wayExtend each sequence in each local way The extensions compete for the The extensions compete for the kk slots at the next position slots at the next position
AdvantagesAdvantages Fast: and beam sizes of 3-5 are as good or almost as good Fast: and beam sizes of 3-5 are as good or almost as good
as exact inference in many casesas exact inference in many cases Easy to implement (no dynamic programming required)Easy to implement (no dynamic programming required)
Disadvantage:Disadvantage: Inexact: the global best sequence can fall off the beamInexact: the global best sequence can fall off the beam
![Page 53: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/53.jpg)
Viterbi searchViterbi search
Viterbi inferenceViterbi inference Dynamic programming or memoizationDynamic programming or memoization Requires small window of state influence (e.g., Requires small window of state influence (e.g.,
past two states are relevant)past two states are relevant) AdvantagesAdvantages
Exact: the global best sequence is returnedExact: the global best sequence is returned DisadtvantagesDisadtvantages
Harder to implement long-distance state-state Harder to implement long-distance state-state interactions (but beam inference tends not to interactions (but beam inference tends not to allow long-distance resurrection of sequences allow long-distance resurrection of sequences anyway)anyway)
![Page 54: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/54.jpg)
Decoding (cont)Decoding (cont)
Tags for words:Tags for words: Known words: use tag dictionary Unknown words: try all possible tags
Ex: “time flies like an arrow”Ex: “time flies like an arrow”Running time: O(NTAB)Running time: O(NTAB) N: sentence length B: beam size T: tagset size A: average number of features that
are active for a given event
![Page 55: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/55.jpg)
Experiment resultsExperiment results
![Page 56: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/56.jpg)
Comparison with other Comparison with other learnerslearnersHMM: MaxEnt uses more contextHMM: MaxEnt uses more context
SDT: MaxEnt does not split dataSDT: MaxEnt does not split data
TBL: MaxEnt is statistical and it provides TBL: MaxEnt is statistical and it provides probability distributions.probability distributions.
![Page 57: Maximum Entropy Model](https://reader035.fdocuments.in/reader035/viewer/2022081501/56814524550346895db1ea2f/html5/thumbnails/57.jpg)
MaxEnt SummaryMaxEnt Summary
Concept: choose the p* that Concept: choose the p* that maximizes entropy while satisfying all maximizes entropy while satisfying all the constraints.the constraints.
Max likelihood: p* is also the model Max likelihood: p* is also the model within a model family that maximizes within a model family that maximizes the log-likelihood of the training data.the log-likelihood of the training data.
Training: GIS or IIS, which can be slow.Training: GIS or IIS, which can be slow.MaxEnt handles overlapping features MaxEnt handles overlapping features
well.well. In general, MaxEnt achieves good In general, MaxEnt achieves good
performances on many NLP tasks.performances on many NLP tasks.