Linear Methods For Sequence Labeling: Hidden Markov Models vs...
Transcript of Linear Methods For Sequence Labeling: Hidden Markov Models vs...
Learning for Structured Prediction
Linear Methods For Sequence Labeling:
Hidden Markov Models vs Structured Perceptron
Ivan Titov
Last Time: Structured Prediction
1. Selecting feature representation It should be sufficient to discriminate correct structure from incorrect ones
It should be possible to decode with it (see (3))
2. Learning Which error function to optimize on the training set, for example
How to make it efficient (see (3))
3. Decoding: Dynamic programming for simpler representations ?
Approximate search for more powerful ones?
We illustrated all these challenges on the example of dependency parsing
y = argmaxy02Y(x)w ¢ '(x;y0)
w ¢ '(x;y?)¡maxy02Y(x);y 6=y? w ¢ '(x;y0) > °
'(x;y)
'
x is an input (e.g.,
sentence), y is an output
(syntactic tree)
Outline
3
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model
Standard definition + maximum likelihood estimation
General views: as a representative of linear models
Perceptron and Structured Perceptron
algorithms / motivations
Decoding with the Linear Model
Discussion: Discriminative vs. Generative
Sequence Labeling Problems
4
Definition:
Input: sequences of variable length
Output: every position is labeled
Examples:
Part-of-speech tagging
Named-entity recognition, shallow parsing (“chunking”), gesture recognition
from video-streams, …
x = (x1; x2; : : : ; xjxj), xi 2X
y = (y1; y2; : : : ; yjxj), yi 2 Y
John carried a tin can .
NP VBD DT NN NN .
x=
y =
Part-of-speech tagging
5
Labels:
NNP – proper singular noun;
VBD - verb, past tense
DT - determiner
Consider
John carried a tin can .
NNP VBD DT NN NN or MD? .
x=
y =
NN – singular noun
MD - modal
. - final punctuation
If you just predict the most
frequent tag for each word you
will make a mistake here
In fact, even knowing that the previous
word is a noun is not enough
Tin can cause poisoning …
NN MD VB NN …
x=
y =
One need to model interactions between labels
to successfully resolve ambiguities, so this
should be tackled as a structured prediction
problem
3
Named Entity Recognition
6
[ORG Chelsea], despite their name, are not based in [LOC Chelsea], but in
neighbouring [LOC Fulham] .
Not as trivial as it may seem, consider:
[PERS Bill Clinton] will not embarrass [PERS Chelsea] at her wedding
Tiger failed to make a birdie
Encoding example (BIO-encoding)
Bill Clinton embarrassed Chelsea at her wedding at Astor Courts
B-PERS I-PERS O B-PERS O O O O B-LOC I-LOC
x=y =
Chelsea can be a person too!
Is it an animal or a person?
in the South Course …
3
Vision: Gesture Recognition
7
Given a sequence of frames in a video annotate each frame with a gesture type:
Types of gestures:
It is hard to predict gestures from each frame in isolation, you need to exploit
relations between frames and gesture types
Flip back Shrink
vertically
Expand
vertically
Double
back
Point
and back
Expand horizontally
Figures from (Wang et al.,
CVPR 06)
Outline
8
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model
Standard definition + maximum likelihood estimation
General views: as a representative of linear models
Perceptron and Structured Perceptron
algorithms / motivations
Decoding with the Linear Model
Discussion: Discriminative vs. Generative
Hidden Markov Models
9
We will consider the part-of-speech (POS) tagging example
A “generative” model, i.e.:
Model: Introduce a parameterized model of how both words and tags are
generated
Learning: use a labeled training set to estimate the most likely parameters of
the model
Decoding:
John carried a tin can .
NP VBD DT NN NN .
P (x;yjµ)
µ
y = argmaxy0 P(x;y0jµ)
Hidden Markov Models
10
A simplistic state diagram for noun phrases: N – tags, M – vocabulary size
States correspond to POS tags,
Words are emitted independently from each POS tag
Parameters (to be estimated from the training set):
Transition probabilities : [ N x N ] matrix
Emission probabilities : [ N x M] matrix
Stationarity assumption: this
probability does not depend on
the position in the sequence t
$
Det
Adj
Noun
[
0.5 : the]
[0.01: red,
, …]
0.1
0.8
P (y(t)jy(t¡1))P (x(t)jy(t))
0.1
0.50.2
1.0
0.5
0.8
[
0.01 : herring, …]
0.01 : dog
0.01 : hungry
0.5 : a
a
Example:
hungry dog
Hidden Markov Models
11
Representation as an instantiation of a graphical model: N – tags, M – vocabulary size
States correspond to POS tags,
Words are emitted independently from each POS tag
Parameters (to be estimated from the training set):
Transition probabilities : [ N x N ] matrix
Emission probabilities : [ N x M] matrix
Stationarity assumption: this
probability does not depend on
the position in the sequence t
P (y(t)jy(t¡1))P (x(t)jy(t))
…
x(4)
…
x(1) x(2) x(3)
y(4)y(1) y(2) y(3) A arrow means that in the
generative story x(4) is generated
from some P(x(4) | y(4))
= Det
= a
= Adj = Noun
= hungry = dog
Hidden Markov Models: Estimation
12
N – the number tags, M – vocabulary size
Parameters (to be estimated from the training set):
Transition probabilities , A - [ N x N ] matrix
Emission probabilities , B - [ N x M] matrix
Training corpus:
x(1)= (In, an, Oct., 19, review, of, …. ), y(1)= (IN, DT, NNP, CD, NN, IN, …. )
x(2)= (Ms., Haag, plays, Elianti,.), y(2)= (NNP, NNP, VBZ, NNP, .)
…
x(L)= (The, company, said,…), y(L)= (DT, NN, VBD, NNP, .)
How to estimate the parameters using maximum likelihood estimation?
You probably can guess what these estimation should be?
aji = P(y(t) = ijy(t¡1) = j)
bik = P(x(t) = kjy(t) = i)
Hidden Markov Models: Estimation
13
Parameters (to be estimated from the training set):
Transition probabilities , A - [ N x N ] matrix
Emission probabilities , B - [ N x M] matrix
Training corpus: (x(1),y(1) ), l = 1, … L
Write down the probability of the corpus according to the HMM:
P(fx(l);y(l)gLl=1) =QL
l=1P(x(l);y(l)) =
=QL
l=1 a$;y(l)1
³Qjxlj¡1t=1 b
y(l)
t ;x(l)
t
ay(l)
t ;y(l)
t+1
´by(l)
jx lj;x(l)
jx ljay(l)
jx lj;$
=
Select tag for
the first word
Draw a word
from this state
Select the next
stateDraw last
wordTransit into the
$ state
=QN
i;j=1 aCT (i;j)
i;j
QN
i=1
QM
k=1 bCE(i;k)
i;k
CT(i,j) is #times tag i is followed by tag j.
Here we assume that $ is a special tag which
precedes and succeeds every sentence
CE(i,k) is #times
word k is
emitted by tag i
aji = P(y(t) = ijy(t¡1) = j)
bik = P(x(t) = kjy(t) = i)
8
Hidden Markov Models: Estimation Maximize:
Equivalently maximize the logarithm of this:
subject to probabilistic constraints:
Or, we can decompose it into 2N optimization tasks:
P (fx(l);y(l)gLl=1) =
log(P(fx(l);y(l)gLl=1)) =
=PN
i=1
³PN
j=1CT (i; j) logai;j +PM
k=1CE(i; k) log bi;k
´
PN
j=1 ai;j = 1;PN
i=1 bi;k = 1; i = 1; : : : ; N
14
=QN
i;j=1 aCT (i;j)
i;j
QN
i=1
QM
k=1 bCE(i;k)
i;k
CT(i,j) is #times tag i is followed by tag j. CE(i,k) is #times word k is
emitted by tag i
i = 1; : : : ;N:
maxai;1;:::;ai;NPN
j=1CT (i; j) log ai;j
s.t.PN
j=1 ai;j = 1
maxbi;1;:::;bi;M CE(i; k) log bi;k
s.t.PN
i=1 bi;k = 1
For transitions
i = 1; : : : ;N:
For emissions
Hidden Markov Models: Estimation For transitions (some i)
Constrained optimization task, Lagrangian:
Find critical points of Lagrangian by solving the system of equations:
Similarly, for emissions:
15
maxai;1;:::;ai;NPN
j=1CT (i; j) log ai;j
s.t. 1¡PN
j=1 ai;j = 0
L(ai;1; : : : ; ai;N ; ¸) =PN
j=1CT (i; j) log ai;j + ¸£ (1¡PN
j=1 ai;j)
@L@¸
= 1¡PN
j=1 ai;j = 0
@L@aij
=CT (i;j)
aij¡ ¸ = 0 =) aij =
CT (i;j)
¸
P(yt = jjyt¡1 = i) = ai;j =CT (i;j)Pj0 CT (i;j
0)
P (xt = kjyt = i) = bi;k =CE(i;k)Pk0 CE(i;k
0)
The maximum likelihood solution is
just normalized counts of events.
Always like this for generative
models if all the labels are visible in
training
I ignore “smoothing” to process rare
or unseen word tag combinations…
Outside score of the seminar
2
HMMs as linear models
16
Decoding:
We will talk about the decoding algorithm slightly later, let us generalize Hidden
Markov Model:
But this is just a linear model!!
John carried a tin can .
? ? ? ? ? .
y = argmaxy0 P (x;y0jA;B) = argmaxy0 logP(x;y0jA;B)
logP (x;y0jA;B) =Pjxj+1
l=1 log by0i;xi + log ay0
i;y0i+1
=PN
i=1
PN
j=1CT (y0; i; j)£ log ai;j +
PN
i=1
PM
k=1CE(x;y0; i; k)£ log bi;k
The number of times tag i
is followed by tag j in the
candidate y’
The number of times tag i
corresponds to word k in (x, y’)
2
)
)
Scoring: example
Their inner product is exactly
But may be there other (and better?) ways to estimate , especially when we know that HMM is not a faithful model of reality?
It is not only a theoretical question! (we’ll talk about that in a moment)
wML = (
'(x;y0) = (
(x;y0) =John carried a tin can .
NP VBD DT NN NN .
1 0 … 1 1 0 …
NP: John NP:Mary ... NP-VBD NN-. MD-. …Unary features
log bNP;John log bNP;Mary : : : log aNN;V BD log aNN;: log aMD;:
Edge features
wML¢'(x;y0) =PN
i=1
PN
j=1CT (y0; i; j)£ log ai;j+
PN
i=1
PM
k=1CE(x;y0; i; k)£ log bi;k
CE(x;y0; i; k)
CT (y0; i; j): : :
logP (x;y0jA;B)
w
Feature view
18
Basically, we define features which correspond to edges in the graph:
…
x(4)
…
x(1) x(2) x(3)
y(4)y(1) y(2) y(3)
Shaded because they
are visible (both in
training and testing)
Generative modeling
19
For a very large dataset (asymptotic analysis):
If data is generated from some “true” HMM, then (if the training set is sufficiently
large), we are guaranteed to have an optimal tagger
Otherwise, (generally) HMM will not correspond to an optimal linear classifier
Discriminative methods which minimize the error more directly are guaranteed
(under some fairly general conditions) to converge to an optimal linear classifier
For smaller training sets
Generative classifiers converge faster to their optimal error [Ng & Jordan, NIPS
01]
Real case: HMM is a
coarse approximation
of reality
A discriminative classifier
A generative model
Errors on a
regression dataset
(predict housing
prices in Boston
area):
# train examples1
Outline
20
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model
Standard definition + maximum likelihood estimation
General views: as a representative of linear models
Perceptron and Structured Perceptron
algorithms / motivations
Decoding with the Linear Model
Discussion: Discriminative vs. Generative
Perceptron
21
Let us start with a binary classification problem
For binary classification the prediction rule is:
Perceptron algorithm, given a training set
y 2 f+1;¡1g
y = sign (w ¢ '(x))
= 0 // initialize
do
err = 0
for = 1 .. L // over the training examples
if ( < 0) // if mistake
+= // update,
err ++ // # errors
endif
endfor
while ( err > 0 ) // repeat until no errors
return
y(l)¡w ¢ '(x(l))
¢l
w
w ´y(l)'(x(l))
w
fx(l); y(l)gLl=1
´ > 0
break ties (0) in some
deterministic way
Linear classification
22
Linear separable case, “a perfect” classifier:
Linear functions are often written as: , but we can assume
that
y = sign (w ¢ '(x) + b)
(w ¢ '(x) + b) = 0
w
'(x)1
'(x)2
'(x)0 = 1 for any x
Figure adapted from Dan
Roth’s class at UIUC
Perceptron: geometric interpretation
23
if ( < 0) // if mistake
+= // update
endif
y(l)¡w ¢ '(x(l))
¢w ´y(l)'(x(l))
Perceptron: geometric interpretation
24
if ( < 0) // if mistake
+= // update
endif
y(l)¡w ¢ '(x(l))
¢w ´y(l)'(x(l))
Perceptron: geometric interpretation
25
if ( < 0) // if mistake
+= // update
endif
y(l)¡w ¢ '(x(l))
¢w ´y(l)'(x(l))
Perceptron: geometric interpretation
26
if ( < 0) // if mistake
+= // update
endif
y(l)¡w ¢ '(x(l))
¢w ´y(l)'(x(l))
Perceptron: algebraic interpretation
27
if ( < 0) // if mistake
+= // update
endif
y(l)¡w ¢ '(x(l))
¢w ´y(l)'(x(l))
We want after the update to increase
If the increase is large enough than there will be no misclassification
Let’s see that’s what happens after the update
So, the perceptron update moves the decision hyperplane towards
misclassified
y(l)¡(w+ ´y(l)'(x(l))) ¢ '(x(l))
¢
y(l)¡w ¢ '(x(l))
¢
= y(l)¡w ¢ '(x(l))
¢+ ´(y(l))2
¡'(x(l)) ¢ '(x(l))
¢
(y(l))2 = 1 squared norm > 0
'(x(l))
Perceptron
28
The perceptron algorithm, obviously, can only converge if the
training set is linearly separable
It is guaranteed to converge in a finite number of iterations,
dependent on how well two classes are separated (Novikoff,
1962)
Averaged Perceptron
29
A small modification
= 0, = 0 // initialize
for k = 1 .. K // for a number of iterations
for = 1 .. L // over the training examples
if ( < 0) // if mistake
+= // update,
endif
+= // sum of over the course of training
endfor
endfor
return
y(l)¡w ¢ '(x(l))
¢l
wP
w ´y(l)'(x(l)) ´ > 0
w wP
w w
1KLwP
w
Do not run until convergence
Note: it is after endif
More stable in training: a vector which survived
more iterations without updates is more similar to
the resulting vector , as it was added a
larger number of times
w
1KLwP
2
Structured Perceptron
30
Let us start with structured problem:
Perceptron algorithm, given a training set
= 0 // initialize
do
err = 0
for = 1 .. L // over the training examples
// model prediction
if ( ) // if mistake
+= // update
err ++ // # errors
endif
endfor
while ( err > 0 ) // repeat until no errors
return
l
w
w
fx(l);y(l)gLl=1
y = argmaxy02Y(x)w ¢ '(x;y0)
y = argmaxy02Y(x(l))w ¢ '(x(l);y0)w ¢ '(x(l); y) >w ¢ '(x(l);y(l))w ´
¡'(x(l);y(l))¡'(x(l); y)
¢Pushes the correct
sequence up and the
incorrectly predicted one
down
Str. perceptron: algebraic interpretation
31
if ( ) // if mistake
+= // update
endif
w
We want after the update to increase
If the increase is large enough then will be scored above
Clearly, that this is achieved as this product will be increased byThere might be other
but we will deal with them
on the next iterations
w ¢ '(x(l); y) >w ¢ '(x(l);y(l))´¡'(x(l);y(l))¡'(x(l); y)
¢
w ¢ ('(x(l);y(l))¡'(x(l); y))
y(l) y
y0 2 Y(x(l))´jj'(x(l);y(l))¡'(x(l); y)jj2
Structured Perceptron
32
Positive:
Very easy to implement
Often, achieves respectable results
As other discriminative techniques, does not make assumptions about the
generative process
Additional features can be easily integrated, as long as decoding is tractable
Drawbacks
“Good” discriminative algorithms should optimize some measure which is
closely related to the expected testing results: what perceptron is doing on
non-linearly separable data seems not clear
However, for the averaged (voted) version a generalization bound which
generalization properties of Perceptron (Freund & Shapire 98)
Later, we will consider more advance learning algorithms
Outline
33
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model
Standard definition + maximum likelihood estimation
General views: as a representative of linear models
Perceptron and Structured Perceptron
algorithms / motivations
Decoding with the Linear Model
Discussion: Discriminative vs. Generative
Decoding with the Linear model
34
Decoding:
Again a linear model with the following edge features (a generalization of a HMM)
In fact, the algorithm does not depend on the feature of input (they do not need to
be local)
…
x(4)
…
x(1) x(2) x(3)
y(4)y(1) y(2) y(3)
y = argmaxy02Y(x)w ¢ '(x;y0)
Decoding with the Linear model
35
Decoding:
Again a linear model with the following edge features (a generalization of a HMM)
In fact, the algorithm does not depend on the feature of input (they do not need to
be local)
…
x
y(4)y(1) y(2) y(3)
y = argmaxy02Y(x)w ¢ '(x;y0)
Decoding with the Linear model
36
Decoding:
Let’s change notation:
Edge scores : roughly corresponds to
Defined for t = 0 too (“start” feature: )
Decode:
Decoding: a dynamic programming algorithm -Viterbi algorithm
…
x
y(4)y(1) y(2) y(3)
y = argmaxy02Y(x)w ¢ '(x;y0)
log ayt¡1;yt + log byt;xtft(yt¡1; yt;x)
y = argmaxy02Y(x)Pjxj
t=1 ft(y0t¡1; y
0t;x)
Start/Stop symbol
information ($) can be
encoded with them too.
y0 = $
Time complexity ?
Viterbi algorithm
37
Decoding:
Loop invariant: ( )
scoret[y] - score of the highest scoring sequence up to position t with
prevt[y] - previous tag on this sequence
Init: score0[$] = 0, score0[y] = - 1 for other y
Recomputation ( )
Return: retrace prev pointers starting from
…
x
y(4)y(1) y(2) y(3)
t = 1; : : : ; jxj
y = argmaxy02Y(x)Pjxj
t=1 ft(y0t¡1; y
0t;x)
t = 1; : : : ; jxjprevt[y] = argmaxy0 scoret[y
0] + ft(y0; y;x)
scoret[y] = scoret¡1[prevt[y]] + ft(prevt[y]; y;x)
argmaxy scorejxj[y]
O(N2jxj)
1
Outline
38
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model
Standard definition + maximum likelihood estimation
General views: as a representative of linear models
Perceptron and Structured Perceptron
algorithms / motivations
Decoding with the Linear Model
Discussion: Discriminative vs. Generative
Recap: Sequence Labeling
39
Hidden Markov Models:
How to estimate
Discriminative models
How to learn with structured perceptron
Both learning algorithms result in a linear model
How to label with the linear models
Discriminative vs Generative
40
Generative models:
Cheap to estimate: simply normalized counts
Hard to integrate complex features: need to come up with a generative
story and this story may be wrong
Does not result in an optimal classifier when model assumptions are
wrong (i.e., always)
Discriminative models
More expensive to learn: need to run decoding (here, Viterbi) during
training and usually multiple times per an example
Easy to integrate features: though some feature may make decoding
intractable
Usually less accurate on small datasets
Not necessary the case
for generative models
with latent variables
Reminders
41
Speakers: slides about a week before the talk, meetings with me before/after this point will normally be needed
Reviewers: reviews are accepted only before the day we consider the topic
Everyone: References to the papers to read at GoogleDocs,
These slides (and previous ones) will be online today
speakers: send me the last version of your slides too
Next time: Lea about Models of Parsing, PCFGs vs general WCFGs (Michael Collins’ book chapter)