Linear Methods For Sequence Labeling: Hidden Markov Models vs...

Learning for Structured Prediction

Linear Methods For Sequence Labeling:

Hidden Markov Models vs Structured Perceptron

Ivan Titov

Last Time: Structured Prediction

1. Selecting feature representation It should be sufficient to discriminate correct structure from incorrect ones

It should be possible to decode with it (see (3))

2. Learning Which error function to optimize on the training set, for example

How to make it efficient (see (3))

3. Decoding: Dynamic programming for simpler representations ?

Approximate search for more powerful ones?

We illustrated all these challenges on the example of dependency parsing

y = argmaxy02Y(x)w ¢ '(x;y0)

w ¢ '(x;y?)¡maxy02Y(x);y 6=y? w ¢ '(x;y0) > °

'(x;y)

'

x is an input (e.g.,

sentence), y is an output

(syntactic tree)

Outline

3

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models

Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative

Sequence Labeling Problems

4

Definition:

Input: sequences of variable length

Output: every position is labeled

Examples:

Part-of-speech tagging

Named-entity recognition, shallow parsing (“chunking”), gesture recognition

from video-streams, …

x = (x1; x2; : : : ; xjxj), xi 2X

y = (y1; y2; : : : ; yjxj), yi 2 Y

John carried a tin can .

NP VBD DT NN NN .

x=

y =

Part-of-speech tagging

5

Labels:

NNP – proper singular noun;

VBD - verb, past tense

DT - determiner

Consider


NNP VBD DT NN NN or MD? .

x=

y =

NN – singular noun

MD - modal

. - final punctuation

If you just predict the most

frequent tag for each word you

will make a mistake here

In fact, even knowing that the previous

word is a noun is not enough

Tin can cause poisoning …

NN MD VB NN …

x=

y =

One need to model interactions between labels

to successfully resolve ambiguities, so this

should be tackled as a structured prediction

problem

3

Named Entity Recognition

6

[ORG Chelsea], despite their name, are not based in [LOC Chelsea], but in

neighbouring [LOC Fulham] .

Not as trivial as it may seem, consider:

[PERS Bill Clinton] will not embarrass [PERS Chelsea] at her wedding

Tiger failed to make a birdie

Encoding example (BIO-encoding)

Bill Clinton embarrassed Chelsea at her wedding at Astor Courts

B-PERS I-PERS O B-PERS O O O O B-LOC I-LOC

x=y =

Chelsea can be a person too!

Is it an animal or a person?

in the South Course …

3

Vision: Gesture Recognition

7

Given a sequence of frames in a video annotate each frame with a gesture type:

Types of gestures:

It is hard to predict gestures from each frame in isolation, you need to exploit

relations between frames and gesture types

Flip back Shrink

vertically

Expand

vertically

Double

back

Point

and back

Expand horizontally

Figures from (Wang et al.,

CVPR 06)

Outline

8



Hidden Markov Model







Hidden Markov Models

9

We will consider the part-of-speech (POS) tagging example

A “generative” model, i.e.:

Model: Introduce a parameterized model of how both words and tags are

generated

Learning: use a labeled training set to estimate the most likely parameters of

the model

Decoding:


NP VBD DT NN NN .

P (x;yjµ)

µ

y = argmaxy0 P(x;y0jµ)


10

A simplistic state diagram for noun phrases: N – tags, M – vocabulary size

States correspond to POS tags,

Words are emitted independently from each POS tag

Parameters (to be estimated from the training set):

Transition probabilities : [ N x N ] matrix

Emission probabilities : [ N x M] matrix

Stationarity assumption: this

probability does not depend on

the position in the sequence t

$

Det

Adj

Noun

[

0.5 : the]

[0.01: red,

, …]

0.1

0.8

P (y(t)jy(t¡1))P (x(t)jy(t))

0.1

0.50.2

1.0

0.5

0.8

[

0.01 : herring, …]

0.01 : dog

0.01 : hungry

0.5 : a

a

Example:

hungry dog


11

Representation as an instantiation of a graphical model: N – tags, M – vocabulary size

States correspond to POS tags,

Words are emitted independently from each POS tag


Transition probabilities : [ N x N ] matrix

Emission probabilities : [ N x M] matrix

Stationarity assumption: this

probability does not depend on

the position in the sequence t

P (y(t)jy(t¡1))P (x(t)jy(t))

…

x(4)

…

x(1) x(2) x(3)

y(4)y(1) y(2) y(3) A arrow means that in the

generative story x(4) is generated

from some P(x(4) | y(4))

= Det

= a

= Adj = Noun

= hungry = dog

Hidden Markov Models: Estimation

12

N – the number tags, M – vocabulary size


Transition probabilities , A - [ N x N ] matrix

Emission probabilities , B - [ N x M] matrix

Training corpus:

x(1)= (In, an, Oct., 19, review, of, …. ), y(1)= (IN, DT, NNP, CD, NN, IN, …. )

x(2)= (Ms., Haag, plays, Elianti,.), y(2)= (NNP, NNP, VBZ, NNP, .)

…

x(L)= (The, company, said,…), y(L)= (DT, NN, VBD, NNP, .)

How to estimate the parameters using maximum likelihood estimation?

You probably can guess what these estimation should be?

aji = P(y(t) = ijy(t¡1) = j)

bik = P(x(t) = kjy(t) = i)

Hidden Markov Models: Estimation

13


Transition probabilities , A - [ N x N ] matrix

Emission probabilities , B - [ N x M] matrix

Training corpus: (x(1),y(1) ), l = 1, … L

Write down the probability of the corpus according to the HMM:

P(fx(l);y(l)gLl=1) =QL

l=1P(x(l);y(l)) =

=QL

l=1 a$;y(l)1

³Qjxlj¡1t=1 b

y(l)

t ;x(l)

t

ay(l)

t ;y(l)

t+1

´by(l)

jx lj;x(l)

jx ljay(l)

jx lj;$

=

Select tag for

the first word

Draw a word

from this state

Select the next

stateDraw last

wordTransit into the

$ state

=QN

i;j=1 aCT (i;j)

i;j

QN

i=1

QM

k=1 bCE(i;k)

i;k

CT(i,j) is #times tag i is followed by tag j.

Here we assume that $ is a special tag which

precedes and succeeds every sentence

CE(i,k) is #times

word k is

emitted by tag i

aji = P(y(t) = ijy(t¡1) = j)

bik = P(x(t) = kjy(t) = i)

8

Hidden Markov Models: Estimation Maximize:

Equivalently maximize the logarithm of this:

subject to probabilistic constraints:

Or, we can decompose it into 2N optimization tasks:

P (fx(l);y(l)gLl=1) =

log(P(fx(l);y(l)gLl=1)) =

=PN

i=1

³PN

j=1CT (i; j) logai;j +PM

k=1CE(i; k) log bi;k

´

PN

j=1 ai;j = 1;PN

i=1 bi;k = 1; i = 1; : : : ; N

14

=QN

i;j=1 aCT (i;j)

i;j

QN

i=1

QM

k=1 bCE(i;k)

i;k

CT(i,j) is #times tag i is followed by tag j. CE(i,k) is #times word k is

emitted by tag i

i = 1; : : : ;N:

maxai;1;:::;ai;NPN

j=1CT (i; j) log ai;j

s.t.PN

j=1 ai;j = 1

maxbi;1;:::;bi;M CE(i; k) log bi;k

s.t.PN

i=1 bi;k = 1

For transitions

i = 1; : : : ;N:

For emissions

Hidden Markov Models: Estimation For transitions (some i)

Constrained optimization task, Lagrangian:

Find critical points of Lagrangian by solving the system of equations:

Similarly, for emissions:

15

maxai;1;:::;ai;NPN

j=1CT (i; j) log ai;j

s.t. 1¡PN

j=1 ai;j = 0

L(ai;1; : : : ; ai;N ; ¸) =PN

j=1CT (i; j) log ai;j + ¸£ (1¡PN

j=1 ai;j)

@L@¸

= 1¡PN

j=1 ai;j = 0

@L@aij

=CT (i;j)

aij¡ ¸ = 0 =) aij =

CT (i;j)

¸

P(yt = jjyt¡1 = i) = ai;j =CT (i;j)Pj0 CT (i;j

0)

P (xt = kjyt = i) = bi;k =CE(i;k)Pk0 CE(i;k

0)

The maximum likelihood solution is

just normalized counts of events.

Always like this for generative

models if all the labels are visible in

training

I ignore “smoothing” to process rare

or unseen word tag combinations…

Outside score of the seminar

2

HMMs as linear models

16

Decoding:

We will talk about the decoding algorithm slightly later, let us generalize Hidden

Markov Model:

But this is just a linear model!!


? ? ? ? ? .

y = argmaxy0 P (x;y0jA;B) = argmaxy0 logP(x;y0jA;B)

logP (x;y0jA;B) =Pjxj+1

l=1 log by0i;xi + log ay0

i;y0i+1

=PN

i=1

PN

j=1CT (y0; i; j)£ log ai;j +

PN

i=1

PM

k=1CE(x;y0; i; k)£ log bi;k

The number of times tag i

is followed by tag j in the

candidate y’

The number of times tag i

corresponds to word k in (x, y’)

2

)

)

Scoring: example

Their inner product is exactly

But may be there other (and better?) ways to estimate , especially when we know that HMM is not a faithful model of reality?

It is not only a theoretical question! (we’ll talk about that in a moment)

wML = (

'(x;y0) = (

(x;y0) =John carried a tin can .

NP VBD DT NN NN .

1 0 … 1 1 0 …

NP: John NP:Mary ... NP-VBD NN-. MD-. …Unary features

log bNP;John log bNP;Mary : : : log aNN;V BD log aNN;: log aMD;:

Edge features

wML¢'(x;y0) =PN

i=1

PN

j=1CT (y0; i; j)£ log ai;j+

PN

i=1

PM

k=1CE(x;y0; i; k)£ log bi;k

CE(x;y0; i; k)

CT (y0; i; j): : :

logP (x;y0jA;B)

w

Feature view

18

Basically, we define features which correspond to edges in the graph:

…

x(4)

…

x(1) x(2) x(3)

y(4)y(1) y(2) y(3)

Shaded because they

are visible (both in

training and testing)

Generative modeling

19

For a very large dataset (asymptotic analysis):

If data is generated from some “true” HMM, then (if the training set is sufficiently

large), we are guaranteed to have an optimal tagger

Otherwise, (generally) HMM will not correspond to an optimal linear classifier

Discriminative methods which minimize the error more directly are guaranteed

(under some fairly general conditions) to converge to an optimal linear classifier

For smaller training sets

Generative classifiers converge faster to their optimal error [Ng & Jordan, NIPS

01]

Real case: HMM is a

coarse approximation

of reality

A discriminative classifier

A generative model

Errors on a

regression dataset

(predict housing

prices in Boston

area):

# train examples1

Outline

20



Hidden Markov Model







Perceptron

21

Let us start with a binary classification problem

For binary classification the prediction rule is:

Perceptron algorithm, given a training set

y 2 f+1;¡1g

y = sign (w ¢ '(x))

= 0 // initialize

do

err = 0

for = 1 .. L // over the training examples

if ( < 0) // if mistake

+= // update,

err ++ // # errors

endif

endfor

while ( err > 0 ) // repeat until no errors

return

y(l)¡w ¢ '(x(l))

¢l

w

w ´y(l)'(x(l))

w

fx(l); y(l)gLl=1

´ > 0

break ties (0) in some

deterministic way

Linear classification

22

Linear separable case, “a perfect” classifier:

Linear functions are often written as: , but we can assume

that

y = sign (w ¢ '(x) + b)

(w ¢ '(x) + b) = 0

w

'(x)1

'(x)2

'(x)0 = 1 for any x

Figure adapted from Dan

Roth’s class at UIUC

Perceptron: geometric interpretation

23


+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))


24


+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))


25


+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))


26


+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))

Perceptron: algebraic interpretation

27


+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))

We want after the update to increase

If the increase is large enough than there will be no misclassification

Let’s see that’s what happens after the update

So, the perceptron update moves the decision hyperplane towards

misclassified

y(l)¡(w+ ´y(l)'(x(l))) ¢ '(x(l))

¢

y(l)¡w ¢ '(x(l))

¢

= y(l)¡w ¢ '(x(l))

¢+ ´(y(l))2

¡'(x(l)) ¢ '(x(l))

¢

(y(l))2 = 1 squared norm > 0

'(x(l))

Perceptron

28

The perceptron algorithm, obviously, can only converge if the

training set is linearly separable

It is guaranteed to converge in a finite number of iterations,

dependent on how well two classes are separated (Novikoff,

1962)

Averaged Perceptron

29

A small modification

= 0, = 0 // initialize

for k = 1 .. K // for a number of iterations



+= // update,

endif

+= // sum of over the course of training

endfor

endfor

return

y(l)¡w ¢ '(x(l))

¢l

wP

w ´y(l)'(x(l)) ´ > 0

w wP

w w

1KLwP

w

Do not run until convergence

Note: it is after endif

More stable in training: a vector which survived

more iterations without updates is more similar to

the resulting vector , as it was added a

larger number of times

w

1KLwP

2

Structured Perceptron

30

Let us start with structured problem:

Perceptron algorithm, given a training set

= 0 // initialize

do

err = 0


// model prediction

if ( ) // if mistake

+= // update

err ++ // # errors

endif

endfor

while ( err > 0 ) // repeat until no errors

return

l

w

w

fx(l);y(l)gLl=1


y = argmaxy02Y(x(l))w ¢ '(x(l);y0)w ¢ '(x(l); y) >w ¢ '(x(l);y(l))w ´

¡'(x(l);y(l))¡'(x(l); y)

¢Pushes the correct

sequence up and the

incorrectly predicted one

down

Str. perceptron: algebraic interpretation

31

if ( ) // if mistake

+= // update

endif

w

We want after the update to increase

If the increase is large enough then will be scored above

Clearly, that this is achieved as this product will be increased byThere might be other

but we will deal with them

on the next iterations

w ¢ '(x(l); y) >w ¢ '(x(l);y(l))´¡'(x(l);y(l))¡'(x(l); y)

¢

w ¢ ('(x(l);y(l))¡'(x(l); y))

y(l) y

y0 2 Y(x(l))´jj'(x(l);y(l))¡'(x(l); y)jj2

Structured Perceptron

32

Positive:

Very easy to implement

Often, achieves respectable results

As other discriminative techniques, does not make assumptions about the

generative process

Additional features can be easily integrated, as long as decoding is tractable

Drawbacks

“Good” discriminative algorithms should optimize some measure which is

closely related to the expected testing results: what perceptron is doing on

non-linearly separable data seems not clear

However, for the averaged (voted) version a generalization bound which

generalization properties of Perceptron (Freund & Shapire 98)

Later, we will consider more advance learning algorithms

Outline

33



Hidden Markov Model







Decoding with the Linear model

34

Decoding:

Again a linear model with the following edge features (a generalization of a HMM)

In fact, the algorithm does not depend on the feature of input (they do not need to

be local)

…

x(4)

…

x(1) x(2) x(3)

y(4)y(1) y(2) y(3)



35

Decoding:

Again a linear model with the following edge features (a generalization of a HMM)

In fact, the algorithm does not depend on the feature of input (they do not need to

be local)

…

x

y(4)y(1) y(2) y(3)



36

Decoding:

Let’s change notation:

Edge scores : roughly corresponds to

Defined for t = 0 too (“start” feature: )

Decode:

Decoding: a dynamic programming algorithm -Viterbi algorithm

…

x

y(4)y(1) y(2) y(3)


log ayt¡1;yt + log byt;xtft(yt¡1; yt;x)

y = argmaxy02Y(x)Pjxj

t=1 ft(y0t¡1; y

0t;x)

Start/Stop symbol

information ($) can be

encoded with them too.

y0 = $

Time complexity ?

Viterbi algorithm

37

Decoding:

Loop invariant: ( )

scoret[y] - score of the highest scoring sequence up to position t with

prevt[y] - previous tag on this sequence

Init: score0[$] = 0, score0[y] = - 1 for other y

Recomputation ( )

Return: retrace prev pointers starting from

…

x

y(4)y(1) y(2) y(3)

t = 1; : : : ; jxj

y = argmaxy02Y(x)Pjxj

t=1 ft(y0t¡1; y

0t;x)

t = 1; : : : ; jxjprevt[y] = argmaxy0 scoret[y

0] + ft(y0; y;x)

scoret[y] = scoret¡1[prevt[y]] + ft(prevt[y]; y;x)

argmaxy scorejxj[y]

O(N2jxj)

1

Outline

38



Hidden Markov Model







Recap: Sequence Labeling

39

Hidden Markov Models:

How to estimate

Discriminative models

How to learn with structured perceptron

Both learning algorithms result in a linear model

How to label with the linear models

Discriminative vs Generative

40

Generative models:

Cheap to estimate: simply normalized counts

Hard to integrate complex features: need to come up with a generative

story and this story may be wrong

Does not result in an optimal classifier when model assumptions are

wrong (i.e., always)

Discriminative models

More expensive to learn: need to run decoding (here, Viterbi) during

training and usually multiple times per an example

Easy to integrate features: though some feature may make decoding

intractable

Usually less accurate on small datasets

Not necessary the case

for generative models

with latent variables

Reminders

41

Speakers: slides about a week before the talk, meetings with me before/after this point will normally be needed

Reviewers: reviews are accepted only before the day we consider the topic

Everyone: References to the papers to read at GoogleDocs,

These slides (and previous ones) will be online today

speakers: send me the last version of your slides too

Next time: Lea about Models of Parsing, PCFGs vs general WCFGs (Michael Collins’ book chapter)

Linear Methods For Sequence Labeling: Hidden Markov Models vs...

Documents

Transcript of Linear Methods For Sequence Labeling: Hidden Markov Models vs...