Linear Methods For Sequence Labeling: Hidden Markov Models vs...

41
Learning for Structured Prediction Linear Methods For Sequence Labeling: Hidden Markov Models vs Structured Perceptron Ivan Titov

Transcript of Linear Methods For Sequence Labeling: Hidden Markov Models vs...

Page 1: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Learning for Structured Prediction

Linear Methods For Sequence Labeling:

Hidden Markov Models vs Structured Perceptron

Ivan Titov

Page 2: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Last Time: Structured Prediction

1. Selecting feature representation It should be sufficient to discriminate correct structure from incorrect ones

It should be possible to decode with it (see (3))

2. Learning Which error function to optimize on the training set, for example

How to make it efficient (see (3))

3. Decoding: Dynamic programming for simpler representations ?

Approximate search for more powerful ones?

We illustrated all these challenges on the example of dependency parsing

y = argmaxy02Y(x)w ¢ '(x;y0)

w ¢ '(x;y?)¡maxy02Y(x);y 6=y? w ¢ '(x;y0) > °

'(x;y)

'

x is an input (e.g.,

sentence), y is an output

(syntactic tree)

Page 3: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Outline

3

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models

Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative

Page 4: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Sequence Labeling Problems

4

Definition:

Input: sequences of variable length

Output: every position is labeled

Examples:

Part-of-speech tagging

Named-entity recognition, shallow parsing (“chunking”), gesture recognition

from video-streams, …

x = (x1; x2; : : : ; xjxj), xi 2X

y = (y1; y2; : : : ; yjxj), yi 2 Y

John carried a tin can .

NP VBD DT NN NN .

x=

y =

Page 5: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Part-of-speech tagging

5

Labels:

NNP – proper singular noun;

VBD - verb, past tense

DT - determiner

Consider

John carried a tin can .

NNP VBD DT NN NN or MD? .

x=

y =

NN – singular noun

MD - modal

. - final punctuation

If you just predict the most

frequent tag for each word you

will make a mistake here

In fact, even knowing that the previous

word is a noun is not enough

Tin can cause poisoning …

NN MD VB NN …

x=

y =

One need to model interactions between labels

to successfully resolve ambiguities, so this

should be tackled as a structured prediction

problem

3

Page 6: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Named Entity Recognition

6

[ORG Chelsea], despite their name, are not based in [LOC Chelsea], but in

neighbouring [LOC Fulham] .

Not as trivial as it may seem, consider:

[PERS Bill Clinton] will not embarrass [PERS Chelsea] at her wedding

Tiger failed to make a birdie

Encoding example (BIO-encoding)

Bill Clinton embarrassed Chelsea at her wedding at Astor Courts

B-PERS I-PERS O B-PERS O O O O B-LOC I-LOC

x=y =

Chelsea can be a person too!

Is it an animal or a person?

in the South Course …

3

Page 7: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Vision: Gesture Recognition

7

Given a sequence of frames in a video annotate each frame with a gesture type:

Types of gestures:

It is hard to predict gestures from each frame in isolation, you need to exploit

relations between frames and gesture types

Flip back Shrink

vertically

Expand

vertically

Double

back

Point

and back

Expand horizontally

Figures from (Wang et al.,

CVPR 06)

Page 8: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Outline

8

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models

Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative

Page 9: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Hidden Markov Models

9

We will consider the part-of-speech (POS) tagging example

A “generative” model, i.e.:

Model: Introduce a parameterized model of how both words and tags are

generated

Learning: use a labeled training set to estimate the most likely parameters of

the model

Decoding:

John carried a tin can .

NP VBD DT NN NN .

P (x;yjµ)

µ

y = argmaxy0 P(x;y0jµ)

Page 10: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Hidden Markov Models

10

A simplistic state diagram for noun phrases: N – tags, M – vocabulary size

States correspond to POS tags,

Words are emitted independently from each POS tag

Parameters (to be estimated from the training set):

Transition probabilities : [ N x N ] matrix

Emission probabilities : [ N x M] matrix

Stationarity assumption: this

probability does not depend on

the position in the sequence t

$

Det

Adj

Noun

[

0.5 : the]

[0.01: red,

, …]

0.1

0.8

P (y(t)jy(t¡1))P (x(t)jy(t))

0.1

0.50.2

1.0

0.5

0.8

[

0.01 : herring, …]

0.01 : dog

0.01 : hungry

0.5 : a

a

Example:

hungry dog

Page 11: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Hidden Markov Models

11

Representation as an instantiation of a graphical model: N – tags, M – vocabulary size

States correspond to POS tags,

Words are emitted independently from each POS tag

Parameters (to be estimated from the training set):

Transition probabilities : [ N x N ] matrix

Emission probabilities : [ N x M] matrix

Stationarity assumption: this

probability does not depend on

the position in the sequence t

P (y(t)jy(t¡1))P (x(t)jy(t))

x(4)

x(1) x(2) x(3)

y(4)y(1) y(2) y(3) A arrow means that in the

generative story x(4) is generated

from some P(x(4) | y(4))

= Det

= a

= Adj = Noun

= hungry = dog

Page 12: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Hidden Markov Models: Estimation

12

N – the number tags, M – vocabulary size

Parameters (to be estimated from the training set):

Transition probabilities , A - [ N x N ] matrix

Emission probabilities , B - [ N x M] matrix

Training corpus:

x(1)= (In, an, Oct., 19, review, of, …. ), y(1)= (IN, DT, NNP, CD, NN, IN, …. )

x(2)= (Ms., Haag, plays, Elianti,.), y(2)= (NNP, NNP, VBZ, NNP, .)

x(L)= (The, company, said,…), y(L)= (DT, NN, VBD, NNP, .)

How to estimate the parameters using maximum likelihood estimation?

You probably can guess what these estimation should be?

aji = P(y(t) = ijy(t¡1) = j)

bik = P(x(t) = kjy(t) = i)

Page 13: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Hidden Markov Models: Estimation

13

Parameters (to be estimated from the training set):

Transition probabilities , A - [ N x N ] matrix

Emission probabilities , B - [ N x M] matrix

Training corpus: (x(1),y(1) ), l = 1, … L

Write down the probability of the corpus according to the HMM:

P(fx(l);y(l)gLl=1) =QL

l=1P(x(l);y(l)) =

=QL

l=1 a$;y(l)1

³Qjxlj¡1t=1 b

y(l)

t ;x(l)

t

ay(l)

t ;y(l)

t+1

´by(l)

jx lj;x(l)

jx ljay(l)

jx lj;$

=

Select tag for

the first word

Draw a word

from this state

Select the next

stateDraw last

wordTransit into the

$ state

=QN

i;j=1 aCT (i;j)

i;j

QN

i=1

QM

k=1 bCE(i;k)

i;k

CT(i,j) is #times tag i is followed by tag j.

Here we assume that $ is a special tag which

precedes and succeeds every sentence

CE(i,k) is #times

word k is

emitted by tag i

aji = P(y(t) = ijy(t¡1) = j)

bik = P(x(t) = kjy(t) = i)

8

Page 14: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Hidden Markov Models: Estimation Maximize:

Equivalently maximize the logarithm of this:

subject to probabilistic constraints:

Or, we can decompose it into 2N optimization tasks:

P (fx(l);y(l)gLl=1) =

log(P(fx(l);y(l)gLl=1)) =

=PN

i=1

³PN

j=1CT (i; j) logai;j +PM

k=1CE(i; k) log bi;k

´

PN

j=1 ai;j = 1;PN

i=1 bi;k = 1; i = 1; : : : ; N

14

=QN

i;j=1 aCT (i;j)

i;j

QN

i=1

QM

k=1 bCE(i;k)

i;k

CT(i,j) is #times tag i is followed by tag j. CE(i,k) is #times word k is

emitted by tag i

i = 1; : : : ;N:

maxai;1;:::;ai;NPN

j=1CT (i; j) log ai;j

s.t.PN

j=1 ai;j = 1

maxbi;1;:::;bi;M CE(i; k) log bi;k

s.t.PN

i=1 bi;k = 1

For transitions

i = 1; : : : ;N:

For emissions

Page 15: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Hidden Markov Models: Estimation For transitions (some i)

Constrained optimization task, Lagrangian:

Find critical points of Lagrangian by solving the system of equations:

Similarly, for emissions:

15

maxai;1;:::;ai;NPN

j=1CT (i; j) log ai;j

s.t. 1¡PN

j=1 ai;j = 0

L(ai;1; : : : ; ai;N ; ¸) =PN

j=1CT (i; j) log ai;j + ¸£ (1¡PN

j=1 ai;j)

@L@¸

= 1¡PN

j=1 ai;j = 0

@L@aij

=CT (i;j)

aij¡ ¸ = 0 =) aij =

CT (i;j)

¸

P(yt = jjyt¡1 = i) = ai;j =CT (i;j)Pj0 CT (i;j

0)

P (xt = kjyt = i) = bi;k =CE(i;k)Pk0 CE(i;k

0)

The maximum likelihood solution is

just normalized counts of events.

Always like this for generative

models if all the labels are visible in

training

I ignore “smoothing” to process rare

or unseen word tag combinations…

Outside score of the seminar

2

Page 16: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

HMMs as linear models

16

Decoding:

We will talk about the decoding algorithm slightly later, let us generalize Hidden

Markov Model:

But this is just a linear model!!

John carried a tin can .

? ? ? ? ? .

y = argmaxy0 P (x;y0jA;B) = argmaxy0 logP(x;y0jA;B)

logP (x;y0jA;B) =Pjxj+1

l=1 log by0i;xi + log ay0

i;y0i+1

=PN

i=1

PN

j=1CT (y0; i; j)£ log ai;j +

PN

i=1

PM

k=1CE(x;y0; i; k)£ log bi;k

The number of times tag i

is followed by tag j in the

candidate y’

The number of times tag i

corresponds to word k in (x, y’)

2

Page 17: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

)

)

Scoring: example

Their inner product is exactly

But may be there other (and better?) ways to estimate , especially when we know that HMM is not a faithful model of reality?

It is not only a theoretical question! (we’ll talk about that in a moment)

wML = (

'(x;y0) = (

(x;y0) =John carried a tin can .

NP VBD DT NN NN .

1 0 … 1 1 0 …

NP: John NP:Mary ... NP-VBD NN-. MD-. …Unary features

log bNP;John log bNP;Mary : : : log aNN;V BD log aNN;: log aMD;:

Edge features

wML¢'(x;y0) =PN

i=1

PN

j=1CT (y0; i; j)£ log ai;j+

PN

i=1

PM

k=1CE(x;y0; i; k)£ log bi;k

CE(x;y0; i; k)

CT (y0; i; j): : :

logP (x;y0jA;B)

w

Page 18: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Feature view

18

Basically, we define features which correspond to edges in the graph:

x(4)

x(1) x(2) x(3)

y(4)y(1) y(2) y(3)

Shaded because they

are visible (both in

training and testing)

Page 19: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Generative modeling

19

For a very large dataset (asymptotic analysis):

If data is generated from some “true” HMM, then (if the training set is sufficiently

large), we are guaranteed to have an optimal tagger

Otherwise, (generally) HMM will not correspond to an optimal linear classifier

Discriminative methods which minimize the error more directly are guaranteed

(under some fairly general conditions) to converge to an optimal linear classifier

For smaller training sets

Generative classifiers converge faster to their optimal error [Ng & Jordan, NIPS

01]

Real case: HMM is a

coarse approximation

of reality

A discriminative classifier

A generative model

Errors on a

regression dataset

(predict housing

prices in Boston

area):

# train examples1

Page 20: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Outline

20

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models

Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative

Page 21: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Perceptron

21

Let us start with a binary classification problem

For binary classification the prediction rule is:

Perceptron algorithm, given a training set

y 2 f+1;¡1g

y = sign (w ¢ '(x))

= 0 // initialize

do

err = 0

for = 1 .. L // over the training examples

if ( < 0) // if mistake

+= // update,

err ++ // # errors

endif

endfor

while ( err > 0 ) // repeat until no errors

return

y(l)¡w ¢ '(x(l))

¢l

w

w ´y(l)'(x(l))

w

fx(l); y(l)gLl=1

´ > 0

break ties (0) in some

deterministic way

Page 22: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Linear classification

22

Linear separable case, “a perfect” classifier:

Linear functions are often written as: , but we can assume

that

y = sign (w ¢ '(x) + b)

(w ¢ '(x) + b) = 0

w

'(x)1

'(x)2

'(x)0 = 1 for any x

Figure adapted from Dan

Roth’s class at UIUC

Page 23: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Perceptron: geometric interpretation

23

if ( < 0) // if mistake

+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))

Page 24: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Perceptron: geometric interpretation

24

if ( < 0) // if mistake

+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))

Page 25: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Perceptron: geometric interpretation

25

if ( < 0) // if mistake

+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))

Page 26: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Perceptron: geometric interpretation

26

if ( < 0) // if mistake

+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))

Page 27: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Perceptron: algebraic interpretation

27

if ( < 0) // if mistake

+= // update

endif

y(l)¡w ¢ '(x(l))

¢w ´y(l)'(x(l))

We want after the update to increase

If the increase is large enough than there will be no misclassification

Let’s see that’s what happens after the update

So, the perceptron update moves the decision hyperplane towards

misclassified

y(l)¡(w+ ´y(l)'(x(l))) ¢ '(x(l))

¢

y(l)¡w ¢ '(x(l))

¢

= y(l)¡w ¢ '(x(l))

¢+ ´(y(l))2

¡'(x(l)) ¢ '(x(l))

¢

(y(l))2 = 1 squared norm > 0

'(x(l))

Page 28: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Perceptron

28

The perceptron algorithm, obviously, can only converge if the

training set is linearly separable

It is guaranteed to converge in a finite number of iterations,

dependent on how well two classes are separated (Novikoff,

1962)

Page 29: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Averaged Perceptron

29

A small modification

= 0, = 0 // initialize

for k = 1 .. K // for a number of iterations

for = 1 .. L // over the training examples

if ( < 0) // if mistake

+= // update,

endif

+= // sum of over the course of training

endfor

endfor

return

y(l)¡w ¢ '(x(l))

¢l

wP

w ´y(l)'(x(l)) ´ > 0

w wP

w w

1KLwP

w

Do not run until convergence

Note: it is after endif

More stable in training: a vector which survived

more iterations without updates is more similar to

the resulting vector , as it was added a

larger number of times

w

1KLwP

2

Page 30: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Structured Perceptron

30

Let us start with structured problem:

Perceptron algorithm, given a training set

= 0 // initialize

do

err = 0

for = 1 .. L // over the training examples

// model prediction

if ( ) // if mistake

+= // update

err ++ // # errors

endif

endfor

while ( err > 0 ) // repeat until no errors

return

l

w

w

fx(l);y(l)gLl=1

y = argmaxy02Y(x)w ¢ '(x;y0)

y = argmaxy02Y(x(l))w ¢ '(x(l);y0)w ¢ '(x(l); y) >w ¢ '(x(l);y(l))w ´

¡'(x(l);y(l))¡'(x(l); y)

¢Pushes the correct

sequence up and the

incorrectly predicted one

down

Page 31: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Str. perceptron: algebraic interpretation

31

if ( ) // if mistake

+= // update

endif

w

We want after the update to increase

If the increase is large enough then will be scored above

Clearly, that this is achieved as this product will be increased byThere might be other

but we will deal with them

on the next iterations

w ¢ '(x(l); y) >w ¢ '(x(l);y(l))´¡'(x(l);y(l))¡'(x(l); y)

¢

w ¢ ('(x(l);y(l))¡'(x(l); y))

y(l) y

y0 2 Y(x(l))´jj'(x(l);y(l))¡'(x(l); y)jj2

Page 32: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Structured Perceptron

32

Positive:

Very easy to implement

Often, achieves respectable results

As other discriminative techniques, does not make assumptions about the

generative process

Additional features can be easily integrated, as long as decoding is tractable

Drawbacks

“Good” discriminative algorithms should optimize some measure which is

closely related to the expected testing results: what perceptron is doing on

non-linearly separable data seems not clear

However, for the averaged (voted) version a generalization bound which

generalization properties of Perceptron (Freund & Shapire 98)

Later, we will consider more advance learning algorithms

Page 33: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Outline

33

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models

Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative

Page 34: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Decoding with the Linear model

34

Decoding:

Again a linear model with the following edge features (a generalization of a HMM)

In fact, the algorithm does not depend on the feature of input (they do not need to

be local)

x(4)

x(1) x(2) x(3)

y(4)y(1) y(2) y(3)

y = argmaxy02Y(x)w ¢ '(x;y0)

Page 35: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Decoding with the Linear model

35

Decoding:

Again a linear model with the following edge features (a generalization of a HMM)

In fact, the algorithm does not depend on the feature of input (they do not need to

be local)

x

y(4)y(1) y(2) y(3)

y = argmaxy02Y(x)w ¢ '(x;y0)

Page 36: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Decoding with the Linear model

36

Decoding:

Let’s change notation:

Edge scores : roughly corresponds to

Defined for t = 0 too (“start” feature: )

Decode:

Decoding: a dynamic programming algorithm -Viterbi algorithm

x

y(4)y(1) y(2) y(3)

y = argmaxy02Y(x)w ¢ '(x;y0)

log ayt¡1;yt + log byt;xtft(yt¡1; yt;x)

y = argmaxy02Y(x)Pjxj

t=1 ft(y0t¡1; y

0t;x)

Start/Stop symbol

information ($) can be

encoded with them too.

y0 = $

Page 37: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Time complexity ?

Viterbi algorithm

37

Decoding:

Loop invariant: ( )

scoret[y] - score of the highest scoring sequence up to position t with

prevt[y] - previous tag on this sequence

Init: score0[$] = 0, score0[y] = - 1 for other y

Recomputation ( )

Return: retrace prev pointers starting from

x

y(4)y(1) y(2) y(3)

t = 1; : : : ; jxj

y = argmaxy02Y(x)Pjxj

t=1 ft(y0t¡1; y

0t;x)

t = 1; : : : ; jxjprevt[y] = argmaxy0 scoret[y

0] + ft(y0; y;x)

scoret[y] = scoret¡1[prevt[y]] + ft(prevt[y]; y;x)

argmaxy scorejxj[y]

O(N2jxj)

1

Page 38: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Outline

38

Sequence labeling / segmentation problems: settings and example problems:

Part-of-speech tagging, named entity recognition, gesture recognition

Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models

Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative

Page 39: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Recap: Sequence Labeling

39

Hidden Markov Models:

How to estimate

Discriminative models

How to learn with structured perceptron

Both learning algorithms result in a linear model

How to label with the linear models

Page 40: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Discriminative vs Generative

40

Generative models:

Cheap to estimate: simply normalized counts

Hard to integrate complex features: need to come up with a generative

story and this story may be wrong

Does not result in an optimal classifier when model assumptions are

wrong (i.e., always)

Discriminative models

More expensive to learn: need to run decoding (here, Viterbi) during

training and usually multiple times per an example

Easy to integrate features: though some feature may make decoding

intractable

Usually less accurate on small datasets

Not necessary the case

for generative models

with latent variables

Page 41: Linear Methods For Sequence Labeling: Hidden Markov Models vs …ivan-titov.org/.../struct-pred-class-02.pdf · 2017-10-01 · Outline 3 Sequence labeling / segmentation problems:

Reminders

41

Speakers: slides about a week before the talk, meetings with me before/after this point will normally be needed

Reviewers: reviews are accepted only before the day we consider the topic

Everyone: References to the papers to read at GoogleDocs,

These slides (and previous ones) will be online today

speakers: send me the last version of your slides too

Next time: Lea about Models of Parsing, PCFGs vs general WCFGs (Michael Collins’ book chapter)