Dynamic Programming for Linear-Time Incremental Parsing

Dynamic Programming for

Linear-Time Incremental Parsing

Liang HuangInformation Sciences Institute

University of Southern California

(Joint work with Kenji Sagae, USC/ICT)

JHU CLSP Seminar September 14, 2010

Remembering Fred Jelinek (1932-2010)

Prof. Jelinek hosted my visit and this talk on his last day.

He was very supportive of this work, which is related to his work on structured language models, and I dedicate my work to his memory.

DP for Incremental Parsing

Ambiguity and Incrementality

3

• NLP is (almost) all about ambiguity resolution

• human-beings resolve ambiguity incrementally



3

One morning in Africa, I shot an elephant in my pajamas;





3

One morning in Africa, I shot an elephant in my pajamas;

how he got into my pajamas I’ll never know.



CS 562 - Intro

Ambiguities in Translation

4

CS 562 - Intro

Ambiguities in Translation

4

Google translate: carefully slide

CS 562 - Intro

If you are stolen...

5

CS 562 - Intro

If you are stolen...

5

Google translate: Once the theft to the police

CS 562 - Intro

or even...

6

CS 562 - Intro

or even...

6clear evidence that NLP is used in real life!


Ambiguities in Parsing

I feed cats nearby in the garden ...

• let’s focus on dependency structures for simplicity

• ambiguous attachments of nearby and in

• ambiguity explodes exponentially with sentence length

• must design efficient (polynomial) search algorithm

• typically using dynamic programming (DP); e.g. CKY

7


But full DP is too slow...

8


• full DP (like CKY) is too slow (cubic-time)

• while human parsing is fast & incremental (linear-time)



8




• how about incremental parsing then?

• yes, but only with greedy search (accuracy suffers)

• explores tiny fraction of trees (even w/ beam search)



8




• how about incremental parsing then?

• yes, but only with greedy search (accuracy suffers)

• explores tiny fraction of trees (even w/ beam search)

• can we combine the merits of both approaches?

• a fast, incremental parser with dynamic programming?

• explores exponentially many trees in linear-time?


Linear-Time Incremental DP

9

greedy search

principled search

incremental parsing (e.g. shift-reduce)

(Nivre 04; Collins/Roark 04; ...)

this work:fast shift-reduce parsing with dynamic programming

full DP(e.g. CKY)

(Eisner 96; Collins 99; ...)

fast(linear-time)

slow(cubic-time)

☹☺☺

☹


Big Picture

10

natural languages

programming languages

human

computer

☺psycholinguistics ☹

☹NLP☺

compiler theory(LR, LALR, ...)


Preview of the Results• very fast linear-time dynamic programming parser

• best reported dependency accuracy on PTB/CTB

• explores exponentially many trees (and outputs forest)

11

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60 70

parsing time (secs)

sentence length





11

Ch

arn

iak

Ber

kele

yM

ST

this work 0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60 70

parsing time (secs)

sentence length





11

Ch

arn

iak

Ber

kele

yM

ST

this work 0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60 70

parsing time (secs)

sentence length

100

102

104

106

108

1010

0 10 20 30 40 50 60 70number of trees explored

sentence length

DP: exponen

tial

non-DP beam search


Outline

• Motivation

• Incremental (Shift-Reduce) Parsing

• Dynamic Programming for Incremental Parsing

• Experiments

12


I feed cats ...

feed cats nearby ...

cats nearby in ...

cats nearby in ...

nearby in the ...

nearby in the ...

in the garden ...

Shift-Reduce Parsing

13

action stack queue

I feed cats nearby in the garden.

0 -


I feed cats ...


cats nearby in ...

cats nearby in ...

nearby in the ...

nearby in the ...

in the garden ...


14

action stack queue


I

0 -

1 shift


I feed cats ...


cats nearby in ...

cats nearby in ...

nearby in the ...

nearby in the ...

in the garden ...


15

action stack queue


I feed

I

0 -

1 shift

2 shift


I feed cats ...


cats nearby in ...

cats nearby in ...

nearby in the ...

nearby in the ...

in the garden ...


16

action stack queue


I feed

I

feedI

0 -

1 shift

2 shift

3 l-reduce


I feed cats ...


cats nearby in ...

cats nearby in ...

nearby in the ...

nearby in the ...

in the garden ...


17

action stack queue


I feed

I

feedI feed cats

I

0 -

1 shift

2 shift

3 l-reduce

4 shift



18

action stack queue


I feed cats ...


cats nearby in ...

cats nearby in ...

nearby in the ...

nearby in the ...

in the garden ...

I feed

I

feedI feed cats

I feed

I cats

0 -

1 shift

2 shift

3 l-reduce

4 shift

5a r-reduce



19

action stack queue

I feed cats ...


cats nearby in ...

cats nearby in ...

nearby in the ...

nearby in the ...

in the garden ...


I feed

I

feedI feed cats

I feed

I cats

feed cats nearbyI

0 -

1 shift

2 shift

3 l-reduce

4 shift

5a r-reduce

5b shift



20

action stack queue

shift-reduceconflict


I feed cats ...


cats nearby in ...

cats nearby in ...

nearby in the ...

nearby in the ...

in the garden ...

I feed

I

feedI feed cats

I feed

I cats

feed cats nearbyI

0 -

1 shift

2 shift

3 l-reduce

4 shift

5a r-reduce

5b shift


Choosing Parser Actions

• score each action using features f and weights w

• features are drawn from a local window

• abstraction (or signature) of a state -- this inspires DP!

• weights trained by structured perceptron (Collins 02)

21

... s2 s1 s0 q0 q1 ...

← stack queue →← stack queue →

features:(s0.w, s0.rc, q0, ...) = (cats, nearby, in, ...)

... feed cats I nearby

in the garden ...


Greedy Search

22

• each state => three new states (shift, l-reduce, r-reduce)

• search space should be exponential

• greedy search: always pick the best next state


Greedy Search

23



• greedy search: always pick the best next state


Beam Search

24



• beam search: always keep top-b states


Dynamic Programming• each state => three new states (shift, l-reduce, r-reduce)

• key idea of DP: share common subproblems

• merge equivalent states => polynomial space

25





26

“graph-structured stack” (Tomita, 1988)





27






27


each DP state corresponds toexponentially many non-DP states





28


100

102

104

106

108

1010


sentence length

DP: exponen

tial

non-DP beam search

each DP state corresponds toexponentially many non-DP states


Merging Equivalent States

• two states are equivalent if they agree on features

• because same features guarantee same cost

• shift-reduce conflict:

• feed cats nearby in the gardenI


29

... s2 s1 s0 q0 q1 ...

← stack queue →

... catsre

feed... feedIshsh








29

... s2 s1 s0 q0 q1 ...

← stack queue →

... catsre

feed... feedIshsh assume features only

look at root of s0








29

... s2 s1 s0 q0 q1 ...

← stack queue →

... catsre

feed... feedIshsh assume features only

look at root of s0

two states are equivalent if they agree on root of s0







• feed cats nearby in the gardenI cats

30

... s2 s1 s0 q0 q1 ...

← stack queue →

sh

re... cats

... nearby

... feed

...






• feed cats nearby in the gardenI nearby

• feed cats nearby in the gardenI cats

31

... s2 s1 s0 q0 q1 ...

← stack queue →

sh

re... cats

... nearby

... feed

re... cats

sh... nearby

...






• feed in the gardenI cats nearby


32

... s2 s1 s0 q0 q1 ...

← stack queue →

sh

re... cats

... nearby

... feed

re... cats ... feedre

... feedresh

... nearby

...








32

... s2 s1 s0 q0 q1 ...

← stack queue →

equivalent if featuresonly look at s0 and q0

sh

re... cats

... nearby

... feed

re... cats ... feedre

... feedresh

... nearby

...








33

... s2 s1 s0 q0 q1 ...

← stack queue →

sh

re... cats

... nearby

... feed

re... cats re

... feedresh

... nearby

...








33

... s2 s1 s0 q0 q1 ...

← stack queue →

sh

re... cats

... nearby

... feed

re... cats re

... feedresh

... nearby

...

(local) ambiguity-packing!








34

... s2 s1 s0 q0 q1 ...

← stack queue →

sh

re... cats

... nearby

... feed

re... cats re

... feedresh

... nearby... in

sh

...








34

... s2 s1 s0 q0 q1 ...

← stack queue →

sh

re... cats

... nearby

... feed

re... cats re

... feedresh

... nearby... in

sh

...

graph-structured stack


Theory: Polynomial-Time DP

• this DP is exact and polynomial-time if features are:

• a) bounded -- for polynomial time

• features can only look at a local window

• b) monotonic -- for correctness (optimal substructure)

• features should draw no more info from trees farther away from stack top than from trees closer to top

• both are intuitive: a) always true; b) almost always true35

... s2 s1 s0 q0 q1 ...

← stack queue →


Theory: Monotonic History• related: grammar refinement by annotation (Johnson, 1998)

• annotate vertical context history (e.g., parent)

• monotonicity: can’t annotate grand-parent without annotating the parent (otherwise DP would fail)

• our features: left-context history instead of vertical-context

• similarly, can’t annotate s2 without annotating s1

• but we can always design “minimum monotonic superset”

36

parent

grand-parent

s1

s2

s0

stack


Related Work

• Graph-Structured Stack (Tomita 88): Generalized LR

• GSS is just a chart viewed from left to right (e.g. Earley 70)

• this line of work started w/ Lang (1974); stuck since 1990

• b/c explicit LR table is impossible with modern grammars

• Jelinek (2004) independently rediscovered GSS

37


Related Work

• Graph-Structured Stack (Tomita 88): Generalized LR

• GSS is just a chart viewed from left to right (e.g. Earley 70)

• this line of work started w/ Lang (1974); stuck since 1990

• b/c explicit LR table is impossible with modern grammars

• Jelinek (2004) independently rediscovered GSS

• We revived and advanced this line of work in two aspects

• theoretical: implicit LR table based on features

• merge and split on-the-fly; no pre-compilation needed

• monotonic feature functions guarantee correctness (new)

• practical: achieved linear-time performance with pruning

37


Jelinek (2004)

38In: M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld (eds.): Mathematical Foundations of Speech and Language Processing, 2004


Jelinek (2004)

38

graph-structured stack!

In: M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld (eds.): Mathematical Foundations of Speech and Language Processing, 2004


Jelinek (2004)

38

I don’t know anything about this paper...

graph-structured stack!

In: M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld (eds.): Mathematical Foundations of Speech and Language Processing, 2004


Jelinek (2004)• structured language model as graph-structured stack

39see also (Chelba and Jelinek, 98; 00; Xu, Chelba, Jelinek, 02)


Jelinek (2004)• structured language model as graph-structured stack

39

pSLM( a | has, show)

p3gram( a | its, host)

see also (Chelba and Jelinek, 98; 00; Xu, Chelba, Jelinek, 02)

Experiments


Speed Comparison

41

• 5 times faster with the same parsing accuracy

time (hours)

non-DPDP


Correlation of Search and Parsing

42

92.2

92.3

92.4

92.5

92.6

92.7

92.8

92.9

93

93.1

2365 2370 2375 2380 2385 2390 2395

dependency accuracy

average model score

DPnon-DP

• better search quality <=> better parsing accuracy


Search Space: Exponential

43

100

102

104

106

108

1010


sentence length

DP: exponen

tial

non-DP: fixed (beam-width)num

ber

of t

rees

exp

lore

d


N-Best / Forest Oracles

44

DP forest oracle (98.15)

DP k-best in forest

non-

DP

k-bes

t

in be

am


Better Search => Better Learning

45

• DP leads to faster and better learning w/ perceptron


Learning Details: Early Updates• greedy search: update at first error

• beam search: update when gold is pruned (Collins/Roark 04)

• DP search: also update when gold is “merged” (new!)

• b/c we know gold can’t make to the top again

46

DP non-DP


Parsing Time vs. Sentence Length

47

• parsing speed (scatter plot) compared to other parsers

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60 70

parsing time (secs)

sentence length



47


Ch

arn

iak

Ber

kele

yM

ST

this work

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60 70

parsing time (secs)

sentence length



47


Ch

arn

iak

Ber

kele

yM

ST

this work

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60 70

parsing time (secs)

sentence length

O(n2)

O(n)

O(n2.4)O(n2.5)


Final Results• much faster than major parsers (even with Python!)

• first linear-time incremental dynamic programming parser

• best reported dependency accuracy on Penn Treebank

time complexity trees searched

0.12 O(n2) exponential

- O(n4) exponential

0.11 O(n) constant

0.04 O(n) exponential

0.49 O(n2.5) exponential


McDonald et al 05 - MST

Koo et al 08 baseline*

Zhang & Clark 08 single

this work

Charniak 00

Petrov & Klein 07

89 91 93

92.4

92.5

92.1

91.4

92.0

90.2


Final Results• much faster than major parsers (even with Python!)

• first linear-time incremental dynamic programming parser

• best reported dependency accuracy on Penn Treebank

time complexity trees searched

0.12 O(n2) exponential

- O(n4) exponential

0.11 O(n) constant

0.04 O(n) exponential



McDonald et al 05 - MST

Koo et al 08 baseline*

Zhang & Clark 08 single

this work

Charniak 00

Petrov & Klein 07

89 91 93

92.4

92.5

92.1

91.4

92.0

90.2

*at this ACL: Koo & Collins 10: 93.0 with O(n4)


Final Results on Chinese

• also the best parsing accuracy on Chinese

• Penn Chinese Treebank (CTB 5)

• all numbers below use gold-standard POS tags

49

Duan et al. 2007

Zhang & Clark 08 (single)

this work

70 85

78.3

76.7

73.7

85.5

84.7

84.4

85.2

84.3

83.9 wordnon-rootroot


Conclusion

50

greedy search

principled search

incremental parsing

(e.g. shift-reduce)

☺✓full dynamic

programming(e.g. CKY)

fast(linear-time)

slow(cubic-time)


Conclusion

50

greedy search

principled search

incremental parsing

(e.g. shift-reduce)

☺✓full dynamic

programming(e.g. CKY)

fast(linear-time)

slow(cubic-time)

linear-timeshift-reduce parsing

w/ dynamic programming


Zoom out to Big Picture...

51

natural languages

programming languages

human

computer

☺psycholinguistics ☹

☺?NLP

☺ compiler theory

still a long way to go...


Thank You

• a general theory of DP for shift-reduce parsing

• as long as features are bounded and monotonic

• fast, accurate DP parser release coming soon:

• http://www.isi.edu/~lhuang

• future work

• adapt to constituency parsing (straightforward)

• other grammar formalisms like CCG and TAG

• integrate POS tagging into the parser

• integrate semantic interpretation52

http://www.isi.edu/~lhuang

http://www.isi.edu/~lhuang


How I was invited to give this talk• Fred attended ACL 2010 in Sweden

• Mark Johnson mentioned to him about this work

• Fred saw my co-author Kenji Sagae giving the talk

• but didn’t realize it was Kenji; he thought it was me

• he emailed me (but mis-spelled my name in the address)

• not getting a reply, he asked Kevin Knight to “forward it to Liang Haung or his student Sagae.”

• Fred complained that my paper is very hard to read“As you can see, I am completely confused!” And he was right.

• finally he said “come here to give a talk and explain it.”53

Dynamic Programming for Linear-Time Incremental Parsing

Documents

Transcript of Dynamic Programming for Linear-Time Incremental Parsing