Shuffling Non-Constituents

87
1 Shuffling Non- Constituents Jason Eisner ACL SSST Workshop (invited talk), June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods syntactically- flavored reordering model

description

syntactically-flavored reordering model. Shuffling Non-Constituents. Jason Eisner. with David A. Smith and Roy Tromble. syntactically-flavored reordering search methods. ACL SSST Workshop (invited talk), June 2008. Starting point: Synchronous alignment. - PowerPoint PPT Presentation

Transcript of Shuffling Non-Constituents

Page 1: Shuffling Non-Constituents

1

Shuffling Non-Constituents

Jason Eisner

ACL SSST Workshop (invited talk), June 2008

with David A. Smith and Roy Tromblesyntactically-flavored

reordering search methodssyntactically-

flavoredreordering model

Page 2: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2

Starting point: Synchronous alignment Synchronous grammars are very pretty.

But does parallel text actually have parallel structure? Depends on what kind of parallel text

Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes?

Depends on what kind of parallel structure What kinds of divergences can your synchronous grammar

formalism capture? E.g., wh-movement versus wh in situ

Page 3: Shuffling Non-Constituents

Two training trees, showing a free translation from French to English.

Synchronous Tree Substitution Grammar

enfants(“kids”)

d’(“of”)

beaucoup(“lots”)

Sam

donnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

kids

Sam

kiss

quite

often

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Page 4: Shuffling Non-Constituents

enfants(“kids”)

kids

NPd’

(“of”)

beaucoup(“lots”)

NP

NP

SamSam NP

Synchronous Tree Substitution Grammar

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)Start

NP

NP

nullAdv

quitenull Adv

oftennullAdv

null Adv

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

Page 5: Shuffling Non-Constituents

enfants(“kids”)

kids

Adv

d’(“of”)

beaucoup(“lots”)

NP

SamSam NP

Synchronous Tree Substitution Grammar

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)Start

NPquite

often

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.A much worse alignment ...

Page 6: Shuffling Non-Constituents

enfants(“kids”)

kids

NPd’

(“of”)

beaucoup(“lots”)

NP

NP

SamSam NP

Synchronous Tree Substitution Grammar

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)Start

NP

NP

nullAdv

quitenull Adv

oftennullAdv

null Adv

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

Page 7: Shuffling Non-Constituents

SamSam NPenfants(“kids”)

kids

NPquitenull Adv

Synchronous Grammar = Set of Elementary Trees

oftennullAdv

null Advd’

(“of”)

beaucoup(“lots”)

NP

NP

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)Start

NP

NP

nullAdv

Page 8: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 8

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Page 9: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 9

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Displaced modifier (negation)

Page 10: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Displaced modifier (negation)

Page 11: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 11

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Displaced argument (here, because projective parser)

Page 12: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 12

But many examples are harder

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

To questionthis received Ihave alas answer no

Head-swapping (here, just different annotation conventions)

Page 13: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 13

Free Translation

TschernobylChernobyl

könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome

Then we could deal with Chernobyl some time later

NULL

Page 14: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 14

Free Translation

TschernobylChernobyl

könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome

Then we could deal with Chernobyl some time later

NULL

Probably not systematic (but words are correctly aligned)

Page 15: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 15

Free Translation

TschernobylChernobyl

könntecould

dannthen

etwas something

später later

anon

diethe

Reihequeue

kommencome

Then we could deal with Chernobyl some time later

NULL

Erroneous parse

Page 16: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 16

What to do? Current practice:

Don’t try to model all systematic phenomena! Just use non-syntactic alignments (Giza++). Only care about the fragments that recur often

Phrases or gappy phrases Sometimes even syntactic constituents

(can favor these, e.g., Marton & Resnik 2008) Use these (gappy) phrases in a decoder

Phrase based or hierarchical

Page 17: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 17

What to do? Current practice:

Use non-syntactic alignments (Giza++) Keep frequent phrases for a decoder

But could syntax give us better alignments? Would have to be “loose” syntax …

Why do we want better alignments?1. Throw away less of the parallel training data2. Help learn a smarter, syntactic, reordering model

Could help decoding: less reliance on LM3. Some applications care about full alignments

Page 18: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 18

Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar Any grammar formalism is okay Pick a dependency grammar formalism for now

I did not unfortunately receive an answer to this question

P(PRP | no previous left children of “did”)

P(I | did, PRP)

parsing: O(n3)

Page 19: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 19

Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar But probabilities are influenced by source sentence

Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes

I did not unfortunately receive an answer to this question

parsing: O(n3)

Page 20: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 20

P(PRP | no previous left children of “did”, habe)

QCFG Generative Storyobserved

Auf Fragediese bekommenich leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

habe

P(parent-child)

aligned parsing: O(m2n3)

P(breakage)P(I | did, PRP, ich)

Page 21: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 21

What’s a “nearby node”?

+ “none of the above”

Given parent’s alignment, where might child be aligned?

synchronousgrammar case

Page 22: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 22

Useful analogies:1. Generative grammar with latent word senses2. MEMM

Generate n-gramtag sequence, but probabilities are influenced by word sequence

Quasi-synchronous grammar

Target

Source

How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar But probabilities are influenced by source sentence

Page 23: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 23

Useful analogies:1. Generative grammar with latent word senses2. MEMM3. IBM Model 1

Source nodes can be freely reused or unused Future work: Enforce 1-to-1 to allow good decoding

(NP-hard to do exactly)

Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:

Generate target English by a monolingual grammar But probabilities are influenced by source sentence

Page 24: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 24

Some results: Quasi-synch. Dep. Grammar Alignment (D. Smith & Eisner 2006)

Quasi-synchronous syntax much better than synchronous Maybe also better than IBM Model 4

Question answering (Wang et al. 2007) Align question w/ potential answer Mean average precision 43% 48% 60%

previous state of the art + QG + lexical features Bootstrapping a parser for a new language

(D. Smith & Eisner 2007 & ongoing) Learn how parsed parallel text influences target dependencies

Along with many other features! (cf. co-training) Unsupervised: German 30% 69%, Spanish 26% 65%

Page 25: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 25

Summary of part I Current practice:

Use non-syntactic alignments (Giza++) Some bits align nicely Use the frequent bits in a decoder

Suggestion: Let syntax influence alignments.

So far, loose syntax methods are like IBM Model I. NP-hard to enforce 1-to-1 in any interesting model.

Rest of talk: How to enforce 1-to-1 in interesting models? Can we do something smarter than beam search?

Page 26: Shuffling Non-Constituents

26

Shuffling Non-Constituents

Jason Eisner

ACL SSST Workshop, June 2008

with David A. Smith and Roy Tromblesyntactically-flavored

reordering modelsyntactically-flavored

reordering search methods

Page 27: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 27

Motivation

MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!

Page 28: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 28

Permutation search in MT

1 42 3 5 6 initial order(French)

1 54 2 6 3 best order(French’)

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

Mary hasn’t seen me easy transduction

Page 29: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 29

Motivation

MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!

Have just to fix that pesky word order.

Framing it this way lets us enforce 1-to-1 exactly at the permutation step.Deletion and fertility > 1 are still allowed in the subsequent transduction.

Page 30: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 30

Often want to find an optimal permutation … Machine translation:

Reorder French to French-prime (Brown et al. 1992)So it’s easier to align or translate

MT eval:How much do you need to rearrange MT output so it scores well under an LM derived from ref translations?

Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003)So they flow nicely

Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM

Page 31: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 32

How can we find this needlein the haystack of N!possible permutations?

Permutation search: The problem

1 42 3 5 6 initial order

1 54 2 6 3 best orderaccording tosome costfunction

Page 32: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 33

Traditional approach: Beam searchApprox. best path through a really big FSA

N! paths: one for each permutationonly 2N states

arc weight = cost of picking 5 next

if we’ve seen {1,2,4} so far

state remembers what we’ve generated so far(but not in what order)

Page 33: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 34

An alternative: Local search (“hill climbing”)The SWAP neighborhood

1 2 3 4 5 6cost=22

2 1 3 4 5 6cost=26

1 2 3 4 5 6cost=22 1 2 3 5 4 6

cost=25

1 3 2 4 5 6cost=20

1 2 4 3 5 6cost=19

Page 34: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 35

An alternative: Local search (“hill-climbing”)The SWAP neighborhood

1 2 3 4 5 6cost=22

1 2 4 3 5 6cost=19

Page 35: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 36

An alternative: Local search (“hill-climbing”)Like “greedy decoder” of Germann et al. 2001

1 42 3 5 6 cost=22

The SWAP neighborhood

cost=19cost=17cost=16

. . . Why are the costs always going down?How long does it take to pick best swap?How many swaps might you need to reach answer?What if you get stuck in a local min?

we pick best swapO(N) if you’re

carefulO(N2)

random restarts

Page 36: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 37

Larger neighborhood

1 2 3 4 5 6cost=22

2 1 3 4 5 6cost=26

1 2 3 4 5 6cost=22 1 2 3 5 4 6

cost=25

1 3 2 4 5 6cost=20

1 2 4 3 5 6cost=19

Page 37: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 38

Larger neighborhood(well-known in the literature; works well)

1 2 3 4 5 6 cost=22cost=17

INSERT neighborhood

Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?

yes – 3 can move past 4 to get past 5

O(N) rather than O(N2)O(N2) rather than O(N) O(N2) rather than O(N)

Page 38: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 39

2

Even larger neighborhood

1 3 4 5 6 cost=22cost=14

BLOCK neighborhood

Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?

yes – 2 can get past 45 without having to cross 3 or move 3 first

still O(N)O(N3) rather than O(N), O(N2) O(N3) rather than O(N), O(N2)

Page 39: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 40

2

Larger yet: Via dynamic programming??

1 3 4 5 6 cost=22

Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?

logarithmicexponentialpolynomial

Page 40: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 41

Unifying/generalizing neighborhoods so far

21 3 4 5 6 7 8Exchange two adjacent blocks, of max widths w ≤ w’

SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N

Move is defined by an (i,j,k) triple

i j k

runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)

everything in this talk can be generalized to other values of w,w’

Page 41: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 42

Very large-scale neighborhoods What if we consider multiple simultaneous exchanges

that are “independent”?

The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000)

2 3 4 5 612 1 4 3 6 5

3 2 5 4

1 52 43 6

Lowest-cost neighboris lowest-cost path

Cost of this arc is Δcostof swapping (4,5), here < 0

3 62 1

5 4

Page 42: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 43

Very large-scale neighborhoods

2 3 4 5 612 1 4 3 6 5

3 2 5 4

Lowest-cost neighboris lowest-cost path

Why would this be a good idea?Help get out of bad local minima?Help avoid getting into bad local minima?

no; they’re still local minimayes – less greedy

B = 2 3 412 1 4 3

3 2

DYNASEARCH (-20+-20)

SWAP (-30)

0 -20 0 80

0 0 -30 -0

0 0 0 -20

0 0 0 0

Page 43: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 44

no; they’re still local minimayes – less greedy

yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap.

Up to N moves as fast as 1 move:no penalty for “parallelism”!

Globally optimizes over exponentially many neighbors (paths).

Very large-scale neighborhoods

2 3 4 5 612 1 4 3 6 5

3 2 5 4

Lowest-cost neighboris lowest-cost path

Why would this be a good idea?Help get out of bad local minima?Help avoid getting into bad local minima?More efficient?

Page 44: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 45

Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP?

21 3 4 5 6 7 8Exchange two adjacent blocks, of max widths w ≤ w’

SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N

Move is defined by an (i,j,k) triple

i j k

runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)

Yes.Asymptotic runtime is

always unchanged.

Page 45: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 46

Let’s define each neighbor by a “colored tree”Just like ITG!

1 42 3 5 6

= swap children

Page 46: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 47

Let’s define each neighbor by a “colored tree”Just like ITG!

1 42 3 5 6

= swap children

Page 47: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 48

Let’s define each neighbor by a “colored tree”Just like ITG!

1 42 35 6

= swap children

This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested.

Page 48: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 49

If that was the optimal neighbor …

1 45 6 2 3

… now look for its optimal neighbor

new tree!

Page 49: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 50

If that was the optimal neighbor …

5 6 1 4 2 3

… now look for its optimal neighbor

new tree!

Page 50: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 51

If that was the optimal neighbor …

5 61 4 2 3

… now look for its optimal neighbor… repeat till reach local optimum

Each tree defines a neighbor.At each step, optimize over all possible trees

by dynamic programming (CKY parsing).

Use your favorite parsing speedups (pruning, best-first, …)

Page 51: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 52

Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw …

21 3 4 5 6 7 8Exchange two adjacent blocks, of max widths w ≤ w’

Runtime of the algorithm we just saw was O(N3) because we considered O(N3) distinct (i,j,k) triplesMore generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through)

Move is defined by an (i,j,k) triple

i j k

Page 52: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 53

How many steps to get from here to there?

8 46 2 5 3 7 1

4 51 2 3 6 7 8

One twisted-tree step?No: As you probably know,3 1 4 2 1 2 3 4 is impossible.

initial order

best order

Page 53: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 54

Can you get to the answer in one step?

German-English, Giza++ alignment

often(yay, big neighborhood)

not always(yay, local search)

for longersentences,usually not

Page 54: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 55

8 46 2 5 3 7 1

How many steps to the answer in the worst case? (what is diameter of the search space?)

4 51 2 3 6 7 8

claim: only log2N steps at worst (if you know where to step)

Let’s sketch the proof!

Page 55: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 56

Quicksort anything into, e.g., 1 2 3 4 5 6 7 8

8 46 2 5 3 7 1

5 4

right-branchingtree

Page 56: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 57

Quicksort anything into, e.g., 1 2 3 4 5 6 7 8

1 72 4 3 8 5 6

6

5 4

7 2 3

sequence of right-branching

trees

Only log2 N steps to get to 1 2 3 4 5 6 7 8 …

… or to anywhere!

Page 57: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 58

How can we find this needlein the haystack of N!possible permutations?

1 42 3 5 6 initial order

1 54 2 6 3 best orderaccording tosome costfunction

Defining “best order”What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees?

Page 58: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 59

+ a25 + a56 + a63+ a42

How can we find this needlein the haystack of N!possible permutations?

Defining “best order”What class of cost functions?

best orderaccording tosome costfunction

1 54 2 6 3

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A =

a14

“Traveling Salesperson

Problem” (TSP)

+ a31

Page 59: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 60

How can we find this needlein the haystack of N!possible permutations?

Defining “best order”What class of cost functions?

best orderaccording tosome costfunction

1 54 2 6 3

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

5 4 -12 6 55 0

B =

b26 = cost of 2 preceding 6“Linear Ordering

Problem” (LOP)

(add up n(n-1)/2 such costs)(any order will incur either b26 or b62)

Page 60: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 61

Defining “best order”What class of cost functions?

TSP and LOP are both NP-complete In fact, believed to be inapproximable

hard even to achieve C * optimal cost (any C≥1)

Practical approaches: correct answer, typically fast branch-and-bound,

ILP, … fast answer, typically close to correct beam search,

this talk, …

Page 61: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 63

Defining “best order”What class of cost functions?

initial order1 42 3 5 6

1 54 2 6 3 cost of this order:1.Does my favorite WFSA

like this string of #s?2.Non-local pair order ok?3.Non-local triple order

ok?Can add these all up …

4 before 3 …?1…2…3?Generalize

s TSP LOP

Page 62: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 64

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

Costs are derived from source sentence features

1 42 3 5 6initial order

(French)NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

A =

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

-75 4 -12 6 55 0

B =

ne would like to be brought adjacent to the next NEG word

Page 63: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 65

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

1 42 3 5 6initial order

(French)NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)+27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap

Can also include phrase boundary symbols in the input!

Costs are derived from source sentence features

= 75

Page 64: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 66

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

1 42 3 5 6initial order

(French)NNP

Marie

NEG

ne

PRP

m’

AUX

a

NEG

pas

VBN

vu

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

FSA costs: Distortion modelLanguage model – looks ahead to next step! ( good finite-state translation into good

English?)

Costs are derived from source sentence features

Page 65: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 67

Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6

1 54 2 6 3 cost of this order:1.Does my favorite WFSA

like it as a string?

Page 66: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 68

Scoring with a weighted FSA

This particular WFSA implements TSP scoring for N=3:After you read 1, you’re in state 1After you read 2, you’re in state 2After you read 3, you’re in state 3 …

and this state determines the cost of the next symbol you read

nitial

We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N3Q3) …)

Page 67: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 69

Including WFSA costs via nonterminals

1 42 3 5 661 42 23 14 I5 56

A possible preterminal for word 2is an arc in A that’s labeled with 2.

The preterminal 42 rewrites as word 2

with a cost equal to the arc’s cost.

4 22

Page 68: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 70

I3I3Including WFSA costs via nonterminals

1 42 361 42 23 14

43

13

63

5 6I5 56

I6

63

I6

63

I6

I3

1 42 3 5 661 42 23 14 I5 56

This constituent’s total cost is the

total cost of the best 63 path

.6 11 4 2 34 2 3

.16 1 4 2 34 2 35 6I 5

cost of the new permutation

Page 69: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 71

Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6

1 54 2 6 3 cost of this order:1.Does my favorite WFSA

like it as a string?2.Non-local pair order ok?4 before 3 …?

Page 70: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 72

Incorporating the pairwise ordering costs

So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4

Uh-oh! So now it takesO(N2) time to combine twosubtrees, instead of O(1) time?

Nope – dynamic programmingto the rescue again!

1 42 3 5 6 7

This puts {5,6,7} before {1,2,3,4}.

Page 71: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 73

Computing LOP cost of a block move

1 42 3 5 6 7

1 2 3 4567

1 2 3 4567

1 2 3 4567

1 2 3 4567

1 2 3 4567

So we have to add O(N2) costsjust to consider this single neighbor!

This puts {5,6,7} before {1,2,3,4}.

= + - +

already computed at earlier steps of parsing

Reuse work from other, “narrower” block moves …computed new cost in O(1)!

revise

Page 72: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 74

Incorporating 3-way ordering costs See the initial paper (Eisner & Tromble 2006)

A little tricky, but comes “for free” if you’re willing to

accept a certain restriction on these costs more expensive without that restriction, but possible

Page 73: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 75

Another option: Markov chain Monte Carlo Random walk in the space of permutations

interpret a permutation’s cost as a log-probabilityp(π) = exp(–cost(π)) / Z

Sample a permutation from the neighborhood instead of always picking the most probable

Why? Simulated annealing might beat greedy-with-random-restarts When learning the parameters of the distribution, can use

sampling to compute the feature expectations

Page 74: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 76

Another option: Markov chain Monte Carlo Random walk in the space of permutations

interpret a permutation’s cost as a log-probabilityp(π) = exp(–cost(π)) / Z

Sample a permutation from the neighborhood instead of always picking the most probable

How? Pitfall: Sampling a permutation sampling a tree

Spurious ambiguity: some permutations have many trees Solution: Exclude some trees, leaving 1 per permutation

Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks

to swap), we’ve devised a more complicated normal form

Page 75: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 78

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

Learning the costs Where do these costs come from? If we have some examples on which we know

the true permutation, could try to learn them

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

Page 76: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 79

0 5 -22 93 8 6

12 0 8 -31 -6 54

-7 41 0 -9 24 82

88 17 -6 0 12 -60

11 -17 10 -59 0 23

75 4 -12 6 55 0

0 15 22 80 5 -7

-30 0 -76 24 63 -44

15 28 0 -15 71 -99

12 8 -31 0 54 -6

7 -9 41 24 0 82

6 5 -22 8 93 0

A = B =

Learning the costs Where do these costs come from? If we have some examples on which we know

the true permutation, could try to learn them More precisely, try to learn these weights θ

(the knowledge that’s reused across examples) 50: a verb (e.g., vu) shouldn’t

precede its subject (e.g., Marie)27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap

Page 77: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 83

Experimenting with training LOP params(LOP is quite fast: O(n3) with no grammar constant)

PDS VMFIN PPER ADV APPR ART NN PTKNEG VVINF $.Das kann ich so aus dem Stand nicht sagen .

B[7,9]

Page 78: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 84

Feature templates for cost of swapping i, j

22 features

plus versionsof all of these conjoined withthe distance j - i (binned)

Page 79: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 85

Feature templates for cost of swapping i, j

22 features

plus versionsof all of these conjoined withthe distance j-i (binned)

Only LOP features so far And they’re unnecessarily simple

(don’t examine syntactic constituency) And input sequence is only words

(not interspersed with syntactic brackets)

Page 80: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 86

Learning LOP Costs for MT

Define German’ to be German in English word order To get German’ for training data, use Giza++ to align

all German positions to English positions (disallow NULL)

German EnglishGerman’LOP MOSES

MOSES baseline

(interesting, if odd, to try to reorder with only the LOP costs)

Page 81: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 87

Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)

Easy first try: Naïve Bayes Treat each feature in θ as independent Count and normalize over the training data No real improvement over baseline

German EnglishGerman’LOP MOSES

MOSES baseline

Page 82: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 88

Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)

Easy second try: Perceptron

German EnglishGerman’LOP MOSES

MOSES baseline

0 1 n

*

. . . searcherrormodel

error

globaloptimum

localoptimu

mupdate

gold standard

Note: Search error can be beneficial, e.g., just take 1 step from identity permutation

Page 83: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 90

Benefit from reordering Learning method BLEU vs.

German′BLEU vs. English

No reordering 49.65 25.55

Naïve Bayes—POS 49.21

Naïve Bayes—POS+lexical 49.75

Perceptron—POS 50.05 25.92

Perceptron—POS+lexical 51.30 26.34obviously, not

yet unscrambling German: need more features

Page 84: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 91

Contrastive estimation (Smith & Eisner 2005)

Maximize the probability of the desired permutation relative to its ITG neighborhood

Requires summing all permutations in a neighborhood Must use normal-form trees here

Stochastic gradient descent

1-step very-large-scale neighborhood

Alternatively, work back from gold standard

gold standard

*

Page 85: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 92

k-best MIRA in the neighborhood

Make gold standard beat its local competitors Beat the bad ones by a bigger margin

Good = close to gold in swap distance? Good = close to gold using BLEU? Good = translates into English that’s close to reference?

1-step very-large-scale neighborhood

gold standard

*current winnersin the

neighborhood

Alternatively, work back from gold standard

Page 86: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 93

Alternatively, train each iterate

0 1 n. . .

*0

*1

*n

updateupdate update

model best inneigh of (0)

oracle inneigh of (0)

Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to(n)

Page 87: Shuffling Non-Constituents

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 95

Summary of part II

Local search is fun and easy Popular elsewhere in AI Closely related to MCMC sampling

Probably useful for translation Maybe other NP-hard problems too

Can efficiently use huge local neighborhoods Algorithms are closely related to parsing and FSMs Our community knows that stuff better than anyone!