Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

55
Classroom update Mondays: 7 Bristo Square, lecture theatre 3 Most Thursdays: here, EXCEPT: Weeks 3 & 5: Anatomy lecture theatre Sign up to piazza to get these announcements.

Transcript of Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Page 1: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Classroom update

• Mondays: 7 Bristo Square, lecture theatre 3

• Most Thursdays: here, EXCEPT:

• Weeks 3 & 5: Anatomy lecture theatre

• Sign up to piazza to get these announcements.

Page 2: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

A Brief Tour of Lexical Translation Models

Page 3: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models
Page 4: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models
Page 5: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Last Time ...

AlignmentX

Alignment

p( )Translationp( )= Translation,

Page 6: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Last Time ...

AlignmentX

Alignment

p( )Translationp( )= Translation,

Alignment Translation | Alignment⇥X

Alignment

p( p() )=

Page 7: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Last Time ...

AlignmentX

Alignment

p( )Translationp( )= Translation,

Alignment Translation | Alignment⇥X

Alignment

p( p() )=

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

| {z } | {z }z }| { z }| {

Page 8: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

Page 9: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

mY

i=1

p(ei | fai , fai�1)

mY

i=1

p(ei | fai , fai�1)

mY

i=1

p(ei | fai , ei�1)

Page 10: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

mY

i=1

p(ei | fai , fai�1)

mY

i=1

p(ei | fai , fai�1)

mY

i=1

p(ei | fai , ei�1)

mY

i=1

p(ei, ei+1 | fai)

What is the problem here?

Page 11: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

=X

a2[0,n]m

mY

i=1

1

1 + n| {z }p(a|f,m)

⇥mY

i=1

p(ei | fai)

Page 12: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

=X

a2[0,n]m

mY

i=1

1

1 + n| {z }p(a|f,m)

⇥mY

i=1

p(ei | fai)

=X

a2[0,n]m

mY

i=1

1

1 + np(ei | fai)

Page 13: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

=X

a2[0,n]m

mY

i=1

1

1 + n| {z }p(a|f,m)

⇥mY

i=1

p(ei | fai)

=X

a2[0,n]m

mY

i=1

1

1 + np(ei | fai)

=X

a2[0,n]m

mY

i=1

p(ai)⇥ p(ei | fai)

Page 14: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

=X

a2[0,n]m

mY

i=1

1

1 + n| {z }p(a|f,m)

⇥mY

i=1

p(ei | fai)

=X

a2[0,n]m

mY

i=1

1

1 + np(ei | fai)

=X

a2[0,n]m

mY

i=1

p(ai)⇥ p(ei | fai)

Can we do something better here?

Page 15: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai)⇥ p(ei | fai)p(e | f,m)

Page 16: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai)⇥ p(ei | fai)p(e | f,m)

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

Page 17: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

• Model alignment with an absolute position distribution

• Probability of translating a foreign word at position to generate the word at position (with target length and source length )

• EM training of this model is almost the same as with Model 1 (same conditional independencies hold)

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

aii

m n

p(ai | i,m, n)

Page 18: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

natürlich ist das haus klein

Page 19: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

natürlich ist das haus klein

natürlich natürlich istdas haus klein

of course the house is small

Page 20: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

• Pros

• Non-uniform alignment model

• Fast EM training / marginal inference

• Cons

• Absolute position is very naive

• How many parameters to model

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

p(ai | i,m, n)

Page 21: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

How much do we knowwhen we only know thesource & target lengths and the current position?

Page 22: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

How much do we knowwhen we only know thesource & target lengths and the current position?

Page 23: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

How much do we knowwhen we only know thesource & target lengths and the current position?

How many parameters do we need to model this?

Page 24: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

h(j, i,m, n) = �����i

m� j

n

����

Page 25: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

h(j, i,m, n) = �����i

m� j

n

����

pos in target

Page 26: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

h(j, i,m, n) = �����i

m� j

n

����

pos in target pos in source

Page 27: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

h(j, i,m, n) = �����i

m� j

n

����

pos in target pos in source

target len

Page 28: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

h(j, i,m, n) = �����i

m� j

n

����

pos in target pos in source

target len source len

Page 29: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

h(j, i,m, n) = �����i

m� j

n

����

pos in target pos in source

target len source len

b(j | i,m, n) =exp�h(j, i,m, n)Pj0 exp�h(j

0, i,m, n)

Page 30: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

h(j, i,m, n) = �����i

m� j

n

����

pos in target pos in source

target len source len

b(j | i,m, n) =exp�h(j, i,m, n)Pj0 exp�h(j

0, i,m, n)

p(ai | i,m, n) =

(p0 if ai = 0

(1� p0)b(ai | i,m, n) otherwise

Page 31: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

null

j0 = 1

j0 = 2

j0 = 3

j0 = 4

j0 = 5

i=3

}n=

5

}m = 6

i=1

i=2

i=4

i=5

i=6

h(j, i,m, n) = �����i

m� j

n

����

pos in target pos in source

target len source len

b(j | i,m, n) =exp�h(j, i,m, n)Pj0 exp�h(j

0, i,m, n)

p(ai | i,m, n) =

(p0 if ai = 0

(1� p0)b(ai | i,m, n) otherwise

(Dyer et al. 2013)

Page 32: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models
Page 33: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models
Page 34: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models
Page 35: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Words reorder in groups. Model this!

Page 36: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai)⇥ p(ei | fai)p(e | f,m)

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

Page 37: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

=X

a2[0,n]m

mY

i=1

p(ai)⇥ p(ei | fai)p(e | f,m)

=X

a2[0,n]m

mY

i=1

p(ai | i,m, n)⇥ p(ei | fai)Model 2

=X

a2[0,n]m

mY

i=1

p(ai | ai�1)⇥ p(ei | fai)HMM

Page 38: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

• Insight: words translate in groups

• Condition on previous alignment position

• Probability of translating a foreign word at position given that the previous position translated was

• EM training of this model using forward-backward algorithm (dynamic programming)

aiai�1

p(ai | ai�1)

=X

a2[0,n]m

mY

i=1

p(ai | ai�1)⇥ p(ei | fai)HMM

Page 39: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

• Improvement: model “jumps” through the source sentence

• Relative position model rather than absolute position model

=X

a2[0,n]m

mY

i=1

p(ai | ai�1)⇥ p(ei | fai)HMM

p(ai | ai�1) = j(ai � ai�1)

-4 0.0008

-3 0.0015

-2 0.08

-1 0.18

0 0.0881

1 0.4

2 0.16

3 0.064

4 0.0256

Page 40: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

• Be careful! NULLs must be handled carefully. Here is one option (due to Och):

=X

a2[0,n]m

mY

i=1

p(ai | ai�1)⇥ p(ei | fai)HMM

p(ai | ai�ni) =

(p0 if ai = 0

(1� p0)j(ai � ai�ni) otherwise

ni is the index of the first non-null aligned word in the alignment to the left of .i

Page 41: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Fertility Models

• The models we have considered so far have been efficient

• This efficiency has come at a modeling cost:

• What is to stop the model from “translating” a word 0, 1, 2, or 100 times?

• We introduce fertility models to deal with this

Page 42: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

IBM Model 3

Page 43: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Fertility• Fertility: the number of English words generated by a foreign

word

• Modeled by categorical distribution

• Examples:

n(� | f)

0 0.00008

1 0.1

2 0.0002

3 0.8

4 0.009

5 0

Unabhaengigkeitserklaerung

0 0.01

1 0

2 0.9

3 0.0009

4 0.0001

5 0

zum = (zu + dem)

0 0.01

1 0.92

2 0.07

3 0

4 0

5 0

Haus

Page 44: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Fertility

• Fertility models mean that we can no longer exploit conditional independencies to write as a series of local alignment decisions.

• How do we compute the statistics required for EM training?

p(a | f,m)

Page 45: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Fertility

• Fertility models mean that we can no longer exploit conditional independencies to write as a series of local alignment decisions.

• How do we compute the statistics required for EM training?

p(e | f,m) =X

a2[0,n]m

p(a | f,m)⇥mY

i=1

p(ei | fai)

p(a | f,m)

Page 46: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

EM Recipe reminder

• If alignment points were visible, training fertility models would be easy

• We would _______ and ________

• But, alignments are not visible

n(� = 3 | f = Unabhaenigkeitserklaerung) =count(3,Unabhaenigkeitserklaerung)

count(Unabhaenigkeitserklaerung)

Page 47: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

EM Recipe reminder

• If alignment points were visible, training fertility models would be easy

• We would _______ and ________

• But, alignments are not visible

n(� = 3 | f = Unabhaenigkeitserklaerung) =count(3,Unabhaenigkeitserklaerung)

count(Unabhaenigkeitserklaerung)

n(� = 3 | f = Unabhaenigkeitserklaerung) =E[count(3,Unabhaenigkeitserklaerung)]E[count(Unabhaenigkeitserklaerung)]

Page 48: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Expectation & Fertility

• We need to compute expected counts under p(a | f,e,m)

• Unfortunately p(a | f,e,m) doesn’t factorize nicely. :(

• Can we sum exhaustively? How many different a’s are there?

• What to do?

Page 49: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Sample Alignments• Monte-Carlo methods

• Gibbs sampling

• Importance sampling

• Particle filtering

• For historical reasons

• Use model 2 alignment to start (easy!)

• Weighted sum over all alignment configurations that are “close” to this alignment configuration

• Is this correct? No! Does it work? Sort of.

Page 50: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Pitfalls of Conditional Models

IBM Model 4 alignment Our model's alignment

Figure 2: Example English-Urdu alignment under IBM Model 4 (left) and our discriminative model (right). Model4 displays two characteristic errors: garbage collection and an overly-strong monotonicity bias. Whereas our modeldoes not exhibit these problems, and in fact, makes no mistakes in the alignment.

pervised setting. The contrastive estimation tech-nique proposed by Smith and Eisner (2005) is glob-ally normalized (and thus capable of dealing with ar-bitrary features), and closely related to the model wedeveloped; however, they do not discuss the problemof word alignment. Berg-Kirkpatrick et al. (2010)learn locally normalized log-linear models in a gen-erative setting. Globally normalized discriminativemodels with latent variables (Quattoni et al., 2004)have been used for a number of language processingproblems, including MT (Dyer and Resnik, 2010;Blunsom et al., 2008a). However, this previouswork relied on translation grammars constructed us-ing standard generative word alignment processes.

7 Future Work

While we have demonstrated that this model can besubstantially useful, it is limited in some importantways which are being addressed in ongoing work.First, training is expensive, and we are exploring al-ternatives to the conditional likelihood objective thatis currently used, such as contrastive neighborhoodsadvocated by (Smith and Eisner, 2005). Addition-ally, there is much evidence that non-local featureslike the source word fertility are (cf. IBM Model 3)useful for translation and alignment modeling. To betruly general, it must be possible to utilize such fea-tures. Unfortunately, features like this that dependon global properties of the alignment vector, a, make

the inference problem NP-hard, and approximationsare necessary. Fortunately, there is much recentwork on approximate inference techniques for incor-porating nonlocal features (Blunsom et al., 2008b;Gimpel and Smith, 2009; Cromieres and Kurohashi,2009; Weiss and Taskar, 2010), suggesting that thisproblem too can be solved using established tech-niques.

8 Conclusion

We have introduced a globally normalized, log-linear lexical translation model that can be traineddiscriminatively using only parallel sentences,which we apply to the problem of word alignment.Our approach addresses two important shortcomingsof previous work: (1) that local normalization ofgenerative models constrains the features that can beused, and (2) that previous discriminatively trainedword alignment models required supervised align-ments. According to a variety of measures in a vari-ety of translation tasks, this model produces superioralignments to generative approaches. Furthermore,the features learned by our model reveal interestingcharacteristics of the language pairs being modeled.

AcknowledgmentsThis work was supported in part by the DARPA GALEprogram; the U. S. Army Research Laboratory and theU. S. Army Research Office under contract/grant num-

Page 51: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Pitfalls of Conditional Models

IBM Model 4 alignment Our model's alignment

Figure 2: Example English-Urdu alignment under IBM Model 4 (left) and our discriminative model (right). Model4 displays two characteristic errors: garbage collection and an overly-strong monotonicity bias. Whereas our modeldoes not exhibit these problems, and in fact, makes no mistakes in the alignment.

pervised setting. The contrastive estimation tech-nique proposed by Smith and Eisner (2005) is glob-ally normalized (and thus capable of dealing with ar-bitrary features), and closely related to the model wedeveloped; however, they do not discuss the problemof word alignment. Berg-Kirkpatrick et al. (2010)learn locally normalized log-linear models in a gen-erative setting. Globally normalized discriminativemodels with latent variables (Quattoni et al., 2004)have been used for a number of language processingproblems, including MT (Dyer and Resnik, 2010;Blunsom et al., 2008a). However, this previouswork relied on translation grammars constructed us-ing standard generative word alignment processes.

7 Future Work

While we have demonstrated that this model can besubstantially useful, it is limited in some importantways which are being addressed in ongoing work.First, training is expensive, and we are exploring al-ternatives to the conditional likelihood objective thatis currently used, such as contrastive neighborhoodsadvocated by (Smith and Eisner, 2005). Addition-ally, there is much evidence that non-local featureslike the source word fertility are (cf. IBM Model 3)useful for translation and alignment modeling. To betruly general, it must be possible to utilize such fea-tures. Unfortunately, features like this that dependon global properties of the alignment vector, a, make

the inference problem NP-hard, and approximationsare necessary. Fortunately, there is much recentwork on approximate inference techniques for incor-porating nonlocal features (Blunsom et al., 2008b;Gimpel and Smith, 2009; Cromieres and Kurohashi,2009; Weiss and Taskar, 2010), suggesting that thisproblem too can be solved using established tech-niques.

8 Conclusion

We have introduced a globally normalized, log-linear lexical translation model that can be traineddiscriminatively using only parallel sentences,which we apply to the problem of word alignment.Our approach addresses two important shortcomingsof previous work: (1) that local normalization ofgenerative models constrains the features that can beused, and (2) that previous discriminatively trainedword alignment models required supervised align-ments. According to a variety of measures in a vari-ety of translation tasks, this model produces superioralignments to generative approaches. Furthermore,the features learned by our model reveal interestingcharacteristics of the language pairs being modeled.

AcknowledgmentsThis work was supported in part by the DARPA GALEprogram; the U. S. Army Research Laboratory and theU. S. Army Research Office under contract/grant num-

Page 52: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

p(e|f)

A few tricks...

p(f|e)

Page 53: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

A few tricks...

p(f|e) p(e|f)

Page 54: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

A few tricks...

p(f|e) p(e|f)

Page 55: Edinburgh MT Lecture 4: A Brief Tour of Lexical Translation Models

Lexical Translation• IBM Models 1-5 [Brown et al., 1993]

• Model 1: lexical translation, uniform alignment

• Model 2: absolute position model

• Model 3: fertility

• Model 4: relative position model (jumps in target string)

• Model 5: non-deficient model

• HMM translation model [Vogel et al., 1996]

• Relative position model (jumps in source string)

• Latent variables are useful for much more than translation

• Widely used Giza++ toolkit