Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...

30
Machine Translation 04: Neural Machine Translation Rico Sennrich University of Edinburgh R. Sennrich MT – 2018 – 04 1 / 20

Transcript of Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...

Page 1: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Machine Translation04: Neural Machine Translation

Rico Sennrich

University of Edinburgh

R. Sennrich MT – 2018 – 04 1 / 20

Page 2: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Overview

last lecturehow do we represent language in neural networks?

how do we treat language probabilistically (with neural networks)?

today’s lecturehow do we model translation with a neural network?

how do we generate text from a probabilistic translation model?

R. Sennrich MT – 2018 – 04 1 / 20

Page 3: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Modelling Translation

Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)

We can express translation as a probabilistic model

T ∗ = argmaxT

p(T |S)

Expanding using the chain rule gives

p(T |S) = p(y1, . . . , yn|x1, . . . , xm)

=

n∏i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

R. Sennrich MT – 2018 – 04 2 / 20

Page 4: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Differences Between Translation and Language Model

Target-side language model:

p(T ) =

n∏i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T |S) =n∏

i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text

→ Use separate RNNs for source and target.

R. Sennrich MT – 2018 – 04 3 / 20

Page 5: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Differences Between Translation and Language Model

Target-side language model:

p(T ) =

n∏i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T |S) =n∏

i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text

→ Use separate RNNs for source and target.

R. Sennrich MT – 2018 – 04 3 / 20

Page 6: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Encoder-Decoder for Translation

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

Decoder

Encoder

R. Sennrich MT – 2018 – 04 4 / 20

Page 7: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Encoder-Decoder for Translation

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

Decoder

Encoder

R. Sennrich MT – 2018 – 04 4 / 20

Page 8: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Summary vector

Last encoder hidden-state “summarises” source sentence

With multilingual training, we can potentially learnlanguage-independent meaning representation

[Sutskever et al., 2014]

R. Sennrich MT – 2018 – 04 5 / 20

Page 9: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Summary vector as information bottleneck

Problem: Sentence LengthFixed sized representation degrades as sentence length increases

Reversing source brings some improvement [Sutskever et al., 2014]

[Cho et al., 2014]

Solution: AttentionCompute context vector as weighted average of source hidden states

Weights computed by feed-forward network with softmax activation

R. Sennrich MT – 2018 – 04 6 / 20

Page 10: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Summary vector as information bottleneck

Problem: Sentence LengthFixed sized representation degrades as sentence length increases

Reversing source brings some improvement [Sutskever et al., 2014]

[Cho et al., 2014]

Solution: AttentionCompute context vector as weighted average of source hidden states

Weights computed by feed-forward network with softmax activation

R. Sennrich MT – 2018 – 04 6 / 20

Page 11: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.70.1

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Page 12: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.60.2

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Page 13: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.10.1

0.70.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Page 14: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.10.7

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Page 15: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.10.1

0.10.7

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Page 16: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Attentional encoder-decoder: Maths

simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU

simpler output layer

we do not show bias terms

decoder follows Look, Update, Generate strategy [Sennrich et al., 2017]

Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb

notationW , U , E, C, V are weight matrices (of different dimensionality)

E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)

separate weight matrices for encoder and decoder (e.g. Ex and Ey)

input X of length Tx; output Y of length TyR. Sennrich MT – 2018 – 04 8 / 20

Page 17: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Attentional encoder-decoder: Maths

encoder

−→h j =

{0, , if j = 0

tanh(−→W xExxj +

−→U xhj−1) , if j > 0

←−h j =

{0, , if j = Tx + 1

tanh(←−W xExxj +

←−U xhj+1) , if j ≤ Tx

hj = (−→h j ,←−h j)

R. Sennrich MT – 2018 – 04 9 / 20

Page 18: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Attentional encoder-decoder: Maths

decoder

si =

{tanh(Ws

←−h i), , if i = 0

tanh(WyEyyi−1 + Uysi−1 + Cci) , if i > 0

ti = tanh(Uosi +WoEyyi−1 + Coci)

yi = softmax(Voti)

attention model

eij = v>a tanh(Wasi−1 + Uahj)

αij = softmax(eij)

ci =

Tx∑j=1

αijhj

R. Sennrich MT – 2018 – 04 10 / 20

Page 19: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Attention model

attention modelside effect: we obtain alignment between source and target sentence

information can also flow along recurrent connections, so there is noguarantee that attention corresponds to alignmentapplications:

visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

R. Sennrich MT – 2018 – 04 11 / 20

Page 20: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Attention model

attention model also works with images:

[Cho et al., 2015]

R. Sennrich MT – 2018 – 04 12 / 20

Page 21: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Attention model

[Cho et al., 2015]

R. Sennrich MT – 2018 – 04 13 / 20

Page 22: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Application of Encoder-Decoder Model

Scoring (a translation)p(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?

Decoding ( a source sentence)Generate the most probable translation of a source sentence

y∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)

R. Sennrich MT – 2018 – 04 14 / 20

Page 23: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Decoding

exact searchgenerate every possible sentence T in target language

compute score p(T |S) for each

pick best one

intractable: |vocab|N translations for output length N→ we need approximative search strategy

R. Sennrich MT – 2018 – 04 15 / 20

Page 24: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Decoding

approximative search/1: greedy searchat each time step, compute probabilitydistribution P (yi|S, y<i)

select yi according to some heuristic:

sampling: sample from P (yi|S, y<i)greedy search: pick argmaxy p(yi|S, y<i)

continue until we generate <eos>

! 0.928

0.175

<eos> 0.999

0.175

hello 0.946

0.056

world 0.957

0.100

0

efficient, but suboptimal

R. Sennrich MT – 2018 – 04 16 / 20

Page 25: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Decoding

approximative search/2: beamsearch

maintain list of K hypotheses(beam)

at each time step, expand eachhypothesis k: p(yki |S, yk<i)

select K hypotheses withhighest total probability:∏

i

p(yki |S, yk<i)

hello 0.946

0.056

world 0.957

0.100

World 0.010

4.632

. 0.030

3.609

! 0.928

0.175

... 0.014

4.384

<eos> 0.999

3.609

world 0.684

5.299

HI 0.007

4.920

<eos> 0.994

4.390

Hey 0.006

5.107

<eos> 0.999

0.175

0

K = 3

relatively efficient . . . beam expansion parallelisable

currently default search strategy in neural machine translation

small beam (K ≈ 10) offers good speed-quality trade-off

R. Sennrich MT – 2018 – 04 17 / 20

Page 26: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Ensembles

combine decision of multiple classifiers by votingensemble will reduce error if these conditions are met:

base classifiers are accuratebase classifiers are diverse (make different errors)

R. Sennrich MT – 2018 – 04 18 / 20

Page 27: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Ensembles in NMT

vote at each time step to explore same search space(better than decoding with one, reranking n-best list with others)

voting mechanism: typically average (log-)probability

logP (yi|S, y<i) =

∑Mm=1 logPm(yi|S, y<i)

M

requirements for voting at each time step:same output vocabularysame factorization of Ybut: internal network architecture may be different

we still use reranking in some situationsexample: combine left-to-right decoding and right-to-left decoding

R. Sennrich MT – 2018 – 04 19 / 20

Page 28: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Further Reading

Required ReadingKoehn, 13.5

Optional ReadingSequence to Sequence Learning with Neural Networks. (Sutskever,Vinyals, Le) :https://papers.nips.cc/paper/

5346-sequence-to-sequence-learning-with-neural-networks.pdf

Neural Machine Translation by Jointly Learning to Align and Translate. (Bahdanau, Cho,Bengio) :https://arxiv.org/pdf/1409.0473.pdf

R. Sennrich MT – 2018 – 04 20 / 20

Page 29: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Bibliography I

Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).

Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.

Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014).On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111,Doha, Qatar. Association for Computational Linguistics.

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.

Junczys-Dowmunt, M. and Grundkiewicz, R. (2016).Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing.In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for ComputationalLinguistics.

Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S., Miceli Barone, A. V.,Mokry, J., and Nadejde, M. (2017).Nematus: a Toolkit for Neural Machine Translation.InProceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics,pages 65–68, Valencia, Spain.

R. Sennrich MT – 2018 – 04 21 / 20

Page 30: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich

Bibliography II

Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.

R. Sennrich MT – 2018 – 04 22 / 20