Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...

Post on 08-Oct-2020

0 views 0 download

Transcript of Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...

Machine Translation04: Neural Machine Translation

Rico Sennrich

University of Edinburgh

R. Sennrich MT – 2018 – 04 1 / 20

Overview

last lecturehow do we represent language in neural networks?

how do we treat language probabilistically (with neural networks)?

today’s lecturehow do we model translation with a neural network?

how do we generate text from a probabilistic translation model?

R. Sennrich MT – 2018 – 04 1 / 20

Modelling Translation

Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)

We can express translation as a probabilistic model

T ∗ = argmaxT

p(T |S)

Expanding using the chain rule gives

p(T |S) = p(y1, . . . , yn|x1, . . . , xm)

=

n∏i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

R. Sennrich MT – 2018 – 04 2 / 20

Differences Between Translation and Language Model

Target-side language model:

p(T ) =

n∏i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T |S) =n∏

i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text

→ Use separate RNNs for source and target.

R. Sennrich MT – 2018 – 04 3 / 20

Differences Between Translation and Language Model

Target-side language model:

p(T ) =

n∏i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T |S) =n∏

i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text

→ Use separate RNNs for source and target.

R. Sennrich MT – 2018 – 04 3 / 20

Encoder-Decoder for Translation

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

Decoder

Encoder

R. Sennrich MT – 2018 – 04 4 / 20

Encoder-Decoder for Translation

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

Decoder

Encoder

R. Sennrich MT – 2018 – 04 4 / 20

Summary vector

Last encoder hidden-state “summarises” source sentence

With multilingual training, we can potentially learnlanguage-independent meaning representation

[Sutskever et al., 2014]

R. Sennrich MT – 2018 – 04 5 / 20

Summary vector as information bottleneck

Problem: Sentence LengthFixed sized representation degrades as sentence length increases

Reversing source brings some improvement [Sutskever et al., 2014]

[Cho et al., 2014]

Solution: AttentionCompute context vector as weighted average of source hidden states

Weights computed by feed-forward network with softmax activation

R. Sennrich MT – 2018 – 04 6 / 20

Summary vector as information bottleneck

Problem: Sentence LengthFixed sized representation degrades as sentence length increases

Reversing source brings some improvement [Sutskever et al., 2014]

[Cho et al., 2014]

Solution: AttentionCompute context vector as weighted average of source hidden states

Weights computed by feed-forward network with softmax activation

R. Sennrich MT – 2018 – 04 6 / 20

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.70.1

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.60.2

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.10.1

0.70.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.10.7

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

0.10.1

0.10.7

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Attentional encoder-decoder: Maths

simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU

simpler output layer

we do not show bias terms

decoder follows Look, Update, Generate strategy [Sennrich et al., 2017]

Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb

notationW , U , E, C, V are weight matrices (of different dimensionality)

E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)

separate weight matrices for encoder and decoder (e.g. Ex and Ey)

input X of length Tx; output Y of length TyR. Sennrich MT – 2018 – 04 8 / 20

Attentional encoder-decoder: Maths

encoder

−→h j =

{0, , if j = 0

tanh(−→W xExxj +

−→U xhj−1) , if j > 0

←−h j =

{0, , if j = Tx + 1

tanh(←−W xExxj +

←−U xhj+1) , if j ≤ Tx

hj = (−→h j ,←−h j)

R. Sennrich MT – 2018 – 04 9 / 20

Attentional encoder-decoder: Maths

decoder

si =

{tanh(Ws

←−h i), , if i = 0

tanh(WyEyyi−1 + Uysi−1 + Cci) , if i > 0

ti = tanh(Uosi +WoEyyi−1 + Coci)

yi = softmax(Voti)

attention model

eij = v>a tanh(Wasi−1 + Uahj)

αij = softmax(eij)

ci =

Tx∑j=1

αijhj

R. Sennrich MT – 2018 – 04 10 / 20

Attention model

attention modelside effect: we obtain alignment between source and target sentence

information can also flow along recurrent connections, so there is noguarantee that attention corresponds to alignmentapplications:

visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

R. Sennrich MT – 2018 – 04 11 / 20

Attention model

attention model also works with images:

[Cho et al., 2015]

R. Sennrich MT – 2018 – 04 12 / 20

Attention model

[Cho et al., 2015]

R. Sennrich MT – 2018 – 04 13 / 20

Application of Encoder-Decoder Model

Scoring (a translation)p(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?

Decoding ( a source sentence)Generate the most probable translation of a source sentence

y∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)

R. Sennrich MT – 2018 – 04 14 / 20

Decoding

exact searchgenerate every possible sentence T in target language

compute score p(T |S) for each

pick best one

intractable: |vocab|N translations for output length N→ we need approximative search strategy

R. Sennrich MT – 2018 – 04 15 / 20

Decoding

approximative search/1: greedy searchat each time step, compute probabilitydistribution P (yi|S, y<i)

select yi according to some heuristic:

sampling: sample from P (yi|S, y<i)greedy search: pick argmaxy p(yi|S, y<i)

continue until we generate <eos>

! 0.928

0.175

<eos> 0.999

0.175

hello 0.946

0.056

world 0.957

0.100

0

efficient, but suboptimal

R. Sennrich MT – 2018 – 04 16 / 20

Decoding

approximative search/2: beamsearch

maintain list of K hypotheses(beam)

at each time step, expand eachhypothesis k: p(yki |S, yk<i)

select K hypotheses withhighest total probability:∏

i

p(yki |S, yk<i)

hello 0.946

0.056

world 0.957

0.100

World 0.010

4.632

. 0.030

3.609

! 0.928

0.175

... 0.014

4.384

<eos> 0.999

3.609

world 0.684

5.299

HI 0.007

4.920

<eos> 0.994

4.390

Hey 0.006

5.107

<eos> 0.999

0.175

0

K = 3

relatively efficient . . . beam expansion parallelisable

currently default search strategy in neural machine translation

small beam (K ≈ 10) offers good speed-quality trade-off

R. Sennrich MT – 2018 – 04 17 / 20

Ensembles

combine decision of multiple classifiers by votingensemble will reduce error if these conditions are met:

base classifiers are accuratebase classifiers are diverse (make different errors)

R. Sennrich MT – 2018 – 04 18 / 20

Ensembles in NMT

vote at each time step to explore same search space(better than decoding with one, reranking n-best list with others)

voting mechanism: typically average (log-)probability

logP (yi|S, y<i) =

∑Mm=1 logPm(yi|S, y<i)

M

requirements for voting at each time step:same output vocabularysame factorization of Ybut: internal network architecture may be different

we still use reranking in some situationsexample: combine left-to-right decoding and right-to-left decoding

R. Sennrich MT – 2018 – 04 19 / 20

Further Reading

Required ReadingKoehn, 13.5

Optional ReadingSequence to Sequence Learning with Neural Networks. (Sutskever,Vinyals, Le) :https://papers.nips.cc/paper/

5346-sequence-to-sequence-learning-with-neural-networks.pdf

Neural Machine Translation by Jointly Learning to Align and Translate. (Bahdanau, Cho,Bengio) :https://arxiv.org/pdf/1409.0473.pdf

R. Sennrich MT – 2018 – 04 20 / 20

Bibliography I

Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).

Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.

Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014).On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111,Doha, Qatar. Association for Computational Linguistics.

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.

Junczys-Dowmunt, M. and Grundkiewicz, R. (2016).Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing.In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for ComputationalLinguistics.

Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S., Miceli Barone, A. V.,Mokry, J., and Nadejde, M. (2017).Nematus: a Toolkit for Neural Machine Translation.InProceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics,pages 65–68, Valencia, Spain.

R. Sennrich MT – 2018 – 04 21 / 20

Bibliography II

Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.

R. Sennrich MT – 2018 – 04 22 / 20