Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...
Transcript of Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...
Machine Translation04: Neural Machine Translation
Rico Sennrich
University of Edinburgh
R. Sennrich MT – 2018 – 04 1 / 20
Overview
last lecturehow do we represent language in neural networks?
how do we treat language probabilistically (with neural networks)?
today’s lecturehow do we model translation with a neural network?
how do we generate text from a probabilistic translation model?
R. Sennrich MT – 2018 – 04 1 / 20
Modelling Translation
Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)
We can express translation as a probabilistic model
T ∗ = argmaxT
p(T |S)
Expanding using the chain rule gives
p(T |S) = p(y1, . . . , yn|x1, . . . , xm)
=
n∏i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
R. Sennrich MT – 2018 – 04 2 / 20
Differences Between Translation and Language Model
Target-side language model:
p(T ) =
n∏i=1
p(yi|y1, . . . , yi−1)
Translation model:
p(T |S) =n∏
i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text
→ Use separate RNNs for source and target.
R. Sennrich MT – 2018 – 04 3 / 20
Differences Between Translation and Language Model
Target-side language model:
p(T ) =
n∏i=1
p(yi|y1, . . . , yi−1)
Translation model:
p(T |S) =n∏
i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text
→ Use separate RNNs for source and target.
R. Sennrich MT – 2018 – 04 3 / 20
Encoder-Decoder for Translation
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
Decoder
Encoder
R. Sennrich MT – 2018 – 04 4 / 20
Encoder-Decoder for Translation
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
Decoder
Encoder
R. Sennrich MT – 2018 – 04 4 / 20
Summary vector
Last encoder hidden-state “summarises” source sentence
With multilingual training, we can potentially learnlanguage-independent meaning representation
[Sutskever et al., 2014]
R. Sennrich MT – 2018 – 04 5 / 20
Summary vector as information bottleneck
Problem: Sentence LengthFixed sized representation degrades as sentence length increases
Reversing source brings some improvement [Sutskever et al., 2014]
[Cho et al., 2014]
Solution: AttentionCompute context vector as weighted average of source hidden states
Weights computed by feed-forward network with softmax activation
R. Sennrich MT – 2018 – 04 6 / 20
Summary vector as information bottleneck
Problem: Sentence LengthFixed sized representation degrades as sentence length increases
Reversing source brings some improvement [Sutskever et al., 2014]
[Cho et al., 2014]
Solution: AttentionCompute context vector as weighted average of source hidden states
Weights computed by feed-forward network with softmax activation
R. Sennrich MT – 2018 – 04 6 / 20
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.70.1
0.10.1
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.60.2
0.10.1
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.1
0.70.1
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.7
0.10.1
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.1
0.10.7
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
Attentional encoder-decoder: Maths
simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU
simpler output layer
we do not show bias terms
decoder follows Look, Update, Generate strategy [Sennrich et al., 2017]
Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb
notationW , U , E, C, V are weight matrices (of different dimensionality)
E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)
separate weight matrices for encoder and decoder (e.g. Ex and Ey)
input X of length Tx; output Y of length TyR. Sennrich MT – 2018 – 04 8 / 20
Attentional encoder-decoder: Maths
encoder
−→h j =
{0, , if j = 0
tanh(−→W xExxj +
−→U xhj−1) , if j > 0
←−h j =
{0, , if j = Tx + 1
tanh(←−W xExxj +
←−U xhj+1) , if j ≤ Tx
hj = (−→h j ,←−h j)
R. Sennrich MT – 2018 – 04 9 / 20
Attentional encoder-decoder: Maths
decoder
si =
{tanh(Ws
←−h i), , if i = 0
tanh(WyEyyi−1 + Uysi−1 + Cci) , if i > 0
ti = tanh(Uosi +WoEyyi−1 + Coci)
yi = softmax(Voti)
attention model
eij = v>a tanh(Wasi−1 + Uahj)
αij = softmax(eij)
ci =
Tx∑j=1
αijhj
R. Sennrich MT – 2018 – 04 10 / 20
Attention model
attention modelside effect: we obtain alignment between source and target sentence
information can also flow along recurrent connections, so there is noguarantee that attention corresponds to alignmentapplications:
visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...
Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
R. Sennrich MT – 2018 – 04 11 / 20
Attention model
attention model also works with images:
[Cho et al., 2015]
R. Sennrich MT – 2018 – 04 12 / 20
Attention model
[Cho et al., 2015]
R. Sennrich MT – 2018 – 04 13 / 20
Application of Encoder-Decoder Model
Scoring (a translation)p(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?
Decoding ( a source sentence)Generate the most probable translation of a source sentence
y∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)
R. Sennrich MT – 2018 – 04 14 / 20
Decoding
exact searchgenerate every possible sentence T in target language
compute score p(T |S) for each
pick best one
intractable: |vocab|N translations for output length N→ we need approximative search strategy
R. Sennrich MT – 2018 – 04 15 / 20
Decoding
approximative search/1: greedy searchat each time step, compute probabilitydistribution P (yi|S, y<i)
select yi according to some heuristic:
sampling: sample from P (yi|S, y<i)greedy search: pick argmaxy p(yi|S, y<i)
continue until we generate <eos>
! 0.928
0.175
<eos> 0.999
0.175
hello 0.946
0.056
world 0.957
0.100
0
efficient, but suboptimal
R. Sennrich MT – 2018 – 04 16 / 20
Decoding
approximative search/2: beamsearch
maintain list of K hypotheses(beam)
at each time step, expand eachhypothesis k: p(yki |S, yk<i)
select K hypotheses withhighest total probability:∏
i
p(yki |S, yk<i)
hello 0.946
0.056
world 0.957
0.100
World 0.010
4.632
. 0.030
3.609
! 0.928
0.175
... 0.014
4.384
<eos> 0.999
3.609
world 0.684
5.299
HI 0.007
4.920
<eos> 0.994
4.390
Hey 0.006
5.107
<eos> 0.999
0.175
0
K = 3
relatively efficient . . . beam expansion parallelisable
currently default search strategy in neural machine translation
small beam (K ≈ 10) offers good speed-quality trade-off
R. Sennrich MT – 2018 – 04 17 / 20
Ensembles
combine decision of multiple classifiers by votingensemble will reduce error if these conditions are met:
base classifiers are accuratebase classifiers are diverse (make different errors)
R. Sennrich MT – 2018 – 04 18 / 20
Ensembles in NMT
vote at each time step to explore same search space(better than decoding with one, reranking n-best list with others)
voting mechanism: typically average (log-)probability
logP (yi|S, y<i) =
∑Mm=1 logPm(yi|S, y<i)
M
requirements for voting at each time step:same output vocabularysame factorization of Ybut: internal network architecture may be different
we still use reranking in some situationsexample: combine left-to-right decoding and right-to-left decoding
R. Sennrich MT – 2018 – 04 19 / 20
Further Reading
Required ReadingKoehn, 13.5
Optional ReadingSequence to Sequence Learning with Neural Networks. (Sutskever,Vinyals, Le) :https://papers.nips.cc/paper/
5346-sequence-to-sequence-learning-with-neural-networks.pdf
Neural Machine Translation by Jointly Learning to Align and Translate. (Bahdanau, Cho,Bengio) :https://arxiv.org/pdf/1409.0473.pdf
R. Sennrich MT – 2018 – 04 20 / 20
Bibliography I
Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).
Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.
Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014).On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111,Doha, Qatar. Association for Computational Linguistics.
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.
Junczys-Dowmunt, M. and Grundkiewicz, R. (2016).Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing.In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for ComputationalLinguistics.
Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S., Miceli Barone, A. V.,Mokry, J., and Nadejde, M. (2017).Nematus: a Toolkit for Neural Machine Translation.InProceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics,pages 65–68, Valencia, Spain.
R. Sennrich MT – 2018 – 04 21 / 20
Bibliography II
Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.
R. Sennrich MT – 2018 – 04 22 / 20