Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...

Machine Translation04: Neural Machine Translation

Rico Sennrich

University of Edinburgh

R. Sennrich MT – 2018 – 04 1 / 20

Overview

last lecturehow do we represent language in neural networks?

how do we treat language probabilistically (with neural networks)?

today’s lecturehow do we model translation with a neural network?

how do we generate text from a probabilistic translation model?

R. Sennrich MT – 2018 – 04 1 / 20

Modelling Translation

Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)

We can express translation as a probabilistic model

T ∗ = argmaxT

p(T |S)

Expanding using the chain rule gives

p(T |S) = p(y1, . . . , yn|x1, . . . , xm)

n∏i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

R. Sennrich MT – 2018 – 04 2 / 20

Differences Between Translation and Language Model

Target-side language model:

p(T ) =

n∏i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T |S) =n∏

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text

→ Use separate RNNs for source and target.

R. Sennrich MT – 2018 – 04 3 / 20

Differences Between Translation and Language Model

Target-side language model:

p(T ) =

n∏i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T |S) =n∏

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text

→ Use separate RNNs for source and target.

R. Sennrich MT – 2018 – 04 3 / 20

Encoder-Decoder for Translation

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

Decoder

Encoder

R. Sennrich MT – 2018 – 04 4 / 20

Encoder-Decoder for Translation

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

h1 h2 h3 h4

x1 x2 x3 x4

Decoder

Encoder

R. Sennrich MT – 2018 – 04 4 / 20

Summary vector

Last encoder hidden-state “summarises” source sentence

With multilingual training, we can potentially learnlanguage-independent meaning representation

[Sutskever et al., 2014]

R. Sennrich MT – 2018 – 04 5 / 20

Summary vector as information bottleneck

Problem: Sentence LengthFixed sized representation degrades as sentence length increases

Reversing source brings some improvement [Sutskever et al., 2014]

[Cho et al., 2014]

Solution: AttentionCompute context vector as weighted average of source hidden states

Weights computed by feed-forward network with softmax activation

R. Sennrich MT – 2018 – 04 6 / 20

Summary vector as information bottleneck

Problem: Sentence LengthFixed sized representation degrades as sentence length increases

Reversing source brings some improvement [Sutskever et al., 2014]

[Cho et al., 2014]

Solution: AttentionCompute context vector as weighted average of source hidden states

Weights computed by feed-forward network with softmax activation

R. Sennrich MT – 2018 – 04 6 / 20

Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

0.70.1

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

h1 h2 h3 h4

x1 x2 x3 x4

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

0.60.2

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

h1 h2 h3 h4

x1 x2 x3 x4

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

0.10.1

0.70.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

h1 h2 h3 h4

x1 x2 x3 x4

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

0.10.7

0.10.1

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

h1 h2 h3 h4

x1 x2 x3 x4

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

0.10.1

0.10.7

Decoder

Encoder

R. Sennrich MT – 2018 – 04 7 / 20

Attentional encoder-decoder: Maths

simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU

simpler output layer

we do not show bias terms

decoder follows Look, Update, Generate strategy [Sennrich et al., 2017]

Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb

notationW , U , E, C, V are weight matrices (of different dimensionality)

E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)

separate weight matrices for encoder and decoder (e.g. Ex and Ey)

input X of length Tx; output Y of length TyR. Sennrich MT – 2018 – 04 8 / 20

encoder

−→h j =

{0, , if j = 0

tanh(−→W xExxj +

−→U xhj−1) , if j > 0

←−h j =

{0, , if j = Tx + 1

tanh(←−W xExxj +

←−U xhj+1) , if j ≤ Tx

hj = (−→h j ,←−h j)

R. Sennrich MT – 2018 – 04 9 / 20

decoder

{tanh(Ws

←−h i), , if i = 0

tanh(WyEyyi−1 + Uysi−1 + Cci) , if i > 0

ti = tanh(Uosi +WoEyyi−1 + Coci)

yi = softmax(Voti)

attention model

eij = v>a tanh(Wasi−1 + Uahj)

αij = softmax(eij)

Tx∑j=1

αijhj

R. Sennrich MT – 2018 – 04 10 / 20

Attention model

attention modelside effect: we obtain alignment between source and target sentence

information can also flow along recurrent connections, so there is noguarantee that attention corresponds to alignmentapplications:

visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

R. Sennrich MT – 2018 – 04 11 / 20

Attention model

attention model also works with images:

[Cho et al., 2015]

R. Sennrich MT – 2018 – 04 12 / 20

Attention model

[Cho et al., 2015]

R. Sennrich MT – 2018 – 04 13 / 20

Application of Encoder-Decoder Model

Scoring (a translation)p(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?

Decoding ( a source sentence)Generate the most probable translation of a source sentence

y∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)

R. Sennrich MT – 2018 – 04 14 / 20

Decoding

exact searchgenerate every possible sentence T in target language

compute score p(T |S) for each

pick best one

intractable: |vocab|N translations for output length N→ we need approximative search strategy

R. Sennrich MT – 2018 – 04 15 / 20

Decoding

approximative search/1: greedy searchat each time step, compute probabilitydistribution P (yi|S, y<i)

select yi according to some heuristic:

sampling: sample from P (yi|S, y<i)greedy search: pick argmaxy p(yi|S, y<i)

continue until we generate <eos>

! 0.928

<eos> 0.999

hello 0.946

world 0.957

efficient, but suboptimal

R. Sennrich MT – 2018 – 04 16 / 20

Decoding

approximative search/2: beamsearch

maintain list of K hypotheses(beam)

at each time step, expand eachhypothesis k: p(yki |S, yk<i)

select K hypotheses withhighest total probability:∏

p(yki |S, yk<i)

hello 0.946

world 0.957

World 0.010

. 0.030

! 0.928

... 0.014

<eos> 0.999

world 0.684

HI 0.007

<eos> 0.994

Hey 0.006

<eos> 0.999

relatively efficient . . . beam expansion parallelisable

currently default search strategy in neural machine translation

small beam (K ≈ 10) offers good speed-quality trade-off

R. Sennrich MT – 2018 – 04 17 / 20

Ensembles

combine decision of multiple classifiers by votingensemble will reduce error if these conditions are met:

base classifiers are accuratebase classifiers are diverse (make different errors)

R. Sennrich MT – 2018 – 04 18 / 20

Ensembles in NMT

vote at each time step to explore same search space(better than decoding with one, reranking n-best list with others)

voting mechanism: typically average (log-)probability

logP (yi|S, y<i) =

∑Mm=1 logPm(yi|S, y<i)

requirements for voting at each time step:same output vocabularysame factorization of Ybut: internal network architecture may be different

we still use reranking in some situationsexample: combine left-to-right decoding and right-to-left decoding

R. Sennrich MT – 2018 – 04 19 / 20

Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...

Documents

Transcript of Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...

Practical Neural Machine TranslationPractical Neural Machine Translation Rico Sennrich, Barry Haddow Institute for Language, Cognition and Computation University of Edinburgh April

Machine Translation - 01: Introductionhomepages.inf.ed.ac.uk/rsennric/mt18/1.pdf · Machine Translation 01: Introduction Rico Sennrich (slide credit to Adam Lopez) University of Edinburgh

Mixed-Domain vs. Multi-Domain Statistical Machine Translationhomepages.inf.ed.ac.uk/.../Huck-MTSummit-2015.pdf · Proceedings of MT Summit XV, vol.1: MT Researchers' Track Miami,

Neural Machine Translation - Home page | ÚFAL...Neural Machine Translation Rico Sennrich Institute for Language, Cognition and Computation University of Edinburgh September 15 2016

Machine Translation - 07: Open-vocabulary Translationhomepages.inf.ed.ac.uk/rsennric/mt18/7.pdf · Machine Translation 07: Open-vocabulary Translation Rico Sennrich University of

Revisiting Challenges in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/cwmt.pdfthe test set using NMT system trained on varying amounts of training data. Under low resource

Machine Translation - 09: Monolingual Datahomepages.inf.ed.ac.uk/rsennric/mt18/9_4up.pdf · in a limited context of machine translation to utilize the la rge monolingual corpora in

Challenges in Statistical Machine Translationhomepages.inf.ed.ac.uk/pkoehn/publications/challenges... · 2004-06-13 · Challenges in Statistical Machine Translation p Phrase Translation

02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Neural Machine Translation: Breaking the Performance Plateau · Neural Machine Translation: Breaking the Performance Plateau Rico Sennrich Institute for Language, Cognition and Computation

Machine Translation - 06: Attention Models Analysis and ...homepages.inf.ed.ac.uk/rsennric/mt18/6.pdf · 12 17 Figure 9: Mismatch between attention states and desired word alignments

12: (Non-neural) Statistical Machine Translation - University of …homepages.inf.ed.ac.uk/rsennric/mt18/12.pdf · 2018. 3. 5. · Today’s Lecture So far, main focus of lecture

Morphology Matters: A Multilingual Language Modeling Analysis · 1999) is widely used in NLP tasks including ma-chine translation (Sennrich et al.,2016) as an un-supervised information-theoretic

Neural Machine Translation what's linguistics got to do ...homepages.inf.ed.ac.uk/rsennric/tsd.pdf · a new challenger appears: neural machine translation requires minimal domain

Machine Translation - 13: Syntaxhomepages.inf.ed.ac.uk/rsennric/mt18/13_4up.pdfsubjpred rel rel most syntax-based SMT systems are context-free either they can't produce non-projective

Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Machine Translation - 10: Advanced Neural Machine ...homepages.inf.ed.ac.uk/rsennric/mt18/10_4up.pdf · deep networks non-recurrent architectures: convolutional networks ... With

arXiv:1903.11437v1 [cs.CL] 27 Mar 2019 · is the proposal of Sennrich et al. (2016a), who use monolingual target texts to generate artiﬁcial parallel data via backward translation

Today's Lecture Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/12_4up.pdf · 2018. 3. 5. · University of Edinburgh R. Sennrich MT 2018 12 1/27

Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt