Neural Machine Translation - Home page | ÚFAL...Neural Machine Translation Rico Sennrich Institute...

Post on 25-May-2020

34 views 0 download

Transcript of Neural Machine Translation - Home page | ÚFAL...Neural Machine Translation Rico Sennrich Institute...

Neural Machine Translation

Rico Sennrich

Institute for Language, Cognition and ComputationUniversity of Edinburgh

September 15 2016

Rico Sennrich Neural Machine Translation 1 / 65

Edinburgh’s WMT Results Over the Years

2013 2014 2015 20160.0

10.0

20.0

30.0

20.3 20.9 20.8 21.519.4 20.2

22.0 22.1

18.9

24.7

BLE

Uon

new

stes

t201

3(E

N→

DE

)

phrase-based SMTsyntax-based SMTneural MT

(NMT 2015 from U. Montréal: https://sites.google.com/site/acl16nmt/)Rico Sennrich Neural Machine Translation 1 / 65

Neural Machine Translation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Rico Sennrich Neural Machine Translation 2 / 65

Neural Machine Translation

1 Attentional encoder-decoder

2 Where are we now? Evaluation, challenges,future directions...

Evaluation resultsComparing neural and phrase-based ma-chine translationRecent research in neural machine transla-tion

Rico Sennrich Neural Machine Translation 3 / 65

Translation modelling

decomposition of translation problem (for NMT)a source sentence S of length m is a sequence x1, . . . , xma target sentence T of length n is a sequence y1, . . . , yn

T ∗ = argmaxt

p(T |S)

p(T |S) = p(y1, . . . , yn|x1, . . . , xm)

=

n∏i=1

p(yi|y0, . . . , yi−1, x1, . . . , xm)

Rico Sennrich Neural Machine Translation 4 / 65

Translation modelling

difference from language modeltarget-side language model:

p(T ) =

n∏

i=1

p(yi|y0, . . . , yi−1)

translation model:

p(T |S) =n∏

i=1

p(yi|y0, . . . , yi−1, x1, . . . , xm)

we could just treat sentence pair as one long sequence, but:we do not care about p(S) (S is given)we do not want to use same parameters for S and Twe may want different vocabulary, network architecture for source text

Rico Sennrich Neural Machine Translation 5 / 65

Translation modelling

difference from language modeltarget-side language model:

p(T ) =

n∏

i=1

p(yi|y0, . . . , yi−1)

translation model:

p(T |S) =n∏

i=1

p(yi|y0, . . . , yi−1, x1, . . . , xm)

we could just treat sentence pair as one long sequence, but:we do not care about p(S) (S is given)we do not want to use same parameters for S and Twe may want different vocabulary, network architecture for source text

Rico Sennrich Neural Machine Translation 5 / 65

Encoder-decoder [Sutskever et al., 2014, Cho et al., 2014]

two RNNs (LSTM or GRU):encoder reads input and produces hidden state representationsdecoder produces output, based on last encoder hidden state

joint learning (backpropagation through full network)

Kyunghyun Cho http://devblogs.nvidia.com/parallelforall/

introduction-neural-machine-translation-gpus-part-2/

Rico Sennrich Neural Machine Translation 6 / 65

Summary vector

last encoder hidden-state “summarizes” source sentence

with multilingual training, we can potentially learnlanguage-independent meaning representation

[Sutskever et al., 2014]

Rico Sennrich Neural Machine Translation 7 / 65

Summary vector as information bottleneck

can fixed-size vector represent meaning of arbitrarily long sentence?

empirically, quality decreases for long sentences

reversing source sentence brings some improvement[Sutskever et al., 2014]

[Sutskever et al., 2014]

Rico Sennrich Neural Machine Translation 8 / 65

Attentional encoder-decoder [Bahdanau et al., 2015]

encodergoal: avoid bottleneck of summary vector

use bidirectional RNN, and concatenate forward and backward states→ annotation vector hirepresent source sentence as vector of n annotations→ variable-length representation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Rico Sennrich Neural Machine Translation 9 / 65

Attentional encoder-decoder [Bahdanau et al., 2015]

attentionproblem: how to incorporate variable-length context into hidden state?

attention model computes context vector as weighted average ofannotations

weights are computed by feedforward neural network with softmax

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Rico Sennrich Neural Machine Translation 10 / 65

Attentional encoder-decoder: math

simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU

simpler output layer

we do not show bias terms

notationW , U , E, C, V are weight matrices (of different dimensionality)

E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)

separate weight matrices for encoder and decoder (e.g. Ex and Ey)

input X of length Tx; output Y of length Ty

Rico Sennrich Neural Machine Translation 11 / 65

Attentional encoder-decoder: math

encoder

−→h j =

{0, , if j = 0

tanh(−→W xExxj +

−→U x−→h j−1) , if j > 0

←−h j =

{0, , if j = Tx + 1

tanh(←−W xExxj +

←−U x←−h j+1) , if j ≤ Tx

hj = (−→h j ,←−h j)

Rico Sennrich Neural Machine Translation 12 / 65

Attentional encoder-decoder: math

decoder

si =

{tanh(Ws

←−h i), , if i = 0

tanh(WyEyyi + Uysi−1 + Cci) , if i > 0

ti = tanh(Uosi−1 +WoEyyi−1 + Coci)

yi = softmax(Voti)

attention model

eij = v>a tanh(Wasi−1 + Uahj)

αij = softmax(eij)

ci =

Tx∑

j=1

αijhj

Rico Sennrich Neural Machine Translation 13 / 65

Attention model

attention modelside effect: we obtain alignment between source and target sentence

information can also flow along recurrent connections, so there is noguarantee that attention corresponds to alignmentapplications:

visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Rico Sennrich Neural Machine Translation 14 / 65

Attention model

attention model also works with images:

[Cho et al., 2015]

Rico Sennrich Neural Machine Translation 15 / 65

Attention model

[Cho et al., 2015]

Rico Sennrich Neural Machine Translation 16 / 65

Applications of encoder-decoder neural network

score a translationp(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?

generate the most probable translation of a source sentence→ decodingy∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)

Rico Sennrich Neural Machine Translation 17 / 65

Decoding

exact searchgenerate every possible sentence T in target language

compute score p(T |S) for each

pick best one

intractable: |V |N translations for vocabulary V and output length N→ we need approximative search strategy

Rico Sennrich Neural Machine Translation 18 / 65

Decoding

approximative search/1at each time step, compute probability distribution P (yi|X, y<i)

select yi according to some heuristic:

sampling: sample from P (yi|X, y<i)greedy search: pick argmaxy p(yi|X, y<i)

continue until we generate <eos>

efficient, but suboptimal

Rico Sennrich Neural Machine Translation 19 / 65

Decoding

approximative search/2: beam searchmaintain list of K hypotheses (beam)

at each time step, expand each hypothesis k: p(yki |X, yk<i)

at each time step, we produce |V | ·K translation hypotheses→ prune to K hypotheses with highest total probability:

i

p(yki |X, yk<i)

relatively efficient

currently default search strategy in neural machine translation

small beam (K ≈ 10) offers good speed-quality trade-off

Rico Sennrich Neural Machine Translation 20 / 65

Ensembles

at each timestep, combine the probability distribution of M differentensemble components.

combine operator: typically average (log-)probability

logP (yi|X, y<i) =

∑Mm=1 logPm(yi|X, y<i)

M

requirements:same output vocabularysame factorization of Y

internal network architecture may be different

source representations may be different(extreme example: ensemble-like model with different sourcelanguages [Junczys-Dowmunt and Grundkiewicz, 2016])

Rico Sennrich Neural Machine Translation 21 / 65

Ensembles

recent ensemble strategies in NMTensemble of 8 independent training runs with differenthyperparameters/architectures [Luong et al., 2015a]

ensemble of 8 independent training runs with different randominitializations [Chung et al., 2016]

ensemble of 4 checkpoints of same training run[Sennrich et al., 2016a]→ probably less effective, but only requires one training run

EN→CS EN→DE EN→RO EN→RU CS→EN DE→EN RO→EN RU→EN

0.010.020.030.040.0

23.7

31.628.1

24.3

30.1

36.233.3

26.924.8

33.128.2

26.031.4

37.533.9

28.0

BLE

U

single model ensemble

[Sennrich et al., 2016a]

Rico Sennrich Neural Machine Translation 22 / 65

Neural Machine Translation

1 Attentional encoder-decoder

2 Where are we now? Evaluation, challenges,future directions...

Evaluation resultsComparing neural and phrase-based ma-chine translationRecent research in neural machine transla-tion

Rico Sennrich Neural Machine Translation 23 / 65

State of Neural MT

attentional encoder-decoder networks have become state of the arton various MT tasks...

...but this usually requires more advanced techniques to handleOOVs, use monolingual data, etc.your mileage may vary depending on

language pair and text typeamount of training datatype of training resources (monolingual?)hyperparameters

very general model: can be applied to other sequence-to-sequencetasks

Rico Sennrich Neural Machine Translation 24 / 65

Attentional encoder-decoders (NMT) are SOTA

system BLEU official rankuedin-nmt 34.2 1metamind 32.3 2uedin-syntax 30.6 3NYU-UMontreal 30.8 4online-B 29.4 5-10KIT/LIMSI 29.1 5-10cambridge 30.6 5-10online-A 29.9 5-10promt-rule 23.4 5-10KIT 29.0 6-10jhu-syntax 26.6 11-12jhu-pbmt 28.3 11-12uedin-pbmt 28.4 13-14online-F 19.3 13-15online-G 23.8 14-15

Table: WMT16 results for EN→DE

system BLEU official rankuedin-nmt 38.6 1online-B 35.0 2-5online-A 32.8 2-5uedin-syntax 34.4 2-5KIT 33.9 2-6uedin-pbmt 35.1 5-7jhu-pbmt 34.5 6-7online-G 30.1 8jhu-syntax 31.0 9online-F 20.2 10

Table: WMT16 results for DE→EN

pure NMT

NMT component

Rico Sennrich Neural Machine Translation 25 / 65

Attentional encoder-decoders (NMT) are SOTA

system BLEU official rankuedin-nmt 34.2 1metamind 32.3 2uedin-syntax 30.6 3NYU-UMontreal 30.8 4online-B 29.4 5-10KIT/LIMSI 29.1 5-10cambridge 30.6 5-10online-A 29.9 5-10promt-rule 23.4 5-10KIT 29.0 6-10jhu-syntax 26.6 11-12jhu-pbmt 28.3 11-12uedin-pbmt 28.4 13-14online-F 19.3 13-15online-G 23.8 14-15

Table: WMT16 results for EN→DE

system BLEU official rankuedin-nmt 38.6 1online-B 35.0 2-5online-A 32.8 2-5uedin-syntax 34.4 2-5KIT 33.9 2-6uedin-pbmt 35.1 5-7jhu-pbmt 34.5 6-7online-G 30.1 8jhu-syntax 31.0 9online-F 20.2 10

Table: WMT16 results for DE→EN

pure NMT

NMT component

Rico Sennrich Neural Machine Translation 25 / 65

Attentional encoder-decoders (NMT) are SOTA

system BLEU official rankuedin-nmt 34.2 1metamind 32.3 2uedin-syntax 30.6 3NYU-UMontreal 30.8 4online-B 29.4 5-10KIT/LIMSI 29.1 5-10cambridge 30.6 5-10online-A 29.9 5-10promt-rule 23.4 5-10KIT 29.0 6-10jhu-syntax 26.6 11-12jhu-pbmt 28.3 11-12uedin-pbmt 28.4 13-14online-F 19.3 13-15online-G 23.8 14-15

Table: WMT16 results for EN→DE

system BLEU official rankuedin-nmt 38.6 1online-B 35.0 2-5online-A 32.8 2-5uedin-syntax 34.4 2-5KIT 33.9 2-6uedin-pbmt 35.1 5-7jhu-pbmt 34.5 6-7online-G 30.1 8jhu-syntax 31.0 9online-F 20.2 10

Table: WMT16 results for DE→EN

pure NMT

NMT component

Rico Sennrich Neural Machine Translation 25 / 65

Attentional encoder-decoders (NMT) are SOTA

uedin-nmt 25.8 1NYU-UMontreal 23.6 2jhu-pbmt 23.6 3cu-chimera 21.0 4-5cu-tamchyna 20.8 4-5uedin-cu-syntax 20.9 6-7online-B 22.7 6-7online-A 19.5 15cu-TectoMT 14.7 16cu-mergedtrees 8.2 18

Table: WMT16 results for EN→CS

online-B 39.2 1-2uedin-nmt 33.9 1-2uedin-pbmt 35.2 3uedin-syntax 33.6 4-5online-A 30.8 4-6jhu-pbmt 32.2 5-7LIMSI 31.0 6-7

Table: WMT16 results for RO→EN

uedin-nmt 31.4 1jhu-pbmt 30.4 2online-B 28.6 3PJATK 28.3 8-10online-A 25.7 11cu-mergedtrees 13.3 12

Table: WMT16 results for CS→EN

uedin-nmt 28.1 1-2QT21-HimL-SysComb 28.9 1-2KIT 25.8 3-7uedin-pbmt 26.8 3-7online-B 25.4 3-7uedin-lmu-hiero 25.9 3-7RWTH-SYSCOMB 27.1 3-7LIMSI 23.9 8-10lmu-cuni 24.3 8-10jhu-pbmt 23.5 8-11usfd-rescoring 23.1 10-12online-A 19.2 11-12

Table: WMT16 results for EN→RO

Rico Sennrich Neural Machine Translation 25 / 65

Attentional encoder-decoders (NMT) are SOTAPROMT-rule 22.3 1amu-uedin 25.3 2-4online-B 23.8 2-5uedin-nmt 26.0 2-5online-G 26.2 3-5NYU-UMontreal 23.1 6jhu-pbmt 24.0 7-8LIMSI 23.6 7-10online-A 20.2 8-10AFRL-MITLL-phr 23.5 9-10AFRL-MITLL-verb 20.9 11online-F 8.6 12

Table: WMT16 results for EN→RU

amu-uedin 29.1 1-2online-G 28.7 1-3NRC 29.1 2-4online-B 28.1 3-5uedin-nmt 28.0 4-5online-A 25.7 6-7AFRL-MITLL-phr 27.6 6-7AFRL-MITLL-contrast 27.0 8-9PROMT-rule 20.4 8-9online-F 13.5 10

Table: WMT16 results for RU→EN

uedin-pbmt 23.4 1-4online-G 20.6 1-4online-B 23.6 1-4UH-opus 23.1 1-4PROMT-SMT 20.3 5UH-factored 19.3 6-7uedin-syntax 20.4 6-7online-A 19.0 8jhu-pbmt 19.1 9

Table: WMT16 results for FI→EN

online-G 15.4 1-3abumatra-nmt 17.2 1-4

online-B 14.4 1-4abumatran-combo 17.4 3-5

UH-opus 16.3 4-5NYU-UMontreal 15.1 6-8

abumatran-pbsmt 14.6 6-8online-A 13.0 6-8jhu-pbmt 13.8 9-10

UH-factored 12.8 9-12aalto 11.6 10-13

jhu-hltcoe 11.9 10-13UUT 11.6 11-13

Table: WMT16 results for EN→FI

Rico Sennrich Neural Machine Translation 25 / 65

Neural Machine Translation

1 Attentional encoder-decoder

2 Where are we now? Evaluation, challenges,future directions...

Evaluation resultsComparing neural and phrase-based ma-chine translationRecent research in neural machine transla-tion

Rico Sennrich Neural Machine Translation 26 / 65

Interlude: why is (machine) translation hard?

ambiguitywords are often polysemous, with different translations for differentmeanings

system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-pbsmt There, he was at the club and another male person attacked again.uedin-nmt There he was attacked again by the racket and another male person.

Schläger

attackerracket club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Rico Sennrich Neural Machine Translation 27 / 65

Interlude: why is (machine) translation hard?

ambiguitywords are often polysemous, with different translations for differentmeanings

system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-pbsmt There, he was at the club and another male person attacked again.uedin-nmt There he was attacked again by the racket and another male person.

Schläger

attacker

racket

club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Rico Sennrich Neural Machine Translation 27 / 65

Interlude: why is (machine) translation hard?

ambiguitywords are often polysemous, with different translations for differentmeanings

system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-pbsmt There, he was at the club and another male person attacked again.uedin-nmt There he was attacked again by the racket and another male person.

Schläger

attackerracket

club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Rico Sennrich Neural Machine Translation 27 / 65

Interlude: why is (machine) translation hard?

ambiguitywords are often polysemous, with different translations for differentmeanings

system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-pbsmt There, he was at the club and another male person attacked again.uedin-nmt There he was attacked again by the racket and another male person.

Schläger

attackerracket club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Rico Sennrich Neural Machine Translation 27 / 65

Interlude: why is (machine) translation hard?

word orderthere are systematic word order differences between languages. We needto generate words in the correct order.

system sentencesource Unsere digitalen Leben haben die Notwendigkeit, stark, lebenslustig und erfolgreich zu erscheinen, verdoppelt [...]reference Our digital lives have doubled the need to appear strong, fun-loving and successful [...]uedin-pbsmt Our digital lives are lively, strong, and to be successful, doubled [...]uedin-nmt Our digital lives have doubled the need to appear strong, lifelike and successful [...]

Rico Sennrich Neural Machine Translation 28 / 65

Interlude: why is (machine) translation hard?

grammatical marking systemgrammatical distinctions can be marked in different ways, for instancethrough word order (English), or inflection (German). The translator needsto produce the appropriate marking.

English ... because the dog chased the man.German ... weil der Hund den Mann jagte.

Rico Sennrich Neural Machine Translation 29 / 65

Interlude: why is (machine) translation hard?

multiword expressionsthe meaning of non-compositional expressions is lost in a word-to-wordtranslation

system sentencesource He bends over backwards for the team, ignoring any pain.reference Er zerreißt sich für die Mannschaft, geht über Schmerzen drüber.

(lit: he tears himself apart for the team)uedin-pbsmt Er macht alles für das Team, den Schmerz zu ignorieren.

(lit: he does everything for the team)uedin-nmt Er beugt sich rückwärts für die Mannschaft, ignoriert jeden Schmerz.

(lit: he bends backwards for the team)

Rico Sennrich Neural Machine Translation 30 / 65

Interlude: why is (machine) translation hard?

subcategorizationWords only allow for specific categories of syntactic arguments, that oftendiffer between languages.

English he remembers his medical appointment.German er erinnert sich an seinen Arzttermin.English *he remembers himself to his medical appointment.German *er erinnert seinen Arzttermin.

agreementinflected forms may need to agree over long distances to satisfygrammaticality.

English they can not be foundFrench elles ne peuvent pas être trouvées

Rico Sennrich Neural Machine Translation 31 / 65

Interlude: why is (machine) translation hard?

morphological complexitytranslator may need to analyze/generate morphologically complex wordsthat were not seen before.

German AbwasserbehandlungsanlageEnglish waste water treatment plantFrench station d’épuration des eaux résiduaires

system sentencesource Titelverteidiger ist Drittligaabsteiger SpVgg Unterhaching.reference The defending champions are SpVgg Unterhaching, who have been relegated to the third league.uedin-pbsmt Title defender Drittligaabsteiger Week 2.uedin-nmt Defending champion is third-round pick SpVgg Underhaching.

Rico Sennrich Neural Machine Translation 32 / 65

Interlude: why is (machine) translation hard?

open vocabularylanguages have an open vocabulary, and we need to learn translations forwords that we have only seen rarely (or never)

system sentencesource Titelverteidiger ist Drittligaabsteiger SpVgg Unterhaching.reference The defending champions are SpVgg Unterhaching, who have been relegated to the third league.uedin-pbsmt Title defender Drittligaabsteiger Week 2.uedin-nmt Defending champion is third-round pick SpVgg Underhaching.

Rico Sennrich Neural Machine Translation 33 / 65

Interlude: why is (machine) translation hard?

discontinuous structuresa word (sequence) can map to a discontinuous structure in anotherlanguage.

English I do not knowFrench Je ne sais pas

system sentencesource Ein Jahr später machten die Fed-Repräsentanten diese Kürzungen rückgängig.reference A year later, Fed officials reversed those cuts.uedin-pbsmt A year later, the Fed representatives made these cuts.uedin-nmt A year later, FedEx officials reversed those cuts.

Rico Sennrich Neural Machine Translation 34 / 65

Interlude: why is (machine) translation hard?

discoursethe translation of referential expressions depends on discourse context,which sentence-level translators have no access to.

English I made a decision. Please respect it.French J’ai pris une décision. Respectez-la s’il vous plaît.French J’ai fait un choix. Respectez-le s’il vous plaît.

Rico Sennrich Neural Machine Translation 35 / 65

Interlude: why is (machine) translation hard?

assorted other difficultiesunderspecification

ellipsis

lexical gaps

language change

language variation (dialects, genres, domains)

ill-formed input

Rico Sennrich Neural Machine Translation 36 / 65

Comparison between phrase-based and neural MT

human analysis of NMT (reranking) [Neubig et al., 2015]NMT is more grammatical

word orderinsertion/deletion of function wordsmorphological agreement

minor degradation in lexical choice?

Rico Sennrich Neural Machine Translation 37 / 65

Comparison between phrase-based and neural MT

analysis of IWSLT 2015 results [Bentivogli et al., 2016]human-targeted translation error rate (HTER) based on automatictranslation and human post-edit

4 error types: substitution, insertion, deletion, shift

systemHTER (no shift) HTER

word lemma %∆ (shift only)PBSMT [Ha et al., 2015] 28.3 23.2 -18.0 3.5NMT [Luong and Manning, 2015] 21.7 18.7 -13.7 1.5

word-level is closer to lemma-level performance: better atinflection/agreement

improvement on lemma-level: better lexical choice

fewer shift errors: better word order

Rico Sennrich Neural Machine Translation 38 / 65

Comparison between phrase-based and neural MT

analysis of IWSLT 2015 results [Bentivogli et al., 2016]human-targeted translation error rate (HTER) based on automatictranslation and human post-edit

4 error types: substitution, insertion, deletion, shift

systemHTER (no shift) HTER

word lemma %∆ (shift only)PBSMT [Ha et al., 2015] 28.3 23.2 -18.0 3.5NMT [Luong and Manning, 2015] 21.7 18.7 -13.7 1.5

word-level is closer to lemma-level performance: better atinflection/agreement

improvement on lemma-level: better lexical choice

fewer shift errors: better word order

Rico Sennrich Neural Machine Translation 38 / 65

Comparison between phrase-based and neural MT

analysis of IWSLT 2015 results [Bentivogli et al., 2016]human-targeted translation error rate (HTER) based on automatictranslation and human post-edit

4 error types: substitution, insertion, deletion, shift

systemHTER (no shift) HTER

word lemma %∆ (shift only)PBSMT [Ha et al., 2015] 28.3 23.2 -18.0 3.5NMT [Luong and Manning, 2015] 21.7 18.7 -13.7 1.5

word-level is closer to lemma-level performance: better atinflection/agreement

improvement on lemma-level: better lexical choice

fewer shift errors: better word order

Rico Sennrich Neural Machine Translation 38 / 65

Comparison between phrase-based and neural MT

analysis of IWSLT 2015 results [Bentivogli et al., 2016]human-targeted translation error rate (HTER) based on automatictranslation and human post-edit

4 error types: substitution, insertion, deletion, shift

systemHTER (no shift) HTER

word lemma %∆ (shift only)PBSMT [Ha et al., 2015] 28.3 23.2 -18.0 3.5NMT [Luong and Manning, 2015] 21.7 18.7 -13.7 1.5

word-level is closer to lemma-level performance: better atinflection/agreement

improvement on lemma-level: better lexical choice

fewer shift errors: better word order

Rico Sennrich Neural Machine Translation 38 / 65

Comparison between phrase-based and neural MT

WMT16 direct assessment [Bojar et al., 2016]uedin-nmt is most fluent for all 4 evaluated translation directionsin adequacy, ranked:

1/6 (CS-EN)1/10 (DE-EN)2/7 (RO-EN)6/10 (RU-EN)

relative to other systems, stronger contrast in fluency than adequacy

Rico Sennrich Neural Machine Translation 39 / 65

Why is neural MT output more grammatical?

neural MTend-to-end trained model

generalization via continuous space representation

output conditioned on full source text and target history

phrase-based SMTlog-linear combination of many “weak” features

data sparsenesss triggers back-off to smaller units

strong independence assumptions

Rico Sennrich Neural Machine Translation 40 / 65

Neural Machine Translation

1 Attentional encoder-decoder

2 Where are we now? Evaluation, challenges,future directions...

Evaluation resultsComparing neural and phrase-based ma-chine translationRecent research in neural machine transla-tion

Rico Sennrich Neural Machine Translation 41 / 65

Efficiency

speed bottlenecksmatrix multiplication→ use of highly parallel hardware (GPUs)softmax (scales with vocabulary size). Solutions:

LMs: hierarchical softmax; noise-contrastive estimation;self-normalizationNMT: approximate softmax through subset of vocabulary[Jean et al., 2015]

NMT training vs. decoding (on fast GPU)training: slow (1-3 weeks)

decoding: fast (100 000–500 000 sentences / day)a

awith NVIDIA Titan X and amuNMT (https://github.com/emjotde/amunmt)

Rico Sennrich Neural Machine Translation 42 / 65

Open-vocabulary translation

Why is vocabulary size a problem?size of one-hot input/output vector is linear to vocabulary size

large vocabularies are space inefficient

large output vocabularies are time inefficient

typical network vocabulary size: 30 000–100 000

What about out-of-vocabulary words?training set vocabulary typically larger than network vocabulary(1 million words or more)at translation time, we regularly encounter novel words:

names: Barack Obamamorph. complex words: Hand|gepäck|gebühr (’carry-on bag fee’)numbers, URLs etc.

Rico Sennrich Neural Machine Translation 43 / 65

Open-vocabulary translation

Solutionscopy unknown words, or translate with back-off dictionary[Jean et al., 2015, Luong et al., 2015b, Gulcehre et al., 2016]→ works for names (if alphabet is shared), and 1-to-1 aligned words

use subword units (characters or others) for input/output vocabulary→ model can learn translation of seen words on subword level→ model can translate unseen words if translation is transparent

active research area [Sennrich et al., 2016c,Luong and Manning, 2016, Chung et al., 2016, Ling et al., 2015,Costa-jussà and Fonollosa, 2016]

Rico Sennrich Neural Machine Translation 44 / 65

Core idea: transparent translations

transparent translationssome translations are semantically/phonologically transparentmorphologically complex words (e.g. compounds):

solar system (English)Sonnen|system (German)Nap|rendszer (Hungarian)

named entities:Obama(English; German)Îáàìà (Russian)オバマ (o-ba-ma) (Japanese)

cognates and loanwords:claustrophobia(English)Klaustrophobie(German)Êëàóñòðîôîáèÿ (Russian)

Rico Sennrich Neural Machine Translation 45 / 65

Byte pair encoding [Gage, 1994]

algorithmiteratively replace most frequent byte pair in sequence with unused byte

aaabdaaabac

ZabdZabacZYdZYacXdXac

Z=aaY=abX=ZY

Rico Sennrich Neural Machine Translation 46 / 65

Byte pair encoding [Gage, 1994]

algorithmiteratively replace most frequent byte pair in sequence with unused byte

aaabdaaabacZabdZabac

ZYdZYacXdXac

Z=aa

Y=abX=ZY

Rico Sennrich Neural Machine Translation 46 / 65

Byte pair encoding [Gage, 1994]

algorithmiteratively replace most frequent byte pair in sequence with unused byte

aaabdaaabacZabdZabacZYdZYac

XdXac

Z=aaY=ab

X=ZY

Rico Sennrich Neural Machine Translation 46 / 65

Byte pair encoding [Gage, 1994]

algorithmiteratively replace most frequent byte pair in sequence with unused byte

aaabdaaabacZabdZabacZYdZYacXdXac

Z=aaY=abX=ZY

Rico Sennrich Neural Machine Translation 46 / 65

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word frequency’l o w </w>’ 5’l o w e r </w>’ 2’n e w e s t </w>’ 6’w i d e s t </w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Rico Sennrich Neural Machine Translation 47 / 65

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word frequency’l o w </w>’ 5’l o w e r </w>’ 2’n e w es t </w>’ 6’w i d es t </w>’ 3

(’e’, ’s’) → ’es’

(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Rico Sennrich Neural Machine Translation 47 / 65

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word frequency’l o w </w>’ 5’l o w e r </w>’ 2’n e w est </w>’ 6’w i d est </w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’

(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Rico Sennrich Neural Machine Translation 47 / 65

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word frequency’l o w </w>’ 5’l o w e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’

(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Rico Sennrich Neural Machine Translation 47 / 65

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word frequency’lo w </w>’ 5’lo w e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’

(’lo’, ’w’) → ’low’...

Rico Sennrich Neural Machine Translation 47 / 65

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word frequency’low </w>’ 5’low e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Rico Sennrich Neural Machine Translation 47 / 65

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’l o w e s t </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Rico Sennrich Neural Machine Translation 48 / 65

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’l o w es t </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Rico Sennrich Neural Machine Translation 48 / 65

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’l o w est </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Rico Sennrich Neural Machine Translation 48 / 65

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’l o w est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Rico Sennrich Neural Machine Translation 48 / 65

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’lo w est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Rico Sennrich Neural Machine Translation 48 / 65

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’low est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Rico Sennrich Neural Machine Translation 48 / 65

Linguistic Features [Sennrich and Haddow, 2016]a.k.a. Factored Neural Machine Translation

motivation: disambiguate words by POS

English Germancloseverb schließencloseadj nahclosenoun Ende

source We thought a win like this might be closeadj.reference Wir dachten, dass ein solcher Sieg nah sein könnte.baseline NMT *Wir dachten, ein Sieg wie dieser könnte schließen.

Rico Sennrich Neural Machine Translation 49 / 65

Linguistic Features: Architecture

use separate embeddings for each feature, then concatenate

baseline: only word feature

E(close) =

0.50.20.30.1

|F | input features

E1(close) =

0.40.10.2

E2(adj) =

[0.1]

E1(close) ‖ E2(adj) =

0.40.10.20.1

Rico Sennrich Neural Machine Translation 50 / 65

Linguistic Features: Results

experimental setupWMT 2016 (parallel data only)source-side features:

POS tagdependency labellemmamorphological featuressubword tag

English→German German→English English→Romanian

0.0

10.0

20.0

30.0

40.0

27.8

31.4

23.8

28.4

32.9

24.8

BLE

U

baseline +linguistic features

Rico Sennrich Neural Machine Translation 51 / 65

Architecture variants

an incomplete selectionconvolutional network as encoder [Kalchbrenner and Blunsom, 2013]

TreeLSTM as encoder [Eriguchi et al., 2016]

modifications to attention mechanism[Luong et al., 2015a, Feng et al., 2016]

deeper networks [Zhou et al., 2016]

coverage model [Mi et al., 2016, Tu et al., 2016b, Tu et al., 2016a]

reward symmetry between source-to-target and target-to-sourceattention [Cohn et al., 2016, Cheng et al., 2015]

Rico Sennrich Neural Machine Translation 52 / 65

Sequence-level training

problem: at training time, target-side history is reliable;at test time, it is not.→ exposure bias

solution: instead of using gold context, sample from the model toobtain target context[Shen et al., 2016, Ranzato et al., 2016, Bengio et al., 2015]

more efficient cross entropy training remains in use to initializeweights

Rico Sennrich Neural Machine Translation 53 / 65

Trading-off target and source contextsystem sentencesource Ein Jahr später machten die Fed-Repräsentanten diese Kürzungen rückgängig.reference A year later, Fed officials reversed those cuts.uedin-nmt A year later, FedEx officials reversed those cuts.uedin-pbsmt A year later, the Fed representatives made these cuts.

problemRNN is locally normalized at each time step

given Fed: as previous (sub)word, Ex is very likely in training data:p(Ex|Fed:) = 0.55

label bias problem: locally-normalized models may ignore input inlow-entropy state

potential solutions (speculative)sampling at training time

bidirectional decoder [Liu et al., 2016, Sennrich et al., 2016a]

context gates to trade-off source and target context [Tu et al., 2016]

Rico Sennrich Neural Machine Translation 54 / 65

Training data: monolingual

Why train on monolingual data?cheaper to create/collect

parallel data is scarce for many language pairs

domain adaptation with in-domain monolingual data

Rico Sennrich Neural Machine Translation 55 / 65

Training data: monolingual

Solutions/1 [Gülçehre et al., 2015]shallow fusion: rescore beam with language model

deep fusion: extra, LM-specific hidden layer

(a) Shallow Fusion (Sec. 4.1) (b) Deep Fusion (Sec. 4.2)

Figure 1: Graphical illustrations of the proposed fusion methods.

learned by the LM from monolingual corpora isnot overwritten. It is possible to use monolingualcorpora as well while finetuning all the parame-ters, but in this paper, we alter only the output pa-rameters in the stage of finetuning.

4.2.1 Balancing the LM and TMIn order for the decoder to flexibly balance the in-put from the LM and TM, we augment the decoderwith a “controller” mechanism. The need to flex-ibly balance the signals arises depending on thework being translated. For instance, in the caseof Zh-En, there are no Chinese words that corre-spond to articles in English, in which case the LMmay be more informative. On the other hand, ifa noun is to be translated, it may be better to ig-nore any signal from the LM, as it may prevent thedecoder from choosing the correct translation. In-tuitively, this mechanism helps the model dynami-cally weight the different models depending on theword being translated.

The controller mechanism is implemented as afunction taking the hidden state of the LM as inputand computing

gt = σ(v>g s

LMt + bg

), (7)

where σ is a logistic sigmoid function. vg and bgare learned parameters.

The output of the controller is then multipliedwith the hidden state of the LM. This lets the de-

coder use the signal from the TM fully, while thecontroller controls the magnitude of the LM sig-nal.

In our experiments, we empirically found that itwas better to initialize the bias bg to a small, neg-ative number. This allows the decoder to decidethe importance of the LM only when it is deemednecessary.

5 Datasets

We evaluate the proposed approaches on four di-verse tasks: Chinese to English (Zh-En), Turkishto English (Tr-En), German to English (De-En)and Czech to English (Cs-En). We describe eachof these datasets in more detail below.

5.1 Parallel Corpora

5.1.1 Zh-En: OpenMT’15We use the parallel corpora made availableas a part of the NIST OpenMT’15 Challenge.Sentence-aligned pairs from three domains arecombined to form a training set: (1) SMS/CHATand (2) conversational telephone speech (CTS)from DARPA BOLT Project, and (3) newsgroup-s/weblogs from DARPA GALE Project. In total,the training set consists of 430K sentence pairs(see Table 1 for the detailed statistics). We train

In all our experiments, we set bg = −1 to ensure thatgt is initially 0.26 on average.

[Gülçehre et al., 2015]

Rico Sennrich Neural Machine Translation 56 / 65

Training data: monolingual

Solutions/2 [Sennrich et al., 2016b]decoder is already a language model→ mix monolingual data into training setproblem: how to get ci for monolingual training instances?

dummy source context ci (moderately effective)produce synthetic source sentence via back-translation→ get approximation of ci

EN→CS EN→DE EN→RO EN→RU CS→EN DE→EN RO→EN RU→EN

0.0

10.0

20.0

30.0

40.0

20.9

26.8

23.9

20.3

25.3

28.5 29.2

22.523.7

31.6

28.1

24.3

30.1

36.2

33.3

26.9

BLE

U

parallel +synthetic

[Sennrich et al., 2016a]

Rico Sennrich Neural Machine Translation 57 / 65

Training data: multilingual

Multi-source translation [Zoph and Knight, 2016]we can condition on multiple input sentences

A B C <EOS> W X Y Z

<EOS> Z Y X W

A B C

<EOS> W X Y Z

<EOS> Z Y X W

I J K

Figure 2: Multi-source encoder-decoder model for MT. We have two source sentences (C B A and K J I)in different languages. Each language has its own encoder; it passes its final hidden and cell state to a setof combiners (in black). The output of a combiner is a hidden state and cell state of the same dimension.

the input gate of a typical LSTM cell. In equa-tion 4, there are two forget gates indexed by thesubscript i that serve as the forget gates for eachof the incoming cells for each of the encoders. Inequation 5, o represents the output gate of a nor-mal LSTM. i, f , o, and u are all size-1000 vectors.

2.3 Multi-Source AttentionOur single-source attention model is modeled offthe local-p attention model with feed input fromLuong et al. (2015b), where hidden states from thetop decoder layer can look back at the top hiddenstates from the encoder. The top decoder hiddenstate is combined with a weighted sum of the en-coder hidden states, to make a better hidden statevector (ht), which is passed to the softmax outputlayer. With input-feeding, the hidden state fromthe attention model is sent down to the bottom de-coder layer at the next time step.

The local-p attention model from Luong et al.(2015b) works as follows. First, a position to lookat in the source encoder is predicted by equation 9:

pt = S · sigmoid(vTp tanh(Wpht)) (9)

S is the source sentence length, and vp and Wp

are learned parameters, with vp being a vector ofdimension 1000, and Wp being a matrix of dimen-sion 1000 x 1000.

After pt is computed, a window of size 2D + 1is looked at in the top layer of the source encodercentered around pt (D = 10). For each hiddenstate in this window, we compute an alignment

score at(s), between 0 and 1. This alignment scoreis computed by equations 10, 11 and 12:

at(s) = align(ht, hs)exp(−(s− pt)2

2σ2

)(10)

align(ht, hs) =exp(score(ht, hs))∑s′ exp(score(ht, hs′))

(11)

score(ht, hs) = hTt Wahs (12)

In equation 10, σ is set to be D/2 and s is thesource index for that hidden state. Wa is a learn-able parameter of dimension 1000 x 1000.

Once all of the alignments are calculated, ct iscreated by taking a weighted sum of all source hid-den states multiplied by their alignment weight.

The final hidden state sent to the softmax layeris given by:

ht = tanh(Wc[ht; ct]

)(13)

We modify this attention model to look at bothsource encoders simultaneously. We create a con-text vector from each source encoder named c1tand c2t instead of the just ct in the single-sourceattention model:

ht = tanh(Wc[ht; c

1t ; c

2t ])

(14)

In our multi-source attention model we nowhave two pt variables, one for each source encoder.

benefits:one source text may contain information that is unspecified in other→ possible quality gains

drawbacks:we need multiple source sentences at training and decoding time

Rico Sennrich Neural Machine Translation 58 / 65

Training data: multilingual

Multilingual models [Dong et al., 2015, Firat et al., 2016]we can share layers of the model across language pairsFigure 2: Multi-task learning framework for multiple-target language translation

Figure 3: Optimization for end to multi-end model

3.4 Translation with Beam SearchAlthough parallel corpora are available for theencoder and the decoder modeling in the trainingphrase, the ground truth is not available during testtime. During test time, translation is produced byfinding the most likely sequence via beam search.

Y = argmaxY

p(YTp |STp) (15)

Given the target direction we want to translate to,beam search is performed with the shared encoderand a specific target decoder where search spacebelongs to the decoder Tp. We adopt beam searchalgorithm similar as it is used in SMT system(Koehn, 2004) except that we only utilize scoresproduced by each decoder as features. The sizeof beam is 10 in our experiments for speedupconsideration. Beam search is ended until the end-of-sentence eos symbol is generated.

4 Experiments

We conducted two groups of experiments toshow the effectiveness of our framework. Thegoal of the first experiment is to show thatmulti-task learning helps to improve translationperformance given enough training corpora for alllanguage pairs. In the second experiment, weshow that for some resource-poor language pairswith a few parallel training data, their translationperformance could be improved as well.

4.1 DatasetThe Europarl corpus is a multi-lingual corpusincluding 21 European languages. Here we onlychoose four language pairs for our experiments.The source language is English for all languagepairs. And the target languages are Spanish(Es), French (Fr), Portuguese (Pt) and Dutch(Nl). To demonstrate the validity of ourlearning framework, we do some preprocessingon the training set. For the source language,we use 30k of the most frequent words forsource language vocabulary which is sharedacross different language pairs and 30k mostfrequent words for each target language. Out-of-vocabulary words are denoted as unknownwords, and we maintain different unknown wordlabels for different languages. For test sets,we also restrict all words in the test set tobe from our training vocabulary and mark theOOV words as the corresponding labels as inthe training data. The size of training corpus inexperiment 1 and 2 is listed in Table 1 where

1727

benefits:transfer learning from one language pair to the other→ possible quality gains, especially for low-resourced language pairsscalability: do we need N2 −N independent models for N languages?→ sharing of parameters allows linear growth→ zero-shot translation?

Rico Sennrich Neural Machine Translation 59 / 65

Training data: other tasks

Multi-task models [Luong et al., 2016]other tasks can be modelled with sequence-to-sequence models

we can share layers between translation and other tasks

Published as a conference paper at ICLR 2016

Figure 1:Sequence to sequence learning examples – (left) machine translation (Sutskever et al.,2014) and (right) constituent parsing (Vinyals et al., 2015a).

and German by up to +1.5 BLEU points over strong single-task baselines on the WMT benchmarks.Furthermore, we have established a newstate-of-the-artresult in constituent parsing with 93.0 F1.We also explore two unsupervised learning objectives, sequence autoencoders (Dai & Le, 2015) andskip-thought vectors (Kiros et al., 2015), and reveal theirinteresting properties in the MTL setting:autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought.

2 SEQUENCE TOSEQUENCE LEARNING

Sequence to sequence learning (seq2seq) aims to directly model the conditional probabilityp(y|x) ofmapping an input sequence,x1, . . . , xn, into an output sequence,y1, . . . , ym. It accomplishes suchgoal through theencoder-decoderframework proposed by Sutskever et al. (2014) and Cho et al.(2014). As illustrated in Figure 1, theencodercomputes a representations for each input sequence.Based on that input representation, thedecodergenerates an output sequence, one unit at a time, andhence, decomposes the conditional probability as:

log p(y|x) =∑m

j=1log p (yj|y<j , x, s) (1)

A natural model for sequential data is the recurrent neural network (RNN), which is used by most ofthe recentseq2seqwork. These work, however, differ in terms of: (a)architecture– from unidirec-tional, to bidirectional, and deep multi-layer RNNs; and (b) RNN type– which are long-short termmemory (LSTM) (Hochreiter & Schmidhuber, 1997) and the gated recurrent unit (Cho et al., 2014).

Another important difference betweenseq2seqwork lies in what constitutes the input represen-tation s. The earlyseq2seqwork (Sutskever et al., 2014; Cho et al., 2014; Luong et al., 2015b;Vinyals et al., 2015b) uses only the last encoder state to initialize the decoder and setss = [ ]in Eq. (1). Recently, Bahdanau et al. (2015) proposes anattention mechanism, a way to provideseq2seqmodels with a random access memory, to handle long input sequences. This is accomplishedby settings in Eq. (1) to be the set of encoder hidden states already computed. On the decoder side,at each time step, the attention mechanism will decide how much information to retrieve from thatmemory by learning where to focus, i.e., computing the alignment weights for all input positions.Recent work such as (Xu et al., 2015; Jean et al., 2015a; Luonget al., 2015a; Vinyals et al., 2015a)has found that it is crucial to empowerseq2seqmodels with the attention mechanism.

3 MULTI -TASK SEQUENCE-TO-SEQUENCE LEARNING

We generalize the work of Dong et al. (2015) to the multi-tasksequence-to-sequence learning set-ting that includes the tasks of machine translation (MT), constituency parsing, and image captiongeneration. Depending which tasks involved, we propose to categorize multi-taskseq2seqlearninginto three general settings. In addition, we will discuss the unsupervised learning tasks consideredas well as the learning process.

3.1 ONE-TO-MANY SETTING

This scheme involvesone encoderand multiple decodersfor tasks in which the encoder can beshared, as illustrated in Figure 2. The input to each task is asequence of English words. A separatedecoder is used to generate each sequence of output units which can be either (a) a sequence of tags

2

Rico Sennrich Neural Machine Translation 60 / 65

NMT as a component in log-linear models

Log-linear modelsmodel ensembling is well-established

reranking output of phrase-based/syntax-based with NMT[Neubig et al., 2015]

incorporating NMT as a feature function into PBSMT[Junczys-Dowmunt et al., 2016]→ results depend on relative performance of PBSMT and NMTlog-linear combination of different neural models

left-to-right and right-to-left [Liu et al., 2016]source-to-target and target-to-source [Li and Jurafsky, 2016]

Rico Sennrich Neural Machine Translation 61 / 65

Some future directions for (neural) MT research

(better) solutions to new(ish) problemsOOVs, coverage, efficiency...

work on "hard" translation problemsconsider context beyond sentence boundaryreward semantic adequacy of translation...

new opportunitiesone model for many language pairs?tight integration with other NLP tasks

Rico Sennrich Neural Machine Translation 62 / 65

Further Reading

secondary literaturelecture notes by Kyunghyun Cho: [Cho, 2015]

chapter on Neural Network Models in “Statistical Machine Translation”by Philipp Koehn http://mt-class.org/jhu/assets/papers/neural-network-models.pdf

Rico Sennrich Neural Machine Translation 63 / 65

(A small selection of) Resources

NMT toolsdl4mt-tutorial (theano) https://github.com/nyu-dl/dl4mt-tutorial

(our branch: nematus https://github.com/rsennrich/nematus)

nmt.matlab https://github.com/lmthang/nmt.matlab

seq2seq (tensorflow) https://www.tensorflow.org/versions/r0.8/tutorials/seq2seq/index.html

neural monkey (tensorflow) https://github.com/ufal/neuralmonkey

Rico Sennrich Neural Machine Translation 64 / 65

Do it yourself

sample files and instructions for training NMT modelhttps://github.com/rsennrich/wmt16-scripts

pre-trained models to test decoding (and for further experiments)http://statmt.org/rsennrich/wmt16_systems/

lab session this afternooninstall Nematus

use Nematus with existing model

adapt existing model to new domain via continued training

Rico Sennrich Neural Machine Translation 65 / 65

Thank you!

Rico Sennrich Neural Machine Translation 66 / 65

Bibliography I

Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015).Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.CoRR, abs/1506.03099.

Bentivogli, L., Bisazza, A., Cettolo, M., and Federico, M. (2016).Neural versus Phrase-Based Machine Translation Quality: a Case Study.In EMNLP 2016.

Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., Jimeno Yepes, A., Koehn, P., Logacheva, V., Monz,C., Negri, M., Neveol, A., Neves, M., Popel, M., Post, M., Rubino, R., Scarton, C., Specia, L., Turchi, M., Verspoor, K., andZampieri, M. (2016).Findings of the 2016 Conference on Machine Translation (WMT16).In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 131–198, Berlin,Germany. Association for Computational Linguistics.

Cheng, Y., Shen, S., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. (2015).Agreement-based Joint Training for Bidirectional Attention-based Neural Machine Translation.CoRR, abs/1512.04650.

Cho, K. (2015).Natural Language Understanding with Distributed Representation.CoRR, abs/1511.07916.

Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.

Rico Sennrich Neural Machine Translation 67 / 65

Bibliography II

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014).Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734,Doha, Qatar. Association for Computational Linguistics.

Chung, J., Cho, K., and Bengio, Y. (2016).A Character-level Decoder without Explicit Segmentation for Neural Machine Translation.CoRR, abs/1603.06147.

Cohn, T., Hoang, C. D. V., Vymolova, E., Yao, K., Dyer, C., and Haffari, G. (2016).Incorporating Structural Alignment Biases into an Attentional Neural Translation Model.CoRR, abs/1601.01085.

Costa-jussà, R. M. and Fonollosa, R. J. A. (2016).Character-based Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages357–361. Association for Computational Linguistics.

Dong, D., Wu, H., He, W., Yu, D., and Wang, H. (2015).Multi-Task Learning for Multiple Language Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1723–1732, Beijing, China. Association for Computational Linguistics.

Eriguchi, A., Hashimoto, K., and Tsuruoka, Y. (2016).Tree-to-Sequence Attentional Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages823–833. Association for Computational Linguistics.

Rico Sennrich Neural Machine Translation 68 / 65

Bibliography III

Feng, S., Liu, S., Li, M., and Zhou, M. (2016).Implicit Distortion and Fertility Models for Attention-based Encoder-Decoder NMT Model.CoRR, abs/1601.03317.

Firat, O., Cho, K., and Bengio, Y. (2016).Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pages 866–875. Association for Computational Linguistics.

Gage, P. (1994).A New Algorithm for Data Compression.C Users J., 12(2):23–38.

Gülçehre, c., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H., Bougares, F., Schwenk, H., and Bengio, Y. (2015).On Using Monolingual Corpora in Neural Machine Translation.CoRR, abs/1503.03535.

Gulcehre, C., Ahn, S., Nallapati, R., Zhou, B., and Bengio, Y. (2016).Pointing the Unknown Words.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages140–149. Association for Computational Linguistics.

Ha, T.-L., Niehues, J., Cho, E., Mediani, M., and Waibel, A. (2015).The KIT translation systems for IWSLT 2015.In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), pages 62–69.

Rico Sennrich Neural Machine Translation 69 / 65

Bibliography IV

Haddow, B., Huck, M., Birch, A., Bogoychev, N., and Koehn, P. (2015).The Edinburgh/JHU Phrase-based Machine Translation Systems for WMT 2015.In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 126–133, Lisbon, Portugal. Association forComputational Linguistics.

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.

Junczys-Dowmunt, M., Dwojak, T., and Sennrich, R. (2016).The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions inPhrase-based SMT.In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 316–322, Berlin,Germany. Association for Computational Linguistics.

Junczys-Dowmunt, M. and Grundkiewicz, R. (2016).Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing.In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for ComputationalLinguistics.

Kalchbrenner, N. and Blunsom, P. (2013).Recurrent Continuous Translation Models.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle. Association forComputational Linguistics.

Rico Sennrich Neural Machine Translation 70 / 65

Bibliography V

Li, J. and Jurafsky, D. (2016).Mutual Information and Diverse Decoding Improve Neural Machine Translation.CoRR, abs/1601.00372.

Ling, W., Trancoso, I., Dyer, C., and Black, A. W. (2015).Character-based Neural Machine Translation.ArXiv e-prints.

Liu, L., Utiyama, M., Finch, A., and Sumita, E. (2016).Agreement on Target-bidirectional Neural Machine Translation .In NAACL HLT 16, San Diego, CA.

Luong, M., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. (2016).Multi-task Sequence to Sequence Learning.In ICLR 2016.

Luong, M.-T. and Manning, C. D. (2015).Stanford Neural Machine Translation Systems for Spoken Language Domains.In Proceedings of the International Workshop on Spoken Language Translation 2015, Da Nang, Vietnam.

Luong, M.-T. and Manning, D. C. (2016).Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1054–1063. Association for Computational Linguistics.

Luong, T., Pham, H., and Manning, C. D. (2015a).Effective Approaches to Attention-based Neural Machine Translation.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon,Portugal. Association for Computational Linguistics.

Rico Sennrich Neural Machine Translation 71 / 65

Bibliography VI

Luong, T., Sutskever, I., Le, Q., Vinyals, O., and Zaremba, W. (2015b).Addressing the Rare Word Problem in Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 11–19, Beijing, China. Association for Computational Linguistics.

Mi, H., Sankaran, B., Wang, Z., and Ittycheriah, A. (2016).A Coverage Embedding Model for Neural Machine Translation.ArXiv e-prints.

Neubig, G., Morishita, M., and Nakamura, S. (2015).Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015.In Proceedings of the 2nd Workshop on Asian Translation (WAT2015), pages 35–41, Kyoto, Japan.

Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2016).Sequence Level Training with Recurrent Neural Networks.In ICLR 2016.

Sennrich, R. and Haddow, B. (2016).Linguistic Input Features Improve Neural Machine Translation.In Proceedings of the First Conference on Machine Translation, Volume 1: Research Papers, pages 83–91, Berlin, Germany.Association for Computational Linguistics.

Sennrich, R., Haddow, B., and Birch, A. (2016a).Edinburgh Neural Machine Translation Systems for WMT 16.In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 368–373, Berlin,Germany. Association for Computational Linguistics.

Rico Sennrich Neural Machine Translation 72 / 65

Bibliography VII

Sennrich, R., Haddow, B., and Birch, A. (2016b).Improving Neural Machine Translation Models with Monolingual Data.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages86–96, Berlin, Germany. Association for Computational Linguistics.

Sennrich, R., Haddow, B., and Birch, A. (2016c).Neural Machine Translation of Rare Words with Subword Units.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1715–1725, Berlin, Germany. Association for Computational Linguistics.

Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. (2016).Minimum Risk Training for Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin,Germany.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.

Tu, Z., Liu, Y., Lu, Z., Liu, X., and Li, H. (2016).Context Gates for Neural Machine Translation.ArXiv e-prints.

Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. (2016a).Coverage-based Neural Machine Translation.CoRR, abs/1601.04811.

Rico Sennrich Neural Machine Translation 73 / 65

Bibliography VIII

Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. (2016b).Modeling Coverage for Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages76–85. Association for Computational Linguistics.

Zhou, J., Cao, Y., Wang, X., Li, P., and Xu, W. (2016).Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation.Transactions of the Association of Computational Linguistics – Volume 4, Issue 1, pages 371–383.

Zoph, B. and Knight, K. (2016).Multi-Source Neural Translation.In NAACL HLT 2016.

Rico Sennrich Neural Machine Translation 74 / 65