When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical...

15
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1198–1212 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 1198 When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion Elena Voita 1,2 Rico Sennrich 3,4 Ivan Titov 3,2 1 Yandex, Russia 2 University of Amsterdam, Netherlands 3 University of Edinburgh, Scotland 4 University of Zurich, Switzerland [email protected] [email protected] [email protected] Abstract Though machine translation errors caused by the lack of context beyond one sentence have long been acknowledged, the development of context-aware NMT systems is hampered by several problems. Firstly, standard metrics are not sensitive to improvements in consistency in document-level translations. Secondly, pre- vious work on context-aware NMT assumed that the sentence-aligned parallel data con- sisted of complete documents while in most practical scenarios such document-level data constitutes only a fraction of the available par- allel data. To address the first issue, we per- form a human study on an English-Russian subtitles dataset and identify deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We then create test sets target- ing these phenomena. To address the second shortcoming, we consider a set-up in which a much larger amount of sentence-level data is available compared to that aligned at the doc- ument level. We introduce a model that is suitable for this scenario and demonstrate ma- jor gains over a context-agnostic baseline on our new benchmarks without sacrificing per- formance as measured with BLEU. 1 1 Introduction With the recent rapid progress of neural machine translation (NMT), translation mistakes and in- consistencies due to the lack of extra-sentential context are becoming more and more notice- able among otherwise adequate translations pro- duced by standard context-agnostic NMT systems (Läubli et al., 2018). Though this problem has recently triggered a lot of attention to context- aware translation (Jean et al., 2017a; Wang et al., 2017; Tiedemann and Scherrer, 2017; Bawden 1 We release code and data sets at https://github.com/lena-voita/ good-translation-wrong-in-context. et al., 2018; Voita et al., 2018; Maruf and Haf- fari, 2018; Agrawal et al., 2018; Miculicich et al., 2018; Zhang et al., 2018), the progress and wide- spread adoption of the new paradigm is hampered by several important problems. Firstly, it is highly non-trivial to design metrics which would reliably trace the progress and guide model design. Stan- dard machine translation metrics (e.g., BLEU) do not appear appropriate as they do not sufficiently differentiate between consistent and inconsistent translations (Wong and Kit, 2012). 2 For exam- ple, if multiple translations of a name are pos- sible, forcing consistency is essentially as likely to make all occurrences of the name match the reference translation as making them all different from the reference. Second, most previous work on context-aware NMT has made the assumption that all the bilingual data is available at the doc- ument level. However, isolated parallel sentences are a lot easier to acquire and hence only a frac- tion of the parallel data will be at the document level in any practical scenario. In other words, a context-aware model trained only on document- level parallel data is highly unlikely to outperform a context-agnostic model estimated from much larger sentence-level parallel corpus. This work aims to address both these shortcomings. A context-agnostic NMT system would often produce plausible translations of isolated sen- tences, however, when put together in a docu- ment, these translations end up being inconsis- tent with each other. We investigate which lin- guistic phenomena cause the inconsistencies us- ing the OpenSubtitles (Lison et al., 2018) corpus for the English-Russian language pair. We iden- tify deixis, ellipsis and lexical cohesion as three 2 We use the term ‘inconsistency’ to refer to any violations causing good translations of isolated sentences not to work together, independently of which linguistic phenomena (e.g., ellipsis or lexical cohesion) impose the violated constraints.

Transcript of When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical...

Page 1: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1198–1212Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

1198

When a Good Translation is Wrong in Context: Context-Aware MachineTranslation Improves on Deixis, Ellipsis, and Lexical Cohesion

Elena Voita1,2 Rico Sennrich3,4 Ivan Titov3,2

1Yandex, Russia 2University of Amsterdam, Netherlands3University of Edinburgh, Scotland 4University of Zurich, Switzerland

[email protected]@ed.ac.uk [email protected]

Abstract

Though machine translation errors caused bythe lack of context beyond one sentence havelong been acknowledged, the development ofcontext-aware NMT systems is hampered byseveral problems. Firstly, standard metrics arenot sensitive to improvements in consistencyin document-level translations. Secondly, pre-vious work on context-aware NMT assumedthat the sentence-aligned parallel data con-sisted of complete documents while in mostpractical scenarios such document-level dataconstitutes only a fraction of the available par-allel data. To address the first issue, we per-form a human study on an English-Russiansubtitles dataset and identify deixis, ellipsisand lexical cohesion as three main sources ofinconsistency. We then create test sets target-ing these phenomena. To address the secondshortcoming, we consider a set-up in which amuch larger amount of sentence-level data isavailable compared to that aligned at the doc-ument level. We introduce a model that issuitable for this scenario and demonstrate ma-jor gains over a context-agnostic baseline onour new benchmarks without sacrificing per-formance as measured with BLEU.1

1 Introduction

With the recent rapid progress of neural machinetranslation (NMT), translation mistakes and in-consistencies due to the lack of extra-sententialcontext are becoming more and more notice-able among otherwise adequate translations pro-duced by standard context-agnostic NMT systems(Läubli et al., 2018). Though this problem hasrecently triggered a lot of attention to context-aware translation (Jean et al., 2017a; Wang et al.,2017; Tiedemann and Scherrer, 2017; Bawden

1We release code and data sets athttps://github.com/lena-voita/good-translation-wrong-in-context.

et al., 2018; Voita et al., 2018; Maruf and Haf-fari, 2018; Agrawal et al., 2018; Miculicich et al.,2018; Zhang et al., 2018), the progress and wide-spread adoption of the new paradigm is hamperedby several important problems. Firstly, it is highlynon-trivial to design metrics which would reliablytrace the progress and guide model design. Stan-dard machine translation metrics (e.g., BLEU) donot appear appropriate as they do not sufficientlydifferentiate between consistent and inconsistenttranslations (Wong and Kit, 2012).2 For exam-ple, if multiple translations of a name are pos-sible, forcing consistency is essentially as likelyto make all occurrences of the name match thereference translation as making them all differentfrom the reference. Second, most previous workon context-aware NMT has made the assumptionthat all the bilingual data is available at the doc-ument level. However, isolated parallel sentencesare a lot easier to acquire and hence only a frac-tion of the parallel data will be at the documentlevel in any practical scenario. In other words, acontext-aware model trained only on document-level parallel data is highly unlikely to outperforma context-agnostic model estimated from muchlarger sentence-level parallel corpus. This workaims to address both these shortcomings.

A context-agnostic NMT system would oftenproduce plausible translations of isolated sen-tences, however, when put together in a docu-ment, these translations end up being inconsis-tent with each other. We investigate which lin-guistic phenomena cause the inconsistencies us-ing the OpenSubtitles (Lison et al., 2018) corpusfor the English-Russian language pair. We iden-tify deixis, ellipsis and lexical cohesion as three

2We use the term ‘inconsistency’ to refer to any violationscausing good translations of isolated sentences not to worktogether, independently of which linguistic phenomena (e.g.,ellipsis or lexical cohesion) impose the violated constraints.

Page 2: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1199

main sources of the violations, together amount-ing to about 80% of the cases. We create test setsfocusing specifically on the three identified phe-nomena (6000 examples in total).

We show that by using a limited amount ofdocument-level parallel data, we can alreadyachieve substantial improvements on these bench-marks without negatively affecting performance asmeasured with BLEU. Our approach is inspiredby the Deliberation Networks (Xia et al., 2017).In our method, the initial translation produced bya baseline context-agnostic model is refined by acontext-aware system which is trained on a smalldocument-level subset of parallel data.

The key contributions are as follows:

• we analyze which phenomena cause context-agnostic translations to be inconsistent witheach other;

• we create test sets specifically addressing themost frequent phenomena;

• we consider a novel and realistic set-upwhere a much larger amount of sentence-level data is available compared to thataligned at the document level;

• we introduce a model suitable for this sce-nario, and demonstrate that it is effective onour new benchmarks without sacrificing per-formance as measured with BLEU.

2 Analysis

We begin with a human study, in which we:

1. identify cases when good sentence-leveltranslations are not good when placed in con-text of each other,

2. categorize these examples according to thephenomena leading to a discrepancy in trans-lations of consecutive sentences.

The test sets introduced in Section 3 will then tar-get the most frequent phenomena.

2.1 Human annotation

To find what makes good context-agnostic trans-lations incorrect when placed in context of eachother, we start with pairs of consecutive sentences.We gather data with context from the publiclyavailable OpenSubtitles2018 corpus (Lison et al.,

all one/both bad both goodbad pair good pair

2000 211 140 1649100% 11% 7% 82%

Table 1: Human annotation statistics of pairs of con-secutive translation.

2018) for English and Russian. We train a context-agnostic Transformer on 6m sentence pairs. Thenwe translate 2000 pairs of consecutive sentencesusing this model. For more details on model train-ing and data preprocessing, see Section 5.3.

Then we use human annotation to assess the ad-equacy of the translations without context and inthe context of each other. The whole process istwo-stage:

1. sentence-level evaluation: we ask if the trans-lation of a given sentence is good,

2. evaluation in context: for pairs of consecutivegood translations according to the first stage,we ask if the translations are good in contextof each other.

In the first stage, the annotators are instructedto mark as “good” translations which (i) are fluentsentences in the target language (in our case, Rus-sian) (ii) can be reasonable translations of a sourcesentence in some context.

For the second stage we only consider pairs ofsentences with good sentence-level translations.The annotators are instructed to mark translationsas bad in context of each other only if there isno other possible interpretation or extra additionalcontext which could have made them appropriate.This was made to get more robust results, avoidingthe influence of personal preferences of the anno-tators (for example, for using formal or informalspeech), and excluding ambiguous cases that canonly be resolved with additional context.

The statistics of answers are provided in Ta-ble 1. We find that our annotators labelled 82%of sentence pairs as good translations. In 11% ofcases, at least one translation was considered badat the sentence level, and in another 7%, the sen-tences were considered individually good, but badin context of each other. This indicates that in oursetting, a substantial proportion of translation er-rors are only recognized as such in context.

Page 3: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1200

type of phenomena frequencydeixis 37%ellipsis 29%lexical cohesion 14%ambiguity 9%anaphora 6%other 5%

Table 2: Types of phenomena causing discrepancy incontext-agnostic translation of consecutive sentenceswhen placed in the context of each other

type of discrepancy frequencyT-V distinction 67%speaker/addressee gender:

same speaker 22%different speaker 9%

other 2%

Table 3: Types of discrepancy in context-agnostictranslation caused by deixis (excluding anaphora)

2.2 Types of phenomena

From the results of the human annotation, we takeall instances of consecutive sentences with goodtranslations which become incorrect when placedin the context of each other. For each, we identifythe language phenomenon which caused a discrep-ancy. The results are provided in Table 2.

Below we discuss these types of phenomena, aswell as problems in translation they cause, in moredetail. In the scope of current work, we concen-trate only on the three most frequent phenomena.

2.2.1 DeixisIn this category, we group several types of deic-tic words or phrases, i.e. referential expressionswhose denotation depends on context. This in-cludes personal deixis (“I”, “you”), place deixis(“here”, “there”), and discourse deixis, whereparts of the discourse are referenced (“that’s agood question.”). Most errors in our annotated cor-pus are related to person deixis, specifically gen-der marking in the Russian translation, and theT-V distinction between informal and formal you(Latin “tu” and “vos”).

In many cases, even when having access toneighboring sentences, one cannot make a confi-dent decision which of the forms should be used,as there are no obvious markers pointing to oneform or another (e.g., for the T-V distinction,words such as “officer”, “mister” for formal and“honey”, “dude” for informal). However, when

(a) EN We haven’t really spoken much since yourreturn. Tell me, what’s on your mind thesedays?

RU Мы не разговаривали с тех пор, как вывернулись. Скажи мне, что у тебя науме в последнее время?

RU My ne razgovarivali s tekh por, kak vy ver-nulis’. Skazhi mne, chto u tebya na ume vposledneye vremya?

(b) EN I didn’t come to Simon’s for you. I did thatfor me.

RU Я пришла к Саймону не ради тебя. Ясделал это для себя.

RU Ya prishla k Saymonu ne radi tebya. Ya sdelaleto dlya sebya.

Figure 1: Examples of violation of (a) T-V form con-sistency, (b) speaker gender consistency.In color: (a) red – V-form, blue – T-form; (b) red –feminine, blue – masculine.

pronouns refer to the same person, the pronouns,as well as verbs that agree with them, should betranslated using the same form. See Figure 1(a)for an example translation that violates T-V con-sistency. Figure 1(b) shows an example of incon-sistent first person gender (marked on the verb),although the speaker is clearly the same.

Anaphora are a form of deixis that received alot of attention in MT research, both from theperspective of modelling (Le Nagard and Koehn,2010; Hardmeier and Federico, 2010; Jean et al.,2017b; Bawden et al., 2018; Voita et al., 2018,among others) and targeted evaluation (Hard-meier et al., 2015; Guillou and Hardmeier, 2016;Müller et al., 2018), and we list anaphora errorsseparately, and will not further focus on them.

2.2.2 EllipsisEllipsis is the omission from a clause of one ormore words that are nevertheless understood in thecontext of the remaining elements.

In machine translation, elliptical constructionsin the source language pose a problem if the targetlanguage does not allow the same types of ellipsis(requiring the elided material to be predicted fromcontext), or if the elided material affects the syn-tax of the sentence; for example, the grammaticalfunction of a noun phrase and thus its inflectionin Russian may depend on the elided verb (Fig-ure 2(a)), or the verb inflection may depend on the

Page 4: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1201

type of discrepancy frequencywrong morphological form 66%wrong verb (VP-ellipsis) 20%other error 14%

Table 4: Types of discrepancy in context-agnostictranslation caused by ellipsis

(a) EN You call her your friend but have you been toher home ? Her work ?

RU Ты называешь её своей подругой, но тыбыл у неё дома? Её работа?

RU Ty nazyvayesh’ yeyo svoyey podrugoy, no tybyl u neye doma? Yeyo rabota?

(b) EN Veronica, thank you, but you saw what hap-pened. We all did.

RU Вероника, спасибо, но ты видела, чтопроизошло. Мы все хотели.

RU Veronika, spasibo, no ty videla, chtoproizoshlo. My vse khoteli.

Figure 2: Examples of discrepancies caused by ellipsis.(a) wrong morphological form, incorrectly marking thenoun phrase as a subject. (b) correct meaning is “see”,but MT produces хотели khoteli (“want”).

elided subject. Our analysis focuses on ellipsesthat can only be understood and translated withcontext beyond the sentence-level. This has notbeen studied extensively in MT research.3

We classified ellipsis examples which lead to er-rors in sentence-level translations by the type oferror they cause. Results are provided in Table 4.

It can be seen that the most frequent problemsrelated to ellipsis that we find in our annotatedcorpus are wrong morphological forms, followedby wrongly predicted verbs in case of verb phraseellipsis in English, which does not exist in Rus-sian, thus requiring the prediction of the verb inthe Russian translation (Figure 2(b)).

2.2.3 Lexical cohesionLexical cohesion has been studied previously inMT (Tiedemann, 2010; Gong et al., 2011; Wongand Kit, 2012; Kuang et al., 2018; Miculicichet al., 2018, among others).

There are various cohesion devices (Morris andHirst, 1991), and a good translation should exhibitlexical cohesion beyond the sentence level. We

3Exceptions include (Yamamoto and Sumita, 1998), andwork on the related phenomenon of pronoun dropping (Russoet al., 2012; Wang et al., 2016; Rios and Tuggener, 2017).

(a) EN Not for Julia. Julia has a taste for taunting hervictims.

RU Не для Джулии. Юлия умеет дразнитьсвоих жертв.

RU Ne dlya Dzhulii. Yuliya umeyet draznit’svoikh zhertv.

(b) EN But that’s not what I’m talking about.I’m talking about your future.

RU Но я говорю не об этом. Речь о твоёмбудущем.

RU No ya govoryu ne ob etom. Rech’ o tvoyombudushchem.

Figure 3: Examples of lack of lexical cohesion in MT.(a) Name translation inconsistency. (b) Inconsistenttranslation. Using either of the highlighted translationsconsistently would be good.

focus on repetition with two frequent cases in ourannotated corpus being reiteration of named enti-ties (Figure 3(a)) and reiteration of more generalphrase types for emphasis (Figure 3(b)) or in clar-ification questions.

3 Test Sets

For the most frequent phenomena from the aboveanalysis we create test sets for targeted evaluation.

Each test set contains contrastive examples. Itis specifically designed to test the ability of a sys-tem to adapt to contextual information and han-dle the phenomenon under consideration. Eachtest instance consists of a true example (sequenceof sentences and their reference translation fromthe data) and several contrastive translations whichdiffer from the true one only in the considered as-pect. All contrastive translations we use are cor-rect plausible translations at a sentence level, andonly context reveals the errors we introduce. Allthe test sets are guaranteed to have the necessarycontext in the provided sequence of 3 sentences.The system is asked to score each candidate ex-ample, and we compute the system accuracy as theproportion of times the true translation is preferredover the contrastive ones.

Test set statistics are shown in Table 5.

3.1 Deixis

From Table 3, we see that the most frequent er-ror category related to deixis in our annotated cor-pus is the inconsistency of T-V forms when trans-lating second person pronouns. The test set we

Page 5: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1202

latest relevant contexttotal 1st 2nd 3rd

deixis 3000 1000 1000 1000lex. cohesion 2000 855 630 515ellipsis (infl.) 500ellipsis (VP) 500

Table 5: Size of test sets: total number of test instancesand with regard to the latest context sentence with po-liteness indication or with the named entity under con-sideration. For ellipsis, we distinguish whether modelhas to predict correct noun phrase inflection, or correctverb sense (VP ellipsis).

construct for this category tests the ability of amachine translation system to produce translationswith consistent level of politeness.

We semi-automatically identify sets of consec-utive sentences with consistent politeness markerson pronouns and verbs (but without nominal mark-ers such as “’Mr.” or “officer”) and switch T andV forms. Each automatic step was followed by hu-man postprocessing, which ensures the quality ofthe final test sets.4 This gives us two sets of trans-lations for each example, one consistently infor-mal (T), and one consistently formal (V). For each,we create an inconsistent contrastive example byswitching the formality of the last sentence. Thesymmetry of the test set ensures that any context-agnostic model has 50% accuracy on the test set.

3.2 EllipsisFrom Table 4, we see that the two most frequenttypes of ambiguity caused by the presence of anelliptical structure have different nature, hence weconstruct individual test sets for each of them.

Ambiguity of the first type comes from the in-ability to predict the correct morphological formof some words. We manually gather exampleswith such structures in a source sentence andchange the morphological inflection of the rele-vant target phrase to create contrastive translation.Specifically, we focus on noun phrases where theverb is elided, and the ambiguity lies in how thenoun phrase is inflected.

The second type we evaluate are verb phrase el-lipses. Mostly these are sentences with an auxil-iary verb “do” and omitted main verb. We manu-ally gather such examples and replace the transla-tion of the verb, which is only present on the targetside, with other verbs with different meaning, but

4Details are provided in the appendix.

the same inflection. Verbs which are used to con-struct such contrastive translations are the top-10lemmas of translations of the verb “do” which weget from the lexical table of Moses (Koehn et al.,2007) induced from the training data.

3.3 Lexical cohesion

Lexical cohesion can be established for varioustypes of phrases and can involve reiteration orother semantic relations. In the scope of the cur-rent work, we focus on the reiteration of entities,since these tend to be non-coincidental, and can beeasily detected and transformed.

We identify named entities with alternativetranslations into Russian, find passages where theyare translated consistently, and create contrastivetest examples by switching the translation of someinstances of the named entity. For more details,please refer to the appendix.

4 Model and Setting

4.1 Setting

Previous work on context-aware neural machinetranslation used data where all training instanceshave context. This setting limits the set of avail-able training sets one can use: in a typical sce-nario, we have a lot of sentence-level parallel dataand only a small fraction of document-level data.Since machine translation quality depends heavilyon the amount of training data, training a context-aware model is counterproductive if this leads toignoring the majority of available sentence-leveldata and sacrificing general quality. We will alsoshow that a naive approach to combining sentence-level and document-level data leads to a drop inperformance.

In this work, we argue that it is important toconsider an asymmetric setting where the amountof available document-level data is much smallerthan that of sentence-level data, and propose anapproach specifically targeting this scenario.

4.2 Model

We introduce a two-pass framework: first, the sen-tence is translated with a context-agnostic model,and then this translation is refined using contextof several previous sentences (context includessource sentences as well as their translations). Weexpect this architecture to be suitable in the pro-posed setting: the baseline context-agnostic modelcan be trained on a large amount of sentence-level

Page 6: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1203

Figure 4: Model architecture

data, and the second-pass model can be estimatedon a smaller subset of parallel data which includescontext. As the first-pass translation is producedby a strong model, we expect no loss in generalperformance when training the second part on asmaller dataset.

The model is close in spirit to the Deliberationnetworks (Xia et al., 2017). The first part of themodel is a context-agnostic model (we refer to it asthe base model), and the second one is a context-aware decoder (CADec) which refines context-agnostic translations using context. The basemodel is trained on sentence-level data and thenfixed. It is used only to sample context-agnostictranslations and to get vector representations of thesource and translated sentences. CADec is trainedonly on data with context.

Let Dsent = {(xi, yi)}Ni=1 denote the sentence-level data with n paired sentences and Ddoc ={(xj , yj , cj)}Mj=1 denote the document-level data,where (xj , yj) is source and target sides of a sen-tence to be translated, cj are several preceding sen-tences along with their translations.

Base model For the baseline context-agnosticmodel we use the original Transformer-base (Vaswani et al., 2017), trained tomaximize the sentence-level log-likelihood1N

∑(xi,yi)∈Dsent

logP (yi|xi, θB).

Context-aware decoder (CADec) The context-aware decoder is trained to correct translationsgiven by the base model using contextual infor-

mation. Namely, we maximize the followingdocument-level log-likelihood:

1

M

∑(xj ,yj)∈Ddoc

logEyBj ∝P (y|xj ,θB)P (yj |xj , yBj , cj , θC),

where yBj is sampled from P (y|xj , θB).CADec is composed of a stack ofN = 6 identi-

cal layers and is similar to the decoder of the orig-inal Transformer. It has a masked self-attentionlayer and attention to encoder outputs, and addi-tionally each layer has a block attending over theoutputs of the base decoder (Figure 4). We use thestates from the last layer of the base model’s en-coder of the current source sentence and all con-text sentences as input to the first multi-head at-tention. For the second multi-head attention weinput both last states of the base decoder and thetarget-side token embedding layer; this is done fortranslations of the source and also all context sen-tences. All sentence representations are producedby the base model. To encode the relative positionof each sentence, we concatenate both the encoderand decoder states with one-hot vectors represent-ing their position (0 for the source sentence, 1 forthe immediately preceding one, etc). These dis-tance embeddings are shown in blue in Figure 4.

5 Experiments

5.1 TrainingAt training time, we use reference translations astranslations of the previous sentences. For the cur-

Page 7: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1204

rent sentence, we either sample a translation fromthe base model or use a corrupted version of thereference translation. We propose to stochasticallymix objectives corresponding to these versions:

1

M

∑(xj ,yj)∈Ddoc

log[bj · P (yj |xj , yj , cj , θC))+

+ (1− bj) · P (yj |xj , yBj , cj , θC)],

where yj is a corrupted version of the refer-ence translation and bj ∈ {0, 1} is drawn fromBernoulli distribution with parameter p, p = 0.5in our experiments. Reference translations are cor-rupted by replacing 20% of their tokens with ran-dom tokens.

We discuss the importance of the proposedtraining strategy, as well as the effect of varyingthe value of p, in Section 6.5.

5.2 InferenceAs input to CADec for the current sentence, weuse the translation produced by the base model.Target sides of the previous sentences are pro-duced by our two-stage approach for those sen-tences which have context and with the base modelfor those which do not. We use beam search witha beam of 4 for all models.

5.3 Data and settingWe use the publicly available OpenSubtitles2018corpus (Lison et al., 2018) for English and Rus-sian. As described in detail in the appendix, weapply data cleaning after which only a fraction ofdata has context of several previous sentences. Weuse up to 3 context sentences in this work. Werandomly choose 6 million training instances fromthe resulting data, among which 1.5m have contextof three sentences. We randomly choose two sub-sets of 10k instances for development and testingand construct our contrastive test sets from 400kheld-out instances from movies not encountered intraining. The hyperparameters, preprocessing andtraining details are provided in the supplementarymaterial.

6 Results

We evaluate in two different ways: using BLEUfor general quality and the proposed contrastivetest sets for consistency. We show that models in-distinguishable with BLEU can be very differentin terms of consistency.

We randomly choose 500 out of 2000 examplesfrom the lexical cohesion set and 500 out of 3000from the deixis test set for validation and leave therest for final testing. We compute BLEU on thedevelopment set as well as scores on lexical co-hesion and deixis development sets. We use con-vergence in both metrics to decide when to stoptraining. The importance of using both criteria isdiscussed in Section 6.4. After the convergence,we average 5 checkpoints and report scores on thefinal test sets.

6.1 Baselines

We consider three baselines.baseline The context-agnostic baseline is

Transformer-base trained on all sentence-leveldata. Recall that it is also used as the base modelin our 2-stage approach.

concat The first context-aware baseline is a sim-ple concatenation model. It is trained on 6m sen-tence pairs, including 1.5m having 3 context sen-tences. For the concatenation baseline, we usea special token separating sentences (both on thesource and target side).

s-hier-to-2.tied This is the version of themodel s-hier-to-2 introduced by Bawden et al.(2018), where the parameters between encodersare shared (Müller et al., 2018). The model hasan additional encoder for source context, whereasthe target side of the corpus is concatenated, inthe same way as for the concatenation baseline.Since the model is suitable only for one contextsentence, it is trained on 6m sentence pairs, includ-ing 1.5m having one context sentence. We choses-hier-to-2.tied as our second context-aware base-line because it also uses context on the target sideand performed best in a contrastive evaluation ofpronoun translation (Müller et al., 2018).

6.2 General results

BLEU scores for our model and the baselines aregiven in Table 6.5 For context-aware models, allsentences in a group were translated, and then onlythe current sentence is evaluated. We also reportBLEU for the context-agnostic baseline trainedonly on 1.5m dataset to show how the performanceis influenced by the amount of data.

We observe that our model is no worse in BLEUthan the baseline despite the second-pass model

5We use bootstrap resampling (Koehn, 2004) for signifi-cance testing.

Page 8: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1205

model BLEU

baseline (1.5m) 29.10baseline (6m) 32.40concat 31.56s-hier-to-2.tied 26.68CADec 32.38

Table 6: BLEU scores. CADec trained with p = 0.5.Scores for CADec are not statistically different fromthe baseline (6m).

being trained only on a fraction of the data. Incontrast, the concatenation baseline, trained on amixture of data with and without context is about1 BLEU below the context-agnostic baseline andour model when using all 3 context sentences.CADec’s performance remains the same indepen-dently from the number of context sentences (1, 2or 3) as measured with BLEU.

s-hier-to-2.tied performs worst in terms ofBLEU, but note that this is a shallow recurrentmodel, while others are Transformer-based. It alsosuffers from the asymmetric data setting, like theconcatenation baseline.

6.3 Consistency results

Scores on the deixis, cohesion and ellipsis test setsare provided in Tables 7 and 8. For all tasks,we observe a large improvement from using con-text. For deixis, the concatenation model (con-cat) and CADec improve over the baseline by 33.5and 31.6 percentage points, respectively. On thelexical cohesion test set, CADec shows a largeimprovement over the context-agnostic baseline(12.2 percentage points), while concat performssimilarly to the baseline. For ellipsis, both mod-els improve substantially over the baseline (by19-51 percentage points), with concat strongerfor inflection tasks and CADec stronger for VP-ellipsis. Despite its low BLEU score, s-hier-to-2.tied also shows clear improvements over thecontext-agnostic baseline in terms of consistency,but underperforms both the concatenation modeland CADec, which is unsurprising given that ituses only one context sentence. When lookingonly at the scores where the latest relevant con-text is in the model’s context window (column 2 inTable 7), s-hier-to-2.tied outperforms the concate-nation baseline for lexical cohesion, but remainsbehind the performance of CADec.

The proposed test sets let us distinguish models

latest relevant contexttotal 1st 2nd 3rd

deixisbaseline 50.0 50.0 50.0 50.0concat 83.5 88.8 85.6 76.4s-hier-to-2.tied 60.9 83.0 50.1 50.0CADec 81.6 84.6 84.4 75.9

lexical cohesionbaseline 45.9 46.1 45.9 45.4concat 47.5 48.6 46.7 46.7s-hier-to-2.tied 48.9 53.0 46.1 45.4CADec 58.1 63.2 52.0 56.7

Table 7: Accuracy for deixis and lexical cohesion.

ellipsis (infl.) ellipsis (VP)

baseline 53.0 28.4concat 76.2 76.6s-hier-to-2.tied 66.4 65.6CADec 72.2 80.0

Table 8: Accuracy on ellipsis test set.

Figure 5: BLEU and lexical cohesion accuracy on thedevelopment sets during CADec training.

which are otherwise identical in terms of BLEU:the performance of the baseline and CADec is thesame when measured with BLEU, but very differ-ent in terms of handling contextual phenomena.

6.4 Context-aware stopping criteria

Figure 5 shows that for context-aware models,BLEU is not sufficient as a criterion for stopping:even when a model has converged in terms ofBLEU, it continues to improve in terms of con-sistency. For CADec trained with p = 0.5, BLEUscore has stabilized after 40k batches, but the lex-ical cohesion score continues to grow.

Page 9: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1206

p BLEU deixis lex. c. ellipsis

p=0 32.34 84.1 48.7 65 / 75p=0.25 32.31 83.3 52.4 67 / 78p=0.5 32.38 81.6 58.1 72 / 80p=0.75 32.45 80.0 65.0 70 / 80

Table 9: Results for different probabilities of using cor-rupted reference at training time. BLEU for 3 contextsentences. For ellipsis, we show inflection/VP scores.

6.5 Ablation: using corrupted reference

At training time, CADec uses either a transla-tion sampled from the base model or a corruptedreference translation as the first-pass translationof the current sentence. The purpose of using acorrupted reference instead of just sampling is toteach CADec to rely on the base translation andnot to change it much. In this section, we discussthe importance of the proposed training strategy.

Results for different values of p are given in Ta-ble 9. All models have about the same BLEU, notstatistically significantly different from the base-line, but they are quite different in terms of incor-porating context. The denoising positively influ-ences almost all tasks except for deixis, yieldingthe largest improvement on lexical cohesion.

7 Additional Related Work

In concurrent work, Xiong et al. (2018) also pro-pose a two-pass context-aware translation modelinspired by deliberation network. However, whilethey consider a symmetric data scenario whereall available training data has document-level con-text, and train all components jointly on this data,we focus on an asymmetric scenario where wehave a large amount of sentence-level data, usedto train our first-pass model, and a smaller amountof document-level data, used to train our second-pass decoder, keeping the first-pass model fixed.

Automatic evaluation of the discourse phenom-ena we consider is challenging. For lexical cohe-sion, Wong and Kit (2012) count the ratio betweenthe number of repeated and lexically similar con-tent words over the total number of content wordsin a target document. However, Guillou (2013);Carpuat and Simard (2012) find that translationsgenerated by a machine translation system tend tobe similarly or more lexically consistent, as mea-sured by a similar metric, than human ones. Thiseven holds for sentence-level systems, where theincreased consistency is not due to improved co-

hesion, but accidental – Ott et al. (2018) show thatbeam search introduces a bias towards frequentwords, which could be one factor explaining thisfinding. This means that a higher repetition ratedoes not mean that a translation system is in factmore cohesive, and we find that even our baselineis more repetitive than the human reference.

8 Conclusions

We analyze which phenomena cause otherwisegood context-agnostic translations to be inconsis-tent when placed in the context of each other. Ourhuman study on an English–Russian dataset iden-tifies deixis, ellipsis and lexical cohesion as threemain sources of inconsistency. We create test setsfocusing specifically on the identified phenomena.

We consider a novel and realistic set-up wherea much larger amount of sentence-level data isavailable compared to that aligned at the documentlevel and introduce a model suitable for this sce-nario. We show that our model effectively handlescontextual phenomena without sacrificing generalquality as measured with BLEU despite using onlya small amount of document-level data, while anaive approach to combining sentence-level anddocument-level data leads to a drop in perfor-mance. We show that the proposed test sets al-low us to distinguish models (even though iden-tical in BLEU) in terms of their consistency. Tobuild context-aware machine translation systems,such targeted test sets should prove useful, for val-idation, early stopping and for model selection.

Acknowledgments

We would like to thank the anonymous reviewersfor their comments and Ekaterina Enikeeva for thehelp with initial phenomena classification. Theauthors also thank Yandex Machine Translationteam for helpful discussions and inspiration. IvanTitov acknowledges support of the European Re-search Council (ERC StG BroadSem 678254) andthe Dutch National Science Foundation (NWOVIDI 639.022.518). Rico Sennrich acknowledgessupport from the Swiss National Science Foun-dation (105212_169888), the European Union’sHorizon 2020 research and innovation programme(grant agreement no 825460), and the Royal Soci-ety (NAF\R1\180122).

Page 10: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1207

ReferencesRuchit Agrawal, Turchi Marco, and Negri Matteo.

2018. Contextual Handling in Neural MachineTranslation: Look Behind, Ahead and on BothSides.

Rachel Bawden, Rico Sennrich, Alexandra Birch, andBarry Haddow. 2018. Evaluating Discourse Phe-nomena in Neural Machine Translation. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers), pages 1304–1313, New Orleans,USA. Association for Computational Linguistics.

Marine Carpuat and Michel Simard. 2012. The troublewith smt consistency. In Proceedings of the SeventhWorkshop on Statistical Machine Translation, pages442–449, Montréal, Canada. Association for Com-putational Linguistics.

Zhengxian Gong, Min Zhang, and Guodong Zhou.2011. Cache-based document-level statistical ma-chine translation. In Proceedings of the 2011 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 909–919, Edinburgh, Scotland,UK. Association for Computational Linguistics.

Liane Guillou. 2013. Analysing lexical consistency intranslation. In Proceedings of the Workshop on Dis-course in Machine Translation, pages 10–18, Sofia,Bulgaria. Association for Computational Linguis-tics.

Liane Guillou and Christian Hardmeier. 2016. Protest:A test suite for evaluating pronouns in machinetranslation. In Proceedings of the Tenth Interna-tional Conference on Language Resources and Eval-uation (LREC 2016), Paris, France. European Lan-guage Resources Association (ELRA).

Christian Hardmeier and Marcello Federico. 2010.Modelling Pronominal Anaphora in Statistical Ma-chine Translation. In Proceedings of the seventh In-ternational Workshop on Spoken Language Transla-tion (IWSLT), pages 283–289.

Christian Hardmeier, Preslav Nakov, Sara Stymne,Jörg Tiedemann, Yannick Versley, and Mauro Cet-tolo. 2015. Pronoun-focused mt and cross-lingualpronoun prediction: Findings of the 2015 discomtshared task on pronoun translation. In Proceedingsof the Second Workshop on Discourse in MachineTranslation, pages 1–16. Association for Computa-tional Linguistics.

Sebastien Jean, Stanislas Lauly, Orhan Firat, andKyunghyun Cho. 2017a. Does Neural MachineTranslation Benefit from Larger Context? InarXiv:1704.05135. ArXiv: 1704.05135.

Sébastien Jean, Stanislas Lauly, Orhan Firat, andKyunghyun Cho. 2017b. Neural machine transla-tion for cross-lingual pronoun prediction. In Pro-ceedings of the Third Workshop on Discourse in

Machine Translation, pages 54–57. Association forComputational Linguistics.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference on Learning Repre-sentation (ICLR 2015).

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proceedings ofthe 2004 Conference on Empirical Methods in Nat-ural Language Processing.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brook Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: OpenSource Toolkit for Statistical Machine Translation.In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics Compan-ion Volume Proceedings of the Demo and Poster Ses-sions, pages 177–180, Prague, Czech Republic. As-sociation for Computational Linguistics.

Mikhail Korobov. 2015. Morphological analyzer andgenerator for russian and ukrainian languages. InAnalysis of Images, Social Networks and Texts, vol-ume 542 of Communications in Computer and In-formation Science, pages 320–332. Springer Inter-national Publishing.

Shaohui Kuang, Deyi Xiong, Weihua Luo, andGuodong Zhou. 2018. Modeling coherence forneural machine translation with dynamic and topiccaches. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages596–606. Association for Computational Linguis-tics.

Samuel Läubli, Rico Sennrich, and Martin Volk. 2018.Has Machine Translation Achieved Human Parity?A Case for Document-level Evaluation. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 4791–4796.Association for Computational Linguistics.

Ronan Le Nagard and Philipp Koehn. 2010. Aidingpronoun translation with co-reference resolution. InProceedings of the Joint Fifth Workshop on Statis-tical Machine Translation and MetricsMATR, pages252–261, Uppsala, Sweden. Association for Com-putational Linguistics.

Pierre Lison, Jörg Tiedemann, and Milen Kouylekov.2018. Opensubtitles2018: Statistical rescoring ofsentence alignments in large, noisy parallel corpora.In Proceedings of the Eleventh International Confer-ence on Language Resources and Evaluation (LREC2018), Miyazaki, Japan.

Sameen Maruf and Gholamreza Haffari. 2018. Docu-ment context neural machine translation with mem-ory networks. In Proceedings of the 56th Annual

Page 11: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1208

Meeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 1275–1284, Melbourne, Australia. Association for Com-putational Linguistics.

Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas,and James Henderson. 2018. Document-level neuralmachine translation with hierarchical attention net-works. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing, pages 2947–2954, Brussels, Belgium. Associ-ation for Computational Linguistics.

Jane Morris and Graeme Hirst. 1991. Lexical cohe-sion computed by thesaural relations as an indicatorof the structure of text. Computational Linguistics(Volume 17), pages 21–48.

Mathias Müller, Annette Rios, Elena Voita, and RicoSennrich. 2018. A Large-Scale Test Set for theEvaluation of Context-Aware Pronoun Translationin Neural Machine Translation. In Proceedings ofthe Third Conference on Machine Translation: Re-search Papers , pages 61–72, Belgium, Brussels. As-sociation for Computational Linguistics.

Myle Ott, Michael Auli, David Grangier, andMarc’Aurelio Ranzato. 2018. Analyzing uncer-tainty in neural machine translation. In ICML, vol-ume 80 of JMLR Workshop and Conference Pro-ceedings, pages 3953–3962. JMLR.org.

Martin Popel and Ondrej Bojar. 2018. Training Tipsfor the Transformer Model. pages 43–70.

Annette Rios and Don Tuggener. 2017. Co-referenceresolution of elided subjects and possessive pro-nouns in spanish-english statistical machine trans-lation. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics: Volume 2, Short Papers, pages657–662, Valencia, Spain. Association for Compu-tational Linguistics.

Lorenza Russo, Sharid Loáiciga, and Asheesh Gu-lati. 2012. Improving machine translation of nullsubjects in italian and spanish. In Proceedings ofthe Student Research Workshop at the 13th Confer-ence of the European Chapter of the Association forComputational Linguistics, pages 81–89, Avignon,France. Association for Computational Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Jörg Tiedemann. 2010. Context adaptation in statisti-cal machine translation using models with exponen-tially decaying cache. In Proceedings of the 2010Workshop on Domain Adaptation for Natural Lan-guage Processing, pages 8–15, Uppsala, Sweden.Association for Computational Linguistics.

Jörg Tiedemann and Yves Scherrer. 2017. Neural Ma-chine Translation with Extended Context. In Pro-ceedings of the Third Workshop on Discourse inMachine Translation, DISCOMT’17, pages 82–92,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS, Los Angeles.

Elena Voita, Pavel Serdyukov, Rico Sennrich, and IvanTitov. 2018. Context-aware neural machine trans-lation learns anaphora resolution. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1264–1274, Melbourne, Australia. Associa-tion for Computational Linguistics.

Longyue Wang, Zhaopeng Tu, Andy Way, and QunLiu. 2017. Exploiting Cross-Sentence Context forNeural Machine Translation. In Proceedings of the2017 Conference on Empirical Methods in Natu-ral Language Processing, EMNLP’17, pages 2816–2821, Denmark, Copenhagen. Association for Com-putational Linguistics.

Longyue Wang, Zhaopeng Tu, Xiaojun Zhang, HangLi, Andy Way, and Qun Liu. 2016. A novel ap-proach to dropped pronoun translation. In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages983–993, San Diego, California. Association forComputational Linguistics.

Billy T. M. Wong and Chunyu Kit. 2012. Extend-ing machine translation evaluation metrics with lex-ical cohesion to document level. In Proceedings ofthe 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning, pages 1060–1068, JejuIsland, Korea. Association for Computational Lin-guistics.

Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin,Nenghai Yu, and Tie-Yan Liu. 2017. Deliberationnetworks: Sequence generation beyond one-pass de-coding. In NIPS, Los Angeles.

Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang.2018. Modeling Coherence for Discourse NeuralMachine Translation. In arXiv:1811.05683. ArXiv:1811.05683.

Kazuhide Yamamoto and Eiichiro Sumita. 1998. Fea-sibility study for ellipsis resolution in dialogues bymachine-learning technique. In 36th Annual Meet-ing of the Association for Computational Linguis-tics and 17th International Conference on Compu-tational Linguistics, Volume 2.

Jiacheng Zhang, Huanbo Luan, Maosong Sun, FeifeiZhai, Jingfang Xu, Min Zhang, and Yang Liu. 2018.

Page 12: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1209

Improving the transformer translation model withdocument-level context. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 533–542, Brussels, Bel-gium. Association for Computational Linguistics.

A Protocols for test sets

In this section we describe the process of con-structing the test suites.

A.1 Deixis

English second person pronoun “you” may havethree different interpretations important whentranslating into Russian: the second person singu-lar informal (T form), the second person singularformal (V form) and second person plural (thereis no T-V distinction for the plural from of secondperson pronouns).

Morphological forms for second person singu-lar (V form) and second person plural pronoun arethe same, that is why to automatically identify ex-amples in the second person polite form, we lookfor morphological forms corresponding to secondperson plural pronouns.

To derive morphological tags for Russian, weuse publicly available pymorphy26 (Korobov,2015).

Below, all the steps performed to obtain the testsuite are described in detail.

A.1.1 Automatic identification of politeness

For each sentence we try to automatically findindications of using T or V form. Presence ofthe following words and morphological forms areused as indication of usage of T/V forms:

1. second person singular or plural pronoun,

2. verb in a form corresponding to second per-son singular/plural pronoun,

3. verbs in imperative form,

4. possessive forms of second person pronouns.

For 1-3 we used morphological tags predictedby pymorphy2, for 4th we used hand-craftedlists of forms of second person pronouns, becausepymorphy2 fails to identify them.

6https://github.com/kmike/pymorphy2

A.1.2 Human postprocessing of identificationof politeness

After examples with presence of indication of us-age of T/V form are extracted automatically, wemanually filter out examples where

1. second person plural form corresponds toplural pronoun, not V form,

2. there is a clear indication of politeness.

The first rule is needed as morphological forms forsecond person plural and second person singular Vform pronouns and related verbs are the same, andthere is no simple and reliable way to distinguishthese two automatically.

The second rule is to exclude cases where thereis only one appropriate level of politeness accord-ing to the relation between the speaker and the lis-tener. Such markers include “Mr.”, “Mrs.”, “of-ficer”, “your honour” and “sir”. For the impo-lite form, these include terms denoting family re-lationship (“mom”, “dad”), terms of endearment(“honey”, “sweetie”) and words like “dude” and“pal”.

A.1.3 Automatic change of politenessTo construct contrastive examples aiming to testthe ability of a system to produce translations withconsistent level of politeness, we have to producean alternative translation by switching the formal-ity of the reference translation. First, we do it au-tomatically:

1. change the grammatical number of secondperson pronouns, verbs, imperative verbs,

2. change the grammatical number of posses-sive pronouns.

For the first transformation we use pymorphy2,for the second use manual lists of possessive sec-ond person pronouns, because pymorphy2 cannot change them automatically.

A.1.4 Human postprocessing of automaticchange of politeness

We manually correct the translations from the pre-vious step. Mistakes of the described automaticchange of politeness happen because of:

1. ambiguity arising when imperative and in-dicative verb forms are the same,

Page 13: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1210

2. inability of pymorphy2 to inflect the singu-lar number to some verb forms (e.g., to inflectsingular number to past tense verbs),

3. presence of related adjectives, which have toagree with the pronoun,

4. ambiguity arising when a plural form of apronoun may have different singular forms.

A.1.5 Human annotation: are both polite andimpolite versions appropriate?

After the four previous steps, we have text frag-ments of several consecutive sentences with con-sistent level of politeness. Each fragment uses sec-ond person singular pronouns, either T form or Vform, without nominal markers indicating whichof the forms is the only one appropriate. For eachgroup we have both the original version, and theversion with the switched formality.

To control for appropriateness of both levels ofpoliteness in the context of a whole text fragmentwe conduct a human annotation. Namely, humansare given both versions of the same text fragmentcorresponding to different levels of politeness, andasked if these versions are natural. The answersthey can pick are the following:

1. both appropriate,

2. polite version is not appropriate,

3. impolite version is not appropriate,

4. both versions are bad.

The annotators are not given any specific guide-lines, and asked to answer according to their intu-ition as a native speaker of the language (Russian).

There are a small number of examples whereone of the versions is not appropriate and notequally natural as the other one: 4%. Cases whereannotators claimed both versions to be bad comefrom mistakes in target translations: OpenSubti-tles data is not perfect, and target sides containtranslations which are not reasonable sentences inRussian. These account for 1.5% of all examples.We do not include these 5.5% of examples in theresulting test sets.

A.2 Lexical cohesion

The process of creating the lexical cohesion testset consists of several stages:

1. find passages where named entities are trans-lated consistently,

2. extract alternative translations for thesenamed entities from the lexical table ofMoses (Koehn et al., 2007) induced from thetraining data,

3. construct alternative translations of each ex-ample by switching the translation of in-stances of the named entity,

4. for each example construct several test in-stances.

A.2.1 Identification of examples withconsistent translations

We look for infrequent words that are translatedconsistently in a text fragment. Since the targetlanguage has rich morphology, to verify that trans-lations are the same we have to use lemmas of thetranslations. More precisely, we

1. train Berkeley aligner on about 6.5m sen-tence pairs from both training and held-outdata,

2. find lemmas of all words in the refer-ence translations in the held-out data usingpymorphy2,

3. find words in the source which are not in the5000 most frequent words in our vocabularywhose translations have the same lemma.

A.2.2 Finding alternative translationsFor the words under consideration, we find alter-native translations which would be (i) equally ap-propriate in the context of the remaining sentenceand text fragment (ii) possible for the model toproduce. To address the first point, we focus onnamed entities, and we assume that all translationsof a given named entity seen in the training dataare appropriate. To address the second point, wechoose alternative translations from the referencetranslations encountered in the training data, andpick only ones with a probability at least 10%.

The sequence of actions is as follows:

1. train Moses on the training data (6m sentencepairs),

2. for each word under consideration (fromA.2.1), get possible translations from the lex-ical table of Moses,

Page 14: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1211

3. group possible translations by their lemmausing pymorphy2,

4. if a lemma has a probability at least 10%, weconsider this lemma as possible translationfor the word under consideration,

5. leave only examples with the word un-der consideration having several alternativetranslations.

After that, more than 90% of examples aretranslations of named entities (incl. names of ge-ographical objects). We manually filter the exam-ples with named entities.

A.2.3 Constructing a test setFrom the two previous steps, we have exampleswith named entities in context and source sen-tences and several alternative translations for eachnamed entity. Then we

1. construct alternative translations of each ex-ample by switching the translation of in-stances of the named entity; since the targetlanguage has rich morphology, we do it man-ually,

2. for each example, construct several test in-stances. For each version of the translationof a named entity, we use this translation inthe context, and vary the translation of the en-tity in the current sentence to create one con-sistent, and one or more inconsistent (con-trastive) translation.

B Experimental setup

B.1 Data preprocessingWe use the publicly available OpenSubtitles2018corpus (Lison et al., 2018) for English and Rus-sian.7 We pick sentence pairs with a relative timeoverlap of subtitle frames between source and tar-get language subtitles of at least 0.9 to reducenoise in the data. As context, we take the previoussentence if its timestamp differs from the currentone by no more than 7 seconds. Each long groupof consecutive sentences is split into fragments of4 sentences, with the first 3 sentences treated ascontext. More precisely, from a group of consec-utive sentences s1, s2, . . . , sn we get (s1, . . . , s4),(s2, . . . , s5), . . . , (sn−3, sn). For CADec we also

7http://opus.nlpl.eu/OpenSubtitles2018.php

include (s1, s2) and (s1, s2, s3) as training ex-amples. We do not add these two groups withless context for the concatenation model, becausein preliminary experiments, this performed worseboth in terms of BLEU and consistency as mea-sured on our test sets.

We use the tokenization provided by the cor-pus and use multi-bleu.perl8 on lowercaseddata to compute BLEU score. We use beam searchwith a beam of 4 for both base model and CADec.

Sentences were encoded using byte-pair encod-ing (Sennrich et al., 2016), with source and targetvocabularies of about 32000 tokens. Translationpairs were batched together by approximate se-quence length. For the Transformer models (base-lines and concatenation) each training batch con-tained a set of translation pairs containing approx-imately 160009 source tokens. It has been shownthat Transformer’s performance depends heavilyon the batch size (Popel and Bojar, 2018), andwe chose a large batch size to ensure that mod-els show their best performance. For CADec, weuse a batch size that contains approximately thesame number of translation instances as the base-line models.

B.2 Model parametersWe follow the setup of Transformer basemodel (Vaswani et al., 2017). More precisely, thenumber of layers in the base encoder, base decoderand CADed is N = 6. We employ h = 8 parallelattention layers, or heads. The dimensionality ofinput and output is dmodel = 512, and the inner-layer of a feed-forward networks has dimensional-ity dff = 2048.

We use regularization as described in (Vaswaniet al., 2017).

B.3 OptimizerThe optimizer we use is the same as in (Vaswaniet al., 2017). We use the Adam optimizer (Kingmaand Ba, 2015) with β1 = 0.9, β2 = 0.98 and ε =10−9. We vary the learning rate over the course oftraining, according to the formula:

lrate = scale ·min(step_num−0.5,

step_num · warmup_steps−1.5)

8https://github.com/moses-smt/mosesdecoder/tree/master/scripts/generic

9This can be reached by using several of GPUs or by ac-cumulating the gradients for several batches and then makingan update.

Page 15: When a Good Translation is Wrong in Context: Context-Aware ... · tify deixis, ellipsis and lexical cohesion as three 2We use the term ‘inconsistency’ to refer to any violations

1212

We use warmup_steps = 16000, scale = 4for the models trained on 6m data (baseline (6m)and concatenation) and scale = 1 for the mod-els trained on 1.5m data (baseline (1.5m) andCADec).