Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,[email protected] Wilker...

14
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6392–6405 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 6392 Latent Variable Model for Multi-modal Translation Iacer Calixto Miguel Rios ILLC The University of Amsterdam {iacer.calixto,m.riosgaona,w.aziz}@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model. This latent variable can be seen as a multi-modal stochastic embedding of an image and its de- scription in a foreign language. It is used in a target-language decoder and also to pre- dict image features. Importantly, our model formulation utilises visual and textual inputs during training but does not require that im- ages be available at test time. We show that our latent variable MMT formulation im- proves considerably over strong baselines, in- cluding a multi-task learning approach (Elliott and K´ ad´ ar, 2017) and a conditional variational auto-encoder approach (Toyama et al., 2016). Finally, we show improvements due to (i) pre- dicting image features in addition to only con- ditioning on them, (ii) imposing a constraint on the KL term to promote models with non- negligible mutual information between inputs and latent variable, and (iii) by training on additional target-language image descriptions (i.e. synthetic data). 1 Introduction Multi-modal machine translation (MMT) is an exciting novel take on machine translation (MT) where we are interested in learning to translate sentences in the presence of visual input (mostly images). In the last three years there have been shared tasks (Specia et al., 2016; Elliott et al., 2017; Barrault et al., 2018) where many research groups proposed different techniques to integrate images into MT, e.g. Caglayan et al. (2017); Libovick ´ y and Helcl (2017). Most MMT models expand neural machine trans- lation (NMT) architectures (Sutskever et al., 2014; Bahdanau et al., 2015) to additionally condition on an image in order to compute the likelihood of a translation in context. This gives the model a chance to exploit correlations in visual and lan- guage data, but also means that images must be available at test time. An exception to this rule is the work of Toyama et al. (2016) who exploit the framework of conditional variational auto-encoders (CVAEs) (Sohn et al., 2015) to decouple the en- coder used for posterior inference at training time from the encoder used for generation at test time. Rather than conditioning on image features, the model of Elliott and K´ ad ´ ar (2017) learns to rank image features using language data in a multi-task learning (MTL) framework, therefore sharing pa- rameters between a translation (generative) and a sentence-image ranking model (discriminative). This similarly exploits correlations between the two modalities and has the advantage that images are also not necessary at test time. In this work, we also aim at translating with- out images at test time, yet learning a visually grounded translation model. To that end, we re- sort to probabilistic modelling instead of multi-task learning and estimate a joint distribution over trans- lations and images. In a nutshell, we propose to model the interaction between visual and textual features through a latent variable. This latent vari- able can be seen as a stochastic embedding which is used in the target-language decoder, as well as to predict image features. Our experiments show that this joint formulation improves over an MTL approach (Elliott and K´ ad ´ ar, 2017), which does model both modalities but not jointly, and over the CVAE of Toyama et al. (2016), which uses image features to condition an inference network but cru- cially does not model the images. The main contributions of this paper are: 1 we propose a novel multi-modal NMT model 1 Code and pre-trained models available in https:// github.com/iacercalixto/variational_mmt.

Transcript of Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,[email protected] Wilker...

Page 1: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6392–6405Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

6392

Latent Variable Model for Multi-modal Translation

Iacer Calixto Miguel RiosILLC

The University of Amsterdam{iacer.calixto,m.riosgaona,w.aziz}@uva.nl

Wilker Aziz

Abstract

In this work, we propose to model the in-teraction between visual and textual featuresfor multi-modal neural machine translation(MMT) through a latent variable model. Thislatent variable can be seen as a multi-modalstochastic embedding of an image and its de-scription in a foreign language. It is usedin a target-language decoder and also to pre-dict image features. Importantly, our modelformulation utilises visual and textual inputsduring training but does not require that im-ages be available at test time. We showthat our latent variable MMT formulation im-proves considerably over strong baselines, in-cluding a multi-task learning approach (Elliottand Kadar, 2017) and a conditional variationalauto-encoder approach (Toyama et al., 2016).Finally, we show improvements due to (i) pre-dicting image features in addition to only con-ditioning on them, (ii) imposing a constrainton the KL term to promote models with non-negligible mutual information between inputsand latent variable, and (iii) by training onadditional target-language image descriptions(i.e. synthetic data).

1 Introduction

Multi-modal machine translation (MMT) is anexciting novel take on machine translation (MT)where we are interested in learning to translatesentences in the presence of visual input (mostlyimages). In the last three years there have beenshared tasks (Specia et al., 2016; Elliott et al., 2017;Barrault et al., 2018) where many research groupsproposed different techniques to integrate imagesinto MT, e.g. Caglayan et al. (2017); Libovicky andHelcl (2017).

Most MMT models expand neural machine trans-lation (NMT) architectures (Sutskever et al., 2014;Bahdanau et al., 2015) to additionally conditionon an image in order to compute the likelihood

of a translation in context. This gives the modela chance to exploit correlations in visual and lan-guage data, but also means that images must beavailable at test time. An exception to this rule isthe work of Toyama et al. (2016) who exploit theframework of conditional variational auto-encoders(CVAEs) (Sohn et al., 2015) to decouple the en-coder used for posterior inference at training timefrom the encoder used for generation at test time.Rather than conditioning on image features, themodel of Elliott and Kadar (2017) learns to rankimage features using language data in a multi-tasklearning (MTL) framework, therefore sharing pa-rameters between a translation (generative) anda sentence-image ranking model (discriminative).This similarly exploits correlations between thetwo modalities and has the advantage that imagesare also not necessary at test time.

In this work, we also aim at translating with-out images at test time, yet learning a visuallygrounded translation model. To that end, we re-sort to probabilistic modelling instead of multi-tasklearning and estimate a joint distribution over trans-lations and images. In a nutshell, we propose tomodel the interaction between visual and textualfeatures through a latent variable. This latent vari-able can be seen as a stochastic embedding whichis used in the target-language decoder, as well asto predict image features. Our experiments showthat this joint formulation improves over an MTLapproach (Elliott and Kadar, 2017), which doesmodel both modalities but not jointly, and over theCVAE of Toyama et al. (2016), which uses imagefeatures to condition an inference network but cru-cially does not model the images.

The main contributions of this paper are:1

• we propose a novel multi-modal NMT model

1Code and pre-trained models available in https://github.com/iacercalixto/variational_mmt.

Page 2: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6393

that incorporates image features through la-tent variables in a deep generative model.

• our latent variable MMT formulation im-proves considerably over strong baselines, andcompares favourably to the state-of-the-art.

• we exploit correlations between both modali-ties at training time through a joint generativeapproach and do not require images at predic-tion time.

The remainder of this paper is organised as fol-lows. In §2, we describe our variational MMTmodels. In §3, we introduce the data sets we usedand report experiments and assess how our mod-els compare to prior work. In §4, we position ourapproach with respect to the literature. Finally, in§5 we draw conclusions and provide avenues forfuture work.

2 Variational Multi-modal NMT

Similarly to standard NMT, in MMT we wish totranslate a source sequence xm1 , 〈x1, · · · , xm〉into a target sequence yn1 , 〈y1, · · · , yn〉. Themain difference is the presence of an image vwhich illustrates the sentence pair 〈xm1 , yn1 〉. Wedo not model images directly, but instead an 2048-dimensional vector of pre-activations of a ResNet-50’s pool5 layer (He et al., 2015).

In our variational MMT models, image featuresare assumed to be generated by transforming astochastic latent embedding z, which is also usedto inform the RNN decoder in translating sourcesentences into a target language.

Generative model We propose a generativemodel of translation and image generation whereboth the image v and the target sentence yn1 are in-dependently generated given a common stochasticembedding z. The generative story is as follows.We observe a source sentence xm1 and draw an em-bedding z from a latent Gaussian model,

Z|xm1 ∼ N (µ,diag(σ2))

µ = fµ(xm1 ; θ)

σ = fσ(xm1 ; θ) ,

(1)

where fµ(·) and fσ(·) map from a source sentenceto a vector of locations µ ∈ Rc and a vector ofscales σ ∈ Rc>0, respectively. We then proceed to

draw the image features from a Gaussian observa-tion model,

V |z ∼ N (ν, ς2I)

ν = fν(z; θ) ,(2)

where fν(·) maps from z to a vector of locationsν ∈ Ro, and ς ∈ R>0 is a hyperparameter of themodel (we use 1). Conditioned on z and on thesource sentence xm1 , and independently of v, wegenerate a translation by drawing each target wordin context from a Categorical observation model,

Yj |xm1 , z, y<j ∼ Cat(πj)

πj = fπ(xm1 , y<j , z; θ) ,(3)

where fπ(·) maps z, xm1 , and a prefix translationy<j to the parameters πj of a categorical distribu-tion over the target vocabulary. Functions fµ(·),fσ(·), fν(·), and fπ(·) are implemented as neu-ral networks whose parameters are collectively de-noted by θ. In particular, implementing fπ(·) isas simple as augmenting a standard NMT architec-ture (Bahdanau et al., 2015; Luong et al., 2015),i.e. encoder-decoder with attention, with an addi-tional input z available at every time-step. All otherfunctions are single-layer MLPs that transform theaverage encoder hidden state to the dimensionalityof the corresponding Gaussian variable followedby an appropriate activation.2

Note that in effect we model a joint distribution

pθ(yn1 , v, z|xm1 ) =

pθ(z|xm1 )pθ(v|z)Pθ(yn1 |xm1 , z)(4)

consisting of three components which we parame-terise directly. As there are no observations for z,we cannot estimate these components directly. Wemust instead marginalise z out, which yields themarginal

Pθ(yn1 , v|xm1 ) =∫pθ(z|xm1 )pθ(v|z)Pθ(yn1 |xm1 , z)dz .

(5)

An important statistical consideration about thismodel is that even though yn1 and v are condi-tionally independent given z, they are marginallydependent. This means that we have designed adata generating process where our observations

2Locations have support on the entire real space, thus weuse linear activations, scales must be strictly positive, thus weuse a softplus activation.

Page 3: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6394

yy<

zxm1 v

θ

n

N

(a) VMMTC: given the source text xm1 , we model the jointlikelihood of the translation yn1 , the image (features) v,and a stochastic embedding z sampled from a conditionallatent Gaussian model. Note that the stochastic embeddingis the sole responsible for assigning a probability to theobservation v, and it helps assign a probability to thetranslation.

xm1

yn1

z

v

λ

N

(b) Inference model for VMMTC: to approximate the trueposterior we have access to both modalities (text xm1 , yn1and image v).

Figure 1: Generative model of target text and image features (left), and inference model (right).

yn1 , v|xm1 are not assumed to have been indepen-dently produced.3 This is in direct contrast withmulti-task learning or joint modelling without la-tent variables—for an extended discussion see(Eikema and Aziz, 2019, § 3.1).

Finally, Figure 1 (left) is a graphical depictionof the generative model: shaded circles denote ob-served random variables, unshaded circles indicatelatent random variables, deterministic quantitiesare not circled; the internal plate indicates iterationover time-steps, the external plate indicates itera-tion over the training data. Note that deterministicparameters θ are global to all training instances,while stochastic embeddings z are local to eachtuple 〈xm1 , yn1 , v〉.

Inference Parameter estimation for our model ischallenging due to the intractability of the marginallikelihood function (5). We can however employvariational inference (VI) (Jordan et al., 1999),in particular amortised VI (Kingma and Welling,2014; Rezende et al., 2014), and estimate parame-ters to maximise a lowerbound

Eqλ(z|xm1 ,yn1 ,v) [log pθ(v|z) + logPθ(yn1 |xm1 , z)]

−KL(qλ(z|xm1 , yn1 , v)||pθ(z|xm1 ))(6)

on the log-likelihood function. This evidencelowerbound (ELBO) is expressed in terms of aninference model qλ(z|xm1 , yn1 , v) which we designhaving tractability in mind. In particular, our ap-

3This is an aspect of the model we aim to explore moreexplicitly in the near future.

proximate posterior is a Gaussian distribution

qλ(z|xm1 , yn1 , v) = N (z|u, diag(s2))

u = gu(xm1 , yn1 , v;λ)

s = gs(xm1 , y

n1 , v;λ)

(7)

parametrised by an inference network, that is, an in-dependently parameterised neural network (whoseparameters we denote collectively by λ) whichmaps from observations, in our case a sentencepair and an image, to a variational location u ∈ Rcand a variational scale s ∈ Rc>0. Figure 1 (right) isa graphical depiction of the inference model.

Location-scale variables (e.g. Gaussians) can bereparametrised, i.e. we can obtain a latent samplevia a deterministic transformation of the variationalparameters and a sample from the standard Gaus-sian distribution:

z = u + ε� s where ε ∼ N (0, I) . (8)

This reparametrisation enables backpropagationthrough stochastic units (Kingma and Welling,2014; Titsias and Lazaro-Gredilla, 2014). In addi-tion, for two Gaussians the KL term in the ELBO(6) can be computed in closed form (Kingma andWelling, 2014, Appendix B). Altogether, we canobtain a reparameterised gradient estimate of theELBO, we use a single sample estimate of the firstterm, and count on stochastic gradient descent toattain a local optimum of (6).

Architecture All of our parametric functions areneural network architectures. In particular, fπ isa standard sequence-to-sequence architecture withattention and a softmax output. We build uponOpenNMT (Klein et al., 2017), which we modify

Page 4: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6395

slightly by providing z as additional input to thetarget-language decoder at each time step. Loca-tion layers fµ, fν and gu, and scale layers fσ andgs, are feed-forward networks with a single ReLUhidden layer. Furthermore, location layers have alinear output while scale layers have a softplus out-put. For the generative model, fµ and fσ transformthe average source-language encoder hidden state.We let the inference model condition on source-language encodings without updating them, andwe use a target-language bidirectional LSTM en-coder in order to also condition on the completetarget sentence. Then gu and gs transform a con-catenation of the average source-language encoderhidden state, the average target-language bidirec-tional encoder hidden state, and the image features.

Fixed Gaussian prior We have just presentedour variational MMT model in its full generality—we refer to that model as VMMTC. However, keep-ing in mind that MMT datasets are rather small, itis desirable to simplify some of our model’s com-ponents. In particular, the estimated latent Gaus-sian model (1) can be replaced by a fixed standardGaussian prior, i.e., Z ∼ N (0, I)—we refer to thismodel as VMMTF. Along with this change it is con-venient to modify the inference model to conditionon xm1 alone, which allow us to use the inferencemodel for both training and prediction. Importantlythis also sidesteps the need for a target-languagebidirectional LSTM encoder, which leaves us asmaller set of inference parameters λ to estimate.Interestingly, this model does not rely on featuresfrom v, instead only using it as learning signalthrough the objective in (6), which is in direct con-trast with the model of Toyama et al. (2016).

3 Experiments

Our encoder is a 2-layer 500D bidirectional RNNwith GRU, the source and target word embeddingsare 500D, and all are trained jointly with the model.We use OpenNMT to implement all our models(Klein et al., 2017). All model parameters areinitialised sampling from a uniform distributionU(−0.1,+0.1) and bias vectors are initialised to ~0.

Visual features are obtained by feeding images tothe pre-trained ResNet-50 and using the activationsof the pool5 layer (He et al., 2015). We applydropout with a probability of 0.5 in the encoderbidirectional RNN, the image features, the decoderRNN, and before emitting a target word.

All models are trained using the Adam opti-miser (Kingma and Ba, 2014) with an initial learn-ing rate of 0.002 and minibatches of size 40, whereeach training instance consists of one English sen-tence, one German sentence and one image (MMT).Models are trained for up to 40 epochs and we per-form model selection based on BLEU4, and usethe best performing model on the validation set totranslate test data. Moreover, we halt training if themodel does not improve BLEU4 scores on the vali-dation set for 10 epochs or more. We report meanand standard deviation over 4 independent runs forall models we trained ourselves (NMT, VMMTF,VMMTC), and other baseline results are the onesreported in the authors’ publications (Toyama et al.,2016; Elliott and Kadar, 2017).

We preprocess our data by tokenizing, lower-casing, and converting words to subword tokensusing a bilingual BPE model with 10k merge oper-ations (Sennrich et al., 2016b). We quantitativelyevaluate translation quality using case-insensitiveand tokenized outputs in terms of BLEU4 (Papineniet al., 2002), METEOR (Denkowski and Lavie,2014), chrF3 (Popovic, 2015), and BEER (Stano-jevic and Sima’an, 2014). By using these, we hopeto include word-level metrics which are tradition-ally used by the MT community (i.e. BLEU andMETEOR), as well as more recent metrics whichoperate at the character level and that better corre-late with human judgements of translation quality(i.e. chrF3 and BEER) (Bojar et al., 2017).

3.1 Datasets

The Flickr30k dataset (Young et al., 2014) consistsof images from Flickr and their English descrip-tions. We use the translated Multi30k (M30kT)dataset (Elliott et al., 2016), i.e. an extension ofFlickr30k where for each image one of its Englishdescriptions was translated into German by a pro-fessional translator. Training, validation and testsets contain 29k, 1014 and 1k images respectively,each accompanied by the original English sentenceand its translation into German. In addition tothe test set released for the first run of the multi-modal translation shared task (Elliott et al., 2016),henceforth test2016, we also use test2017released for the next run of this shared task (Elliottet al., 2017).

Since this dataset is very small, we also investi-gate the effect of including more in-domain datato train our models. To that purpose, we use addi-

Page 5: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6396

Model BLEU4↑ METEOR↑ chrF↑ BEER↑

NMT 35.0 (0.4) 54.9 (0.2) 61.0 (0.2) 65.2 (0.1)Imagination 36.8 (0.8) 55.8 (0.4) – –Model G 36.5 56.0 – –

VMMTF 37.7 (0.4) ↑ 0.9 56.0 (0.1) ↑ 0.0 62.1 (0.1) ↑ 1.1 66.6 (0.1) ↑ 1.4VMMTC 37.5 (0.3) ↑ 0.7 55.7 (0.1) ↓ 0.3 61.9 (0.1) ↑ 0.9 66.5 (0.1) ↑ 1.3

Table 1: Results of applying variational MMT models to translate the Multi30k 2016 test set. For each model,we report the mean and standard deviation over 4 independent runs where models were selected using validationBLEU4 scores. Best mean baseline scores per metric are underlined and best overall results (i.e. means) are inbold. We highlight in green/red the improvement brought by our models compared to the best baseline mean score.

tional 145K monolingual German descriptions re-leased as part of the Multi30k dataset to the task ofimage description generation (Elliott et al., 2016).We refer to this dataset as comparable Multi30k(M30kC). Descriptions in the comparable Multi30kwere collected independently of existing Englishdescriptions and describe the same 29K images asin the M30kT dataset.

In order to obtain features for images, we useResNet-50 (He et al., 2015) pre-trained on Ima-geNet (Russakovsky et al., 2015). We report experi-ments using pool5 features as our image features,i.e. 2048-dimensional pre-activations of the lastlayer of the network.

In order to investigate how well our models gen-eralise, we also evaluate our models on the ambigu-ous MSCOCO test set (Elliott et al., 2017) whichwas designed with example sentences that are hardto translate without resorting to visual context avail-able in the accompanying image.

Finally, we use a 50D latent embedding z inour experiments with the translated Multi30k data,whereas in our ablative experiments and experi-ments with the comparable Multi30k data, we usea 500D stochastic embedding z.

3.2 Baselines

We compare our work against three different base-lines. The first one is a standard text-only sequence-to-sequence NMT model with attention (Luonget al., 2015), trained from scratch using hyper-parameters described above. The second base-line is the variational multi-modal MT modelModel G proposed by Toyama et al. (2016), whereglobal image features are used as additional in-put to condition an inference network. Finally,a third baseline is the Imagination model of El-liott and Kadar (2017), a multi-task MMT model

which uses a shared source-language encoder RNNand is trained in two tasks: to translate from En-glish into German and on image-sentence ranking(English↔image).

3.3 Translated Multi30k

We now report on experiments conducted with mod-els trained to translate from English into Germanusing the translated Multi30k data set (M30kT).

In Table 1, we compare our variational MMTmodels—VMMTC for the general case with a con-ditional Gaussian latent model, and VMMTF forthe simpler case of a fixed Gaussian prior—to thethree baselines described above. The general trendis that both formulations of our VMMT improvewith respect to all three baselines. We note an im-provement in BLEU and METEOR mean scorescompared to the Imagination model (Elliott andKadar, 2017), as well as reduced variance (thoughnote this is based on only 4 independent runs inour case, and 3 independent runs of Imagination).Both models VMMTF and VMMTC outperformModel G according to BLEU and perform com-parably according to METEOR, especially sinceresults reported by (Toyama et al., 2016) are basedon a single run. Moreover, we also note that bothour models outperform the text-only NMT base-line according to all four metrics, and by 1%–1.4%according chrF3 and BEER, both being metricswell-suited to measure the quality of translationsinto German and generated with subwords units.

In Table 2, we report results when translat-ing the Multi30k test2017 and the ambiguousMSCOCO test sets. Note that standard deviationsfor the conditional model VMMTC are consider-ably higher than those obtained for model VMMTF.We investigated the issue further and found out thatone of the runs of VMMTC performed considerably

Page 6: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6397

Model BLEU4↑ METEOR↑ chrF↑ BEER↑

Multi30k 2017 test set

VMMTF 30.1 (0.3) 49.9 (0.3) 57.2 (0.4) 62.2 (0.3)VMMTC 26.1 (6.6) 45.4 (7.3) 52.2 (8.4) 58.6 (5.8)

Ambiguous MSCOCO 2017 test set

VMMTF 25.5 (0.5) 44.8 (0.2) 52.0 (0.3) 58.3 (0.2)VMMTC 21.8 (5.6) 41.2 (6.3) 47.4 (7.6) 55.3 (5.2)

Table 2: Results of applying variational MMT models to translate the Multi30k 2017 and the ambiguous MSCOCOtest sets. For each model, we report the mean and standard deviation over 4 independent runs where models wereselected using validation BLEU4 scores. Best overall results (i.e. means) are in bold. Note that standard deviationsfor the conditional model VMMTC are considerably higher than those obtained for model VMMTF. This is partlydue to the fact that one of the runs of VMMTC underperformed compared to the other three.

worse than the others; this caused the mean scoresto be much lower and also increased the variancesignificantly.

Finally, one interesting finding is that all fourmetrics indicate that the fixed-prior model VMMTFeither performs slightly (Table 1) or consider-ably better (Table 2) than the conditional modelVMMTC. We speculate this is partly due toVMMTF’s simpler parameterisation, after all, wehave just about 29k training instances to estimatetwo sets of parameters (θ and λ) and the more com-plex VMMTC requires an additional bidirectionalLSTM encoder for the target text.

3.4 Back-translated Comparable Multi30k

Since the translated Multi30k dataset is very small,we also investigate the effect of including morein-domain data to train our models. For that pur-pose, we use additional 145K monolingual Germandescriptions released as part of the comparableMulti30k dataset (M30kC). We train a text-onlyNMT model to translate from German into Englishusing the original 29K parallel sentences in thetranslated Multi30k (without images), and applythis model to back-translate the 145K German de-scriptions into English (Sennrich et al., 2016a).

In this set of experiments, we explore how pre-training models NMT, VMMTF and VMMTC usingboth the translated and back-translated compara-ble Multi30k affects results. Models are pre-trainedon mini-batches with a one-to-one ratio of trans-lated and back-translated data.4 All three mod-els NMT, VMMTF and VMMTC, are further fine-

4One pre-training epoch corresponds to about 290K exam-ples, i.e. we up-sample the smaller translated Multi30k dataset to achieve the one-to-one ratio.

Figure 2: Validation set BLEU scores per number ofpre-trained epochs for models VMMTC and VMMTFpre-trained using the comparable Multi30k and trans-lated Multi30k data sets. The height of a bar representsthe mean and the black vertical lines indicate ±1 stdover 4 independent runs.

tuned on the translated Multi30k until convergence,and model selection using BLEU is only appliedduring fine-tuning and not at the pre-training stage.

In Figure 2, we inspect for how many epochsshould a model be pre-trained using the additionalnoisy back-translated descriptions, and note thatboth VMMTF and VMMTC reach best BLEUscores on the validation set when pre-trained forabout 3 epochs. As shown in Figure 2, we notethat when using additional noisy data VMMTC,which uses a conditional prior, performs consider-ably better than its counterpart VMMTF, which hasa fixed prior. These results indicate that VMMTCmakes better use of additional synthetic data thanVMMTF. Some of the reasons that explain theseresults are (i) the conditional prior p(z|x) can learn

Page 7: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6398

Model BLEU4↑ METEOR↑ # trainsents.

NMT 37.7 (0.5) 56.0 (0.3)145KVMMTF 38.4 (0.6) -↑ 0.7 56.0 (0.3) -↑ 0.0

VMMTC 38.4 (0.2) -↑ 0.7 56.3 (0.2) -↑ 0.3

Imagination 37.8 (0.7) 57.1 (0.2) 654K

Table 3: Results for models pre-trained using thetranslated and comparable Multi30k to translate theMulti30k test set. We report the mean and standarddeviation over 4 independent runs. Our best over-all results are highlighted in bold, and we highlightin green/red the improvement/decrease brought by ourmodels compared to the baseline mean score. We addi-tionally show results for the Imagination model trainedon 4× more data (as reported in the authors’ paper).

to be sensitive to whether x is gold-standard or syn-thetic, whereas p(z) cannot; (ii) in the conditionalcase the posterior approximation q(z|x, y, v) candirectly exploit different patterns arising from agold-standard versus a synthetic 〈x, y〉 pair; andfinally (iii) our synthetic data is made of target-language gold-standard image descriptions, whichhelp train the inference network’s target-languageBiLSTM encoder.

In Table 3, we show results when applyingVMMTF and VMMTC to translate the Multi30ktest set. Both models and the NMT baseline are pre-trained on the translated and the back-translatedcomparable Multi30k data sets, and are selectedaccording to validation set BLEU scores. Forcomparison, we also include results for Imagina-tion (Elliott and Kadar, 2017) when trained on thetranslated Multi30k, the WMT News CommentaryEnglish-German dataset (240K parallel sentencepairs) and the MSCOCO image description dataset(414K German descriptions of 83K images, i.e. 5descriptions for each image). In contrast, our mod-els observe 29K images (i.e. the same as the mod-els evaluated in Section 3.3) plus 145K Germandescriptions only.5

3.5 Ablative experiments

In our ablation we are interested in finding out towhat extent the model makes use of the latent space,i.e. how important is the latent variable.

KL free bits A common issue when training la-tent variable models with a strong decoder is having

5There are no additional images because the comparableMulti30k consists of additional German descriptions for thesame 29K images already in the translated Multi30k.

Model Number of BLEU4↑free bits (KL)

VMMTF

0 38.3 (0.2)1 38.1 (0.3)2 38.4 (0.4)4 38.4 (0.4)8 35.7 (3.1)

VMMTC

0 38.5 (0.2)1 38.3 (0.3)2 38.2 (0.2)4 36.8 (2.6)8 38.6 (0.2)

Table 4: Results of applying VMMT models trainedwith different numbers of free bits in the KL (Kingmaet al., 2016) to translate the Multi30k validation set.

the true posterior collapse to the prior and the KLterm in the ELBO vanish to zero. In practice, thatwould mean the model has virtually not used thelatent variable z to predict image features v, butmostly as a source of stochasticity in the decoder.This can happen because the model has access toinformative features from the source bi-LSTM en-coder and need not learn a difficult mapping fromobservations to latent representations predictive ofimage features.

For that reason, we wish to measure how wellcan we train latent variable MMT models whileensuring that the KL term in the loss (Equation (6))does not vanish to zero. We use the free bits heuris-tic (Kingma et al., 2016) to impose a constrainton the KL, which in turn promotes models withnon-negligible mutual information between inputsand latent variables (Alemi et al., 2018).

In Table 4, we see the results of different mod-els trained using different number of free bits inthe KL component. We note that including freebits improves translations slightly, but note thatfinding the optimal number of free bits requireshyper-parameter search.

3.6 Discussion

In Table 5 we show how our different models trans-late two examples of the M30k test set. In thefirst example (id#801), training on additional back-translated data improves variational models but notthe NMT baseline, whereas in the second example(id#873) differences between baseline and varia-tional models still persist even when training on

Page 8: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6399

Model Example #801 Example #873

source a man on a bycicle pedals through an archway . a man throws a fishing net into the bay .reference ein mann fahrt auf einem fahrrad durch einen torbogen . ein mann wirft ein fischernetz in die bucht .

M30kT M30kT

NMT ein mann auf einem fahrrad fahrt durch eine scheibe . ein mann wirft ein fischernetz in die luft .VMMTF ein mann auf einem fahrrad fahrt durch einen torbogen . ein mann wirft ein fischernetz in die bucht .VMMTC ein mann auf einem fahrrad fahrt durch einen bogen . ein mann wirft ein fischernetz in die bucht .

M30kT + back-translated M30kC M30kT + back-translated M30kC

NMT ein mann auf einem fahrrad fahrt durch einen bogen . ein mann wirft ein fischernetz ins meer .VMMTF ein mann auf einem fahrrad fahrt durch einen torbogen . ein mann wirft ein fischernetz in den wellen .VMMTC ein mann auf einem fahrrad fahrt durch einen torbogen . ein mann wirft ein fischernetz in die bucht .

Table 5: Translations for examples 801 and 873 of the M30k test set. In the first example, neither the NMT baseline(with or without back-translated data) nor model VMMTC (trained on limited data) could translate archway cor-rectly; the NMT baseline translates it as “scheibe” (disk) and “bogen” (bow), and VMMTC also incorrectly trans-lates it as “bogen” (bow). However, VMMTC translates without errors when trained on additional back-translateddata, i.e. “torbogen” (archway). In the second example, the NMT baseline translates bay as “luft” (air) or “meer”(sea), whereas VMMTF translates it as “bucht” (bay) or “wellen” (waves) and VMMTC always as “bucht” (bay).

additional back-translated data.

4 Related work

Even though there has been growing interest in vari-ational approaches to machine translation (Zhanget al., 2016; Schulz et al., 2018; Shah and Bar-ber, 2018; Eikema and Aziz, 2019) and to tasksthat integrate vision and language, e.g. image de-scription generation (Pu et al., 2016; Wang et al.,2017), relatively little attention has been dedicatedto variational models for multi-modal translation.This is partly due to the fact that multi-modal ma-chine translation was only recently addressed bythe MT community by means of a shared task (Spe-cia et al., 2016; Elliott et al., 2017; Barrault et al.,2018). Nevertheless, we now discuss relevant vari-ational and deterministic multi-modal MT modelsin the literature.

Fully supervised MMT models. All submis-sions to the three runs of the multi-modal MTshared tasks (Specia et al., 2016; Elliott et al., 2017;Barrault et al., 2018) model conditional probabili-ties directly without latent variables.

Perhaps the first MMT model proposed prior tothese shared tasks is that of Hitschler et al. (2016),who used image features to re-rank translations ofimage descriptions generated by a phrase-basedstatistical MT model (PBSMT) and reported sig-nificant improvements. Shah et al. (2016) proposea similar model where image logits are used to re-rank the output of PBSMT. Global image features,i.e. features computed over an entire image (suchas pool5 ResNet-50 features used in this work),have been directly used as “tokens” in the sourcesentence, to initialise encoder RNN hidden states,or as additional information used to initialise the

decoder RNN states (Huang et al., 2016; Libovickyet al., 2016; Calixto and Liu, 2017). On the otherhand, spatial visual features, i.e. local features thatencode different parts of the image separately in dif-ferent vectors, have been used in doubly-attentivemodels where there is one attention mechanismover the source RNN hidden states and anotherone over the image features (Caglayan et al., 2016;Calixto et al., 2017).

Finally, Caglayan et al. (2017) proposed to in-teract image features with target word embeddings,more specifically to perform an element-wise mul-tiplication of the (projected) global image featuresand the target word embeddings before feeding thetarget word embeddings into their decoder GRU.They reported significant improvements by usingimage features to gate target word embeddings andwon the 2017 Multi-modal MT shared task (Elliottet al., 2017).

Multi-task MMT models. Multi-task learningMMT models are easily applicable to translate sen-tences without images (at test time), which is anadvantage over the above-mentioned models.

Luong et al. (2016) proposed a multi-task ap-proach where a model is trained using two tasksand a shared decoder: the main task is to translatefrom German into English and the secondary taskis to generate English descriptions given an image.They show improvements in the main translationtask when also training for the secondary imagedescription task. Their model is large, i.e. a 4-layerencoder LSTM and a 4-layer decoder LSTM, andtheir best set up uses a ratio of 0.05 image descrip-tion generation training data samples in compar-ison to translation training data samples. Elliottand Kadar (2017) propose an MTL model trained

Page 9: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6400

to do translation (English→German) and sentence-image ranking (English↔image), using a standardword cross-entropy and margin-based losses as itstask objectives, respectively. Their model uses thepre-trained GoogleNet v3 CNN (Szegedy et al.,2016) to extract pool5 features, and has a 1-layersource-language bidirectional GRU encoder and a1-layer GRU decoder.

Variational MMT models. Toyama et al. (2016)proposed a variational MMT model that is likelythe most similar model to the one we put forwardin this work. They build on the variational neuralMT (VNMT) model of Zhang et al. (2016), whichis a conditional latent model where a Gaussian-distributed prior of z is parameterised as a functionof the the source sentence xm1 , i.e. p(z|xm1 ), andboth xm1 and z are used at each time step in anattentive decoder RNN, P (yj |xm1 , z, y<j).

In Toyama et al. (2016), image features are usedas input to the inference model qλ(z|xm1 , yn1 , v) thatapproximates the posterior over the latent variable,but otherwise are not modelled and not used in thegenerative network. Differently from their work,we use image features in all our generative models,and propose modelling them as random observedoutcomes while still being able to use our modelto translate without images at test time. In the con-ditional case, we further use image features forposterior inference. Additionally, we also investi-gate both conditional and fixed priors, i.e. p(z|xm1 )and p(z), whereas their model is always condi-tional. Interestingly, we found in our experimentsthat fixed-prior models perform slightly better thanconditional ones under limited training data.

Toyama et al. (2016) uses the pre-trained VGG19CNN (Simonyan and Zisserman, 2015) to extractFC7 features, and additionally experiment withusing additional features from object detectionsobtained with the Fast RCNN network (Girshick,2015). One more difference between their workand ours is that we only use the ResNet-50 networkto extract pool5 features, and no additional pre-trained CNN nor object detections.

5 Conclusions and Future work

We have proposed a latent variable model for multi-modal neural machine translation and have shownbenefits from both modelling images and promot-ing use of latent space. We also show that in the ab-sence of enough data to train a more complex infer-ence network a simple fixed prior suffices, whereas

when more training data is available (even noisydata) a conditional prior is preferable. Importantly,our models compare favourably to the state-of-the-art.

In future work we will explore other generativemodels for multi-modal MT, as well as differentways to directly incorporate images into these mod-els. We are also interested in modelling differentviews of the image, such as global vs. local imagefeatures, and also in using larger image collectionsand modelling images directly, i.e. pixel intensi-ties.

Acknowledgements

This work is supported by the Dutch Organisationfor Scientific Research (NWO) VICI Grant nr. 277-89-002.

References

Alexander A. Alemi, Ben Poole, Ian Fischer,Joshua V. Dillon, Rif A. Saurous, and KevinMurphy. 2018. Fixing a Broken ELBO. InProceedings of the 35th International Confer-ence on Machine Learning, ICML 2018, pages159–168.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural Machine Translation byJointly Learning to Align and Translate. InInternational Conference on Learning Repre-sentations, ICLR 2015, San Diego, California.

Loıc Barrault, Fethi Bougares, Lucia Specia, Chi-raag Lala, Desmond Elliott, and Stella Frank.2018. Findings of the Third Shared Taskon Multimodal Machine Translation. In Pro-ceedings of the Third Conference on MachineTranslation: Shared Task Papers, pages 304–323, Belgium, Brussels. Association for Com-putational Linguistics.

Ondrej Bojar, Yvette Graham, and Amir Kamran.2017. Results of the WMT17 Metrics SharedTask. In Proceedings of the Second Confer-ence on Machine Translation, pages 489–513,Copenhagen, Denmark. Association for Com-putational Linguistics.

Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer-cedes Garcıa-Martınez, Fethi Bougares, LoıcBarrault, Marc Masana, Luis Herranz, andJoost van de Weijer. 2017. LIUM-CVC Sub-missions for WMT17 Multimodal Translation

Page 10: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6401

Task. In Proceedings of the Second Confer-ence on Machine Translation, pages 432–439,Copenhagen, Denmark. Association for Com-putational Linguistics.

Ozan Caglayan, Loıc Barrault, and Fethi Bougares.2016. Multimodal attention for neural ma-chine translation. CoRR, abs/1609.03976.

Iacer Calixto and Qun Liu. 2017. IncorporatingGlobal Visual Features into Attention-basedNeural Machine Translation. In Proceedingsof the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 992–1003, Copenhagen, Denmark. Association forComputational Linguistics.

Iacer Calixto, Qun Liu, and Nick Campbell. 2017.Doubly-Attentive Decoder for Multi-modalNeural Machine Translation. In Proceedingsof the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume1: Long Papers), pages 1913–1924, Vancou-ver, Canada. Association for ComputationalLinguistics.

Michael Denkowski and Alon Lavie. 2014. Me-teor Universal: Language Specific Transla-tion Evaluation for Any Target Language. InProceedings of the EACL 2014 Workshop onStatistical Machine Translation.

Bryan Eikema and Wilker Aziz. 2019. Auto-encoding variational neural machine trans-lation. In 4th Workshop on RepresentationLearning for NLP.

Desmond Elliott, Stella Frank, Loıc Barrault, FethiBougares, and Lucia Specia. 2017. Findingsof the second shared task on multimodal ma-chine translation and multilingual image de-scription. In Proceedings of the Second Con-ference on Machine Translation, pages 215–233, Copenhagen, Denmark. Association forComputational Linguistics.

Desmond Elliott, Stella Frank, Khalil Sima’an, andLucia Specia. 2016. Multi30K: MultilingualEnglish-German Image Descriptions. In Pro-ceedings of the 5th Workshop on Vision andLanguage, VL@ACL 2016, Berlin, Germany.

Desmond Elliott and Akos Kadar. 2017. Imagi-nation improves multimodal translation. In

Proceedings of the Eighth International JointConference on Natural Language Processing(Volume 1: Long Papers), pages 130–141,Taipei, Taiwan. Asian Federation of NaturalLanguage Processing.

Ross Girshick. 2015. Fast R-CNN. In Proceedingsof the 2015 IEEE International Conference onComputer Vision (ICCV), ICCV ’15, pages1440–1448, Washington, DC, USA. IEEEComputer Society.

Kaiming He, Xiangyu Zhang, Shaoqing Ren,and Jian Sun. 2015. Deep Residual Learn-ing for Image Recognition. arXiv preprintarXiv:1512.03385.

Julian Hitschler, Shigehiko Schamoni, and StefanRiezler. 2016. Multimodal Pivots for ImageCaption Translation. In Proceedings of the54th Annual Meeting of the Association forComputational Linguistics (Volume 1: LongPapers), pages 2399–2409, Berlin, Germany.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Comput.,9(8):1735–1780.

Po-Yao Huang, Frederick Liu, Sz-Rung Shiang,Jean Oh, and Chris Dyer. 2016. Attention-based Multimodal Neural Machine Transla-tion. In Proceedings of the First Confer-ence on Machine Translation, pages 639–645,Berlin, Germany.

MichaelI. Jordan, Zoubin Ghahramani, TommiS.Jaakkola, and LawrenceK. Saul. 1999. Anintroduction to variational methods for graph-ical models. Machine Learning, 37(2):183–233.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Diederik P Kingma, Tim Salimans, Rafal Jozefow-icz, Xi Chen, Ilya Sutskever, and Max Welling.2016. Improved variational inference withinverse autoregressive flow. In D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, andR. Garnett, editors, Advances in Neural Infor-mation Processing Systems 29, pages 4743–4751. Curran Associates, Inc.

Page 11: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6402

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In InternationalConference on Learning Representations.

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander M. Rush. 2017.OpenNMT: Open-source toolkit for neural ma-chine translation. In Proc. ACL.

Jindrich Libovicky and Jindrich Helcl. 2017. At-tention Strategies for Multi-Source Sequence-to-Sequence Learning. In Proceedings of the55th Annual Meeting of the Association forComputational Linguistics (Volume 2: ShortPapers), pages 196–202, Vancouver, Canada.Association for Computational Linguistics.

Jindrich Libovicky, Jindrich Helcl, Marek Tlusty,Ondrej Bojar, and Pavel Pecina. 2016. CUNISystem for WMT16 Automatic Post-Editingand Multimodal Translation Tasks. In Pro-ceedings of the First Conference on MachineTranslation, pages 646–654, Berlin, Germany.

Minh-Thang Luong, Quoc V. Le, Ilya Sutskever,Oriol Vinyals, and Lukasz Kaiser. 2016.Multi-Task Sequence to Sequence Learning.In Proceedings of the International Confer-ence on Learning Representations (ICLR),2016, San Juan, Puerto Rico.

Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective Approaches toAttention-based Neural Machine Translation.In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Pro-cessing (EMNLP), pages 1412–1421, Lisbon,Portugal.

Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. BLEU: A Method forAutomatic Evaluation of Machine Translation.In Proceedings of the 40th Annual Meetingon Association for Computational Linguistics,ACL ’02, pages 311–318, Philadelphia, Penn-sylvania.

Maja Popovic. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Pro-ceedings of the Tenth Workshop on StatisticalMachine Translation, pages 392–395, Lisbon,Portugal. Association for Computational Lin-guistics.

Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan,Chunyuan Li, Andrew Stevens, and LawrenceCarin. 2016. Variational autoencoder for deeplearning of images, labels and captions. InD. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett, editors, Advancesin Neural Information Processing Systems 29,pages 2352–2360. Curran Associates, Inc.

Danilo Jimenez Rezende, Shakir Mohamed, andDaan Wierstra. 2014. Stochastic backprop-agation and approximate inference in deepgenerative models. In Proceedings of the 31thInternational Conference on Machine Learn-ing, ICML 2014, Beijing, China, 21-26 June2014, pages 1278–1286.

Olga Russakovsky, Jia Deng, Hao Su, JonathanKrause, Sanjeev Satheesh, Sean Ma, Zhi-heng Huang, Andrej Karpathy, Aditya Khosla,Michael Bernstein, Alexander C. Berg, andLi Fei-Fei. 2015. ImageNet Large ScaleVisual Recognition Challenge. Interna-tional Journal of Computer Vision (IJCV),115(3):211–252.

Philip Schulz, Wilker Aziz, and Trevor Cohn. 2018.A stochastic decoder for neural machine trans-lation. In Proceedings of the 56th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages1243–1252. Association for ComputationalLinguistics.

Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016a. Improving Neural MachineTranslation Models with Monolingual Data.In Proceedings of the 54th Annual Meetingof the Association for Computational Linguis-tics (Volume 1: Long Papers), pages 86–96,Berlin, Germany. Association for Computa-tional Linguistics.

Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016b. Neural Machine Translation ofRare Words with Subword Units. In Proceed-ings of the 54th Annual Meeting of the Associ-ation for Computational Linguistics (Volume1: Long Papers), pages 1715–1725, Berlin,Germany.

Harshil Shah and David Barber. 2018. Genera-tive neural machine translation. In S. Ben-gio, H. Wallach, H. Larochelle, K. Grauman,

Page 12: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6403

N. Cesa-Bianchi, and R. Garnett, editors, Ad-vances in Neural Information Processing Sys-tems 31, pages 1346–1355. Curran Associates,Inc.

Kashif Shah, Josiah Wang, and Lucia Specia.2016. SHEF-Multimodal: Grounding Ma-chine Translation on Images. In Proceedingsof the First Conference on Machine Transla-tion, pages 660–665, Berlin, Germany.

K. Simonyan and A. Zisserman. 2015. Very deepconvolutional networks for large-scale imagerecognition. In Proceedings of the Interna-tional Conference on Learning Representa-tions (ICLR), San Diego, CA.

Kihyuk Sohn, Honglak Lee, and Xinchen Yan.2015. Learning structured output represen-tation using deep conditional generative mod-els. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors, Ad-vances in Neural Information Processing Sys-tems 28, pages 3483–3491. Curran Associates,Inc.

Lucia Specia, Stella Frank, Khalil Sima’an, andDesmond Elliott. 2016. A Shared Taskon Multimodal Machine Translation andCrosslingual Image Description. In Pro-ceedings of the First Conference on MachineTranslation, WMT 2016, colocated with ACL2016, pages 543–553, Berlin, Germany.

Milos Stanojevic and Khalil Sima’an. 2014. Fit-ting sentence level translation evaluation withmany dense features. In Proceedings ofthe 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP),pages 202–206, Doha, Qatar. Association forComputational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. VLe. 2014. Sequence to sequence learningwith neural networks. In Z. Ghahramani,M. Welling, C. Cortes, N.D. Lawrence, andK.Q. Weinberger, editors, Advances in Neu-ral Information Processing Systems 27, pages3104–3112. Curran Associates, Inc.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, andZ. Wojna. 2016. Rethinking the inception ar-chitecture for computer vision. In 2016 IEEEConference on Computer Vision and PatternRecognition (CVPR), pages 2818–2826.

Michalis Titsias and Miguel Lazaro-Gredilla. 2014.Doubly stochastic variational bayes for non-conjugate inference. In Proceedings of the31st International Conference on MachineLearning (ICML-14), pages 1971–1979.

Joji Toyama, Masanori Misono, Masahiro Suzuki,Kotaro Nakayama, and Yutaka Matsuo. 2016.Neural machine translation with latent seman-tic of image and text. CoRR, abs/1611.08459.

Liwei Wang, Alexander Schwing, and SvetlanaLazebnik. 2017. Diverse and accurate imagedescription using a variational auto-encoderwith an additive gaussian encoding space. InI. Guyon, U. V. Luxburg, S. Bengio, H. Wal-lach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, Advances in Neural Informa-tion Processing Systems 30, pages 5756–5766.Curran Associates, Inc.

Peter Young, Alice Lai, Micah Hodosh, and JuliaHockenmaier. 2014. From image descriptionsto visual denotations: New similarity metricsfor semantic inference over event descriptions.Transactions of the Association for Computa-tional Linguistics, 2:67–78.

Biao Zhang, Deyi Xiong, jinsong su, Hong Duan,and Min Zhang. 2016. Variational neural ma-chine translation. In Proceedings of the 2016Conference on Empirical Methods in NaturalLanguage Processing, pages 521–530, Austin,Texas. Association for Computational Linguis-tics.

Page 13: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6404

A Model Architecture

Once again, we wish to translate a source sequencexm1 , 〈x1, · · · , xm〉 into a target sequence yn1 ,〈y1, · · · , yn〉, and also predict image features v.

xv y

µ σ

z

y v

λ

θ

inferencemodel

generativemodel

ε ∼ N (0, I)

KL KL

Figure 3: Illustration of multi-modal machine transla-tion generative and inference models. The conditionalmodel VMMTC includes dashed arrows; the fixed priormodel VMMTF does not, i.e. its inference networkonly uses x.

In Figure 3, we illustrate generative and infer-ence networks for models VMMTC and VMMTF.

A.1 Generative model

Source-language encoder The source-languageencoder is deterministic and implemented usinga 2-layer bidirectional Long Short-Term Memory(LSTM) network (Hochreiter and Schmidhuber,1997):

fi = emb(xi; θemb-x),

h0 = ~0,−→h i = LSTM(hi−1,fi; θlstmf-x),←−h i = LSTM(hi+1,fi; θlstmb-x),

hi = [−→h i,←−h i],

(9)

where emb is the source look-up matrix, trainedjointly with the model, and hm1 are the final sourcehidden states.

Target-language decoder Now we assume thatz is given, and will discuss how to compute it lateron. The translation model consists of a sequenceof draws from a Categorical distribution over thetarget-language vocabulary (independently fromimage features v):

Yj |z, x, y<j ∼ Cat(fθ(z, x, y<j)),

where fθ parameterises the distribution with anattentive encoder-decoder architecture:

wj = emb(yj ; θemb-y),

s0 = tanh(affine(hm1 ; θinit-y)

),

sj = LSTM(sj−1, [wj , z]; θlstm-y),

ci,j = attention(hm1 , sn1 ; θattn),

fθ(z, x, y<j) = softmax(affine([sj , cj ]; θout-y)),

where the attention mechanism is a bi-linear attention (Luong et al., 2015),and the generative parameters are θ ={θemb-{x,y}, θlstm{f,b}-x, θinit-y, θlstm-y, θattn, θout-y}.

Image decoder We do not model images directly,but instead as a 2048-dimensional feature vector vof pre-activations of a ResNet-50’s pool5 layer.We simply draw image features from a Gaussianobservation model:

V |z ∼ N (ν, ς2I),

ν = MLP(z; θ), (10)

where a multi-layer perceptron (MLP) maps fromz to a vector of locations ν ∈ Ro, and ς ∈ R>0 isa hyper-parameter of the model (we use 1).

Conditional prior VMMTC Given a source sen-tence xm1 , we draw an embedding z from a latentGaussian model:

Z|xm1 ∼ N (µ, diag(σ2)),

µ = MLP(hm1 ; θlatent), (11)

σ = softplus(MLP(hm1 ; θlatent)) , (12)

where Equations (11) and (12) employ two multi-layer perceptrons (MLPs) to map from a sourcesentence (i.e. source hidden states) to a vector oflocations µ ∈ Rc and a vector of scales σ ∈ Rc>0,respectively.

Fixed prior VMMTF In the MMT modelVMMTF, we simply have a draw from a standardNormal prior:

Z ∼ N (0, I).

All MLPs have one hidden layer and are imple-mented as below (eqs. (10) to (12)):

MLP(·) = affine(ReLU(affine( · ; θ)); θ).

A.2 Inference modelThe inference network shares the source-languageencoder with the generative model and differs de-pending on the model (VMMTC or VMMTF).

Page 14: Latent Variable Model for Multi-modal Translationfiacer.calixto,m.riosgaona,w.azizg@uva.nl Wilker Aziz Abstract In this work, we propose to model the in- teraction between visual and

6405

Conditional prior VMMTC Model VMMTC’sapproximate posterior qλ(z|xm1 , yn1 , v) is a Gaus-sian distribution:

Z|xm1 , yn1 , v ∼ N (u, diag(s2);λ).

We use two bidirectional LSTMs, one over source-and the other over target-language words, respec-tively. To reduce the number of model parame-ters, we re-use the entire source-language BiLSTMand the target-language embeddings in the gener-ative model but prevent updates to the generativemodel’s parameters by blocking gradients from be-ing back-propagated (Equation 9). Concretely, theinference model is parameterised as below:

hm1 = detach(BiLSTM(xm1 ; θemb-x,lstmf-x,lstmb-x)),

wn1 = detach(emb(yn1 ; θemb-y)),

hx = avg(affine(hm1 ;λx)),

hy = avg(BiLSTM(wn1 ;λy)),

hv = MLP(v;λv),

hall = [hx,hy,hv],

u = MLP(hall;λmu),

s = softplus(MLP(hall;λsigma)),

where the set of the inference network parametersare λ = {λx, λy, λv, λmu, λsigma}.

Fixed prior VMMTF Model VMMTF’s approx-imate posterior qλ(z|xm1 ) is also a Gaussian:

Z|xm1 ∼ N (u,diag(s2);λ),

where we re-use the source-language BiLSTMfrom the generative model but prevent updates to itsparameters by blocking gradients from being back-propagated (Equation 9). Concretely, the inferencemodel is parameterised as below:

hm1 = detach(BiLSTM(xm1 ; θemb-x,lstmf-x,lstmb-x)),

hx = avg(affine(hm1 ;λx)),

u = MLP(hx;λmu),

s = softplus(MLP(hx;λsigma)),

where the set of the inference network parametersare λ = {λx, λmu, λsigma}.

Finally, all MLPs are implemented as below:

MLP(·) = affine(ReLU(affine( · ;λ));λ).