Adversarial Training for Unsupervised Bilingual Lexicon ...training. Finally, our evaluation on the...

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1959–1970Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1179

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1959–1970Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1179

Adversarial Training for Unsupervised Bilingual Lexicon Induction

Meng Zhang†‡ Yang Liu†‡∗Huanbo Luan† Maosong Sun†‡†State Key Laboratory of Intelligent Technology and Systems

Tsinghua National Laboratory for Information Science and TechnologyDepartment of Computer Science and Technology, Tsinghua University, Beijing, China‡Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China

[email protected], [email protected]@gmail.com, [email protected]

Abstract

Word embeddings are well known to cap-ture linguistic regularities of the languageon which they are trained. Researchersalso observe that these regularities cantransfer across languages. However, previ-ous endeavors to connect separate mono-lingual word embeddings typically requirecross-lingual signals as supervision, eitherin the form of parallel corpus or seed lex-icon. In this work, we show that suchcross-lingual connection can actually beestablished without any form of supervi-sion. We achieve this end by formulatingthe problem as a natural adversarial game,and investigating techniques that are cru-cial to successful training. We carry outevaluation on the unsupervised bilinguallexicon induction task. Even though thistask appears intrinsically cross-lingual, weare able to demonstrate encouraging per-formance without any cross-lingual clues.

1 Introduction

As word is the basic unit of a language, the better-ment of its representation has significant impact onvarious natural language processing tasks. Con-tinuous word representations, commonly knownas word embeddings, have formed the basis fornumerous neural network models since their ad-vent. Their popularity results from the perfor-mance boost they bring, which should in turn beattributed to the linguistic regularities they capture(Mikolov et al., 2013b).

Soon following the success on monolingualtasks, the potential of word embeddings for cross-lingual natural language processing has attractedmuch attention. In their pioneering work, Mikolov

∗Corresponding author.

caballo (horse)

cerdo (pig)

gato (cat)

horse

pig

cat

Spanish English

Figure 1: Illustrative monolingual word em-beddings of Spanish and English, adapted from(Mikolov et al., 2013a). Although trained inde-pendently, the two sets of embeddings exhibit ap-proximate isomorphism.

et al. (2013a) observe that word embeddingstrained separately on monolingual corpora exhibitisomorphic structure across languages, as illus-trated in Figure 1. This interesting finding isin line with research on human cognition (Younet al., 2016). It also means a linear transforma-tion may be established to connect word embed-ding spaces, allowing word feature transfer. Thishas far-reaching implication on low-resource sce-narios (Daume III and Jagarlamudi, 2011; Irvineand Callison-Burch, 2013), because word embed-dings only require plain text to train, which is themost abundant form of linguistic resource.

However, connecting separate word embeddingspaces typically requires supervision from cross-lingual signals. For example, Mikolov et al.(2013a) use five thousand seed word translationpairs to train the linear transformation. In a re-cent study, Vulic and Korhonen (2016) show thatat least hundreds of seed word translation pairs areneeded for the model to generalize. This is un-fortunate for low-resource languages and domains,

1959

https://doi.org/10.18653/v1/P17-1179

https://doi.org/10.18653/v1/P17-1179

(b)

D1

0/1

G GT

D2

1/0

(a)

D

0/1

G

(c)

D

0/1

G GT

LR

~

Figure 2: (a) The unidirectional transformation model directly inspired by the adversarial game: Thegenerator G tries to transform source word embeddings (squares) to make them seem like target ones(dots), while the discriminator D tries to classify whether the input embeddings are generated by G orreal samples from the target embedding distribution. (b) The bidirectional transformation model. Twogenerators with tied weights perform transformation between languages. Two separate discriminatorsare responsible for each language. (c) The adversarial autoencoder model. The generator aims to makethe transformed embeddings not only indistinguishable by the discriminator, but also recoverable asmeasured by the reconstruction loss LR.

because data encoding cross-lingual equivalence isoften expensive to obtain.

In this work, we aim to entirely eliminate theneed for cross-lingual supervision. Our approachdraws inspiration from recent advances in gen-erative adversarial networks (Goodfellow et al.,2014). We first formulate our task in a fashionthat naturally admits an adversarial game. Thenwe propose three models that implement the game,and explore techniques to ensure the success oftraining. Finally, our evaluation on the bilinguallexicon induction task reveals encouraging perfor-mance, even though this task appears formidablewithout any cross-lingual supervision.

2 Models

In order to induce a bilingual lexicon, we startfrom two sets of monolingual word embeddingswith dimensionality d. They are trained separatelyon two languages. Our goal is to learn a mappingfunction f : Rd → Rd so that for a source wordembedding x, f (x) lies close to the embedding ofits target language translation y. The learned map-ping function can then be used to translate each

source word x by finding the nearest target em-bedding to f (x).

We consider x to be drawn from a distributionpx, and similarly y ∼ py. The key intuition here isto find the mapping function to make f (x) seemto follow the distribution py, for all x ∼ px. Fromthis point of view, we design an adversarial gameas illustrated in Figure 2(a): The generator G im-plements the mapping function f , trying to makef (x) passable as target word embeddings, whilethe discriminator D is a binary classifier strivingto distinguish between fake target word embed-dings f (x) ∼ pf(x) and real ones y ∼ py. Thisintuition can be formalized as the minimax gameminGmaxD V (D,G) with value function

V (D,G)

=Ey∼py [logD (y)] +

Ex∼px [log (1−D (G (x)))] .

(1)

Theoretical analysis reveals that adversarialtraining tries to minimize the Jensen-Shannondivergence JSD

(py||pf(x)

)(Goodfellow et al.,

2014). Importantly, the minimization happensat the distribution level, without requiring word

1960

translation pairs to supervise training.

2.1 Model 1: Unidirectional Transformation

The first model directly implements the adversar-ial game, as shown in Figure 2(a). As hintedby the isomorphism shown in Figure 1, previousworks typically choose the mapping function fto be a linear map (Mikolov et al., 2013a; Dinuet al., 2015; Lazaridou et al., 2015). We thereforeparametrize the generator as a transformation ma-trix G ∈ Rd×d. We also tried non-linear mapsparametrized by neural networks, without success.In fact, if the generator is given sufficient capacity,it can in principle learn a constant mapping func-tion to a target word embedding, which makes thediscriminator impossible to distinguish, much likethe “mode collapse” problem widely observed inthe image domain (Radford et al., 2015; Salimanset al., 2016). We therefore believe it is crucial togrant the generator with suitable capacity.

As a generic binary classifier, a standard feed-forward neural network with one hidden layer isused to parametrize the discriminator D, and itsloss function is the usual cross-entropy loss, as inthe value function (1):

LD = − logD (y)− log (1−D (Gx)) . (2)

For simplicity, here we write the loss with a mini-batch size of 1; in our experiments we use 128.

The generator loss is given by

LG = − logD (Gx) . (3)

In line with previous work (Goodfellow et al.,2014), we find this loss easier to minimize thanthe original form log (1−D (Gx)).

Orthogonal ConstraintThe above model is very difficult to train. Onepossible reason is that the parameter search spaceRd×d for the generator may still be too large. Pre-vious works have attempted to constrain the trans-formation matrix to be orthogonal (Xing et al.,2015; Zhang et al., 2016b; Artetxe et al., 2016).An orthogonal transformation is also theoreticallyappealing for its self-consistency (Smith et al.,2017) and numerical stability. However, usingconstrained optimization for our purpose is cum-bersome, so we opt for an orthogonal parametriza-tion (Mhammedi et al., 2016) of the generator in-stead.

2.2 Model 2: Bidirectional TransformationThe orthogonal parametrization is still quite slow.We can relax the orthogonal constraint and onlyrequire the transformation to be self-consistent(Smith et al., 2017): If G transforms the sourceword embedding space into the target languagespace, its transpose G> should transform the tar-get language space back to the source. This can beimplemented by two unidirectional models with atied generator, as illustrated in Figure 2(b). Twoseparate discriminators are used, with the samecross-entropy loss as Equation (2) used by Model1. The generator loss is given by

LG = − logD1 (Gx)− logD2

(G>x

). (4)

2.3 Model 3: Adversarial AutoencoderAs another way to relax the orthogonal con-straint, we introduce the adversarial autoencoder(Makhzani et al., 2015), depicted in Figure 2(c).After the generator G transforms a source wordembedding x into a target language representationGx, we should be able to reconstruct the sourceword embedding x by mapping back withG>. Wetherefore introduce the reconstruction loss mea-sured by cosine similarity:

LR = − cos(x,G>Gx

). (5)

Note that this loss will be minimized if G is or-thogonal. With this term included, the loss func-tion for the generator becomes

LG = − logD (Gx)− λ cos(x,G>Gx

), (6)

where λ is a hyperparameter that balances the twoterms. λ = 0 recovers the unidirectional trans-formation model, while larger λ should enforce astricter orthogonal constraint.

3 Training Techniques

Generative adversarial networks are notoriouslydifficult to train, and investigation into stablertraining remains a research frontier (Radford et al.,2015; Salimans et al., 2016; Arjovsky and Bottou,2017). We contribute in this aspect by reportingtechniques that are crucial to successful trainingfor our task.

3.1 Regularizing the DiscriminatorRecently, it has been suggested to inject noise intothe input to the discriminator (Sønderby et al.,

1961

2016; Arjovsky and Bottou, 2017). The noise istypically additive Gaussian. Here we explore morepossibilities, with the following types of noise, in-jected into the input and hidden layer:

• Multiplicative Bernoulli noise (dropout) (Sri-vastava et al., 2014): ε ∼ Bernoulli (p).

• Additive Gaussian noise: ε ∼ N(0, σ2

).

• Multiplicative Gaussian noise: ε ∼N

(1, σ2

).

As noise injection is a form of regularization(Bishop, 1995; Van der Maaten et al., 2013; Wa-ger et al., 2013), we also try l2 regularization, anddirectly restricting the hidden layer size to combatoverfitting. Our findings include:

• Without regularization, it is not impossiblefor the optimizer to find a satisfactory param-eter configuration, but the hidden layer sizehas to be tuned carefully. This indicates thata balance of capacity between the generatorand discriminator is needed.

• All forms of regularization help training byallowing us to liberally set the hidden layersize to a relatively large value.

• Among the types of regularization, multi-plicative Gaussian injected into the input isthe most effective, and additive Gaussian issimilar. On top of input noise, hidden layernoise helps slightly.

In the following experiments, we inject multiplica-tive Gaussian into the input and hidden layer of thediscriminator with σ = 0.5.

3.2 Model SelectionFrom a typical training trajectory shown in Fig-ure 3, we observe that training is not convergent.In fact, simply using the model saved at the endof training gives poor performance. Therefore weneed a mechanism to select a good model. We ob-serve there are sharp drops of the generator lossLG, and find they correspond to good models,as the discriminator gets confused at these pointswith its classification accuracy (D accuracy) drop-ping simultaneously. Interestingly, the reconstruc-tion loss LR and the value of

∥∥G>G− I∥∥F

ex-hibit synchronous drops, even if we use the uni-directional transformation model (λ = 0). Thismeans a good transformation matrix is indeed

0 200000 400000# minibatches

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

L G,D

accu

racy

,LR

0

2

4

6

8

10

||G>

G−

I||F

||G>G− I||FLG

D accuracyLR

Figure 3: A typical training trajectory of the adver-sarial autoencoder model with λ = 1. The valuesare averages within each minibatch.

nearly orthogonal, and justifies our encourage-ment of G towards orthogonality. With this find-ing, we can train for sufficient steps and save themodel with the lowest generator loss.

As we aim to find the cross-lingual transforma-tion without supervision, it would be ideal to de-termine hyperparameters without a validation set.The sharp drops can also be indicative in this case.If a hyperparameter configuration is poor, thosevalues will oscillate without a clear drop. Al-though this criterion is somewhat subjective, wefind it to be quite feasible in practice.

3.3 Other Training Details

Our approach takes monolingual word embed-dings as input. We train the CBOW model(Mikolov et al., 2013b) with default hyperparam-eters in word2vec.1 The embedding dimensiond is 50 unless stated otherwise. Before feedingthem into our system, we normalize the word em-beddings to unit length. When sampling words foradversarial training, we penalize frequent wordsin a way similar to (Mikolov et al., 2013b). G is

1https://code.google.com/archive/p/word2vec

1962

initialized with a random orthogonal matrix. Thehidden layer size ofD is 500. Adversarial traininginvolves alternate gradient update of the genera-tor and discriminator, which we implement with asimpler variant algorithm described in (Nowozinet al., 2016). Adam (Kingma and Ba, 2014) isused as the optimizer, with default hyperparame-ters. For the adversarial autoencoder model, λ = 1generally works well, but λ = 10 appears stablerfor the low-resource Turkish-English setting.

4 Experiments

We evaluate the quality of the cross-lingual em-bedding transformation on the bilingual lexiconinduction task. After a source word embeddingis transformed into the target space, its M nearesttarget embeddings (in terms of cosine similarity)are retrieved, and compared against the entry ina ground truth bilingual lexicon. Performance ismeasured by top-M accuracy (Vulic and Moens,2013): If any of the M translations is found in theground truth bilingual lexicon, the source word isconsidered to be handled correctly, and the accu-racy is calculated as the percentage of correctlytranslated source words. We generally report theharshest top-1 accuracy, unless when comparingwith published figures in Section 4.4.

BaselinesAlmost all approaches to bilingual lexicon induc-tion from non-parallel data depend on seed lexica.An exception is decipherment (Dou and Knight,2012; Dou et al., 2015), and we use it as ourbaseline. The decipherment approach is not basedon distributional semantics, but rather views thesource language as a cipher for the target lan-guage, and attempts to learn a statistical model todecipher the source language. We run the Mono-Giza system as recommended by the toolkit.2 Itcan also utilize monolingual embeddings (Douet al., 2015); in this case, we use the same em-beddings as the input to our approach.

Sharing the underlying spirit with our approach,related methods also build upon monolingual wordembeddings and find transformation to link dif-ferent languages. Although they need seed wordtranslation pairs to train and thus not directly com-parable, we report their performance with 50 and100 seeds for reference. These methods are:

2http://www.isi.edu/natural-language/software/monogiza release v1.0.tar.gz

# tokens vocab. sizeWikipedia comparable corpora

zh-enzh 21m 3,349en 53m 5,154

es-enes 61m 4,774en 95m 6,637

it-enit 73m 8,490en 93m 6,597

ja-zhja 38m 6,043zh 16m 2,814

tr-entr 6m 7,482en 28m 13,220

Large-scale settingszh-en zh 143m 14,686

Wikipedia en 1,907m 61,899zh-en zh 2,148m 45,958

Gigaword en 5,017m 73,504

Table 1: Statistics of the non-parallel corpora.Language codes: zh = Chinese, en = English, es= Spanish, it = Italian, ja = Japanese, tr = Turkish.

• Translation matrix (TM) (Mikolov et al.,2013a): the pioneer of this type of methodsmentioned in the introduction, using lineartransformation. We use a publicly availableimplementation.3

• Isometric alignment (IA) (Zhang et al.,2016b): an extension of TM by augmentingits learning objective with the isometric (or-thogonal) constraint. Although Zhang et al.(2016b) had subsequent steps for their POStagging task, it could be used for bilinguallexicon induction as well.

We ensure the same input embeddings for thesemethods and ours.

The seed word translation pairs are obtained asfollows. First, we ask Google Translate4 to trans-late the source language vocabulary. Then the tar-get translations are queried again and translatedback to the source language, and those that donot match the original source words are discarded.This helps to ensure the translation quality. Fi-nally, the translations are discarded if they fall outof our target language vocabulary.

3http://clic.cimec.unitn.it/˜georgiana.dinu/down4https://translate.google.com

1963

method # seeds accuracy (%)MonoGiza w/o emb. 0 0.05MonoGiza w/ emb. 0 0.09

TM50 0.29100 21.79

IA50 18.71100 32.29

Model 1 0 39.25Model 1 + ortho. 0 28.62

Model 2 0 40.28Model 3 0 43.31

Table 2: Chinese-English top-1 accuracies of theMonoGiza baseline and our models, along withthe translation matrix (TM) and isometric align-ment (IA) methods that utilize 50 and 100 seeds.

4.1 Experiments on Chinese-English

DataFor this set of experiments, the data for trainingword embeddings comes from Wikipedia com-parable corpora.5 Following (Vulic and Moens,2013), we retain only nouns with at least 1,000occurrences. For the Chinese side, we first useOpenCC6 to normalize characters to be simplified,and then perform Chinese word segmentation andPOS tagging with THULAC.7 The preprocessingof the English side involves tokenization, POS tag-ging, lemmatization, and lowercasing, which wecarry out with the NLTK toolkit.8 The statisticsof the final training data is given in Table 1, alongwith the other experimental settings.

As the ground truth bilingual lexicon for evalua-tion, we use Chinese-English Translation LexiconVersion 3.0 (LDC2002L27).

Overall PerformanceTable 2 lists the performance of the MonoGizabaseline and our four variants of adversarial train-ing. MonoGiza obtains low performance, likelydue to the harsh evaluation protocol (cf. Sec-tion 4.4). Providing it with syntactic informationcan help (Dou and Knight, 2013), but in a low-resource scenario with zero cross-lingual informa-tion, parsers are likely to be inaccurate or even un-available.

5http://linguatools.org/tools/corpora/wikipedia-comparable-corpora

6https://github.com/BYVoid/OpenCC7http://thulac.thunlp.org8http://www.nltk.org

城市小行星文学chengshi xiaoxingxing wenxue

city asteroid poetrytown astronomer literature

suburb comet prosearea constellation poet

proximity orbit writing

Table 3: Top-5 English translation candidates pro-posed by our approach for some Chinese words.The ground truth is marked in bold.

0 500 1000# seeds

0

10

20

30

Acc

urac

y(%

)

OursIA

TM

Figure 4: Top-1 accuracies of our approach,isometric alignment (IA), and translation matrix(TM), with the number of seeds varying in {50,100, 200, 500, 1000, 1280}.

The unidirectional transformation model attainsreasonable accuracy if trained successfully, but itis rather sensitive to hyperparameters and initial-ization. This training difficulty motivates our or-thogonal constraint. But imposing a strict orthog-onal constraint hurts performance. It is also about20 times slower even though we utilize orthogonalparametrization instead of constrained optimiza-tion. The last two models represent different relax-ations of the orthogonal constraint, and the adver-sarial autoencoder model achieves the best perfor-mance. We therefore use it in our following exper-iments. Table 3 lists some word translation exam-ples given by the adversarial autoencoder model.

Comparison With Seed-Based MethodsIn this section, we investigate how many seedsTM and IA require to attain the performance levelof our approach. There are a total of 1,280 seedtranslation pairs for Chinese-English, which areremoved from the test set during the evaluation forthis experiment. We use the most frequent S pairsfor TM and IA.

Figure 4 shows the accuracies with respect to

1964

method # seeds es-en it-en ja-zh tr-enMonoGiza w/o embeddings 0 0.35 0.30 0.04 0.00MonoGiza w/ embeddings 0 1.19 0.27 0.23 0.09

TM50 1.24 0.76 0.35 0.09

100 48.61 37.95 26.67 11.15

IA50 39.89 27.03 19.04 7.58

100 60.44 46.52 36.35 17.11Ours 0 71.97 58.60 43.02 17.18

Table 4: Top-1 accuracies (%) of the MonoGiza baseline and our approach on Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English. The results for translation matrix (TM) and isometricalignment (IA) using 50 and 100 seeds are also listed.

50 100 150 200Embedding dimension

30

40

50

Acc

urac

y(%

)

Figure 5: Top-1 accuracies of our approach withrespect to the input embedding dimensions in {20,50, 100, 200}.

S. When the seeds are few, the seed-based meth-ods exhibit clear performance degradation. In thiscase, we also observe the importance of the or-thogonal constraint from the superiority of IA toTM, which supports our introduction of this con-straint as we attempt zero supervision. Finally,in line with the finding in (Vulic and Korhonen,2016), hundreds of seeds are needed for TM to gen-eralize. Only then do seed-based methods catch upwith our approach, and the performance differenceis marginal even when more seeds are provided.

Effect of Embedding Dimension

As our approach takes monolingual word embed-dings as input, it is conceivable that their qualitysignificantly affects how well the two spaces canbe connected by a linear map. We look into thisaspect by varying the embedding dimension d inFigure 5. As the dimension increases, the accuracyimproves and gradually levels off. This indicatesthat too low a dimension hampers the encoding oflinguistic information drawn from the corpus, andit is advisable to use a sufficiently large dimension.

4.2 Experiments on Other Language Pairs

DataWe also induce bilingual lexica from Wikipediacomparable corpora for the following languagepairs: Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English. For Spanish-English and Italian-English, we choose to useTreeTagger9 for preprocessing, as in (Vulic andMoens, 2013). For the Japanese corpus, we useMeCab10 for word segmentation and POS tag-ging. For Turkish, we utilize the preprocessingtools (tokenization and POS tagging) provided inLORELEI Language Packs (Strassel and Tracey,2016), and its English side is preprocessed byNLTK. Unlike the other language pairs, the fre-quency cutoff threshold for Turkish-English is100, as the amount of data is relatively small.

The ground truth bilingual lexica for Spanish-English and Italian-English are obtained fromOpen Multilingual WordNet11 through NLTK. ForJapanese-Chinese, we use an in-house lexicon.For Turkish-English, we build a set of ground truthtranslation pairs in the same way as how we obtainseed word translation pairs from Google Translate,described above.

ResultsAs shown in Table 4, the MonoGiza baseline stilldoes not work well on these language pairs, whileour approach achieves much better performance.The accuracies are particularly high for Spanish-English and Italian-English, likely because theyare closely related languages, and their embeddingspaces may exhibit stronger isomorphism. The

9http://www.cis.uni-muenchen.de/˜schmid/tools/TreeTagger

10http://taku910.github.io/mecab11http://compling.hss.ntu.edu.sg/omw

1965

method # seeds Wikipedia Gigaword

TM50 0.00 0.01100 4.79 2.07

IA50 3.25 1.68100 7.08 4.18

Ours 0 7.92 2.53

Table 5: Top-1 accuracies (%) of our approachto inducing bilingual lexica for Chinese-Englishfrom Wikipedia and Gigaword. Also listed areresults for translation matrix (TM) and isometricalignment (IA) using 50 and 100 seeds.

performance on Japanese-Chinese is lower, on acomparable level with Chinese-English (cf. Table2), and these languages are relatively distantly re-lated. Turkish-English represents a low-resourcescenario, and therefore the lexical semantic struc-ture may be insufficiently captured by the embed-dings. The agglutinative nature of Turkish can alsoadd to the challenge.

4.3 Large-Scale Settings

We experiment with large-scale Chinese-Englishdata from two sources: the whole Wikipedia dumpand Gigaword (LDC2011T13 and LDC2011T07).We also simplify preprocessing by removing thenoun restriction and the lemmatization step (cf.preprocessing decisions for the above experi-ments).

Although large-scale data may benefit the train-ing of embeddings, it poses a greater challenge tobilingual lexicon induction. First, the degree ofnon-parallelism tends to increase. Second, withcruder preprocessing, the noise in the corpora maytake its toll. Finally, but probably most impor-tantly, the vocabularies expand dramatically com-pared to previous settings (see Table 1). Thismeans a word translation has to be retrieved froma much larger pool of candidates.

For these reasons, we consider the performanceof our approach presented in Table 5 to be encour-aging. The imbalanced sizes of the Chinese andEnglish Wikipedia do not seem to cause a prob-lem for the structural isomorphism needed by ourmethod. MonoGiza does not scale to such largevocabularies, as it already takes days to train in ourItalian-English setting. In contrast, our approachis immune from scalability issues by working withembeddings provided by word2vec, which iswell known for its fast speed. With the network

method 5k 10kMonoGiza w/o embeddings 13.74 7.80MonoGiza w/ embeddings 17.98 10.56

(Cao et al., 2016) 23.54 17.82Ours 68.59 51.86

Table 6: Top-5 accuracies (%) of 5k and 10k mostfrequent words in the French-English setting. Thefigures for the baselines are taken from (Cao et al.,2016).

configuration used in our experiments, the adver-sarial autoencoder model takes about two hours totrain for 500k minibatches on a single CPU.

4.4 Comparison With (Cao et al., 2016)

In order to compare with the recent method by Caoet al. (2016), which also uses zero cross-lingualsignal to connect monolingual embeddings, wereplicate their French-English experiment to testour approach.12 This experimental setting has im-portant differences from the above ones, mostly inthe evaluation protocol. Apart from using top-5accuracy as the evaluation metric, the ground truthbilingual lexicon is obtained by performing wordalignment on a parallel corpus. We find this auto-matically constructed bilingual lexicon to be nois-ier than the ones we use for the other languagepairs; it often lists tens of translations for a sourceword. This lenient evaluation protocol should ex-plain MonoGiza’s higher numbers in Table 6 thanwhat we report in the other experiments. In thissetting, our approach is able to considerably out-perform both MonoGiza and the method by Caoet al. (2016).

5 Related Work

5.1 Cross-Lingual Word Embeddings forBilingual Lexicon Induction

Inducing bilingual lexica from non-parallel datais a long-standing cross-lingual task. Except forthe decipherment approach, traditional statisticalmethods all require cross-lingual signals (Rapp,1999; Koehn and Knight, 2002; Fung and Cheung,2004; Gaussier et al., 2004; Haghighi et al., 2008;Vulic et al., 2011; Vulic and Moens, 2013).

Recent advances in cross-lingual word embed-dings (Vulic and Korhonen, 2016; Upadhyay et al.,

12As a confirmation, we ran MonoGiza in this setting andobtained comparable performance as reported.

1966

2016) have rekindled interest in bilingual lexi-con induction. Like their traditional counterparts,these embedding-based methods require cross-lingual signals encoded in parallel data, aligned atdocument level (Vulic and Moens, 2015), sentencelevel (Zou et al., 2013; Chandar A P et al., 2014;Hermann and Blunsom, 2014; Kocisky et al.,2014; Gouws et al., 2015; Luong et al., 2015;Coulmance et al., 2015; Oshikiri et al., 2016),or word level (i.e. seed lexicon) (Gouws andSøgaard, 2015; Wick et al., 2016; Duong et al.,2016; Shi et al., 2015; Mikolov et al., 2013a; Dinuet al., 2015; Lazaridou et al., 2015; Faruqui andDyer, 2014; Lu et al., 2015; Ammar et al., 2016;Zhang et al., 2016a, 2017; Smith et al., 2017). Incontrast, our work completely removes the needfor cross-lingual signals to connect monolingualword embeddings, trained on non-parallel text cor-pora.

As one of our baselines, the method by Caoet al. (2016) also does not require cross-lingualsignals to train bilingual word embeddings. Itmodifies the objective for training embeddings,whereas our approach uses monolingual embed-dings trained beforehand and held fixed. More im-portantly, its learning mechanism is substantiallydifferent from ours. It encourages word embed-dings from different languages to lie in the sharedsemantic space by matching the mean and vari-ance of the hidden states, assumed to follow aGaussian distribution, which is hard to justify. Ourapproach does not make any assumptions and di-rectly matches the mapped source embedding dis-tribution with the target distribution by adversarialtraining.

A recent work also attempts adversarial train-ing for cross-lingual embedding transformation(Barone, 2016). The model architectures are simi-lar to ours, but the reported results are not positive.We tried the publicly available code on our data,but the results were not positive, either. Therefore,we attribute the outcome to the difference in theloss and training techniques, but not the model ar-chitectures or data.

5.2 Adversarial Training

Generative adversarial networks are originallyproposed for generating realistic images as an im-plicit generative model, but the adversarial train-ing technique for matching distributions is gen-eralizable to much more tasks, including natural

language processing. For example, Ganin et al.(2016) address domain adaptation by adversari-ally training features to be domain invariant, andtest on sentiment classification. Chen et al. (2016)extend this idea to cross-lingual sentiment clas-sification. Our research deals with unsupervisedbilingual lexicon induction based on word embed-dings, and therefore works with word embeddingdistributions, which are more interpretable thanthe neural feature space of classifiers in the aboveworks.

In the field of neural machine translation, a re-cent work (He et al., 2016) proposes dual learn-ing, which also involves a two-agent game andtherefore bears conceptual resemblance to the ad-versarial training idea. The framework is carriedout with reinforcement learning, and thus differsgreatly in implementation from adversarial train-ing.

6 Conclusion

In this work, we demonstrate the feasibility of con-necting word embeddings of different languageswithout any cross-lingual signal. This is achievedby matching the distributions of the transformedsource language embeddings and target ones viaadversarial training. The success of our approachsignifies the existence of universal lexical seman-tic structure across languages. Our work alsoopens up opportunities for the processing of ex-tremely low-resource languages and domains thatlack parallel data completely.

Our work is likely to benefit from advances intechniques that further stabilize adversarial train-ing. Future work also includes investigating otherdivergences that adversarial training can minimize(Nowozin et al., 2016), and broader mathematicaltools that match distributions (Mohamed and Lak-shminarayanan, 2016).

Acknowledgments

We thank the anonymous reviewers for their help-ful comments. This work is supported by the Na-tional Natural Science Foundation of China (No.61522204), the 973 Program (2014CB340501),and the National Natural Science Foundation ofChina (No. 61331013). This research is alsosupported by the Singapore National ResearchFoundation under its International Research Cen-tre@Singapore Funding Initiative and adminis-tered by the IDM Programme.

1967

ReferencesWaleed Ammar, George Mulcaire, Yulia Tsvetkov,

Guillaume Lample, Chris Dyer, and Noah A.Smith. 2016. Massively Multilingual Word Embed-dings. arXiv:1602.01925 [cs] http://arxiv.org/abs/1602.01925.

Martin Arjovsky and Leon Bottou. 2017. TowardsPrincipled Methods For Training Generative Ad-versarial Networks. In ICLR. http://arxiv.org/abs/1701.04862.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.2016. Learning principled bilingual mappings ofword embeddings while preserving monolingualinvariance. In EMNLP. http://aclanthology.info/papers/learning-principled-bilingual-mappings-of-word-embeddings-while-preserving-monolingual-invariance.

Antonio Valerio Miceli Barone. 2016. Towards cross-lingual distributed representations without paral-lel text trained with adversarial autoencoders. InProceedings of the 1st Workshop on Representa-tion Learning for NLP. https://doi.org/10.18653/v1/W16-1614.

Chris M. Bishop. 1995. Training with Noise is Equiv-alent to Tikhonov Regularization. Neural Comput.https://doi.org/10.1162/neco.1995.7.1.108.

Hailong Cao, Tiejun Zhao, Shu Zhang, and YaoMeng. 2016. A Distribution-based Model toLearn Bilingual Word Embeddings. In COL-ING. http://aclanthology.info/papers/a-distribution-based-model-to-learn-bilingual-word-embeddings.

Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle,Mitesh Khapra, Balaraman Ravindran, Vikas CRaykar, and Amrita Saha. 2014. An Autoen-coder Approach to Learning Bilingual WordRepresentations. In NIPS. http://papers.nips.cc/paper/5270-an-autoencoder-approach-to-learning-bilingual-word-representations.pdf.

Xilun Chen, Yu Sun, Ben Athiwaratkun, ClaireCardie, and Kilian Weinberger. 2016. AdversarialDeep Averaging Networks for Cross-Lingual Senti-ment Classification. arXiv:1606.01614 [cs] http://arxiv.org/abs/1606.01614.

Jocelyn Coulmance, Jean-Marc Marty, GuillaumeWenzek, and Amine Benhalloum. 2015. Trans-gram, Fast Cross-lingual Word-embeddings. InEMNLP. http://aclanthology.info/papers/trans-gram-fast-cross-lingual-word-embeddings.

Hal Daume III and Jagadeesh Jagarlamudi. 2011. Do-main adaptation for machine translation by miningunseen words. In ACL-HLT . http://aclweb.org/anthology/P11-2071.

Georgiana Dinu, Angeliki Lazaridou, and Marco Ba-roni. 2015. Improving Zero-Shot Learning by Mit-igating the Hubness Problem. In ICLR Workshop.http://arxiv.org/abs/1412.6568.

Qing Dou and Kevin Knight. 2012. Large scale deci-pherment for out-of-domain machine translation. InEMNLP-CoNLL. http://aclweb.org/anthology/D12-1025.

Qing Dou and Kevin Knight. 2013. Dependency-Based Decipherment for Resource-Limited MachineTranslation. In EMNLP. http://aclanthology.info/papers/dependency-based-decipherment-for-resource-limited-machine-translation.

Qing Dou, Ashish Vaswani, Kevin Knight, and ChrisDyer. 2015. Unifying Bayesian Inference andVector Space Models for Improved Decipherment.In ACL-IJCNLP. http://www.aclweb.org/anthology/P15-1081.

Long Duong, Hiroshi Kanayama, Tengfei Ma,Steven Bird, and Trevor Cohn. 2016. LearningCrosslingual Word Embeddings without BilingualCorpora. In EMNLP. http://aclanthology.info/papers/learning-crosslingual-word-embeddings-without-bilingual-corpora.

Manaal Faruqui and Chris Dyer. 2014. Improv-ing Vector Space Word Representations UsingMultilingual Correlation. In EACL. http://aclanthology.info/papers/improving-vector-space-word-representations-using-multilingual-correlation.

Pascale Fung and Percy Cheung. 2004. Mining Very-Non-Parallel Corpora: Parallel Sentence and Lex-icon Extraction via Bootstrapping and EM. InEMNLP. http://aclweb.org/anthology/W04-3208.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, Francois Lavi-olette, Mario Marchand, and Victor Lempitsky.2016. Domain-Adversarial Training of Neural Net-works. Journal of Machine Learning Research http://jmlr.org/papers/v17/15-239.html.

Eric Gaussier, J.M. Renders, I. Matveeva, C. Goutte,and H. Dejean. 2004. A Geometric View on Bilin-gual Lexicon Extraction from Comparable Corpora.In ACL. https://doi.org/10.3115/1218955.1219022.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. GenerativeAdversarial Nets. In NIPS. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

Stephan Gouws, Yoshua Bengio, and Greg Cor-rado. 2015. BilBOWA: Fast Bilingual Dis-tributed Representations without Word Alignments.In ICML. http://jmlr.org/proceedings/papers/v37/gouws15.html.

Stephan Gouws and Anders Søgaard. 2015. Sim-ple task-specific bilingual word embeddings. InNAACL-HLT . http://www.aclweb.org/anthology/N15-1157.

1968

Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick,and Dan Klein. 2008. Learning Bilingual Lexi-cons from Monolingual Corpora. In ACL-HLT .http://aclanthology.info/papers/learning-bilingual-lexicons-from-monolingual-corpora.

Di He, Yingce Xia, Tao Qin, Liwei Wang, NenghaiYu, Tieyan Liu, and Wei-Ying Ma. 2016. DualLearning for Machine Translation. In NIPS.http://papers.nips.cc/paper/6469-dual-learning-for-machine-translation.pdf.

Karl Moritz Hermann and Phil Blunsom. 2014.Multilingual Distributed Representations withoutWord Alignment. In ICLR. http://arxiv.org/abs/1312.6173.

Ann Irvine and Chris Callison-Burch. 2013. Combin-ing bilingual and comparable corpora for low re-source machine translation. In Proceedings of theEighth Workshop on Statistical Machine Transla-tion. http://aclweb.org/anthology/W13-2233.

Diederik Kingma and Jimmy Ba. 2014. Adam:A Method for Stochastic Optimization.arXiv:1412.6980 [cs] http://arxiv.org/abs/1412.6980.

Philipp Koehn and Kevin Knight. 2002. Learning aTranslation Lexicon from Monolingual Corpora. InACL Workshop on Unsupervised Lexical Acquisi-tion. https://doi.org/10.3115/1118627.1118629.

Tomas Kocisky, Karl Moritz Hermann, and PhilBlunsom. 2014. Learning Bilingual Word Repre-sentations by Marginalizing Alignments. In ACL.http://aclanthology.info/papers/learning-bilingual-word-representations-by-marginalizing-alignments.

Angeliki Lazaridou, Georgiana Dinu, and Marco Ba-roni. 2015. Hubness and Pollution: Delvinginto Cross-Space Mapping for Zero-Shot Learning.In ACL-IJCNLP. https://doi.org/10.3115/v1/P15-1027.

Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel,and Karen Livescu. 2015. Deep MultilingualCorrelation for Improved Word Embeddings. InNAACL-HLT . http://aclanthology.info/papers/deep-multilingual-correlation-for-improved-word-embeddings.

Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Bilingual Word Representa-tions with Monolingual Quality in Mind. InProceedings of the 1st Workshop on VectorSpace Modeling for Natural Language Process-ing. http://aclanthology.info/papers/bilingual-word-representations-with-monolingual-quality-in-mind.

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly,Ian Goodfellow, and Brendan Frey. 2015. Adver-sarial Autoencoders. arXiv:1511.05644 [cs] http://arxiv.org/abs/1511.05644.

Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rah-man, and James Bailey. 2016. Efficient Orthogo-nal Parametrisation of Recurrent Neural NetworksUsing Householder Reflections. arXiv:1612.00188[cs] http://arxiv.org/abs/1612.00188.

Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a.Exploiting Similarities among Languages for Ma-chine Translation. arXiv:1309.4168 [cs] http://arxiv.org/abs/1309.4168.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg SCorrado, and Jeff Dean. 2013b. DistributedRepresentations of Words and Phrases and theirCompositionality. In NIPS. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.

Shakir Mohamed and Balaji Lakshminarayanan.2016. Learning in Implicit Generative Mod-els. arXiv:1610.03483 [cs, stat] http://arxiv.org/abs/1610.03483.

Sebastian Nowozin, Botond Cseke, and RyotaTomioka. 2016. f-GAN: Training Generative NeuralSamplers using Variational Divergence Minimiza-tion. arXiv:1606.00709 [cs, stat] http://arxiv.org/abs/1606.00709.

Takamasa Oshikiri, Kazuki Fukui, and Hidetoshi Shi-modaira. 2016. Cross-Lingual Word Representa-tions via Spectral Graph Embeddings. In ACL.https://doi.org/10.18653/v1/P16-2080.

Alec Radford, Luke Metz, and Soumith Chintala.2015. Unsupervised Representation Learning withDeep Convolutional Generative Adversarial Net-works. arXiv:1511.06434 [cs] http://arxiv.org/abs/1511.06434.

Reinhard Rapp. 1999. Automatic Identification ofWord Translations from Unrelated English and Ger-man Corpora. In ACL. https://doi.org/10.3115/1034678.1034756.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba,Vicki Cheung, Alec Radford, and Xi Chen. 2016.Improved Techniques for Training GANs. InNIPS. http://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf.

Tianze Shi, Zhiyuan Liu, Yang Liu, and Maosong Sun.2015. Learning Cross-lingual Word Embeddingsvia Matrix Co-factorization. In ACL-IJCNLP. http://aclanthology.info/papers/learning-cross-lingual-word-embeddings-via-matrix-co-factorization.

Samuel Smith, David Turban, Steven Hamblin, andNils Hammerla. 2017. Offline bilingual word vec-tors, orthogonal transformations and the invertedsoftmax. In ICLR. http://arxiv.org/abs/1702.03859.

Casper Kaae Sønderby, Jose Caballero, Lucas Theis,Wenzhe Shi, and Ferenc Huszar. 2016. Amor-tised MAP Inference for Image Super-resolution.arXiv:1610.04490 [cs, stat] http://arxiv.org/abs/1610.04490.

1969

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A Simple Way to Prevent Neural Net-works from Overfitting. Journal of MachineLearning Research http://www.jmlr.org/papers/v15/srivastava14a.html.

Stephanie Strassel and Jennifer Tracey. 2016.LORELEI Language Packs: Data, Tools, andResources for Technology Development in LowResource Languages. In LREC. http://www.lrec-conf.org/proceedings/lrec2016/pdf/1138 Paper.pdf.

Shyam Upadhyay, Manaal Faruqui, Chris Dyer, andDan Roth. 2016. Cross-lingual Models of Word Em-beddings: An Empirical Comparison. In ACL. http://aclanthology.info/papers/cross-lingual-models-of-word-embeddings-an-empirical-comparison.

Laurens Van der Maaten, Minmin Chen, StephenTyree, and Kilian Weinberger. 2013. Learn-ing with Marginalized Corrupted Features. InICML. http://www.jmlr.org/proceedings/papers/v28/vandermaaten13.html.

Ivan Vulic and Anna Korhonen. 2016. On the Roleof Seed Lexicons in Learning Bilingual WordEmbeddings. In ACL. http://aclanthology.info/papers/on-the-role-of-seed-lexicons-in-learning-bilingual-word-embeddings.

Ivan Vulic and Marie-Francine Moens. 2013. Cross-Lingual Semantic Similarity of Words as theSimilarity of Their Semantic Word Responses.In NAACL-HLT . http://aclanthology.info/papers/cross-lingual-semantic-similarity-of-words-as-the-similarity-of-their-semantic-word-responses.

Ivan Vulic and Marie-Francine Moens. 2015.Bilingual Word Embeddings from Non-ParallelDocument-Aligned Data Applied to Bilin-gual Lexicon Induction. In ACL-IJCNLP.http://aclanthology.info/papers/bilingual-word-embeddings-from-non-parallel-document-aligned-data-applied-to-bilingual-lexicon-induction.

Ivan Vulic, Wim De Smet, and Marie-FrancineMoens. 2011. Identifying Word Translations fromComparable Corpora Using Latent Topic Models.In ACL-HLT . http://aclanthology.info/papers/identifying-word-translations-from-comparable-corpora-using-latent-topic-models.

Stefan Wager, Sida Wang, and Percy S Liang. 2013.Dropout Training as Adaptive Regularization. InNIPS. http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf.

Michael Wick, Pallika Kanani, and Adam Pocock.2016. Minimally-Constrained Multilingual Em-beddings via Artificial Code-Switching. InAAAI. http://www.aaai.org/Conferences/AAAI/2016/Papers/15Wick12464.pdf.

Chao Xing, Dong Wang, Chao Liu, and Yiye Lin.2015. Normalized Word Embedding and Orthog-onal Transform for Bilingual Word Translation.In NAACL-HLT . http://aclanthology.info/papers/normalized-word-embedding-and-orthogonal-transform-for-bilingual-word-translation.

Hyejin Youn, Logan Sutton, Eric Smith, CristopherMoore, Jon F. Wilkins, Ian Maddieson, WilliamCroft, and Tanmoy Bhattacharya. 2016. On the uni-versal structure of human lexical semantics. Pro-ceedings of the National Academy of Scienceshttps://doi.org/10.1073/pnas.1520752113.

Meng Zhang, Yang Liu, Huanbo Luan, YiqunLiu, and Maosong Sun. 2016a. Inducing Bilin-gual Lexica From Non-Parallel Data With EarthMover’s Distance Regularization. In COLING.http://aclanthology.info/papers/inducing-bilingual-lexica-from-non-parallel-data-with-earth-mover-s-distance-regularization.

Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan,and Maosong Sun. 2017. Bilingual Lexicon Induc-tion From Non-Parallel Data With Minimal Super-vision. In AAAI. http://thunlp.org/˜zm/publications/aaai2017.pdf.

Yuan Zhang, David Gaddy, Regina Barzilay, andTommi Jaakkola. 2016b. Ten Pairs to Tag –Multilingual POS Tagging via Coarse Map-ping between Embeddings. In NAACL-HLT .http://aclanthology.info/papers/ten-pairs-to-tag-multilingual-pos-tagging-via-coarse-mapping-between-embeddings.

Will Y. Zou, Richard Socher, Daniel Cer, andChristopher D. Manning. 2013. Bilingual WordEmbeddings for Phrase-Based Machine Transla-tion. In EMNLP. http://aclanthology.info/papers/bilingual-word-embeddings-for-phrase-based-machine-translation.

1970

Adversarial Training for Unsupervised Bilingual Lexicon ...training. Finally, our evaluation on the...

Documents

Transcript of Adversarial Training for Unsupervised Bilingual Lexicon ...training. Finally, our evaluation on the...