Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A...

12
Yoshinari Fujinuma , Michael Paul, and Jordan Boyd-Graber. A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity. Association for Computational Linguistics, 2019, 10 pages. @inproceedings{Fujinuma:Paul:Boyd-Graber-2019, Title = {A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity}, Author = {Yoshinari Fujinuma and Michael Paul and Jordan Boyd-Graber}, Booktitle = {Association for Computational Linguistics}, Year = {2019}, Location = {Florence, Italy}, Url = {http://umiacs.umd.edu/~jbg//docs/2019_acl_modularity.pdf} } Downloaded from http://umiacs.umd.edu/ ~ jbg/docs/2019_acl_modularity.pdf Contact Jordan Boyd-Graber ([email protected]) for questions about this paper. 1

Transcript of Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A...

Page 1: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber. A Resource-Free Evaluation Metric forCross-Lingual Word Embeddings Based on Graph Modularity. Association for Computational Linguistics,2019, 10 pages.

@inproceedings{Fujinuma:Paul:Boyd-Graber-2019,Title = {A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity},Author = {Yoshinari Fujinuma and Michael Paul and Jordan Boyd-Graber},Booktitle = {Association for Computational Linguistics},Year = {2019},Location = {Florence, Italy},Url = {http://umiacs.umd.edu/~jbg//docs/2019_acl_modularity.pdf}}

Downloaded from http://umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf

Contact Jordan Boyd-Graber ([email protected]) for questions about this paper.

1

Page 2: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

A Resource-Free Evaluation Metric for Cross-Lingual Word EmbeddingsBased on Graph Modularity

Yoshinari FujinumaComputer Science

University of [email protected]

Jordan Boyd-GraberCS, iSchool, UMIACS, LSC

University of [email protected]

Michael J. PaulInformation Science

University of [email protected]

Abstract

Cross-lingual word embeddings encode themeaning of words from different languagesinto a shared low-dimensional space. Animportant requirement for many downstreamtasks is that word similarity should be indepen-dent of language—i.e., word vectors withinone language should not be more similar toeach other than to words in another language.We measure this characteristic using modular-ity, a network measurement that measures thestrength of clusters in a graph. Modularityhas a moderate to strong correlation with threedownstream tasks, even though modularity isbased only on the structure of embeddings anddoes not require any external resources. Weshow through experiments that modularity canserve as an intrinsic validation metric to im-prove unsupervised cross-lingual word embed-dings, particularly on distant language pairs inlow-resource settings.1

1 Introduction

The success of monolingual word embeddingsin natural language processing (Mikolov et al.,2013b) has motivated extensions to cross-lingualsettings. Cross-lingual word embeddings—wheremultiple languages share a single distributedrepresentation—work well for classification (Kle-mentiev et al., 2012; Ammar et al., 2016) and ma-chine translation (Lample et al., 2018; Artetxe et al.,2018b), even with few bilingual pairs (Artetxe et al.,2017) or no supervision at all (Zhang et al., 2017;Conneau et al., 2018; Artetxe et al., 2018a).

Typically the quality of cross-lingual word em-beddings is measured with respect to how well theyimprove a downstream task. However, sometimesit is not possible to evaluate embeddings for a spe-cific downstream task, for example a future task

1Our code is at https://github.com/akkikiki/modularity_metric

that does not yet have data or on a rare languagethat does not have resources to support traditionalevaluation. In such settings, it is useful to have anintrinsic evaluation metric: a metric that looks atthe embedding space itself to know whether theembedding is good without resorting to an extrinsictask. While extrinsic tasks are the ultimate arbiterof whether cross-lingual word embeddings work,intrinsic metrics are useful for low-resource lan-guages where one often lacks the annotated datathat would make an extrinsic evaluation possible.

However, few intrinsic measures exist for cross-lingual word embeddings, and those that do existrequire external linguistic resources (e.g., sense-aligned corpora in Ammar et al. (2016)). The re-quirement of language resources makes this ap-proach limited or impossible for low-resource lan-guages, which are the languages where intrinsicevaluations are most needed. Moreover, requiringlanguage resources can bias the evaluation towardwords in the resources rather than evaluating theembedding space as a whole.

Our solution involves a graph-based metric thatconsiders the characteristics of the embeddingspace without using linguistic resources. To sketchthe idea, imagine a cross-lingual word embeddingspace where it is possible to draw a hyperplane thatseparates all word vectors in one language fromall vectors in another. Without knowing anythingabout the languages, it is easy to see that this isa problematic embedding: the representations ofthe two languages are in distinct parts of the spacerather than using a shared space. While this exam-ple is exaggerated, this characteristic where vec-tors are clustered by language often appears withinsmaller neighborhoods of the embedding space, wewant to discover these clusters.

To measure how well word embeddings aremixed across languages, we draw on conceptsfrom network science. Specifically, some cross-

Page 3: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

.83.71

eat

(eat)

(take)(drink)drink

(warm up)

nutritious (cook)

(drinkable)

consume

(eat)

(meal, rice)(meal, rice)

eating

.77 .75.72

.72.71 .71

.71

.70

.82.76

.75.75.71

.71

.70

(a) low modularity

firefox

chrome

lollipopubuntu

mozillaandroid

ios

(firefox)

(Yamataka)

(Kamo River)(Miyagino)

(Kasugano)

(Shiroyama)

(Rumoi)

.84

.82.82

.81 .81.80

.75

.73.72

.72.72

.71

(b) high modularity

Figure 1: An example of a low modularity (languagesmixed) and high modularity cross-lingual word embed-ding lexical graph using k-nearest neighbors of “eat”(left) and “firefox” (right) in English and Japanese.

lingual word embeddings are modular by language:vectors in one language are consistently closerto each other than vectors in another language(Figure 1). When embeddings are modular, theyoften fail on downstream tasks (Section 2).

Modularity is a concept from network theory(Section 3); because network theory is applied tographs, we turn our word embeddings into a graphby connecting nearest-neighbors—based on vectorsimilarity—to each other. Our hypothesis is thatmodularity will predict how useful the embedding isin downstream tasks; low-modularity embeddingsshould work better.

We explore the relationship between modular-ity and three downstream tasks (Section 4) thatuse cross-lingual word embeddings differently: (i)cross-lingual document classification; (ii) bilin-gual lexical induction in Italian, Japanese, Span-ish, and Danish; and (iii) low-resource documentretrieval in Hungarian and Amharic, finding mod-erate to strong negative correlations between mod-ularity and performance. Furthermore, using mod-ularity as a validation metric (Section 5) makesMUSE (Conneau et al., 2018), an unsupervisedmodel, more robust on distant language pairs. Com-pared to other existing intrinsic evaluation metrics,modularity captures complementary properties andis more predictive of downstream performance de-spite needing no external resources (Section 6).

2 Background: Cross-Lingual WordEmbeddings and their Evaluation

There are many approaches to training cross-lingual word embeddings. This section reviews theembeddings we consider in this paper, along withexisting work on evaluating those embeddings.

2.1 Cross-Lingual Word Embeddings

We focus on methods that learn a cross-lingualvector space through a post-hoc mapping betweenindependently constructed monolingual embed-dings (Mikolov et al., 2013a; Vulic and Korhonen,2016). Given two separate monolingual embed-dings and a bilingual seed lexicon, a projection ma-trix can map translation pairs in a given bilinguallexicon to be near each other in a shared embeddingspace. A key assumption is that cross-linguallycoherent words have “similar geometric arrange-ments” (Mikolov et al., 2013a) in the embeddingspace, enabling “knowledge transfer between lan-guages” (Ruder et al., 2017).

We focus on mapping-based approaches for tworeasons. First, these approaches are applicable tolow-resource languages because they do not requir-ing large bilingual dictionaries or parallel corpora(Artetxe et al., 2017; Conneau et al., 2018).2 Sec-ond, this focus separates the word embedding taskfrom the cross-lingual mapping, which allows us tofocus on evaluating the specific multilingual com-ponent in Section 4.

2.2 Evaluating Cross-Lingual Embeddings

Most work on evaluating cross-lingual embed-dings focuses on extrinsic evaluation of down-stream tasks (Upadhyay et al., 2016; Glavas et al.,2019). However, intrinsic evaluations are crucialsince many low-resource languages lack annota-tions needed for downstream tasks. Thus, our goalis to develop an intrinsic measure that correlateswith downstream tasks without using any externalresources. This section summarizes existing workon intrinsic methods of evaluation for cross-lingualembeddings.

One widely used intrinsic measure for evalu-ating the coherence of monolingual embeddingsis QVEC (Tsvetkov et al., 2015). Ammar et al.(2016) extend QVEC by using canonical correlationanalysis (QVEC-CCA) to make the scores compara-ble across embeddings with different dimensions.However, while both QVEC and QVEC-CCA can beextended to cross-lingual word embeddings, theyare limited: they require external annotated corpora.This is problematic in cross-lingual settings sincethis requires annotation to be consistent across lan-guages (Ammar et al., 2016).

Other internal metrics do not require external

2Ruder et al. (2017) offers detailed discussion on alterna-tive approaches.

Page 4: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

resources, but those consider only part of the em-beddings. Conneau et al. (2018) and Artetxe et al.(2018a) use a validation metric that calculates simi-larities of cross-lingual neighbors to conduct modelselection. Our approach differs in that we considerwhether cross-lingual nearest neighbors are rela-tively closer than intra-lingual nearest neighbors.

Søgaard et al. (2018) use the similarities of intra-lingual neighbors and compute graph similarity be-tween two monolingual lexical subgraphs built bysubsampled words in a bilingual lexicon. They fur-ther show that the resulting graph similarity has ahigh correlation with bilingual lexical induction onMUSE (Conneau et al., 2018). However, their graphsimilarity still only uses intra-lingual similaritiesbut not cross-lingual similarities.

These existing metrics are limited by either re-quiring external resources or considering only partof the embedding structure (e.g., intra-lingual butnot cross-lingual neighbors). In contrast, our workdevelops an intrinsic metric which is highly corre-lated with multiple downstream tasks but does notrequire external resources, and considers both intra-and cross-lingual neighbors.

Related Work A related line of work is the in-trinsic evaluation measures of probabilistic topicmodels, which are another low-dimensional rep-resentation of words similar to word embeddings.Metrics based on word co-occurrences have beendeveloped for measuring the monolingual coher-ence of topics (Newman et al., 2010; Mimno et al.,2011; Lau et al., 2014). Less work has studied eval-uation of cross-lingual topics (Mimno et al., 2009).Some researchers have measured the overlap ofdirect translations across topics (Boyd-Graber andBlei, 2009), while Hao et al. (2018) propose a met-ric based on co-occurrences across languages thatis more general than direct translations.

3 Approach: Graph-Based Diagnosticsfor Detecting Clustering by Language

This section describes our graph-based approachto measure the intrinsic quality of a cross-lingualembedding space.

3.1 Embeddings as Lexical GraphsWe posit that we can understand the quality ofcross-lingual embeddings by analyzing characteris-tics of a lexical graph (Pelevina et al., 2016; Hamil-ton et al., 2016). The lexical graph has words asnodes and edges weighted by their similarity in the

1 0.5 0 0.5 1

9

9.5

10

10.5

stronger

sluggish

slower

slowsslow

(worse)

(faster)

(slow, late)

(slowly, late)

(congestion)

Figure 2: Local t-SNE (van der Maaten and Hin-ton, 2008) of an EN-JA cross-lingual word embedding,which shows an example of “clustering by language”.

embedding space. Given a pair of words (i, j) andassociated word vectors (vi, vj), we compute thesimilarity between two words by their vector simi-larity. We encode this similarity in a weighted ad-jacency matrix A: Aij ≡ max(0, cos_sim(vi, vj)).However, nodes are only connected to their k-nearest neighbors (Section 6.2 examines the sensi-tivity to k); all other edges become zero. Finally,each node i has a label gi indicating the word’slanguage.

3.2 Clustering by Language

We focus on a phenomenon that we call “clusteringby language”, when word vectors in the embed-ding space tend to be more similar to words in thesame language than words in the other. For exam-ple in Figure 2, the intra-lingual nearest neighborsof “slow” have higher similarity in the embeddingspace than semantically related cross-lingual words.This indicates that words are represented differentlyacross the two languages, thus our hypothesis isthat clustering by language degrades the qualityof cross-lingual embeddings when used in down-stream tasks.

3.3 Modularity of Lexical Graphs

With a labeled graph, we can now ask whetherthe graph is modular (Newman, 2010). In a cross-lingual lexical graph, modularity is the degree towhich words are more similar to words in thesame language than to words in a different lan-guage. This is undesirable, because the represen-tation of words is not transferred across languages.If the nearest neighbors of the words are insteadwithin the same language, then the languages arenot mapped into the cross-lingual space consis-

Page 5: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

tently. In our setting, the language l of each worddefines its group, and high modularity indicates em-beddings are more similar within languages thanacross languages (Newman, 2003; Newman andGirvan, 2004). In other words, good embeddingsshould have low modularity.

Conceptually, the modularity of a lexical graphis the difference between the proportion of edges inthe graph that connect two nodes from the same lan-guage and the expected proportion of such edges ina randomly connected lexical graph. If edges wererandom, the number of edges starting from node iwithin the same language would be the degree ofnode i, di =

∑j Aij for a weighted graph, follow-

ing Newman (2004), times the proportion of wordsin that language. Summing over all nodes gives theexpected number of edges within a language,

al =1

2m

i

di1 [gi = l] , (1)

where m is the number of edges, gi is the label ofnode i, and 1 [·] is an indicator function that evalu-ates to 1 if the argument is true and 0 otherwise.

Next, we count the fraction of edges ell thatconnect words of the same language:

ell =1

2m

ij

Aij1 [gi = l]1 [gj = l] . (2)

Given L different languages, we calculate overallmodularity Q by taking the difference between elland a2l for all languages:

Q =

L∑

l=1

(ell − a2l ). (3)

Since Q does not necessarily have a maximumvalue of 1, we normalize modularity:

Qnorm =Q

Qmax,where Qmax = 1−

L∑

l=1

(a2l ).

(4)The higher the modularity, the more words from thesame language appear as nearest neighbors. Fig-ure 1 shows the example of a lexical subgraph withlow modularity (left, Qnorm = 0.143) and highmodularity (right, Qnorm = 0.672). In Figure 1b,the lexical graph is modular since “firefox” doesnot encode same sense in both languages.

Our hypothesis is that cross-lingual word em-beddings with lower modularity will be more suc-cessful in downstream tasks. If this hypothesisholds, then modularity could be a useful metric forcross-lingual evaluation.

Language Corpus TokensEnglish (EN) News 23MSpanish (ES) News 25MItalian (IT) News 23MDanish (DA) News 20MJapanese (JA) News 28MHungarian (HU) News 20MAmharic (AM) LORELEI 28M

Table 1: Dataset statistics (source and number of to-kens) for each language including both Indo-Europeanand non-Indo-European languages.

4 Experiments: Correlation ofModularity with Downstream Success

We now investigate whether modularity can predictthe effectiveness of cross-lingual word embeddingson three downstream tasks: (i) cross-lingual docu-ment classification, (ii) bilingual lexical induction,and (iii) document retrieval in low-resource lan-guages. If modularity correlates with task perfor-mance, it can characterize embedding quality.

4.1 Data

To investigate the relationship between embeddingeffectiveness and modularity, we explore five differ-ent cross-lingual word embeddings on six languagepairs (Table 1).

Monolingual Word Embeddings All monolin-gual embeddings are trained using a skip-grammodel with negative sampling (Mikolov et al.,2013b). The dimension size is 100 or 200. Allother hyperparameters are default in Gensim (Re-hurek and Sojka, 2010). News articles exceptfor Amharic are from Leipzig Corpora (Goldhahnet al., 2012). For Amharic, we use documentsfrom LORELEI (Strassel and Tracey, 2016). MeCab(Kudo et al., 2004) tokenizes Japanese sentences.

Bilingual Seed Lexicon For supervised meth-ods, bilingual lexicons from Rolston and Kirchhoff(2016) induce all cross-lingual embeddings exceptfor Danish, which uses Wiktionary.3

4.2 Cross-Lingual Mapping Algorithms

We use three supervised (MSE, MSE+Orth, CCA)and two unsupervised (MUSE, VECMAP) cross-lingual mappings:4

3https://en.wiktionary.org/4We use the implementations from original authors with

default parameters unless otherwise noted.

Page 6: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

Mean-squared error (MSE) Mikolov et al.(2013a) minimize the mean-squared error of bilin-gual entries in a seed lexicon to learn a projectionbetween two embeddings. We use the implementa-tion by Artetxe et al. (2016).

MSE with orthogonal constraints (MSE+Orth)Xing et al. (2015) add length normalization andorthogonal constraints to preserve the cosine sim-ilarities in the original monolingual embeddings.Artetxe et al. (2016) further preprocess monolin-gual embeddings by mean centering.5

Canonical Correlation Analysis (CCA)Faruqui and Dyer (2014) maps two mono-lingual embeddings into a shared space bymaximizing the correlation between translationpairs in a seed lexicon.

Conneau et al. (2018, MUSE) use language-adversarial learning (Ganin et al., 2016) to inducethe initial bilingual seed lexicon, followed by arefinement step, which iteratively solves the or-thogonal Procrustes problem (Schönemann, 1966;Artetxe et al., 2017), aligning embeddings with-out an external bilingual lexicon. Like MSE+Orth,vectors are unit length and mean centered. SinceMUSE is unstable (Artetxe et al., 2018a; Søgaardet al., 2018), we report the best of five runs.

Artetxe et al. (2018a, VECMAP) induce an ini-tial bilingual seed lexicon by aligning intra-lingualsimilarity matrices computed from each monolin-gual embedding. We report the best of five runs toaddress uncertainty from the initial dictionary.

4.3 Modularity Implementation

We implement modularity using random projectiontrees (Dasgupta and Freund, 2008) to speed up theextraction of k-nearest neighbors,6 tuning k = 3on the German Rcv2 dataset (Section 6.2).

4.4 Task 1: Document Classification

We now explore the correlation of modularityand accuracy on cross-lingual document classifi-cation. We classify documents from the ReutersRcv1 and Rcv2 corpora (Lewis et al., 2004). Docu-ments have one of four labels (Corporate/Industrial,Economics, Government/Social, Markets). We fol-low Klementiev et al. (2012), except we use all EN

training documents and documents in each target

5One round of iterative normalization (Zhang et al., 2019)6https://github.com/spotify/annoy

0.2 0.4 0.6 0.8Classification Accuracy

0.4

0.5

0.6

0.7

Mod

ular

ity

Dim100200

LangDAESITJA

Figure 3: Classification accuracy and modularity ofcross-lingual word embeddings (ρ = −0.665): lessmodular cross-lingual mappings have higher accuracy.

Method Acc. ModularityMSE 0.399 0.529

Supervised CCA 0.502 0.513MSE+Orth 0.628 0.452

Unsupervised MUSE 0.711 0.431VECMAP 0.643 0.432

Table 2: Average classification accuracy on (EN →DA, ES, IT, JA) along with the average modularity offive cross-lingual word embeddings. MUSE has the bestaccuracy, captured by its low modularity.

language (DA, ES, IT, and JA) as tuning and testdata. After removing out-of-vocabulary words, wesplit documents in target languages into 10% tun-ing data and 90% test data. Test data are 10,067documents for DA, 25,566 for IT, 58,950 for JA,and 16,790 for ES. We exclude languages Reuterslacks: HU and AM. We use deep averaging net-works (Iyyer et al., 2015, DAN) with three layers,100 hidden states, and 15 epochs as our classifier.The DAN had better accuracy than averaged percep-tron (Collins, 2002) in Klementiev et al. (2012).

Results We report the correlation value com-puted from the data points in Figure 3. Spearman’scorrelation between modularity and classificationaccuracy on all languages is ρ = −0.665. Withineach language pair, modularity has a strong cor-relation within EN-ES embeddings (ρ = −0.806),EN-JA (ρ = −0.794), EN-IT (ρ = −0.784), anda moderate correlation within EN-DA embeddings(ρ = −0.515). MUSE has the best classificationaccuracy (Table 2), reflected by its low modularity.

Error Analysis A common error in EN → JA

classification is predicting Corporate/Industrial fordocuments labeled Markets. One cause is doc-uments with 終値 “closing price”; this has fewmarket-based English neighbors (Table 3). As aresult, the model fails to transfer across languages.

Page 7: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

市場 “market” 終値 “closing price”新興 “new coming” 上げ幅 “gains”market 株価 “stock price”markets 年初来 “yearly”軟調 “bearish” 続落 “continued fall”マーケット “market” 月限 “contract month”活況 “activity” 安値 “low price”相場 “market price” 続伸 “continuous rise”底入 “bottoming” 前日 “previous day”為替 “exchange” 先物 “futures”ctoc 小幅 “narrow range”

Table 3: Nearest neighbors in an EN-JA embedding.Unlike the JA word “market”, the JA word “closingprice” has no EN vector nearby.

5 10 15 20Precision@1

0.4

0.5

0.6

0.7

Mod

ular

ity

Dim100200

LangDAESITJA

Figure 4: Bilingual lexical induction results and modu-larity of cross-lingual word embeddings (ρ = −0.789):lower modularity means higher precision@1.

4.5 Task 2: Bilingual Lexical Induction (BLI)

Our second downstream task explores the correla-tion between modularity and bilingual lexical in-duction (BLI). We evaluate on the test set fromConneau et al. (2018), but we remove pairs in theseed lexicon from Rolston and Kirchhoff (2016).The result is 2,099 translation pairs for ES, 1,358for IT, 450 for DA, and 973 for JA. We report preci-sion@1 (P@1) for retrieving cross-lingual nearestneighbors by cross-domain similarity local scal-ing (Conneau et al., 2018, CSLS).

Results Although this task ignores intra-lingualnearest neighbors when retrieving translations,modularity still has a high correlation (ρ =−0.785) with P@1 (Figure 4). MUSE and VECMAP

beat the three supervised methods, which have thelowest modularity (Table 4). P@1 is low comparedto other work on the MUSE test set (e.g., Conneauet al. (2018)) because we filter out translation pairswhich appeared in the large training lexicon com-piled by Rolston and Kirchhoff (2016), and the rawcorpora used to train monolingual embeddings (Ta-ble 1) are relatively small compared to Wikipedia.

Method P@1 ModularityMSE 7.30 0.529

Supervised CCA 3.06 0.513MSE+Orth 10.57 0.452

Unsupervised MUSE 11.83 0.431VECMAP 12.92 0.432

Table 4: Average precision@1 on (EN → DA, ES, IT,JA) along with the average modularity of the cross-lingual word embeddings trained with different meth-ods. VECMAP scores the best P@1, which is capturedby its low modularity.

4.6 Task 3: Document Retrieval inLow-Resource Languages

As a third downstream task, we turn to an importanttask for low-resource languages: lexicon expan-sion (Gupta and Manning, 2015; Hamilton et al.,2016) for document retrieval. Specifically, we startwith a set of EN seed words relevant to a particularconcept, then find related words in a target lan-guage for which a comprehensive bilingual lexicondoes not exist. We focus on the disaster domain,where events may require immediate NLP analysis(e.g., sorting SMS messages to first responders).

We induce keywords in a target language by tak-ing the n nearest neighbors of the English seedwords in a cross-lingual word embedding. We man-ually select sixteen disaster-related English seedwords from Wikipedia articles, “Natural hazard”and “Anthropogenic hazard”. Examples of seedterms include “earthquake” and “flood”. Usingthe extracted terms, we retrieve disaster-relateddocuments by keyword matching and assess thecoverage and relevance of terms by area under theprecision-recall curve (AUC) with varying n.

Test Corpora As positively labeled docu-ments, we use documents from the LORELEI

project (Strassel and Tracey, 2016) containing anydisaster-related annotation. There are 64 disaster-related documents in Amharic, and 117 in Hungar-ian. We construct a set of negatively labeled docu-ments from the Bible; because the LORELEI corpusdoes not include negative documents and the Bibleis available in all our languages (Christodouloupou-los and Steedman, 2015), we take the chapters ofthe gospels (89 documents), which do not discussdisasters, and treat these as non-disaster-relateddocuments.

Results Modularity has a moderate correlationwith AUC (ρ = −0.378, Table 5). While modular-ity focuses on the entire vocabulary of cross-lingual

Page 8: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

Lang. Method AUC Mod.

AM

MSE 0.578 0.628CCA 0.345 0.501MSE+Orth 0.606 0.480MUSE 0.555 0.475VECMAP 0.592 0.506

HU

MSE 0.561 0.598CCA 0.675 0.506MSE+Orth 0.612 0.447MUSE 0.664 0.445VECMAP 0.612 0.432

Spearman Correlation ρ −0.378

Table 5: Correlation between modularity and AUC ondocument retrieval. It shows a moderate correlation tothis task.

word embeddings, this task focuses on a small, spe-cific subset—disaster-relevant words—which mayexplain the low correlation compared to BLI or doc-ument classification.

5 Use Case: Model Selection for MUSE

A common use case of intrinsic measures is modelselection. We focus on MUSE (Conneau et al.,2018) since it is unstable, especially on distantlanguage pairs (Artetxe et al., 2018a; Søgaardet al., 2018; Hoshen and Wolf, 2018) and there-fore requires an effective metric for model selec-tion. MUSE uses a validation metric in its twosteps: (1) the language-adversarial step, and (2)the refinement step. First the algorithm selects anoptimal mapping W using a validation metric, ob-tained from language-adversarial learning (Ganinet al., 2016). Then the selected mapping W fromthe language-adversarial step is passed on to therefinement step (Artetxe et al., 2017) to re-selectthe optimal mapping W using the same validationmetric after each epoch of solving the orthogonalProcrustes problem (Schönemann, 1966).

Normally, MUSE uses an intrinsic metric, CSLS

of the top 10K frequent words (Conneau et al.,2018, CSLS-10K). Given word vectors s, t ∈ Rn

from a source and a target embedding, CSLS is across-lingual similarity metric,

CSLS(Ws, t) = 2 cos(Ws, t)−r(Ws)−r(t) (5)

where W is the trained mapping after each epoch,and r(x) is the average cosine similarity of the top10 cross-lingual nearest neighbors of a word x.

What if we use modularity instead? To test mod-ularity as a validation metric for MUSE, we com-pute modularity on the lexical graph of 10K mostfrequent words (Mod-10K; we use 10K for con-sistency with CSLS on the same words) after each

Family Lang. CSLS-10K Mod-10KAvg. Best Avg. Best

Germanic DA 52.62 60.27 52.18 60.13DE 75.27 75.60 75.16 75.53

Romance ES 74.35 83.00 74.32 83.00IT 78.41 78.80 78.43 78.80

Indo-Iranian

FA 27.79 33.40 27.77 33.40HI 25.71 33.73 26.39 34.20BN 0.00 0.00 0.09 0.87

Others

FI 4.71 47.07 4.71 47.07HU 52.55 54.27 52.35 54.73JA 18.13 49.69 36.13 49.69ZH 5.01 37.20 10.75 37.20KO 16.98 20.68 17.34 22.53AR 15.43 33.33 15.71 33.67ID 67.69 68.40 67.82 68.40VI 0.01 0.07 0.01 0.07

Table 6: BLI results (P@1 ×100%) from EN to eachtarget language with different validation metrics forMUSE: default (CSLS-10K) and modularity (Mod-10K).We report the average (Avg.) and the best (Best) fromten runs with ten random seeds for each validation met-ric. Bold values are mappings that are not shared be-tween the two validation metrics. Mod-10K improvesthe robustness of MUSE on distant language pairs.

epoch of the adversarial step and the refinementstep and select the best mapping.

The important difference between these two met-rics is that Mod-10K considers the relative simi-larities between intra- and cross-lingual neighbors,while CSLS-10K only considers the similarities ofcross-lingual nearest neighbors.7

Experiment Setup We use the pre-trained fast-Text vectors (Bojanowski et al., 2017) to be com-parable with the prior work. Following Artetxeet al. (2018a), all vectors are unit length normal-ized, mean centered, and then unit length normal-ized. We use the test lexicon by Conneau et al.(2018). We run ten times with the same randomseeds and hyperparameters but with different vali-dation metrics. Since MUSE is unstable on distantlanguage pairs (Artetxe et al., 2018a; Søgaard et al.,2018; Hoshen and Wolf, 2018), we test it on En-glish to languages from diverse language families:Indo-European languages such as Danish (DA),German (DE), Spanish (ES), Farsi (FA), Italian (IT),Hindi (HI), Bengali (BN), and non-Indo-Europeanlanguages such as Finnish (FI), Hungarian (HU),Japanese (JA), Chinese (ZH), Korean (KO), Arabic(AR), Indonesian (ID), and Vietnamese (VI).

7Another difference is that k-nearest neighbors for CSLS-10K is k = 10, whereas Mod-10K uses k = 3. However,using k = 3 for CSLS-10K leads to worse results; we thereforeonly report the result on the default metric i.e., k = 10.

Page 9: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

00.250.500.75

1Pr

edict

ed A

ccur

acy

R2 = 0.770R2 = 0.770Omit Avg. cos_sim

dim100200

langDAIT

R2 = 0.797R2 = 0.797Omit CSLS-10K

0 0.25 0.50 0.75 10

0.250.500.75

1R2 = 0.554R2 = 0.554

Omit Modularity

0 0.25 0.50 0.75 1Actual Classification Accuracy

R2 = 0.703R2 = 0.703Omit QVEC-CCA

Figure 5: We predict the cross-lingual document classi-fication results for DA and IT from Figure 3 using threeout of four evaluation metrics. Ablating modularitycauses by far the largest decrease (R2 = 0.814 whenusing all four features) in R2, showing that it capturesinformation complementary to the other metrics.

Results Table 6 shows P@1 on BLI for each tar-get language using English as the source language.Mod-10K improves P@1 over the default valida-tion metric in diverse languages, especially on theaverage P@1 for non-Germanic languages such asJA (+18.00%) and ZH (+5.74%), and the best P@1for KO (+1.85%). These language pairs includepairs (EN-JA and EN-HI), which are difficult forMUSE (Hoshen and Wolf, 2018). Improvementsin JA come from selecting a better mapping dur-ing the refinement step, which the default valida-tion misses. For ZH, HI, and KO, the improvementcomes from selecting better mappings during theadversarial step. However, modularity does not im-prove on all languages (e.g., VI) that are reportedto fail by Hoshen and Wolf (2018).

6 Analysis: Understanding Modularityas an Evaluation Metric

The experiments so far show that modularity cap-tures whether an embedding is useful, which sug-gests that modularity could be used as an intrinsicevaluation or validation metric. Here, we investi-gate whether modularity can capture distinct infor-mation compared to existing evaluation measures:QVEC-CCA (Ammar et al., 2016), CSLS (Conneauet al., 2018), and cosine similarity between transla-tion pairs (Section 6.1). We also analyze the effectof the number of nearest neighbors k (Section 6.2).

6.1 Ablation Study Using Linear Regression

We fit a linear regression model to predict the clas-sification accuracy given four intrinsic measures:QVEC-CCA, CSLS, average cosine similarity of

0 50 100 150 200k

0.80

0.85

0.90

Abso

lute

Cor

rela

tion

Pearson Correlation Spearman Correlation

Figure 6: Correlation between modularity and classi-fication performance (EN→DE) with different numbersof neighbors k. Correlations are computed on the samesetting as Figure 3 using supervised methods. We usethis to set k = 3.

translations, and modularity. We ablate each ofthe four measures, fitting linear regression withstandardized feature values, for two target lan-guages (IT and DA) on the task of cross-lingualdocument classification (Figure 3). We limit to IT

and DA because aligned supersense annotations toEN ones (Miller et al., 1993), required for QVEC-CCA are only available in those languages (Monte-magni et al., 2003; Martínez Alonso et al., 2015;Martınez Alonso et al., 2016; Ammar et al., 2016).We standardize the values of the four features be-fore training the regression model.

Omitting modularity hurts accuracy predictionon cross-lingual document classification substan-tially, while omitting the other three measures hassmaller effects (Figure 5). Thus, modularity com-plements the other measures and is more predictiveof classification accuracy.

6.2 Hyperparameter Sensitivity

While modularity itself does not have anyadjustable hyperparameters, our approach toconstructing the lexical graph has two hyper-parameters: the number of nearest neighbors(k) and the number of trees (t) for approxi-mating the k-nearest neighbors using randomprojection trees. We conduct a grid searchfor k ∈ {1, 3, 5, 10, 50, 100, 150, 200} and t ∈{50, 100, 150, 200, 250, 300, 350, 400, 450, 500}using the German Rcv2 corpus as the held-outlanguage to tune hyperparameters.

The nearest neighbor k has a much larger effecton modularity than t, so we focus on analyzingthe effect of k, using the optimal t = 450. Our

Page 10: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

earlier experiments all use k = 3 since it gives thehighest Pearson’s and Spearman’s correlation onthe tuning dataset (Figure 6). The absolute correla-tion between the downstream task decreases whensetting k > 3, indicating nearest neighbors beyondk = 3 are only contributing noise.

7 Discussion: What Modularity Can andCannot Do

This work focuses on modularity as a diagnos-tic tool: it is cheap and effective at discoveringwhich embeddings are likely to falter on down-stream tasks. Thus, practitioners should considerincluding it as a metric for evaluating the qualityof their embeddings. Additionally, we believe thatmodularity could serve as a useful prior for the al-gorithms that learn cross-lingual word embeddings:during learning prefer updates that avoid increasingmodularity if all else is equal.

Nevertheless, we recognize limitations of modu-larity. Consider the following cross-lingual wordembedding “algorithm”: for each word, select arandom point on the unit hypersphere. This is ahorrible distributed representation: the position ofwords’ embedding has no relationship to the un-derlying meaning. Nevertheless, this representa-tion will have very low modularity. Thus, whilemodularity can identify bad embeddings, once vec-tors are well mixed, this metric—unlike QVEC orQVEC-CCA—cannot identify whether the meaningsmake sense. Future work should investigate howto combine techniques that use both word mean-ing and nearest neighbors for a more robust, semi-supervised cross-lingual evaluation.

Acknowledgments

This work was supported by NSF grant IIS-1564275and by DARPA award HR0011-15-C-0113 undersubcontract to Raytheon BBN Technologies. Theauthors would like to thank Sebastian Ruder, AkikoAizawa, the members of the CLIP lab at the Univer-sity of Maryland, the members of the CLEAR labat the University of Colorado, and the anonymousreviewers for their feedback. The authors wouldlike to also thank Mozhi Zhang for providing thedeep averaging network code. Any opinions, find-ings, conclusions, or recommendations expressedhere are those of the authors and do not necessarilyreflect the view of the sponsor.

ReferencesWaleed Ammar, George Mulcaire, Yulia Tsvetkov,

Guillaume Lample, Chris Dyer, and Noah A Smith.2016. Massively multilingual word embeddings.Computing Research Repository, arXiv:1602.01925.Version 2.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.Learning principled bilingual mappings of word em-beddings while preserving monolingual invariance.In Proceedings of Empirical Methods in NaturalLanguage Processing.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.Learning bilingual word embeddings with (almost)no bilingual data. In Proceedings of the Associationfor Computational Linguistics.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.2018a. A robust self-learning method for fully un-supervised cross-lingual mappings of word embed-dings. In Proceedings of the Association for Compu-tational Linguistics.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2018b. Unsupervised neural ma-chine translation. In Proceedings of the Interna-tional Conference on Learning Representations.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics.

Jordan Boyd-Graber and David M. Blei. 2009. Multi-lingual topic models for unaligned text. In Proceed-ings of Uncertainty in Artificial Intelligence.

Christos Christodouloupoulos and Mark Steedman.2015. A massively parallel corpus: The Bible in 100languages. Proceedings of the Language Resourcesand Evaluation Conference.

Michael Collins. 2002. Discriminative training meth-ods for hidden Markov models: Theory and exper-iments with perceptron algorithms. In Proceedingsof Empirical Methods in Natural Language Process-ing.

Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Hervé Jégou. 2018.Word translation without parallel data. In Proceed-ings of the International Conference on LearningRepresentations.

Sanjoy Dasgupta and Yoav Freund. 2008. Random pro-jection trees and low dimensional manifolds. In Pro-ceedings of the annual ACM symposium on Theoryof computing.

Manaal Faruqui and Chris Dyer. 2014. Improving vec-tor space word representations using multilingualcorrelation. In Proceedings of the European Chap-ter of the Association for Computational Linguistics.

Page 11: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, François Lavi-olette, Mario Marchand, and Victor Lempitsky.2016. Domain-adversarial training of neural net-works. Journal of Machine Learning Research,17(59).

Goran Glavas, Robert Litschko, Sebastian Ruder, andIvan Vulic. 2019. How to (properly) evaluate cross-lingual word embeddings: On strong baselines,comparative analyses, and some misconceptions.Computing Research Repository, arXiv:1902.00508.Version 1.

Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.2012. Building large monolingual dictionaries atthe Leipzig corpora collection: From 100 to 200 lan-guages. In Proceedings of the Language Resourcesand Evaluation Conference.

Sonal Gupta and Christopher D. Manning. 2015. Dis-tributed representations of words to guide boot-strapped entity classifiers. In Proceedings of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies.

William L. Hamilton, Kevin Clark, Jure Leskovec, andDan Jurafsky. 2016. Inducing domain-specific senti-ment lexicons from unlabeled corpora. In Proceed-ings of Empirical Methods in Natural Language Pro-cessing.

Shudong Hao, Jordan Boyd-Graber, and Michael J.Paul. 2018. From the Bible to Wikipedia: adapt-ing topic model evaluation to multilingual and low-resource settings. In Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics.

Yedid Hoshen and Lior Wolf. 2018. Non-adversarialunsupervised word translation. In Proceedings ofEmpirical Methods in Natural Language Process-ing.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daumé III. 2015. Deep unordered compo-sition rivals syntactic methods for text classification.In Proceedings of the Annual Meeting of the Associ-ation for Computational Linguistics and the Interna-tional Joint Conference on Natural Language Pro-cessing.

Alexandre Klementiev, Ivan Titov, and Binod Bhattarai.2012. Inducing crosslingual distributed representa-tions of words. In Proceedings of International Con-ference on Computational Linguistics.

Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto.2004. Applying conditional random fields toJapanese morphological analysis. In Proceedingsof Empirical Methods in Natural Language Process-ing.

Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018. Unsupervised ma-chine translation using monolingual corpora only.In Proceedings of the International Conference onLearning Representations.

Jey Han Lau, David Newman, and Timothy Baldwin.2014. Machine reading tea leaves: Automaticallyevaluating topic coherence and topic model quality.In Proceedings of the European Chapter of the Asso-ciation for Computational Linguistics.

David D. Lewis, Yiming Yang, Tony G. Rose, and FanLi. 2004. RCV1: A new benchmark collection fortext categorization research. Journal of MachineLearning Research.

Héctor Martínez Alonso, Anders Johannsen, SussiOlsen, Sanni Nimb, Nicolai Hartvig Sørensen, AnnaBraasch, Anders Søgaard, and Bolette Sandford Ped-ersen. 2015. Supersense tagging for Danish. In Pro-ceedings of the Nordic Conference of ComputationalLinguistics.

Héctor Martınez Alonso, Anders Johannsen, SussiOlsen, Sanni Nimb, and Bolette Sandford Pedersen.2016. An empirically grounded expansion of the su-persense inventory. In Proceedings of the GlobalWordnet Conference.

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a.Exploiting similarities among languages for ma-chine translation. Computing Research Repository,arXiv:1309.4168. Version 1.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Proceedings of Advances in Neural Informa-tion Processing Systems.

George A. Miller, Claudia Leacock, Randee Tengi, andRoss T. Bunker. 1993. A semantic concordance.In Proceedings of the Human Language TechnologyConference.

David M. Mimno, Hanna M. Wallach, Jason Narad-owsky, David A. Smith, and Andrew McCallum.2009. Polylingual topic models. In Proceedingsof Empirical Methods in Natural Language Process-ing.

David M. Mimno, Hanna M. Wallach, Edmund M. Tal-ley, Miriam Leenders, and Andrew McCallum. 2011.Optimizing semantic coherence in topic models. InProceedings of Empirical Methods in Natural Lan-guage Processing.

Simonetta Montemagni, Francesco Barsotti, MarcoBattista, Nicoletta Calzolari, Ornella Corazzari,Alessandro Lenci, Antonio Zampolli, Francesca Fan-ciulli, Maria Massetani, Remo Raffaelli, RobertoBasili, Maria Teresa Pazienza, Dario Saracino,Fabio Zanzotto, Nadia Mana, Fabio Pianesi, andRodolfo Delmonte. 2003. Building the Italiansyntactic-semantic treebank. In Treebanks: Build-ing and Using Parsed Corpora. Springer.

Page 12: Yoshinari Fujinuma, Michael Paul, and Jordan Boyd-Graber A ...users.umiacs.umd.edu/~jbg/docs/2019_acl_modularity.pdf · similarity between two words by their vector simi-larity. We

David Newman, Jey Han Lau, Karl Grieser, and Timo-thy Baldwin. 2010. Automatic evaluation of topiccoherence. In Conference of the North AmericanChapter of the Association for Computational Lin-guistics.

Mark E. J. Newman. 2003. Mixing patterns in net-works. Physical Review E, 67(2).

Mark E. J. Newman. 2004. Analysis of weighted net-works. Physical Review E, 70(5).

Mark E. J. Newman. 2010. Networks: An introduction.Oxford university press.

Mark E. J. Newman and Michelle Girvan. 2004. Find-ing and evaluating community structure in networks.Physical Review E, 69(2).

Maria Pelevina, Nikolay Arefiev, Chris Biemann, andAlexander Panchenko. 2016. Making sense of wordembeddings. In Proceedings of the 1st Workshop onRepresentation Learning for NLP.

Radim Rehurek and Petr Sojka. 2010. Software frame-work for topic modelling with large corpora. In Pro-ceedings of the LREC 2010 Workshop on New Chal-lenges for NLP Frameworks.

Leanne Rolston and Katrin Kirchhoff. 2016. Collec-tion of bilingual data for lexicon transfer learning.UWEE Technical Report.

Sebastian Ruder, Ivan Vulic, and Anders Søgaard.2017. A survey of cross-lingual embedding models.Computing Research Repository, arXiv:1706.04902.Version 2.

Peter H. Schönemann. 1966. A generalized solution ofthe orthogonal procrustes problem. Psychometrika,31(1).

Anders Søgaard, Sebastian Ruder, and Ivan Vulic.2018. On the limitations of unsupervised bilingualdictionary induction. In Proceedings of the Associa-tion for Computational Linguistics.

Stephanie Strassel and Jennifer Tracey. 2016.LORELEI language packs: Data, tools, andresources for technology development in low re-source languages. In Proceedings of the LanguageResources and Evaluation Conference.

Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guil-laume Lample, and Chris Dyer. 2015. Evaluationof word vector representations by subspace align-ment. In Proceedings of Empirical Methods in Nat-ural Language Processing.

Shyam Upadhyay, Manaal Faruqui, Chris Dyer, andDan Roth. 2016. Cross-lingual models of word em-beddings: An empirical comparison. In Proceedingsof the Association for Computational Linguistics.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research, 9:2579–2605.

Ivan Vulic and Anna Korhonen. 2016. On the role ofseed lexicons in learning bilingual word embeddings.In Proceedings of the Association for ComputationalLinguistics.

Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015.Normalized word embedding and orthogonal trans-form for bilingual word translation. In Proceedingsof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies.

Meng Zhang, Yang Liu, Huanbo Luan, and MaosongSun. 2017. Adversarial training for unsupervisedbilingual lexicon induction. In Proceedings of theAssociation for Computational Linguistics.

Mozhi Zhang, Keyulu Xu, Ken-ichi Kawarabayashi,Stefanie Jegelka, and Jordan Boyd-Graber. 2019.Are girls neko or shojo? Cross-lingual alignment ofnon-isomorphic embeddings with iterative normal-ization. In Proceedings of the Association for Com-putational Linguistics.