pair2vec: Compositional Word-Pair Embeddings for Cross ...

13
pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference Mandar Joshi Eunsol Choi Omer Levy Daniel S. Weld Luke Zettlemoyer †‡ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA {mandar90,eunsol,weld,lsz}@cs.washington.edu Facebook AI Research, Seattle {omerlevy,lsz}@fb.com Abstract Reasoning about implied relationships (e.g. paraphrastic, common sense, encyclopedic) between pairs of words is crucial for many cross-sentence inference problems. This pa- per proposes new methods for learning and us- ing embeddings of word pairs that implicitly represent background knowledge about such relationships. Our pairwise embeddings are computed as a compositional function on word representations, which is learned by maximiz- ing the pointwise mutual information (PMI) with the contexts in which the two words co- occur. We add these representations to the cross-sentence attention layer of existing in- ference models (e.g. BiDAF for QA, ESIM for NLI), instead of extending or replacing ex- isting word embeddings. Experiments show a gain of 2.7% on the recently released SQuAD 2.0 and 1.3% on MultiNLI. Our representa- tions also aid in better generalization with gains of around 6-7% on adversarial SQuAD datasets, and 8.8% on the adversarial entail- ment test set by Glockner et al. (2018). 1 Introduction Reasoning about relationships between pairs of words is crucial for cross sentence inference prob- lems such as question answering (QA) and natu- ral language inference (NLI). In NLI, for exam- ple, given the premise “golf is prohibitively expen- sive”, inferring that the hypothesis “golf is a cheap pastime” is a contradiction requires one to know that expensive and cheap are antonyms. Recent work (Glockner et al., 2018) has shown that cur- rent models, which rely heavily on unsupervised single-word embeddings, struggle to learn such re- lationships. In this paper, we show that they can be learned with word pair vectors (pair2vec 1 ), 1 https://github.com/mandarjoshi90/ pair2vec X Y Contexts with X and Y baths hot cold too X or too Y neither X nor Y in X, Y Portland Oregon the X metropolitan area in Y X International Airport in Y food X are maize, Y, etc crop wheat dry X, such as Y, more X circles appeared in Y fields X OS comes with Y play Android Google the X team at Y X is developed by Y Table 1: Example word pairs and their contexts. which are trained unsupervised, and which signif- icantly improve performance when added to exist- ing cross-sentence attention mechanisms. Unlike single-word representations, which typ- ically model the co-occurrence of a target word x with its context c, our word-pair representa- tions are learned by modeling the three-way co- occurrence between words (x, y) and the context c that ties them together, as seen in Table 1. While similar training signals have been used to learn models for ontology construction (Hearst, 1992; Snow et al., 2005; Turney, 2005; Shwartz et al., 2016) and knowledge base completion (Riedel et al., 2013), this paper shows, for the first time, that large scale learning of pairwise embeddings can be used to directly improve the performance of neural cross-sentence inference models. More specifically, we train a feedforward net- work R(x, y) that learns representations for the individual words x and y, as well as how to com- pose them into a single vector. Training is done by maximizing a generalized notion of the point- wise mutual information (PMI) among x, y, and their context c using a variant of negative sam- pling (Mikolov et al., 2013a). Making R(x, y) a arXiv:1810.08854v2 [cs.CL] 5 Apr 2019

Transcript of pair2vec: Compositional Word-Pair Embeddings for Cross ...

Page 1: pair2vec: Compositional Word-Pair Embeddings for Cross ...

pair2vec: Compositional Word-Pair Embeddingsfor Cross-Sentence Inference

Mandar Joshi† Eunsol Choi† Omer Levy‡ Daniel S. Weld† Luke Zettlemoyer†‡

† Paul G. Allen School of Computer Science & Engineering,University of Washington, Seattle, WA

{mandar90,eunsol,weld,lsz}@cs.washington.edu‡ Facebook AI Research, Seattle{omerlevy,lsz}@fb.com

Abstract

Reasoning about implied relationships (e.g.paraphrastic, common sense, encyclopedic)between pairs of words is crucial for manycross-sentence inference problems. This pa-per proposes new methods for learning and us-ing embeddings of word pairs that implicitlyrepresent background knowledge about suchrelationships. Our pairwise embeddings arecomputed as a compositional function on wordrepresentations, which is learned by maximiz-ing the pointwise mutual information (PMI)with the contexts in which the two words co-occur. We add these representations to thecross-sentence attention layer of existing in-ference models (e.g. BiDAF for QA, ESIMfor NLI), instead of extending or replacing ex-isting word embeddings. Experiments show again of 2.7% on the recently released SQuAD2.0 and 1.3% on MultiNLI. Our representa-tions also aid in better generalization withgains of around 6-7% on adversarial SQuADdatasets, and 8.8% on the adversarial entail-ment test set by Glockner et al. (2018).

1 Introduction

Reasoning about relationships between pairs ofwords is crucial for cross sentence inference prob-lems such as question answering (QA) and natu-ral language inference (NLI). In NLI, for exam-ple, given the premise “golf is prohibitively expen-sive”, inferring that the hypothesis “golf is a cheappastime” is a contradiction requires one to knowthat expensive and cheap are antonyms. Recentwork (Glockner et al., 2018) has shown that cur-rent models, which rely heavily on unsupervisedsingle-word embeddings, struggle to learn such re-lationships. In this paper, we show that they canbe learned with word pair vectors (pair2vec1),

1 https://github.com/mandarjoshi90/pair2vec

X Y Contextswith X and Y baths

hot cold too X or too Yneither X nor Y

in X, YPortland Oregon the X metropolitan area in Y

X International Airport in Y

food X are maize, Y, etccrop wheat dry X, such as Y,

more X circles appeared in Y fields

X OS comes with Y playAndroid Google the X team at Y

X is developed by Y

Table 1: Example word pairs and their contexts.

which are trained unsupervised, and which signif-icantly improve performance when added to exist-ing cross-sentence attention mechanisms.

Unlike single-word representations, which typ-ically model the co-occurrence of a target wordx with its context c, our word-pair representa-tions are learned by modeling the three-way co-occurrence between words (x, y) and the contextc that ties them together, as seen in Table 1. Whilesimilar training signals have been used to learnmodels for ontology construction (Hearst, 1992;Snow et al., 2005; Turney, 2005; Shwartz et al.,2016) and knowledge base completion (Riedelet al., 2013), this paper shows, for the first time,that large scale learning of pairwise embeddingscan be used to directly improve the performanceof neural cross-sentence inference models.

More specifically, we train a feedforward net-work R(x, y) that learns representations for theindividual words x and y, as well as how to com-pose them into a single vector. Training is doneby maximizing a generalized notion of the point-wise mutual information (PMI) among x, y, andtheir context c using a variant of negative sam-pling (Mikolov et al., 2013a). Making R(x, y) a

arX

iv:1

810.

0885

4v2

[cs

.CL

] 5

Apr

201

9

Page 2: pair2vec: Compositional Word-Pair Embeddings for Cross ...

compositional function on individual words alle-viates the sparsity that necessarily comes with em-bedding pairs of words, even at a very large scale.

We show that our embeddings can be added toexisting cross-sentence inference models, such asBiDAF++ (Seo et al., 2017; Clark and Gardner,2018) for QA and ESIM (Chen et al., 2017) forNLI. Instead of changing the word embeddingsthat are fed into the encoder, we add the pretrainedpair representations to higher layers in the net-work where cross sentence attention mechanismsare used. This allows the model to use the back-ground knowledge that the pair embeddings im-plicitly encode to reason about the likely relation-ships between the pairs of words it aligns.

Experiments show that simply adding our word-pair embeddings to existing high-performing mod-els, which already use ELMo (Peters et al., 2018),results in sizable gains. We show 2.72 F1 pointsover the BiDAF++ model (Clark and Gardner,2018) on SQuAD 2.0 (Rajpurkar et al., 2018), aswell as a 1.3 point gain over ESIM (Chen et al.,2017) on MultiNLI (Williams et al., 2018). Ad-ditionally, our approach generalizes well to adver-sarial examples, with a 6-7% F1 increase on adver-sarial SQuAD (Jia and Liang, 2017) and a 8.8%gain on the Glockner et al. (2018) NLI bench-mark. An analysis of pair2vec on word analo-gies suggests that it complements the informationin single-word representations, especially for en-cyclopedic and lexicographic relations.

2 Unsupervised Pretraining

Extending the distributional hypothesis to wordpairs, we posit that similar word pairs tend to oc-cur in similar contexts, and that the contexts pro-vide strong clues about the likely relationships thathold between the words (see Table 1). We assumea dataset of (x, y, c) triplets, where each instancedepicts a word pair (x, y) and the context c inwhich they appeared. We learn two compositionalrepresentation functions, R(x, y) and C(c), to en-code the pair and the context, respectively, as d-dimensional vectors (Section 2.1). The functionsare trained using a variant of negative sampling,which tries to embed word pairs (x, y) close to thecontexts c with which they appeared (Section 2.2).

2.1 Representation

Our word-pair and context representations areboth fixed-length vectors, composed from individ-

ual words. The word-pair representation functionR(x, y) first embeds and normalizes the individualwords with a shared lookup matrix Ea:

x =Ea(x)

‖Ea(x)‖y =

Ea(y)

‖Ea(y)‖

These vectors, along with their element-wise prod-uct, are fed into a four-layer perceptron:

R(x, y) =MLP 4(x,y,x ◦ y)

The context c = c1...cn is encoded as a d-dimensional vector using the function C(c). C(c)embeds each token ci with a lookup matrix Ec,contextualizes it with a single-layer Bi-LSTM, andthen aggregates the entire context with attentivepooling:

ci = Ec(ci)

h1...hn = BiLSTM(c1...cn)

w = softmaxi(khi)

C(c) =∑i

wiWhi

where W ∈ Rd×d and k ∈ Rd. All parameters,including the lookup tablesEa andEc, are trained.

Our representation is similar to two recently-proposed frameworks by Washio and Kato(2018a,b), but differs in that: (1) they use depen-dency paths as context, while we use surface form;(2) they encode the context as either a lookup tableor the last state of a unidirectional LSTM. We alsouse a different objective, which we discuss next.

2.2 ObjectiveTo optimize our representation functions, we con-sider two variants of negative sampling (Mikolovet al., 2013a): bivariate and multivariate. The orig-inal bivariate objective models the two-way dis-tribution of context and (monolithic) word pairco-occurrences, while our multivariate extensionmodels the three-way distribution of word-word-context co-occurrences. We further augment themultivariate objective with typed sampling to up-sample harder negative examples. We discuss theimpact of the bivariate and multivariate objectives(and other components) in Section 4.3.

Bivariate Negative Sampling Our objective as-pires to make R(x, y) and C(c) similar (have highinner products) for (x, y, c) that were observed to-gether in the data. At the same time, we wish

Page 3: pair2vec: Compositional Word-Pair Embeddings for Cross ...

Bivariate J2NS (x, y, c) = log σ (R(x, y) · C(c)) +∑kc

i=1 log σ(−R(x, y) · C(cNi )

)Multivariate J3NS (x, y, c) = J2NS (x, y, c) +

∑kxi=1 log σ

(−R(xNi , y) · C(c)

)+∑ky

i=1 log σ(−R(x, yNi ) · C(c)

)Table 2: The bivariate and multivariate negative sampling objectives. The superscript N marks randomly sampledcomponents, with k∗ being the negative sample size per instance. The equations present per-instance objectives.

to keep our pair vectors dis-similar from randomcontext vectors. In a straightforward applicationof the original (bivariate) negative sampling objec-tive, we could generate a negative example fromeach observed (x, y, c) instance by replacing theoriginal context c with a randomly-sampled con-text cN (Table 2, J2NS).

Assuming that the negative contexts are sam-pled from the empirical distribution P (·, ·, c) (withP (x, y, c) being the portion of (x, y, c) instancesin the dataset), we can follow Levy and Goldberg(2014) to show that this objective converges intothe pointwise mutual information (PMI) betweenthe word pair and the context.

R(x, y) · C(c) = logP (x, y, c)

kcP (x, y, ·)P (·, ·, c)

This objective mainly captures co-occurrences ofmonolithic pairs and contexts, and is limited bythe fact that the training data, by construction,only contains pairs occurring within a sentence.For better generalization to cross-sentence tasks,where the pair distribution differs from that of thetraining data, we need a multivariate objective thatcaptures the full three-way (x, y, c) interaction.

Multivariate Negative Sampling We introducenegative sampling of target words, x and y, in ad-dition to negative sampling of contexts c (Table 2,J3NS). Our new objective also converges to anovel multivariate generalization of PMI, differentfrom previous PMI extensions that were inspiredby information theory (Van de Cruys, 2011) andheuristics (Jameel et al., 2018).2 Following Levyand Goldberg (2014), we can show that when re-placing target words in addition to contexts, ourobjective will converge3 to the optimal value inEquation 1:

R(x, y) · C(c) = logP (x, y, c)

Zx,y,c(1)

2See supplementary material for their exact formulations.3A full proof is provided in the supplementary material.

where Zx,y,c, the denominator, is:

Zx,y,c = kcP (·, ·, c)P (x, y, ·)+ kxP (x, ·, ·)P (·, y, c)+ kyP (·, y, ·)P (x, ·, c) (2)

This optimal value deviates from previous gen-eralizations of PMI by having a linear mixtureof marginal probability products in its denomina-tor. By introducing terms such as P (x, ·, c) andP (·, y, c), the objective penalizes spurious corre-lations between words and contexts that disregardthe other target word. For example, it would assignthe pattern “X is a Y” a high score with (banana,fruit), but a lower score with (cat, fruit).

Typed Sampling In multivariate negative sam-pling, we typically replace x and y by samplingfrom their unigram distributions. In addition tothis, we also sample uniformly from the top 100words according to cosine similarity using distri-butional word vectors. This is done to encouragethe model to learn relations between specific in-stances as opposed to more general types. For ex-ample, using California as a negative sample forOregon helps the model to learn that the pattern“X is located in Y” fits the pair (Portland, Oregon),but not the pair (Portland, California). Similar ad-versarial constraints were used in knowledge basecompletion (Toutanova et al., 2015) and word em-beddings (Li et al., 2017).4

3 Integrating pair2vec into Models

We first present a general outline for incorporat-ing pair2vec into attention-based architectures,and then discuss changes made to BiDAF++ andESIM. The key idea is to inject our pairwise rep-resentations into the attention layer by reusing thecross-sentence attention weights. In addition to at-tentive pooling over single word representations,we also pool over cross-sentence word pair em-beddings (Figure 1).

4Applying typed sampling also changes the value towhich our objective will converge, and will replace the un-igram probabilities in Equation (2) to reflect the type-baseddistribution.

Page 4: pair2vec: Compositional Word-Pair Embeddings for Cross ...

Figure 1: A typical architecture of a cross-sentence inference model (left), and how pair2vec is added to it(right). Given two sequences, a and b, existing models create b-aware representations of words in a. For any wordai, this typically involves the BiLSTM representation of word ai (ai), and an attention-weighted sum over b’sBiLSTM states with ai as the query (bi). To these, we add the word-pair representation of ai and each word in b,weighted by attention (ri). Thicker attention arrows indicate stronger word pair alignments (e.g. cheap, expensive).

3.1 General ApproachPair Representation We assume that we aregiven two sequences a = a1...an and b = b1...bm.We represent the word-pair embeddings between aand b using the pretrained pair2vec model as:

ri,j =

[R(ai, bj)

‖R(ai, bj)‖;R(bj , ai)

‖R(bj , ai)‖

](3)

We include embeddings in both directions,R(ai, bj) and R(bj , ai), because the many rela-tions can be expressed in both directions; e.g., hy-pernymy can be expressed via “X is a type of Y”as well as “Y such as X”. We take the L2 normal-ization of each direction’s pair embedding becausethe heavy-tailed distribution of word pairs resultsin significant variance of their norms.

Base Model Let a1...an and b1...bm be the vec-tor representations of sequences a and b, as pro-duced by the input encoder (e.g. ELMo em-beddings contextualized with model-specific BiL-STMs). Furthermore, we assume that the basemodel computes soft word alignments between aand b via co-attention (4, 5), which are then usedto compute b-aware representations of a:

si,j = fatt(ai,bj) (4)

α = softmaxj(si,j) (5)

bi =m∑j=0

αi,jbj (6)

ainfi =[ai;bi

](7)

The symmetric term binfj is defined analogously.

We refer to ainf and binf as the inputs to the infer-

ence layer, since this layer computes some func-tion over aligned word pairs, typically via a feed-forward network and LSTMs. The inference layeris followed by aggregation and output layers.

Injecting pair2vec We conjecture that the in-ference layer effectively learns word-pair relation-ships from training data, and it should, therefore,help to augment its input with pair2vec. Weaugment ainfi (7) with the pair vectors ri,j (3) byconcatenating a weighted average of the pair vec-tors ri,j involving ai, where the weights are thesame αi,j computed via attention in (5):

ri =∑j

αi,jri,j (8)

ainfi =[ai;bi; ri

](9)

The symmetric term binfj is defined analogously.

3.2 Question AnsweringWe augment the inference layer in the BiDAF++model with pair2vec. BiDAF++ is an im-proved version of the BiDAFNoAnswer (Seoet al., 2017; Levy et al., 2017) which includes self-attention and ELMo embeddings from Peters et al.(2018). We found this variant to be stronger thanthe baselines presented in Rajpurkar et al. (2018)by over 2.5 F1. We use BiDAF++ as a baselinesince its architecture is typical for QA systems,and, until recently, was state-of-the-art on SQuAD2.0 and other benchmarks.

BiDAF++ Let a and b be the outputs of the pas-sage and question encoders respectively (in placeof the standard p and q notations). The infer-ence layer’s inputs ainfi are defined similarly to

Page 5: pair2vec: Compositional Word-Pair Embeddings for Cross ...

the generic model’s in (7), but also contain an ag-gregation of the elements in a, with better-alignedelements receiving larger weights:

µ = softmaxi(maxjsi,j) (10)

ai =∑i

µiai (11)

ainfi =[ai;bi;ai ◦ bi; a

](12)

In the later layers, ainf is recontextualized usinga BiGRU and self attention. Finally a predictionlayer predicts the start and end tokens.

BiDAF++ with pair2vec To add our pair vec-tors, we simply concatenate ri (3) to ainfi (12):

ainfi =[ai;bi;ai ◦ bi; a; ri

](13)

3.3 Natural Language InferenceFor NLI, we augment the ESIM model (Chenet al., 2017), which was previously state-of-the-art on both SNLI (Bowman et al., 2015) andMultiNLI (Williams et al., 2018) benchmarks.

ESIM Let a and b be the outputs of the premiseand hypothesis encoders respectively (in place ofthe standard p and h notations). The inferencelayer’s inputs ainfi (and binf

j ) are defined similarlyto the generic model’s in (7):

ainfi =[ai;bi;ai ◦ bi;ai − bi

](14)

In the later layers, ainf and binf are projected,recontextualized, and converted to a fixed-lengthvector for each sentence using multiple poolingschemes. These vectors are then passed on to anoutput layer, which predicts the class.

ESIM with pair2vec To add our pair vectors,we simply concatenate ri (3) to ainfi (14):

ainfi =[ai;bi;ai ◦ bi;ai − bi; ri

](15)

A similar augmentation of ESIM was recently pro-posed in KIM (Chen et al., 2018). However, theirpair vectors are composed of WordNet features,while our pair embeddings are learned directlyfrom text (see further discussion in Section 6).

4 Experiments

For experiments on QA (Section 4.1) and NLI(Section 4.2), we use our full model which in-cludes multivariate and typed negative sampling.We discuss ablations in Section 4.3

Benchmark BiDAF + pair2vec ∆

SQuAD 2.0 EM 65.66 68.02 +2.36F1 68.86 71.58 +2.72

AddSent EM 37.50 44.20 +6.70F1 42.55 49.69 +7.14

AddOneSent EM 48.20 53.30 +5.10F1 54.02 60.13 +6.11

Table 3: Performance on SQuAD 2.0 and adversarialSQuAD (AddSent and AddOneSent) benchmarks, withand without pair2vec. All models have ELMo.

Benchmark ESIM + pair2vec ∆

Matched 79.68 81.03 +1.35Mismatched 78.80 80.12 +1.32

Table 4: Performance on MultiNLI, with and withoutpair2vec. All models have ELMo.

Data We use the January 2018 dump of En-glish Wikipedia, containing 96M sentences totrain pair2vec. We restrict the vocabulary tothe 100K most frequent words. Preprocessing re-moves all out-of-vocabulary words in the corpus.We consider each word pair within a window of5 in the preprocessed corpus, and subsample5 in-stances based on pair probability with a thresholdof 5·10−7. We define the context as one word eachto the left and right, and all the words in betweeneach pair, replacing both target words with place-holders X and Y (see Table 1). More details canbe found in the supplementary material.

4.1 Question Answering

We experiment on the SQuAD 2.0 QA benchmark(Rajpurkar et al., 2018), as well as the adversarialdatasets of SQuAD 1.1 (Rajpurkar et al., 2016; Jiaand Liang, 2017). Table 3 shows the performanceof BiDAF++, with ELMo , before and after addingpair2vec. Experiments on SQuAD 2.0 showthat our pair representations improve performanceby 2.72 F1. Moreover, adding pair2vec alsoresults in better generalization on the adversarialSQuAD datasets with gains of 7.14 and 6.11 F1.

4.2 Natural Language Inference

We report the performance of our model onMultiNLI and the adversarial test set from Glock-ner et al. (2018) in Table 5. We outperform the

5Like in word2vec, subsampling reduces the size of thedataset and speeds up training. For this, we define the wordpair probability as the product of unigram probabilities.

Page 6: pair2vec: Compositional Word-Pair Embeddings for Cross ...

Model Accuracy

Rule-based ModelsWordNet Baseline 85.8

Models with GloVeESIM (Chen et al., 2017) 77.0KIM (Chen et al., 2018) 87.7ESIM + pair2vec 92.9

Models with ELMoESIM (Peters et al., 2018) 84.6ESIM + pair2vec 93.4

Table 5: Performance on the adversarial NLI test set ofGlockner et al. (2018).

Model EM (∆) F1 (∆)

pair2vec (Full Model) 69.20 72.68

Composition: 2 Layers 68.35 (-0.85) 71.65 (-1.03)Composition: Multiply 67.10 (-2.20) 70.20 (-2.48)Objective: Bivariate NS 68.63 (-0.57) 71.98 (-0.70)Unsupervised: Pair Dist 67.07 (-2.13) 70.24 (-2.44)

No pair2vec (BiDAF) 66.66 (-2.54) 69.90 (-2.78)

Table 6: Ablations on the Squad 2.0 development setshow that argument sampling as well as using a deepercomposition function are useful.

ESIM + ELMo baseline by 1.3% on the matchedand mismatched portions of the dataset.

We also record a gain of 8.8% absolute overESIM on the Glockner et al. (2018) dataset, settinga new state of the art. Following standard prac-tice (Glockner et al., 2018), we train all models ona combination of SNLI (Bowman et al., 2015) andMultiNLI. Glockner et al. (2018) show that withthe exception of KIM (Chen et al., 2018), whichuses WordNet features, several NLI models fail togeneralize to this setting which involves lexical in-ference. For a fair comparison with KIM on theGlockner test set, we replace ELMo with GLoVEembeddings, and still outperform KIM by almosthalving the error rate.

4.3 Ablations

Ablating parts of pair2vec shows that all com-ponents of the model (Section 2) are useful. Weablate each component and report the EM and F1on the development set of SQuAD 2.0 (Table 6).The full model, which uses a 4-layer MLP forR(x, y) and trains with multivariate negative sam-pling, achieves the highest F1 of 72.68.

We experiment with two alternative composi-tion functions, a 2-layer MLP (Composition: 2Layers) and element-wise multiplication (Compo-

0.0 0.2 0.4 0.6 0.8 1.00

20

40

60

80

100

Accu

racy

DerivationalLexicographicInflectionalEncyclopedic

Figure 2: Accuracy as a function of the interpolationparameter α (see Eq. (16)). The α=0 configurationrelies only on fastText (Bojanowski et al., 2017), whileα=1 reflects pair2vec.

sition: Multiply), which yield significantly smallergains over the baseline BiDAF++ model. Thisdemonstrates the need for a deep compositionfunction. Eliminating sampling of target words(x, y) from the objective (Objective: BivariateNS) results in a drop of 0.7 F1, accounting forabout a quarter of the overall gain. This suggeststhat while the bulk of the signal is mined from thepair-context interactions, there is also valuable in-formation in other interactions as well.

We also test whether specific pre-training ofword pair representations is useful by replacingpair2vec embeddings with the vector offsetsof pre-trained word embeddings (Unsupervised:Pair Dist). We follow the PairDistance method forword analogies (Mikolov et al., 2013b), and repre-sent the pair (x, y) as the L2 normalized differenceof single-word vectors: (x− y)/‖x− y‖. We usethe same fastText (Bojanowski et al., 2017) wordvectors with which we initialized pair2vec be-fore training. We observe a gain of only 0.34 F1over the baseline.

5 Analysis

In Section 4, we showed that pair2vec addsinformation complementary to single-word repre-sentations like ELMo. Here, we ask what this ex-tra information is, and try to characterize whichword relations are better captured by pair2vec.To that end, we evaluate performance on a wordanalogy dataset with over 40 different relationtypes (Section 5.1), and observe how pair2vecfills hand-crafted relation patterns (Section 5.2).

Page 7: pair2vec: Compositional Word-Pair Embeddings for Cross ...

Relation 3CosAdd +pair2vec α∗

Country:Capital 1.2 86.1 0.9Name:Occupation 1.8 44.6 0.8Name:Nationality 0.1 42.0 0.9UK City:County 0.7 31.7 1.0Country:Language 4.0 28.4 0.8Verb 3pSg:Ved 49.1 61.7 0.6Verb Ving:Ved 61.1 73.3 0.5Verb Inf:Ved 58.5 70.1 0.5Noun+less 4.8 16.0 0.2Substance Meronym 3.8 14.5 0.6

Table 7: The top 10 analogy relations for which inter-polating with pair2vec improves performance. α∗

is the optimal interpolation parameter for each relation.

5.1 Quantitative Analysis: Word Analogies

Word Analogy Dataset Given a word pair (a, b)and word x, the word analogy task involves pre-dicting a word y such that a : b :: x : y. Weuse the Bigger Analogy Test Set (BATS, Glad-kova et al., 2016) which contains four groups ofrelations: encyclopedic semantics (e.g., person-profession as in Einstein-physicist), lexicographicsemantics (e.g., antonymy as in cheap-expensive),derivational morphology (e.g., noun forms as inoblige-obligation), and inflectional morphology(e.g., noun-plural as in bird-birds). Each groupcontains 10 sub-relations.

Method We interpolate pair2vec and3CosAdd (Mikolov et al., 2013b; Levy et al.,2014) scores on fastText embeddings, as follows:

score(y) = α · cos(ra,b, rx,y)

+ (1− α) · cos(b− a+ x,y) (16)

where a, b, x, and y represent fastText embed-dings6 and ra,b, rx,y represent the pair2vec em-bedding for the word pairs (a, b) and (x, y), re-spectively; α is the linear interpolation parameter.Following prior work (Mikolov et al., 2013b), wereturn the highest-scoring y in the entire vocabu-lary, excluding the given words a, b, and x.

Results Figure 2 shows how the accuracy oneach category of relations varies with α. For allfour groups, adding pair2vec to 3CosAdd re-sults in significant gains. In particular, the biggestrelative improvements are observed for encyclope-dic (356%) and lexicographic (51%) relations.

6The fastText embeddings in the analysis were retrainedusing the same Wikipedia corpus used to train pair2vec tocontrol for the corpus when comparing the two methods.

Table 7 shows the specific relations in whichpair2vec made the largest absolute impact.The gains are particularly significant for relationswhere fastText embeddings provide limited sig-nal. For example, the accuracy for substancemeronyms goes from 3.8% to 14.5%. In somecases, there is also a synergistic effect; for in-stance, in noun+less, pair2vec alone scored0% accuracy, but mixing it with 3CosAdd, whichgot 4.8% on its own, yielded 16% accuracy.

These results, alongside our experiments in Sec-tion 4, strongly suggest that pair2vec encodesinformation complementary to that in single-wordembedding methods such as fastText and ELMo.

5.2 Qualitative Analysis: Slot Filling

To further explore how pair2vec encodes suchcomplementary information, we consider a set-ting similar to that of knowledge base completion:given a Hearst-like context pattern c and a sin-gle word x, predict the other word y from the en-tire vocabulary. We rank candidate words y basedon the scoring function in our training objective:R(x, y) ·C(c). We use a fixed set of example rela-tions and manually define their predictive contextpatterns and a small set of candidate words x.

Table 8 shows the top three y words. The modelembeds (x, y) pairs close to contexts that reflecttheir relationship. For example, substituting Port-land in the city-state pattern (“in X, Y.”), the toptwo words are Oregon and Maine, both US stateswith cities named Portland. When used with thecity-city pattern (“from X to Y.”), the top two wordsare Salem and Astoria, both cities in Oregon. Theword-context interaction often captures multiplerelations; for example, Monet is used to refer tothe painter (profession) as well as his paintings.

As intended, pair2vec captures the three-way word-word-context interaction, and not justthe two-way word-context interaction (as insingle-word embeddings). This profound differ-ence allows pair2vec to complement single-word embeddings with additional information.

6 Related Work

Pretrained Word Embeddings Many state-of-the-art models initialize their word representationsusing pretrained embeddings such as word2vec(Mikolov et al., 2013a) or ELMo (Peters et al.,2018). These representations are typically trainedusing an interpretation of the Distributional Hy-

Page 8: pair2vec: Compositional Word-Pair Embeddings for Cross ...

Relation Context X Y (Top 3)

Antonymy/Exclusion either X or Y accept reject, refuse, recognisehard soft, brittle, polished

Hypernymy including X and other Y copper ones, metals, minesgoogle apps, browsers, searches

Hyponymy X like Y cities solaris, speyer, medinabrowsers chrome, firefox, netscape

Co-hyponymy , X , Y , copper malachite, flint, ivorygoogle microsoft, bing, yahoo

City-State in X , Y . portland oregon, maine, dorsetdallas tx, texas, va

City-City from X to Y . portland salem, astoria, ogdensburgdallas denton, allatoona, addison

Profession X , a famous Y , ronaldo footballer, portuguese, playermonet painter, painting, butterfly

Table 8: Given a context c and a word x, we select the top 3 words y from the entire vocabulary using our scoringfunction R(x, y) · C(c). The analysis suggests that the model tends to rank correct matches (italics) over others.

pothesis (Harris, 1954) in which the bivariate dis-tribution of target words and contexts is modeled.Our work deviates from the word embedding lit-erature in two major aspects. First, our goal is torepresent word pairs, not individual words. Sec-ond, our new PMI formulation models the trivari-ate word-word-context distribution. Experimentsshow that our pair embeddings can complementsingle-word embeddings.

Mining Textual Patterns There is extensive lit-erature on mining textual patterns to predict rela-tions between words (Hearst, 1992; Snow et al.,2005; Turney, 2005; Riedel et al., 2013; Van deCruys, 2014; Toutanova et al., 2015; Shwartz andDagan, 2016). These approaches focus mostly onrelations between pairs of nouns (perhaps with theexception of VerbOcean (Chklovski and Pantel,2004)). More recently, they have been expandedto predict relations between unrestricted pairs ofwords (Jameel et al., 2018; Espinosa Anke andSchockaert, 2018), assuming that each word-pairwas observed together during pretraining. Washioand Kato (2018a,b) relax this assumption with acompositional model that can represent any pair,as long as each word appeared (individually) in thecorpus.

These methods are evaluated on either intrin-sic relation prediction tasks, such as BLESS (Ba-roni and Lenci, 2011) and CogALex (Santuset al., 2016), or knowledge-base population bench-marks, e.g. FB15 (Bordes et al., 2013). To thebest of our knowledge, our work is the first to in-tegrate pattern-based methods into modern high-

performing semantic models and evaluate theirimpact on complex end-tasks like QA and NLI.

Integrating Knowledge in Complex ModelsAhn et al. (2016) integrate Freebase facts into alanguage model using a copying mechanism overfact attributes. Yang and Mitchell (2017) modifythe LSTM cell to incorporate WordNet and NELLknowledge for event and entity extraction. Forcross-sentence inference tasks, Weissenborn et al.(2017), Bauer et al. (2018), and Mihaylov andFrank (2018) dynamically refine word representa-tions by reading assertions from ConceptNet andWikipedia abstracts. Our approach, on the otherhand, relies on a relatively simple extension of ex-isting cross-sentence inference models. Further-more, we do not need to dynamically retrieve andprocess knowledge base facts or Wikipedia texts,and just pretrain our pair vectors in advance.

KIM (Chen et al., 2017) integrates word-pairvectors into the ESIM model for NLI in a verysimilar way to ours. However, KIM’s word-pair vectors contain only hand-engineered word-relation indicators from WordNet, whereas ourword-pair vectors are automatically learned fromunlabeled text. Our vectors can therefore reflectrelation types that do not exist in WordNet (such asprofession) as well as word pairs that do not havea direct link in WordNet (e.g. bronze and statue);see Table 8 for additional examples.

Page 9: pair2vec: Compositional Word-Pair Embeddings for Cross ...

7 Conclusion and Future Work

We presented new methods for training and usingword pair embeddings that implicitly representbackground knowledge. Our pair embeddings arecomputed as a compositional function of the indi-vidual word representations, which is learned bymaximizing a variant of the PMI with the contextsin which the the two words co-occur. Experimentson cross-sentence inference benchmarks demon-strated that adding these representations to exist-ing models results in sizable improvements forboth in-domain and adversarial settings.

Published concurrently with this paper, BERT(Devlin et al., 2018), which uses a masked lan-guage model objective, has reported dramaticgains on multiple semantic benchmarks includingquestion-answering, natural language inference,and named entity recognition. Potential avenuesfor future work include multitasking BERT withpair2vec in order to more directly incorporatereasoning about word pair relations into the BERTobjective.

Acknowledgments

We would like to thank Anna Rogers (Gladkova),Qian Chen, Koki Washio, Pranav Rajpurkar, andRobin Jia for their help with the evaluation. Weare also grateful to members of the UW and FAIRNLP groups, and anonymous reviewers for theirthoughtful comments and suggestions.

ReferencesSungjin Ahn, Heeyoul Choi, Tanel Parnamaa, and

Yoshua Bengio. 2016. A neural knowledge lan-guage model. arXiv preprint arXiv:1608.00318.

Marco Baroni and Alessandro Lenci. 2011. How weblessed distributional semantic evaluation. In Pro-ceedings of the GEMS 2011 Workshop on GEometri-cal Models of Natural Language Semantics, GEMS’11, pages 1–10, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018.Commonsense for generative multi-hop question an-swering tasks. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 4220–4230, Brussels, Belgium.Association for Computational Linguistics.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors, Advances in Neural Information ProcessingSystems 26, pages 2787–2795. Curran Associates,Inc.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642. Association for Computational Linguis-tics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, DianaInkpen, and Si Wei. 2018. Neural natural languageinference models enhanced with external knowl-edge. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 2406–2417. Associa-tion for Computational Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTM fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1657–1668. Association for Computational Linguis-tics.

Timothy Chklovski and Patrick Pantel. 2004. Verbo-cean: Mining the web for fine-grained semantic verbrelations. In Proceedings of the 2004 Conference onEmpirical Methods in Natural Language Process-ing.

Christopher Clark and Matt Gardner. 2018. Simpleand effective multi-paragraph reading comprehen-sion. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 845–855. Associationfor Computational Linguistics.

Tim Van de Cruys. 2011. Two multivariate general-izations of pointwise mutual information. In Pro-ceedings of the Workshop on Distributional Seman-tics and Compositionality, pages 16–20. Associationfor Computational Linguistics.

Tim Van de Cruys. 2014. A neural network approach toselectional preference acquisition. In Proceedingsof the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP), pages 26–35. Association for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. arXiv preprint arXiv:1810.04805.

Luis Espinosa Anke and Steven Schockaert. 2018.Seven: Augmenting word embeddings with unsu-pervised relation vectors. In Proceedings of the

Page 10: pair2vec: Compositional Word-Pair Embeddings for Cross ...

27th International Conference on ComputationalLinguistics, pages 2653–2665. Association for Com-putational Linguistics.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.Allennlp: A deep semantic natural language pro-cessing platform. arXiv preprint arXiv:1803.07640.

Anna Gladkova, Aleksandr Drozd, and Satoshi Mat-suoka. 2016. Analogy-based detection of morpho-logical and semantic relations with word embed-dings: What works and what doesn’t. In Proceed-ings of the NAACL-HLT SRW, pages 47–54, SanDiego, California, June 12-17, 2016. ACL.

Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking nli systems with sentences that re-quire simple lexical inferences. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),pages 650–655. Association for Computational Lin-guistics.

Zellig Harris. 1954. Distributional structure. Word,10(23):146–162.

Marti Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedings ofthe 14th International Conference on ComputationalLinguistics, Nantes, France.

Shoaib Jameel, Zied Bouraoui, and Steven Schockaert.2018. Unsupervised learning of distributional re-lation vectors. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 23–33. As-sociation for Computational Linguistics.

Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages2021–2031. Association for Computational Linguis-tics.

Omer Levy, Ido Dagan, and Jacob Goldberger. 2014.Focused entailment graphs for open ie propositions.In Proceedings of the Eighteenth Conference onComputational Natural Language Learning, pages87–97. Association for Computational Linguistics.

Omer Levy and Yoav Goldberg. 2014. Neural wordembedding as implicit matrix factorization. In Pro-ceedings of the 27th International Conference onNeural Information Processing Systems - Volume 2,NIPS’14, pages 2177–2185, Cambridge, MA, USA.MIT Press.

Omer Levy, Minjoon Seo, Eunsol Choi, and LukeZettlemoyer. 2017. Zero-shot relation extractionvia reading comprehension. In Proceedings of the21st Conference on Computational Natural Lan-guage Learning (CoNLL 2017), Vancouver, Canada,August 3-4, 2017, pages 333–342.

Bofang Li, Tao Liu, Zhe Zhao, Buzhou Tang, Alek-sandr Drozd, Anna Rogers, and Xiaoyong Du. 2017.Investigating different syntactic context types andcontext representations for learning word embed-dings. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, pages 2421–2431. Association for Computa-tional Linguistics.

Todor Mihaylov and Anette Frank. 2018. Knowledge-able reader: Enhancing cloze-style reading compre-hension with external commonsense knowledge. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 821–832. Association for Com-putational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013a. Distributed rep-resentations of words and phrases and their com-positionality. In Advances in Neural InformationProcessing Systems 26: 27th Annual Conference onNeural Information Processing Systems 2013. Pro-ceedings of a meeting held December 5-8, 2013,Lake Tahoe, Nevada, United States., pages 3111–3119.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013b. Linguistic regularities in continuous spaceword representations. In Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 746–751. Associa-tion for Computational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 784–789.Association for Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392, Austin,Texas. Association for Computational Linguistics.

Sebastian Riedel, Limin Yao, Benjamin M. Marlin, andAndrew McCallum. 2013. Relation extraction withmatrix factorization and universal schemas. In JointHuman Language Technology Conference/AnnualMeeting of the North American Chapter of theAssociation for Computational Linguistics (HLT-NAACL).

Page 11: pair2vec: Compositional Word-Pair Embeddings for Cross ...

Enrico Santus, Anna Gladkova, Stefan Evert, andAlessandro Lenci. 2016. The CogALex-V sharedtask on the corpus-based identification of seman-tic relations. In Proceedings of the 5th Workshopon Cognitive Aspects of the Lexicon (CogALex - V),pages 69–79. The COLING 2016 Organizing Com-mittee.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional attentionflow for machine comprehension. In Proceedings ofthe International Conference on Learning Represen-tations (ICLR).

Vered Shwartz and Ido Dagan. 2016. Path-basedvs. distributional information in recognizing lexi-cal semantic relations. In Proceedings of the 5thWorkshop on Cognitive Aspects of the Lexicon, Co-gALex@COLING 2016, Osaka, Japan, December12, 2016, pages 24–29.

Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.Improving hypernymy detection with an integratedpath-based and distributional method. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 2389–2398. Association for Computa-tional Linguistics.

Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005.Learning syntactic patterns for automatic hypernymdiscovery. In L. K. Saul, Y. Weiss, and L. Bottou,editors, Advances in Neural Information ProcessingSystems 17, pages 1297–1304. MIT Press.

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-fung Poon, Pallavi Choudhury, and Michael Gamon.2015. Representing text for joint embedding of textand knowledge bases. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2015, Lisbon, Portugal,September 17-21, 2015, pages 1499–1509.

Peter D. Turney. 2005. Measuring semantic similar-ity by latent relational analysis. In Proceedings ofthe 19th International Joint Conference on ArtificialIntelligence, IJCAI’05, pages 1136–1141, San Fran-cisco, CA, USA. Morgan Kaufmann Publishers Inc.

Koki Washio and Tsuneaki Kato. 2018a. Filling miss-ing paths: Modeling co-occurrences of word pairsand dependency paths for recognizing lexical se-mantic relations. In Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages1123–1133. Association for Computational Linguis-tics.

Koki Washio and Tsuneaki Kato. 2018b. Neural latentrelational analysis to capture lexical semantic rela-tions in a vector space. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2018. Association forComputational Linguistics.

Dirk Weissenborn, Tomas Kocisky, and Chris Dyer.2017. Dynamic integration of background knowl-edge in neural NLU systems. arXiv preprintarXiv:1706.02596.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers), pages 1112–1122. Association forComputational Linguistics.

Bishan Yang and Tom Mitchell. 2017. Leveragingknowledge bases in LSTMs for improving machinereading. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1436–1446. Asso-ciation for Computational Linguistics.

Page 12: pair2vec: Compositional Word-Pair Embeddings for Cross ...

A Implementation Details

Hyperparameters For both word pairs and con-texts, we use 300-dimensional word embeddingsinitialized with FastText (Bojanowski et al., 2017).The context representation uses a single-layer Bi-LSTM with a hidden layer size of 100. We use 2negative context samples and 3 negative argumentsamples for each pair-context tuple.

For pre-training, we used stochastic gradient de-scent with an initial learning rate of 0.01. We re-duce the learning rate by a factor of 0.9 if the lossdoes not decrease for 300K steps. We use a batchsize of 600, and train for 12 epochs.7

For both end-task models, we use AllenNLP’simplementations (Gardner et al., 2018) with de-fault hyperparameters; we did not change any set-ting before or after injecting pair2vec. We use0.15 dropout on our pretrained pair embeddings.

B Multivariate Negative Sampling

In this appendix, we elaborate on mathematicaldetails of multivariate negative sampling to sup-port our claims in Section 2.2.

B.1 Global Objective

Equation (Table 2, J3NS) in Section 2.2 charac-terizes the local objective for each data instance.To understand the mathematical properties of thisobjective, we must first describe the global objec-tive in terms of the entire dataset. However, thiscannot be done by simply summing the local ob-jective for each (x, y, c), since each such examplemay appear multiple times in our dataset. More-over, due to the nature of negative sampling, thenumber of times an (x, y, c) triplet appears as apositive example will almost always be differentfrom the number of times it appears as a negativeone. Therefore, we must determine the frequencyin which each triplet appears in each role.

We first denote the number of times the exam-ple (x, y, c) appears in the dataset as #(x, y, c);this is also the number of times (x, y, c) is used asa positive example. We observe that the expectednumber of times (x, y, c) is used as a corrupt x ex-ample is kxP (x, ·, ·)#(·, y, c), since (x, y, c) canonly be created as a corrupt x example by ran-domly sampling x from an example that alreadycontained y and c. The number of times (x, y, c)is used as a corrupt y or c example can be derived

7On Titan X GPUs, the training takes about a week.

analogously. Therefore, the global objective of ourtrenary negative sampling approach is:

JGlobal3NS =

∑(x,y,c)

Jx,y,c3NS (17)

Jx,y,c3NS = #(x, y, c) · log σ (Sx,y,c)

+ Z ′x,y,c · log σ (−Sx,y,c) (18)

Z ′x,y,c = kxP (x, ·, ·)#(·, y, c)Jx,y,c−

+ kyP (·, y, ·)#(x, ·, c)Jx,y,c−

+ kcP (·, ·, c)#(x, y, ·)Jx,y,c− (19)

Sx,y,c = R(x, y) · C(c) (20)

B.2 Relation to Multivariate PMIWith the global objective, we can now ask what isthe optimal value of Sx,y,c (20) by comparing thepartial derivative of (17) to zero. This derivative isin fact equal to the partial derivative of (18), sinceit is the only component of the global objective inwhich R(x, y) · C(c) appears:

0 =∂JGlobal

3NS

∂Sx,y,c=∂Jx,y,c

3NS

∂Sx,y,c

The partial derivative of (18) can be expressed as:

0 = #(x, y, c) · σ (−Sx,y,c)− Z ′x,y,c · σ (Sx,y,c)

which can be reformulated as:

Sx,y,c = log#(x, y, c)

Z ′x,y,c

By expanding the fraction by 1/#(·, ·, ·) (i.e. di-viding by the size of the dataset), we essentiallyconvert all the frequency counts (e.g. #(x, y, z))to empirical probabilities (e.g. P (x, y, z)), and ar-rive at Equation (1) in Section 2.2.

B.3 Other Multivariate PMI FormulationsPrevious work has proposed different multivariateformulations of PMI, shown in Table 9. Van deCruys (2011) presented specific interaction infor-mation (SI1) and specific correlation (SI2). Inaddition to those metrics, Jameel et al. (2018) ex-perimented with SI3, which is the bivariate PMIbetween (x, y) and c, and with SI4. Our formula-tion deviates from previous work, and, to the bestof our knowledge, cannot be trivially expressed byone of the existing metrics.

Page 13: pair2vec: Compositional Word-Pair Embeddings for Cross ...

(Van de Cruys, 2011)SI1(x, y, c) log P (x,y,·)P (x,·,c)P (·,y,c)

P (x,·,·)P (·,y,·)P (·,·,c)P (x,y,c)

SI2(x, y, c) log P (x,y,c)P (x,·,·)P (·,y,·)P (·,·,c)

(Jameel et al., 2018)SI3(x, y, c) log P (x,y,c)

P (x,y,·)P (·,·,c)

SI4(x, y, c) log P (x,y,c)P (·,·,c)P (x,·,c)P (·,y,c)

(This Work) MV PMI(x, y, c) log P (x,y,c)kcP (·,·,c)P (x,y,·)+kxP (x,·,·)P (·,y,c)+kyP (·,y,·)P (x,·,c)

Table 9: Multivariate generalizations of PMI.