Evaluating the Stability of Embedding-based Word Similarities · 2017-11-12 · 1 Introduction Word...

Evaluating the Stability of Embedding-based Word Similarities

Maria AntoniakCornell University

[email protected]

David MimnoCornell University

[email protected]

Abstract

Word embeddings are increasingly being usedas a tool to study word associations in specificcorpora. However, it is unclear whether suchembeddings reflect enduring properties of lan-guage or if they are sensitive to inconsequentialvariations in the source documents. We findthat nearest-neighbor distances are highly sen-sitive to small changes in the training corpusfor a variety of algorithms. For all methods,including specific documents in the trainingset can result in substantial variations. Weshow that these effects are more prominent forsmaller training corpora. We recommend thatusers never rely on single embedding modelsfor distance calculations, but rather averageover multiple bootstrap samples, especially forsmall corpora.

1 Introduction

Word embeddings are a popular technique in naturallanguage processing (NLP) in which the words in avocabulary are mapped to low-dimensional vectors.Embedding models are easily trained—several imple-mentations are publicly available—and relationshipsbetween the embedding vectors, often measured viacosine similarity, can be used to reveal latent seman-tic relationships between pairs of words. Word em-beddings are increasingly being used by researchersin unexpected ways and have become popular infields such as digital humanities and computationalsocial science (Hamilton et al., 2016; Heuser, 2016;Phillips et al., 2017).

Embedding-based analyses of semantic similaritycan be a robust and valuable tool, but we find that

standard methods dramatically under-represent thevariability of these measurements. Embedding algo-rithms are much more sensitive than they appear tofactors such as the presence of specific documents,the size of the documents, the size of the corpus, andeven seeds for random number generators. If usersdo not account for this variability, their conclusionsare likely to be invalid. Fortunately, we also find thatsimply averaging over multiple bootstrap samplesis sufficient to produce stable, reliable results in allcases tested.

NLP research in word embeddings has so far fo-cused on a downstream-centered use case, wherethe end goal is not the embeddings themselves butperformance on a more complicated task. For exam-ple, word embeddings are often used as the bottomlayer in neural network architectures for NLP (Ben-gio et al., 2003; Goldberg, 2017). The embedding’straining corpus, which is selected to be as large aspossible, is only of interest insofar as it generalizesto the downstream training corpus.

In contrast, other researchers take a corpus-centered approach and use relationships between em-beddings as direct evidence about the language andculture of the authors of a training corpus (Bolukbasiet al., 2016; Hamilton et al., 2016; Heuser, 2016).Embeddings are used as if they were simulationsof a survey asking subjects to free-associate wordsfrom query terms. Unlike the downstream-centeredapproach, the corpus-centered approach is based ondirect human analysis of nearest neighbors to embed-ding vectors, and the training corpus is not simply anoff-the-shelf convenience but rather the central objectof study.

Downstream-centered Corpus-centeredBig corpus Small corpus, difficult or impossi-

ble to expandSource is not important Source is the object of studyOnly vectors are important Specific, fine-grained comparisons

are importantEmbeddings are used indownstream tasks

Embeddings are used to learn aboutthe mental model of word associa-tion for the authors of the corpus

Table 1: Comparison of downstream-centered and corpus-centered approaches to word embeddings.

While word embeddings may appear to measureproperties of language, they in fact only measureproperties of a curated corpus, which could suf-fer from several problems. The training corpus ismerely a sample of the authors’ language model(Shazeer et al., 2016). Sources could be missing orover-represented, typos and other lexical variationscould be present, and, as noted by Goodfellow et al.(2016), “Many datasets are most naturally arrangedin a way where successive examples are highly cor-related.” Furthermore, embeddings can vary consid-erably across random initializations, making lists of“most similar words” unstable.

We hypothesize that training on small and poten-tially idiosyncratic corpora can exacerbate these prob-lems and lead to highly variable estimates of wordsimilarity. Such small corpora are common in digitalhumanities and computational social science, and itis often impossible to mitigate these problems simplyby expanding the corpus. For example, we cannotcreate more 18th Century English books or changetheir topical focus.

We explore causes of this variability, which rangefrom the fundamental stochastic nature of certain al-gorithms to more troubling sensitivities to propertiesof the corpus, such as the presence or absence ofspecific documents. We focus on the training cor-pus as a source of variation, viewing it as a fragileartifact curated by often arbitrary decisions. We ex-amine four different algorithms and six datasets, andwe manipulate the corpus by shuffling the order ofthe documents and taking bootstrap samples of thedocuments. Finally, we examine the effects of thesemanipulations on the cosine similarities between em-beddings.

We find that there is considerable variability inembeddings that may not be obvious to users of these

methods. Rankings of most similar words are notreliable, and both ordering and membership in suchlists are liable to change significantly. Some uncer-tainty is expected, and there is no clear criterion for“acceptable” levels of variance, but we argue that theamount of variation we observe is sufficient to callinto question the whole method. For example, wefind cases in which there is zero set overlap in “top10” lists for the same query word across bootstrapsamples. Smaller corpora and larger document sizesincrease this variation. Our goal is to provide meth-ods to quantify this variability, and to account for thisvariability, we recommend that as the size of a corpusgets smaller, cosine similarities should be averagedover many bootstrap samples.

2 Related Work

Word embeddings are mappings of words to pointsin a K-dimensional continuous space, where K ismuch smaller than the size of the vocabulary. Re-ducing the number of dimensions has two benefits:first, large, sparse vectors are transformed into small,dense vectors, and second, the conflation of featuresuncovers latent semantic relationships between thewords. These semantic relationships are usually mea-sured via cosine similarity, though other metrics suchas Euclidean distance and the Dice coefficient arepossible (Turney and Pantel, 2010). We focus on fourof the most popular training algorithms: LSA (Deer-wester et al., 1990), SGNS (Mikolov et al., 2013),GloVe (Pennington et al., 2014), and PPMI (Levyand Goldberg, 2014) (see Section 5 for more detaileddescriptions of these algorithms).

In NLP, word embeddings are often used as fea-tures for downstream tasks. Dependency parsing(Chen and Manning, 2014), named entity recogni-tion (Turian et al., 2010; Cherry and Guo, 2015), andbilingual lexicon induction (Vulic and Moens, 2015)are just a few examples where the use of embeddingsas features has increased performance in recent years.

Increasingly, word embeddings have been usedas evidence in studies of language and culture. Forexample, Hamilton et al. (2016) train separate em-beddings on temporal segments of a corpus and thenanalyze changes in the similarity of words to measuresemantic shifts, and Heuser (2016) uses embeddingsto characterize discourse about virtues in 18th Cen-

tury English text. Other studies use cosine similar-ities between embeddings to measure the variationof language across geographical areas (Kulkarni etal., 2016; Phillips et al., 2017) and time (Kim et al.,2014). Each of these studies seeks to reconstruct themental model of authors based on documents.

An example that highlights the contrast betweenthe downstream-centered and corpus-centered per-spectives is the exploration of implicit bias inword embeddings. Researchers have observed thatembedding-based word similarities reflect culturalstereotypes, such as associations between occupa-tions and genders (Bolukbasi et al., 2016). From adownstream-centered perspective, these stereotypicalassociations represent bias that should be filtered outbefore using the embeddings as features. In contrast,from a corpus-centered perspective, implicit bias inembeddings is not a problem that must be fixed buta means of measurement, providing quantitative evi-dence of bias in the training corpus.

Embeddings are usually evaluated on direct usecases, such as word similarity and analogy tasks viacosine similarities (Mikolov et al., 2013; Penningtonet al., 2014; Levy et al., 2015; Shazeer et al., 2016).Intrinsic evaluations like word similarities measurethe interpretability of the embeddings rather thantheir downstream task performance (Gladkova andDrozd, 2016), but while some research does evaluateembedding vectors on their downstream task perfor-mance (Pennington et al., 2014; Faruqui et al., 2015),the standard benchmarks remain intrinsic.

There has been some recent work in evaluatingthe stability of word embeddings. Levy et al. (2015)focus on the hyperparameter settings for each algo-rithm and show that hyperparameters such as the sizeof the context window, the number of negative sam-ples, and the level of context distribution smoothingcan affect the performance of embeddings on simi-larity and analogy tasks. Hellrich and Hahn (2016)examine the effects of word frequency, word am-biguity, and the number of training epochs on thereliability of embeddings produced by the SGNS andskip-gram hierarchical softmax (SGHS) (a variant ofSGNS), striving for reproducibility and recommend-ing against sampling the corpus in order to preservestability. Likewise, Tian et al. (2016) explore the ro-bustness of SGNS and GloVe embeddings trained onlarge, generic corpora (Wikipedia and news data) and

propose methods to align these embeddings acrossdifferent iterations.

In contrast, our goal is not to produce artificiallystable embeddings but to identify the factors that cre-ate instability and measure our statistical confidencein the cosine similarities between embeddings trainedon small, specific corpora. We focus on the corpusas a fragile artifact and source of variation, consid-ering the corpus itself merely a sample of possibledocuments produced by the authors. We examinewhether the embeddings accurately model those au-thors, using bootstrap sampling to measure the effectsof adding or removing documents from the trainingcorpus.

3 Corpora

We collected two sub-corpora from each of threedatasets (see Table 2) to explore how word embed-dings are affected by size, vocabulary, and otherparameters of the training corpus. In order to bet-ter model realistic examples of corpus-centered re-search, these corpora are deliberately chosen to bepublicly available, suggestive of social research ques-tions, varied in corpus parameters (e.g. topic, size,vocabulary), and much smaller than the standard cor-pora typically used in training word embeddings (e.g.Wikipedia, Gigaword). Each dataset was created or-ganically, over specific time periods, in specific socialsettings, by specific authors. Thus, it is impossibleto expand these datasets without compromising thisspecificity.

We process each corpus by lowercasing all text, re-moving words that appear fewer than 20 times in thecorpus, and removing all numbers and punctuation.Because our methods rely on bootstrap sampling (seeSection 6), which operates by removing or multi-plying the presence of documents, we also removeduplicate documents from each corpus.

U.S. Federal Courts of Appeals The U.S. Federalcourts of appeals are regional courts that decide ap-peals from the district courts within their federal ju-dicial circuit. We examine the embeddings of themost recent five years of the 4th and 9th circuits.1

The 4th circuit contains Washington D.C. and sur-rounding states, while the 9th circuit contains the

1https://www.courtlistener.com/

https://www.courtlistener.com/

Corpus Number of documents Unique words Vocabulary density Words per documentNYT Sports (2000) 8,786 12,475 0.0020 708NYT Music (2000) 3,666 9,762 0.0037 715AskScience 331,635 16,901 0.0012 44AskHistorians 63,578 9,384 0.0022 664th Circuit 5,368 16,639 0.0014 2,2819th Circuit 9,729 22,146 0.0011 2,108

Table 2: Comparison of the number of documents, number of unique words (after removing words that appear fewerthan 20 times), vocabulary density (the ratio of unique words to the total number of words), and the average number ofwords per document for each corpus.

Setting Method Tests... Run 1 Run 2 Run 3Fixed Documents in consistent order variability due to algorithm (baseline) A B C A B C A B CShuffled Documents in random order variability due to document order A C B B A C C B ABootstrap Documents sampled with replacement variability due to document presence B A A C A B B B B

Table 3: The three settings that manipulate the document order and presence in each corpus.

entirety of the west coast. Social science researchquestions might involve measuring a widely heldbelief that certain courts have distinct ideological ten-dencies (Broscheid, 2011). Such bias may result inmeasurable differences in word association due toframing effects (Card et al., 2015), which could beobservable by comparing the words associated witha given query term. We treat each opinion as a singledocument.

New York Times The New York Times (NYT) An-notated Corpus (Sandhaus, 2008) contains newspaperarticles tagged with additional metadata reflectingtheir content and publication context. To constrainthe size of the corpora and to enhance their specificity,we extract data only for the year 2000 and focus ononly two sections of the NYT dataset: sports andmusic. In the resulting corpora, the sports section issubstantially larger than the music section (see Table2). We treat an article as a single document.

Reddit Reddit2 is a social website containing thou-sands of forums (subreddits) organized by topic. Weuse a dataset containing all posts for the years 2007-2014 from two subreddits: /r/AskScience and/r/AskHistorians. These two subreddits allowusers to post any question in the topics of historyand science, respectively. AskScience is more thanfive times larger than AskHistorians, though the doc-ument length is generally longer for AskHistorians(see Table 2). Reddit is a popular data source for

2https://www.reddit.com/

computational social science research; for example,subreddits can be used to explore the distinctivenessand dynamicity of communities (Zhang et al., 2017).We treat an original post as a single document.

4 Corpus Parameters

Order and presence of documents We use threedifferent methods to sample the corpus: FIXED,SHUFFLED, and BOOTSTRAP. The FIXED settingincludes each document exactly once, and the doc-uments appear in a constant, chronological orderacross all models. The purpose of this setting isto measure the baseline variability of an algorithm,independent of any change in input data. Algorith-mic variability may arise from random initializationsof learned parameters, random negative sampling,or randomized subsampling of tokens within docu-ments. The SHUFFLED setting includes each docu-ment exactly once, but the order of the documentsis randomized for each model. The purpose of thissetting is to evaluate the impact of variation on howwe present examples to each algorithm. The orderof documents could be an important factor for algo-rithms that use online training such as SGNS. TheBOOTSTRAP setting samples N documents randomlywith replacement, where N is equal to the numberof documents in the FIXED setting. The purpose ofthis setting is to measure how much variability is dueto the presence or absence of specific sequences oftokens in the corpus. See Table 3 for a comparisonof these three settings.

https://www.reddit.com/

Size of corpus We expect the stability ofembedding-based word similarities to be influencedby the size of the training corpus. As we add moredocuments, the impact of any specific documentshould be less significant. At the same time, largercorpora may also tend to be more broad in scope andvariable in style and topic, leading to less idiosyn-cratic patterns in word co-occurrence. Therefore,for each corpus, we curate a smaller sub-corpus thatcontains 20% of the total corpus documents. Thesesamples are selected using contiguous sequences ofdocuments at the beginning of each training (thisensures that the FIXED setting remains constant).

Length of documents We use two document seg-mentation strategies. In the first setting, each traininginstance is a single document (i.e. an article for theNYT corpus, an opinion from the Courts corpus, anda post from the Reddit corpus). In the second setting,each training instance is a single sentence. We ex-pect this choice of segmentation to have the largestimpact on the BOOTSTRAP setting. Documents areoften characterized by “bursty” words that are locallyfrequent but globally rare (Madsen et al., 2005), suchas the name of a defendant in a court case. Samplingwhole documents with replacement should magnifythe effect of bursty words: a rare but locally frequentword will either occur in a Bootstrap corpus or notoccur. Sampling sentences with replacement shouldhave less effect on bursty words, since the chancethat an entire document will be removed from thecorpus is much smaller.

5 Algorithms

Evaluating all current embedding algorithms and im-plementations is beyond the scope of this work, sowe select four categories of algorithms that representdistinct optimization strategies. Recall that our goalis to examine how algorithms respond to variationin the corpus, not to maximize performance in theaccuracy or effectiveness of the embeddings.

The first category is online stochastic updates, inwhich the algorithm updates model parameters us-ing stochastic gradients as it proceeds through thetraining corpus. All methods implemented in theword2vec and fastText packages follow thisformat, including skip-gram, CBOW, negative sam-pling, and hierarchical softmax (Mikolov et al., 2013).

We focus on SGNS as a popular and representativeexample. The second category is batch stochasticupdates, in which the algorithm first collects a matrixof summary statistics derived from a pass throughthe training data that takes place before any parame-ters are set, and then updates model parameters usingstochastic optimization. We select the GloVe algo-rithm (Pennington et al., 2014) as a representativeexample. The third category is matrix factorization,in which the algorithm makes deterministic updatesto model parameters based on a matrix of summarystatistics. As a representative example we includePPMI (Levy and Goldberg, 2014). Finally, to testwhether word order is a significant factor we includea document-based embedding method that uses ma-trix factorization, LSA (Deerwester et al., 1990; Lan-dauer and Dumais, 1997).

These algorithms each include several hyperparam-eters, which are known to have measurable effectson the resulting embeddings (Levy et al., 2015). Wehave attempted to choose settings of these parame-ters that are commonly used and comparable acrossalgorithms, but we emphasize that a full evaluationof the effect of each algorithmic parameter wouldbe beyond the scope of this work. For each of thefollowing algorithms, we set the context window sizeto 5 and the embeddings size to 100. Since we re-move words that occur fewer than 20 times duringpreprocessing of the corpus, we set the frequencythreshold for the following algorithms to 0.

For all other hyperparameters, we follow the de-fault or most popular settings for each algorithm, asdescribed in the following sections.

5.1 LSA

Latent semantic analysis (LSA) factorizes a sparseterm-document matrix X (Deerwester et al., 1990;Landauer and Dumais, 1997). X is factored usingsingular value decomposition (SVD), retaining Ksingular values such that

X ≈ XK = UKΣKV TK

The elements of the term-document matrix areweighted, often with TF-IDF, which measures theimportance of a word to a document in a corpus. Thedense, low-rank approximation of the term-documentmatrix, XK , can be used to measure the relatedness

of terms by calculating the cosine similarity of therelevant rows of the reduced matrix.

We use the sci-kit learn3 package to trainour LSA embeddings. We create a term-documentmatrix with TF-IDF weighting, using the default set-tings except that we add L2 normalization and sub-linear TF scaling, which scales the importance ofterms with high frequency within a document. Weperform dimensionality reduction via a randomizedsolver (Halko et al., 2009).

The construction of the term-count matrix and theTF-IDF weighting should introduce no variation tothe final word embeddings. However, we expectvariation due to the randomized SVD solver, evenwhen all other parameters (training document order,presence, size, etc.) are constant.

5.2 SGNS

The skip-gram with negative sampling (SGNS) algo-rithm (Mikolov et al., 2013) is an online algorithmthat uses randomized updates to predict words basedon their context. In each iteration, the algorithm pro-ceeds through the original documents and, at eachword token, updates model parameters based on gra-dients calculated from the current model parameters.This process maximizes the likelihood of observedword-context pairs and minimizes the likelihood ofnegative samples.

We use an implementation of the SGNS algorithmincluded in the Python library gensim4 (Rehurekand Sojka, 2010). We use the default settings pro-vided with gensim except as described above.

We predict that multiple runs of SGNS on the samecorpus will not produce the same results. SGNS ran-domly initializes all the embeddings before trainingbegins, and it relies on negative samples created byrandomly selecting word and context pairs (Mikolovet al., 2013; Levy et al., 2015). We also expect SGNSto be sensitive to the order of documents, as it relieson stochastic gradient descent which can be biasedto be more influenced by initial documents (Bottou,2012).

3http://scikit-learn.org/4https://radimrehurek.com/gensim/models/

word2vec.html

5.3 GloVeGlobal Vectors for Word Representation (GloVe) usesstochastic gradient updates but operates on a “global”representation of word co-occurrence that is calcu-lated once at the beginning of the algorithm (Penning-ton et al., 2014). Words and contexts are associatedwith bias parameters, bw and bc, where w is a wordand c is a context, learned by minimizing the costfunction:

L =∑w,c

f(xwc)~w · ~c + bw + bc − log(xwc)

We use the GloVe implementation provided byPennington et al. (2014)5. We use the default settingsprovided with GloVe except as described above.

Unlike SGNS, the algorithm does not performmodel updates while examining the original docu-ments. As a result, we expect GloVe to be sensitiveto random initializations but not sensitive to the orderof documents.

5.4 PPMIThe positive pointwise mutual information (PPMI)matrix, whose cells represent the PPMI of eachpair of words and contexts, is factored using sin-gular value decomposition (SVD) and results in low-dimensional embeddings that perform similarly toGloVe and SGNS (Levy and Goldberg, 2014).

PMI(w, c) = logP (w, c)

P (w)P (c)

PPMI(w, c) = max(PMI(w, c), 0)

To train our PPMI word embeddings, we usehyperwords,6 an implementation provided as partof Levy et al. (2015).7 We follow the authors’ recom-mendations and set the context distributional smooth-ing (cds) parameter to 0.75, the eigenvalue matrix(eig) to 0.5, the subsampling threshold (sub) to10-5, and the context window (win) to 5.

5http://nlp.stanford.edu/projects/glove/6https://bitbucket.org/omerlevy/

hyperwords/src7We altered the PPMI code to remove a fixed random seed

in order to introduce variability given a fixed corpus; no otherchange was made.

http://scikit-learn.org/

https://radimrehurek.com/gensim/models/word2vec.html

https://radimrehurek.com/gensim/models/word2vec.html

http://nlp.stanford.edu/projects/glove/

https://bitbucket.org/omerlevy/hyperwords/src

https://bitbucket.org/omerlevy/hyperwords/src

Like GloVe and unlike SGNS, PPMI operates on apre-computed representation of word co-occurrence,so we do not expect results to vary based on the or-der of documents. Unlike both GloVe and SGNS,PPMI uses a stable, non-stochastic SVD algorithmthat should produce the same result given the sameinput, regardless of initialization. However, we ex-pect variation due to PPMI’s random subsampling offrequent tokens.

6 Methods

To establish statistical significance bounds for ourobservations, we train 50 LSA models, 50 SGNSmodels, 50 GloVe models, and 50 PPMI models foreach of the three settings (FIXED, SHUFFLED, andBOOTSTRAP), for each document segmentation size,for each corpus.

For each corpus, we select a set of 20 relevantquery words from high probability words from anLDA topic model (Blei et al., 2003) trained on thatcorpus with 200 topics. We calculate the cosine sim-ilarity of each query word to the other words in thevocabulary, creating a similarity ranking of all thewords in the vocabulary. We calculate the mean andstandard deviation of the cosine similarities for eachpair of query word and vocabulary word across eachset of 50 models.

From the lists of queries and cosine similarities,we select the 20 words most closely related to the setof query words and compare the mean and standarddeviation of those pairs across settings. We calculatethe Jaccard similarity between top-N lists to com-pare membership change in the lists of most closelyrelated words, and we find average changes in rankwithin those lists. We examine these metrics acrossdifferent algorithms and corpus parameters.

7 Results

We begin with a case study of the framing around thequery term marijuana. One might hypothesizethat the authors of various corpora (e.g. judges ofthe 4th Circuit, journalists at the NYT, and users onReddit) have different perceptions of this drug andthat their language might reflect those differences.Indeed, after qualitatively examining the lists of mostsimilar terms (see Table 4), we might come to theconclusion that the allegedly conservative 4th Circuit

LSA SGNS GloVe PPMI

ALGORITHM

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

STA

ND

AR

D D

EV

IATIO

N

fixed

shuffled

bootstrap

Standard Deviation in the 9th Circuit Corpus

LSA SGNS GloVe PPMI

ALGORITHM

0.00

0.02

0.04

0.06

0.08

0.10

0.12

STA

ND

AR

D D

EV

IATIO

N

fixed

shuffled

bootstrap

Standard Deviation in the NYT Music Corpus

Figure 1: The mean standard deviations across settingsand algorithms for the 10 closest words to the query wordsin the 9th Circuit and NYT Music corpora using the wholedocuments. Larger variations indicate less stable embed-dings.

judges view marijuana as similar to illegal drugs suchas heroin and cocaine, while Reddit users view mari-juana as closer to legal substances such as nicotineand alcohol.

However, we observe patterns that cause us tolower our confidence in such conclusions. Table 4shows that the cosine similarities can vary signifi-cantly. We see that the top ranked words (chosenaccording to their mean cosine similarity across runsof the FIXED setting) can have widely different meansimilarities and standard deviations depending onthe algorithm and the three training settings, FIXED,SHUFFLED, and BOOTSTRAP.

As expected, each algorithm has small variationin the FIXED setting. For example, we can see theeffect of the random SVD solver for LSA and theeffect of random subsampling for PPMI. We do notobserve a consistent effect for document order in theSHUFFLED setting.

Most importantly, these figures reveal that the

4th Circuit NYT Sports Reddit Ask Science

0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94Cosine Similarity

distributemanufacture

oxycodonedistributing

powdermethamphetamine

crackdistribution

cocaineheroin

Most

Sim

ilar

Word

s

LSA

fixed

shuffled

bootstrap

0.45 0.50 0.55 0.60 0.65 0.70Cosine Similarity

steroidsreserved

substanceinvolving

violentseveralcocainetestingtested

criticized

Most

Sim

ilar

Word

s

LSA

fixed

shuffled

bootstrap

0.70 0.75 0.80 0.85 0.90Cosine Similarity

masturbationmedication

tobaccostress

nicotinealcohol

thccaffeinesmokingcannabis

Most

Sim

ilar

Word

s

LSA

fixed

shuffled

bootstrap


cigarettespowder

narcoticscrackdrugs

pillsmethamphetamine

cocaineoxycodone

heroin

Most

Sim

ilar

Word

s

SGNS

fixed

shuffled

bootstrap


testingabuse

substanceurine

alcoholcounseling

steroidnandrolone

drugcocaine

Most

Sim

ilar

Word

s

SGNS

fixed

shuffled

bootstrap


drugsmoking

lsdthc

cocaineweed

mdmatobacconicotine

cannabis

Most

Sim

ilar

Word

s

SGNS

fixed

shuffled

bootstrap

0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80Cosine Similarity

possessiongrowingsmoked

gramsdrugs

distributecrack

kilogramsheroin

cocaine

Most

Sim

ilar

Word

s

GloVe

fixed

shuffled

bootstrap


positivesuspensions

blamingsteroid

purposesaddiction

testingsmoking

procedurescocaine

Most

Sim

ilar

Word

s

GloVe

fixed

shuffled

bootstrap


effectsdrugs

caffeineweeddrug

thctobacco

smokesmokingcannabis

Most

Sim

ilar

Word

s

GloVe

fixed

shuffled

bootstrap

0.75 0.80 0.85 0.90Cosine Similarity

methamphetaminehydrochloride

kilogramsparaphernalia

kilogramgramscrack

powderheroin

cocaine

Most

Sim

ilar

Word

s

PPMI

fixed

shuffled

bootstrap


steroidtesting

crackdrugs

substancepositive

testedalcohol

drugcocaine

Most

Sim

ilar

Word

s

PPMI

fixed

shuffled

bootstrap


smokesmokers

thccigar

cigarettessmoking

weedtobacconicotine

cannabis

Most

Sim

ilar

Word

s

PPMI

fixed

shuffled

bootstrap

Table 4: The most similar words with their means and standard deviations for the cosine similarities between the queryword marijuana and its 10 nearest neighbors (highest mean cosine similarity in the FIXED setting. Embeddings arelearned from documents segmented by sentence.

BOOTSTRAP setting causes large increases in varia-tion across all algorithms (with a weaker effect forPPMI) and corpora, with large standard deviationsacross word rankings. This indicates that the pres-ence of specific documents in the corpus can signifi-cantly affect the cosine similarities between embed-ding vectors.

GloVe produced very similar embeddings in both

the FIXED and SHUFFLED settings, with similarmeans and small standard deviations, which indi-cates that GloVe is not sensitive to document order.However, the BOOTSTRAP setting caused a reduc-tion in the mean and widened the standard deviation,indicating that GloVe is sensitive to the presence ofspecific documents.

These patterns of larger or smaller variations are

Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7viability fetus trimester surgery trimester pregnancies abdomenpregnancies pregnancies surgery visit surgery occupation tenureabortion gestation visit therapy incarceration viability stepfatherabortions kindergarten tenure pain visit abortion wifefetus viability workday hospitalization arrival tenure groingestation headaches abortions neck pain visit throatsurgery pregnant hernia headaches headaches abortions grandmotherexpiration abortion summer trimester birthday pregnant daughtersudden pain suicide experiencing neck birthday panicfetal bladder abortion medications tenure fetus jaw

Table 5: The 10 closest words to the query term pregnancy are highly variable. None of the words shown appear inevery run. Results are shown across runs of the BOOTSTRAP setting for the full corpus of the 9th Circuit, the wholedocument size, and the SGNS model.

Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7selection selection selection selection selection selection selectiongenetics process human darwinian convergent evolutionary darwinianconvergent darwinian humans theory darwinian humans natureprocess humans natural genetics evolutionary species evolutionarydarwinian convergent genetics human genetics convergent convergentabiogenesis evolutionary species evolutionary theory process processevolutionary species did humans natural natural naturalnatural human convergent natural humans did speciesnature natural process convergent process human humansspecies theory evolutionary creationism human darwinian favor

Table 6: The order of the 10 closest words to the query term evolution are highly variable. Results are shownacross runs of the BOOTSTRAP setting for the full corpus of AskScience, the whole document length, and the GloVemodel.

generalized in Figure 1, which shows the mean stan-dard deviation for different algorithms and settings.We calculated the standard deviation across the 50runs for each query word in each corpus, and thenwe averaged over these standard deviations. The re-sults show the average levels of variation for eachalgorithm and corpus. We observe that the FIXED

and SHUFFLED settings for GloVe and LSA producethe least variable cosine similarities, while PPMI pro-duces the most variable cosine similarities for allsettings. The presence of specific documents has asignificant effect on all four algorithms (lesser forPPMI), consistently increasing the standard devia-tions.

We turn to the question of how this variation instandard deviation affects the lists of most similarwords. Are the top-N words simply re-ordered, ordo the words present in the list substantially change?Table 5 shows an example of the top-N word listsfor the query word pregnancy in the 9th Circuitcorpus. Observing Run 1, we might believe thatjudges of the 9th Circuit associate pregnancy most

with questions of viability and abortion, while observ-ing Run 5, we might believe that pregnancy is mostassociated with questions of prisons and family visits.Although the lists in this table are all produced fromthe same corpus and document size, the membershipof the lists changes substantially between runs of theBOOTSTRAP setting.

As another example, Table 6 shows results forthe query evolution for the GloVe model andthe AskScience corpus. Although this query showsless variation between runs, we still find cause forconcern. For example, Run 3 ranks the words humanand humans highly, while Run 1 includes neither ofthose words in the top 10.

These changes in top-N rank are shown in Figure2. For each query word for the AskHistorians corpus,we find the N most similar words using SGNS. Wegenerate new top-N lists for each of the 50 modelstrained in the BOOTSTRAP setting, and we use Jac-card similarity to compare the 50 lists. We observesimilar patterns to the changes in standard deviationin Figure 2; PPMI displays the lowest Jaccard simi-

LSA SGNS GloVe PPMI

ALGORITHM

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

JAC

CA

RD

SIM

ILA

RIT

Y

fixed

shuffled

bootstrap

Variation in Top 2 Words

LSA SGNS GloVe PPMI

ALGORITHM

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

JAC

CA

RD

SIM

ILA

RIT

Y

fixed

shuffled

bootstrap

Variation in Top 10 Words

Figure 2: The mean Jaccard similarities across settingsand algorithms for the top 2 and 10 closest words to thequery words in the AskHistorians corpus. Larger Jaccardsimilarity indicates more consistency in top N member-ship. Results are shown for the sentence document length.

larity across settings, while the other algorithms havehigher similarities in the FIXED and SHUFFLED set-tings but much lower similarities in the BOOTSTRAP

setting. We display results for both N = 2 andN = 10, emphasizing that even very highly rankedwords often drop out of the top-N list.

Even when words do not drop out of the top-N list,they often change in rank, as we observe in Figure3. We show both a specific example for the queryterm men and an aggregate of all the terms whoseaverage rank is within the top-10 across runs of theBOOTSTRAP setting. In order to highlight the av-erage changes in rank, we do not show outliers inthis figure, but we note that outliers (large falls andjumps in rank) are common. The variability acrosssamples from the BOOTSTRAP setting indicates thatthe presence of specific documents can significantlyaffect the top-N rankings.

We also find that document segmentation size af-fects the cosine similarities. Figure 4 shows that

0 5 10 15 20 25 30 35RANK

children

women

soldiers

boys

girls

horses

officers

people

peasants

bodies

WO

RD

Change in Rank: "men"

0 5 10 15 20 25 30 35 40 45RANK FOR CURRENT ITERATION

1

2

3

4

5

6

7

8

9

10

AV

ER

AG

E R

AN

K

Change in Rank for All Queries

Figure 3: The change in rank across runs of the BOOT-STRAP setting for the top 10 words. We show resultsfor both a single query, men, and an aggregate of all thequeries, showing the change in rank of the words whoseaverage ranking falls within the 10 nearest neighbors ofthose queries. Results are shown for SGNS on the AskHis-torians corpus and the sentence document length.

documents segmented at a more fine-grained levelproduce embeddings with less variability across runsof the BOOTSTRAP setting. Documents segmented atthe sentence level have standard deviations clusteringcloser to the median, while larger documents havestandard deviations that are spread more widely. Thiseffect is most significant for the 4th Circuit and 9thCircuit corpora, as these have much larger “docu-ments” than the other corpora. We observe a similareffect for corpus size in Figure 5. The smaller corpusshows a larger spread in standard deviation than thelarger corpus, indicating greater variability.

Finally, we find that the variance usually stabilizesat about 25 runs of the BOOTSTRAP setting. Figure6 shows that variability initially increases with thenumber of models trained. We observe this patternacross corpora, algorithms, and settings.

1 2 3 4 5 6 7 8 9 10RANK

0.05

0.00

0.05

0.10

0.15

0.20

STA

ND

AR

D D

EV

IATIO

N

Document Size Comparison

DOCUMENT SIZE

sentence

whole

Figure 4: Standard deviation of the cosine similaritiesbetween all rank N words and their 10 nearest neighbors.Results are shown for different document sizes (sentencevs whole document) in the BOOTSTRAP setting for SGNSin the 4th Circuit corpus.

1 2 3 4 5 6 7 8 9 10RANK

0.02

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

STA

ND

AR

D D

EV

IATIO

N

Corpus Size Comparison

CORPUS SIZE

0.2

1.0

Figure 5: Standard deviation of the cosine similaritiesbetween all rank N words and their 10 nearest neighbors.Results are shown at different corpus sizes (20% vs 100%of documents) in the BOOTSTRAP setting for SGNS in the4th Circuit corpus, segmented by sentence.

8 Discussion

The most obvious result of our experiments is toemphasize that embeddings are not even a singleobjective view of a corpus, much less an objectiveview of language. The corpus is itself only a sample,and we have shown that the curation of this sample(its size, document length, and inclusion of specificdocuments) can cause significant variability in theembeddings. Happily, this variability can be quan-tified by averaging results over multiple bootstrapsamples.

We can make several specific observations about al-gorithm sensitivities. In general, LSA, GloVe, SGNS,and PPMI are not sensitive to document order in thecollections we evaluated. This is surprising, as we

5 10 15 20 25 30 35 40 45 50NUMBER OF ITERATIONS

0.028

0.030

0.032

0.034

0.036

mean(S

TA

ND

AR

D D

EV

IATIO

N)

Stability over Iterations

Figure 6: The mean of the standard deviation of the cosinesimilarities between each query term and its 20 nearestneighbors. Results are shown for different numbers ofruns of the BOOTSTRAP setting on the 4th Circuit corpus.

had expected SGNS to be sensitive to document orderand anecdotally, we had observed cases where theembeddings were affected by groups of documents(e.g. in a different language) at the beginning of train-ing. However, all four algorithms are sensitive to thepresence of specific documents, though this effect isweaker for PPMI.

Although PPMI appears deterministic (due to itspre-computed word-context matrix), we find that thisalgorithm produced results under the FIXED orderingwhose variability was closest to the BOOTSTRAP set-ting. We attribute this intrinsic variability to the useof token-level subsampling. This sampling methodintroduces variation into the source corpus that ap-pears to be comparable to a bootstrap resamplingmethod. Sampling in PPMI is inspired by a similarmethod in the word2vec implementation of SGNS(Levy et al., 2015). It is therefore surprising thatSGNS shows noticeable differentiation between theBOOTSTRAP setting on the one hand and the FIXED

and SHUFFLED settings on the other.The use of embeddings as sources of evidence

needs to be tempered with the understanding that fine-grained distinctions between cosine similarities arenot reliable and that smaller corpora and longer docu-ments are more susceptible to variation in the cosinesimilarities between embeddings. When studying thetop-N most similar words to a query, it is importantto account for variation in these lists, as both rankand membership can significantly change across runs.Therefore, we emphasize that with smaller corporacomes greater variability, and we recommend thatpractitioners use bootstrap sampling to generate an

ensemble of word embeddings for each sub-corpusand present both the mean and variability of any sum-mary statistics such as ordered word similarities.

We leave for future work a full hyperparametersweep for the three algorithms. While these hyperpa-rameters can substantially impact performance, ourgoal with this work was not to achieve high perfor-mance but to examine how the algorithms respond tochanges in the corpus. We make no claim that onealgorithm is better than another.

9 Conclusion

We find that there are several sources of variabilityin cosine similarities between word embeddings vec-tors. The size of the corpus, the length of individualdocuments, and the presence or absence of specificdocuments can all affect the resulting embeddings.While differences in word association are measur-able and are often significant, small differences incosine similarity are not reliable, especially for smallcorpora. If the intention of a study is to learn abouta specific corpus, we recommend that practitionerstest the statistical confidence of similarities based onword embeddings by training on multiple bootstrapsamples.

10 Acknowledgements

This work was supported by NSF 1526155, 1652536,and the Alfred P. Sloan Foundation. We would liketo thank Alexandra Schofield, Laure Thompson, andour anonymous reviewers for their helpful comments.

ReferencesYoshua Bengio, Rejean Ducharme, Pascal Vincent, and

Christian Jauvin. 2003. A neural probabilistic lan-guage model. Journal of machine learning research,3(Feb):1137–1155.

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003.Latent dirichlet allocation. Journal of Machine Learn-ing research, 3(Jan):993–1022.

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,Venkatesh Saligrama, and Adam T Kalai. 2016. Man isto computer programmer as woman is to homemaker?debiasing word embeddings. In NIPS, pages 4349–4357.

Leon Bottou. 2012. Stochastic gradient descent tricks. InNeural networks: Tricks of the trade, pages 421–436.Springer.

Andreas Broscheid. 2011. Comparing circuits: Are someu.s. courts of appeals more liberal or conservative thanothers? Law & Society Review, 45(1), March.

Dallas Card, Amber E. Boydstun, Justin H. Gross, PhilipResnik, and Noah A. Smith. 2015. The media framescorpus: Annotations of frames across issues. In ACL.

Danqi Chen and Christopher D Manning. 2014. A fastand accurate dependency parser using neural networks.In EMNLP, pages 740–750.

Colin Cherry and Hongyu Guo. 2015. The unreasonableeffectiveness of word representations for twitter namedentity recognition. In HLT-NAACL, pages 735–745.

Scott Deerwester, Susan T Dumais, George W Furnas,Thomas K Landauer, and Richard Harshman. 1990.Indexing by latent semantic analysis. Journal of theAmerican society for Information Science, 41(6):391.

Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, Chris Dyer,Eduard Hovy, and Noah a Smith. 2015. RetrofittingWord Vectors to Semantic Lexicons. HLT-ACL, pages1606–1615.

Anna Gladkova and Aleksandr Drozd. 2016. Intrinsicevaluations of word embeddings: What can we do bet-ter? In Proceedings of the 1st Workshop on EvaluatingVector-Space Representations for NLP, pages 36–42.

Yoav Goldberg. 2017. Neural Network Methods for Nat-ural Language Processing. Synthesis Lectures on Hu-man Language Technologies. Morgan Claypool Pub-lishers.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.2016. Deep learning. MIT press.

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp.2009. Finding structure with randomness: Stochasticalgorithms for constructing approximate matrix decom-positions.

William L. Hamilton, Jure Leskovec, and Dan Jurafsky.2016. Diachronic word embeddings reveal statisticallaws of semantic change. In ACL.

Johannes Hellrich and Udo Hahn. 2016. Bad compa-nyneighborhoods in neural embedding spaces consid-ered harmful. In COLING.

Ryan Heuser. 2016. Word vectors in the eighteenth-century. In IPAM workshop: Cultural Analytics.

Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde,and Slav Petrov. 2014. Temporal analysis of languagethrough neural language models. Proceedings of theACL 2014 Workshop on Language Technologies andComputational Social Science,.

Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2016.Freshman or fresher? quantifying the geographic vari-ation of language in online social media. In ICWSM,pages 615–618.

Thomas K Landauer and Susan T Dumais. 1997. A so-lution to plato’s problem: The latent semantic analysis

theory of acquisition, induction, and representation ofknowledge. Psychological review, 104(2):211.

Omer Levy and Yoav Goldberg. 2014. Neural wordembedding as implicit matrix factorization. In NIPS,pages 2177–2185.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-proving distributional similarity with lessons learnedfrom word embeddings. Transactions of the ACL,3:211–225.

Rasmus E Madsen, David Kauchak, and Charles Elkan.2005. Modeling word burstiness using the dirichlet dis-tribution. In Proceedings of the 22nd International Con-ference on Machine learning, pages 545–552. ACM.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013.Linguistic Regularities in Continuous Space Word Rep-resentations. HLT-NAACL.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word repre-sentation. In EMNLP, volume 14, pages 1532–43.

Lawrence Phillips, Kyle Shaffer, Dustin Arendt, NathanHodas, and Svitlana Volkova. 2017. Intrinsic and ex-trinsic evaluation of spatiotemporal text representationsin twitter streams. In Proceedings of the 2nd Workshopon Representation Learning for NLP, pages 201–210.

Radim Rehurek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. In Pro-ceedings of the LREC 2010 Workshop on New Chal-lenges for NLP Frameworks, pages 45–50, Valletta,Malta, May. ELRA.

Evan Sandhaus. 2008. The New York Times AnnotatedCorpus.

Noam Shazeer, Ryan Doherty, Colin Evans, and ChrisWaterson. 2016. Swivel: Improving Embeddings byNoticing What’s Missing.

Yingtao Tian, Vivek Kulkarni, Bryan Perozzi, and StevenSkiena. 2016. On the convergent properties of wordembedding methods. arXiv preprint arXiv:1605.03956.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. In Proceedings of theACL, pages 384–394. Association for ComputationalLinguistics.

Peter D Turney and Patrick Pantel. 2010. From frequencyto meaning: Vector space models of semantics. Journalof artificial intelligence research, 37:141–188.

Ivan Vulic and Marie-Francine Moens. 2015. Bilingualword embeddings from non-parallel document-aligneddata applied to bilingual lexicon induction. In Proceed-ings of the ACL, pages 719–725. ACL.

Justine Zhang, William L Hamilton, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky, and Jure Leskovec.2017. Community identity and user engagement ina multi-community landscape. Proceedings of ICWSM.

Evaluating the Stability of Embedding-based Word Similarities · 2017-11-12 · 1 Introduction Word...

Documents

Transcript of Evaluating the Stability of Embedding-based Word Similarities · 2017-11-12 · 1 Introduction Word...