Dynamic Word Embeddings - arxiv.org

Dynamic Word Embeddings

Robert Bamler 1 Stephan Mandt 1

AbstractWe present a probabilistic language model fortime-stamped text data which tracks the se-mantic evolution of individual words over time.The model represents words and contexts bylatent trajectories in an embedding space. Ateach moment in time, the embedding vectorsare inferred from a probabilistic version ofword2vec (Mikolov et al., 2013b). These em-bedding vectors are connected in time througha latent diffusion process. We describe twoscalable variational inference algorithms—skip-gram smoothing and skip-gram filtering—that al-low us to train the model jointly over all times;thus learning on all data while simultaneously al-lowing word and context vectors to drift. Experi-mental results on three different corpora demon-strate that our dynamic model infers word em-bedding trajectories that are more interpretableand lead to higher predictive likelihoods thancompeting methods that are based on static mod-els trained separately on time slices.

1. IntroductionLanguage evolves over time and words change their mean-ing due to cultural shifts, technological inventions, or po-litical events. We consider the problem of detecting shiftsin the meaning and usage of words over a given time spanbased on text data. Capturing these semantic shifts requiresa dynamic language model.

Word embeddings are a powerful tool for modeling se-mantic relations between individual words (Bengio et al.,2003; Mikolov et al., 2013a; Pennington et al., 2014; Mnih& Kavukcuoglu, 2013; Levy & Goldberg, 2014; Vilnis &McCallum, 2014; Rudolph et al., 2016). Word embed-

1Disney Research, 4720 Forbes Avenue, Pittsburgh,PA 15213, USA. Correspondence to: Robert Bamler<[email protected]>, Stephan Mandt<[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

dings model the distribution of words based on their sur-rounding words in a training corpus, and summarize thesestatistics in terms of low-dimensional vector representa-tions. Geometric distances between word vectors reflectsemantic similarity (Mikolov et al., 2013a) and differencevectors encode semantic and syntactic relations (Mikolovet al., 2013c), which shows that they are sensible represen-tations of language. Pre-trained word embeddings are use-ful for various supervised tasks, including sentiment anal-ysis (Socher et al., 2013b), semantic parsing (Socher et al.,2013a), and computer vision (Fu & Sigal, 2016). As un-supervised models, they have also been used for the ex-ploration of word analogies and linguistics (Mikolov et al.,2013c).

Word embeddings are currently formulated as static mod-els, which assumes that the meaning of any given word isthe same across the entire text corpus. In this paper, wepropose a generalization of word embeddings to sequentialdata, such as corpora of historic texts or streams of text insocial media.

Current approaches to learning word embeddings in a dy-namic context rely on grouping the data into time binsand training the embeddings separately on these bins (Kimet al., 2014; Kulkarni et al., 2015; Hamilton et al., 2016).This approach, however, raises three fundamental prob-lems. First, since word embedding models are non-convex,training them twice on the same data will lead to differentresults. Thus, embedding vectors at successive times canonly be approximately related to each other, and only if theembedding dimension is large (Hamilton et al., 2016). Sec-ond, dividing a corpus into separate time bins may lead totraining sets that are too small to train a word embeddingmodel. Hence, one runs the risk of overfitting to few datawhenever the required temporal resolution is fine-grained,as we show in the experimental section. Third, due to thefinite corpus size the learned word embedding vectors aresubject to random noise. It is difficult to disambiguate thisnoise from systematic semantic drifts between subsequenttimes, in particular over short time spans, where we expectonly minor semantic drift.

In this paper, we circumvent these problems by introducinga dynamic word embedding model. Our contributions areas follows:

arX

iv:1

702.

0835

9v2

[st

at.M

L]

17

Jul 2

017


date0.0

0.2

0.4

0.6

0.8

1.0

cosi

nedi

stan

ce

textphotographsprefacereferencesmemorandum

productivitydiminishing

elasticityaggregate

utility

1. marginal

date

exactaccuratesamplingobserverclever

softwareuser

machinedeviceprinter

2. computer

date

quamautquodArizonaauf

EffectsEffect

InfluenceAmer

pp.

3. versus

date

nominationoffendercustodyassignmentvoting

willingnessloyalty

dedicationadherence

devotion

4. commitment

date

peripheralbasaloscortexnuclear

TVtelephone

newspapersphone

computer

5. radio

1850 1900 1950 2000

date

0.0

0.2

0.4

0.6

0.8

1.0

cosi

nedi

stan

ce

objectiveperceptionsubjectiveverbverbs

possibilitiespotentiallypossibility

riskslikelihood

6. potential

1850 1900 1950 2000

date

materiallyfaithfullyeffectuallyclearlyabundantly

especiallyparticularly

includingnotable

exemplified

7. notably

1850 1900 1950 2000

date

noblemanlawyerknightmembernobility

classroomcognitivenetworks

teacherparent

8. peer

1850 1900 1950 2000

date

emphaticallysignificantgesturessmiledsharply

considerablysubstantially

greatlymaterially

slightly

9. significantly

1850 1900 1950 2000

date

processingPlanningFreudenzymespecialized

computerWeb

technologyapplications

design

10. software

Figure 1. Evolution of the 10 words that changed the most in cosine distance from 1850 to 2008 on Google books, using skip-gramfiltering (proposed). Red (blue) curves correspond to the five closest words at the beginning (end) of the time span, respectively.

• We derive a probabilistic state space model whereword and context embeddings evolve in time accord-ing to a diffusion process. It generalizes the skip-grammodel (Mikolov et al., 2013b; Barkan, 2017) to a dy-namic setup, which allows end-to-end training. Thisleads to continuous embedding trajectories, smoothesout noise in the word-context statistics, and allows usto share information across all times.

• We propose two scalable black-box variational in-ference algorithms (Ranganath et al., 2014; Rezendeet al., 2014) for filtering and smoothing. These al-gorithms find word embeddings that generalize bet-ter to held-out data. Our smoothing algorithm carriesout efficient black-box variational inference for struc-tured Gaussian variational distributions with tridiago-nal precision matrices, and applies more broadly.

• We analyze three massive text corpora that span overlong periods of time. Our approach allows us to auto-matically find the words whose meaning changes themost. It results in smooth word embedding trajecto-ries and therefore allows us to measure and visual-ize the continuous dynamics of the entire embeddingcloud as it deforms over time.

Figure 1 exemplifies our method. The plot shows a fit ofour dynamic skip-gram model to Google books (we givedetails in section 5). We show the ten words whose mean-ing changed most drastically in terms of cosine distanceover the last 150 years. We thereby automatically dis-cover words such as “computer” or “radio” whose meaningchanged due to technological advances, but also words like

“peer” and “notably” whose semantic shift is less obvious.

Our paper is structured as follows. In section 2 we discussrelated work, and we introduce our model in section 3. Insection 4 we present two efficient variational inference al-gorithms for our dynamic model. We show experimentalresults in section 5. Section 6 summarizes our findings.

2. Related WorkProbabilistic models that have been extended to latent timeseries models are ubiquitous (Blei & Lafferty, 2006; Wanget al., 2008; Sahoo et al., 2012; Gultekin & Paisley, 2014;Charlin et al., 2015; Ranganath et al., 2015; Jerfel et al.,2017), but none of them relate to word embeddings. Theclosest of these models is the dynamic topic model (Blei &Lafferty, 2006; Wang et al., 2008), which learns the evo-lution of latent topics over time. Topic models are basedon bag-of-word representations and thus treat words assymbols without modelling their semantic relations. Theytherefore serve a different purpose.

Mikolov et al. (2013a;b) proposed the skip-gram modelwith negative sampling (word2vec) as a scalable word em-bedding approach that relies on stochastic gradient de-scent. This approach has been formulated in a Bayesiansetup (Barkan, 2017), which we discuss separately in sec-tion 3.1. These models, however, do not allow the wordembedding vectors to change over time.

Several authors have analyzed different statistics of textdata to analyze semantic changes of words over time (Mi-halcea & Nastase, 2012; Sagi et al., 2011; Kim et al., 2014;


Figure 2. a) Bayesian skip-gram model (Barkan, 2017). b) Thedynamic skip-gram model (proposed) connects T copies of theBayesian skip-gram model via a latent time series prior on theembeddings.

Kulkarni et al., 2015; Hamilton et al., 2016). None of themexplicitly model a dynamical process; instead, they slicethe data into different time bins, fit the model separatelyon each bin, and further analyze the embedding vectors inpost-processing. By construction, these static models cantherefore not share statistical strength across time. Thislimits the applicability of static models to very large cor-pora.

Most related to our approach are methods based on wordembeddings. Kim et al. (2014) fit word2vec separately ondifferent time bins, where the word vectors obtained forthe previous bin are used to initialize the algorithm for thenext time bin. The bins have to be sufficiently large and thefound trajectories are not as smooth as ours, as we demon-strate in this paper. Hamilton et al. (2016) also trainedword2vec separately on several large corpora from differ-ent decades. If the embedding dimension is large enough(and hence the optimization problem less non-convex), theauthors argue that word embeddings at nearby times ap-proximately differ by a global rotation in addition to a smallsemantic drift, and they approximately compute this ro-tation. As the latter does not exist in a strict sense, it isdifficult to distinguish artifacts of the approximate rotationfrom a true semantic drift. As discussed in this paper, bothvariants result in trajectories which are noisier.1

3. ModelWe propose the dynamic skip-gram model, a generaliza-tion of the skip-gram model (word2vec) (Mikolov et al.,2013b) to sequential text data. The model finds word em-bedding vectors that continuously drift over time, allowingto track changes in language and word usage over short andlong periods of time. Dynamic skip-gram is a probabilisticmodel which combines a Bayesian version of the skip-grammodel (Barkan, 2017) with a latent time series. It is jointly

1 Rudolph & Blei (2017) independently developed a similarmodel, using a different likelihood model. Their approach usesa non-Bayesian treatment of the latent embedding trajectories,which makes the approach less robust to noise when the data pertime step is small.

trained end-to-end and scales to massive data by means ofapproximate Bayesian inference.

The observed data consist of sequences of words from afinite vocabulary of size L. In section 3.1, all sequences(sentences from books, articles, or tweets) are consideredtime-independent; in section 3.2 they will be associatedwith different time stamps. The goal is to maximize theprobability of every word that occurs in the data given itssurrounding words within a so-called context window. Asdetailed below, the model learns two vectors ui, vi ∈ Rdfor each word i in the vocabulary, where d is the embed-ding dimension. We refer to ui as the word embeddingvector and to vi as the context embedding vector.

3.1. Bayesian Skip-Gram Model

The Bayesian skip-gram model (Barkan, 2017) is a prob-abilistic version of word2vec (Mikolov et al., 2013b) andforms the basis of our approach. The graphical model isshown in Figure 2a). For each pair of words i, j in thevocabulary, the model assigns probabilities that word i ap-pears in the context of word j. This probability is σ(u>i vj)with the sigmoid function σ(x) = 1/(1 + e−x). Let zij ∈0, 1 be an indicator variable that denotes a draw from thatprobability distribution, hence p(zij = 1) = σ(u>i vj). Thegenerative model assumes that many word-word pairs (i, j)are uniformly drawn from the vocabulary and tested for be-ing a word-context pair; hence a separate random indicatorzij is associated with each drawn pair.

Focusing on words and their neighbors in a context win-dow, we collect evidence of word-word pairs for whichzij = 1. These are called the positive examples. De-note n+ij the number of times that a word-context pair(i, j) is observed in the corpus. This is a sufficient statis-tic of the model, and its contribution to the likelihood isp(n+ij |ui, vj) = σ(u>i vj)

n+ij . However, the generative pro-

cess also assumes the possibility to reject word-word pairsif zij = 0. Thus, one needs to construct a fictitious sec-ond training set of rejected word-word pairs, called nega-tive examples. Let the corresponding counts be n−ij . Thetotal likelihood of both positive and negative examples isthen

p(n+, n−|U, V ) =L∏

i,j=1

σ(u>i vj)n+ijσ(−u>i vj)n

−ij . (1)

Above we used the antisymmetry σ(−x) = 1 − σ(x). Inour notation, dropping the subscript indices for n+ and n−

denotes the entire L × L matrices, U = (u1, · · · , uL) ∈Rd×L is the matrix of all word embedding vectors, and Vis defined analogously for the context vectors. To con-struct negative examples, one typically chooses n−ij ∝P (i)P (j)3/4 (Mikolov et al., 2013b), where P (i) is the


frequency of word i in the training corpus. Thus, n− iswell-defined up to a constant factor which has to be tuned.

Defining n± = (n+, n−) the combination of both positiveand negative examples, the resulting log likelihood is

log p(n±|U, V ) =

L∑

i,j=1

(n+ij log σ(u>i vj) + n−ij log σ(−u>i vj)

). (2)

This is exactly the objective of the (non-Bayesian) skip-gram model, see (Mikolov et al., 2013b). The count ma-trices n+ and n− are either pre-computed for the entirecorpus, or estimated based on stochastic subsamples fromthe data in a sequential way, as done by word2vec. Barkan(2017) gives an approximate Bayesian treatment of themodel with Gaussian priors on the embeddings.

3.2. Dynamic Skip-Gram Model

The key extension of our approach is to use a Kalman fil-ter as a prior for the time-evolution of the latent embed-dings (Welch & Bishop, 1995). This allows us to shareinformation across all times while still allowing the em-beddings to drift.

Notation. We consider a corpus of T documents whichwere written at time stamps τ1 < . . . < τT . For each timestep t ∈ 1, . . . , T the sufficient statistics of word-contextpairs are encoded in the L×L matrices n+t , n

−t of positive

and negative counts with matrix elements n+ij,t and n−ij,t,respectively. Denote Ut = (u1,t, · · · , uL,t) ∈ Rd×L thematrix of word embeddings at time t, and define Vt corre-spondingly for the context vectors. Let U, V ∈ RT×d×Ldenote the tensors of word and context embeddings acrossall times, respectively.

Model. The graphical model is shown in Figure 2b). Weconsider a diffusion process of the embedding vectors overtime. The variance σ2

t of the transition kernel is

σ2t = D(τt+1 − τt), (3)

whereD is a global diffusion constant and (τt+1−τt) is thetime between subsequent observations (Welch & Bishop,1995). At every time step t, we add an additional Gaussianprior with zero mean and variance σ2

0 which prevents theembedding vectors from growing very large, thus

p(Ut+1|Ut) ∝ N (Ut, σ2t )N (0, σ2

0). (4)

Computing the normalization, this results in

Ut+1|Ut ∼ N(

Ut1 + σ2

t /σ20

,1

σ−2t + σ−20

I

), (5)

Vt+1|Vt ∼ N(

Vt1 + σ2

t /σ20

,1

σ−2t + σ−20

I

). (6)

In practice, σ0 σt, so the damping to the origin is veryweak. This is also called Ornstein-Uhlenbeck process (Uh-lenbeck & Ornstein, 1930). We recover the Wiener processfor σ0 → ∞, but σ0 < ∞ prevents the latent time seriesfrom diverging to infinity. At time index t = 1, we definep(U1|U0) ≡ N (0, σ2

0I) and do the same for V1.

Our joint distribution factorizes as follows:

p(n±, U, V ) =T−1∏

t=0

p(Ut+1|Ut) p(Vt+1|Vt)

×T∏

t=1

L∏

i,j=1

p(n±ij,t|ui,t, vj,t) (7)

The prior model enforces that the model learns embeddingvectors which vary smoothly across time. This allows to as-sociate words unambiguously with each other and to detectsemantic changes. The model efficiently shares informa-tion across the time domain, which allows to fit the modeleven in setups where the data at every given point in timeare small, as long as the data in total are large.

4. InferenceWe discuss two scalable approximate inference algorithms.Filtering uses only information from the past; it is requiredin streaming applications where the data are revealed tous sequentially. Smoothing is the other inference method,which learns better embeddings but requires the full se-quence of documents ahead of time.

In Bayesian inference, we start by formulating a joint dis-tribution (Eq. 7) over observations n± and parameters Uand V , and we are interested in the posterior distributionover parameters conditioned on observations,

p(U, V |n±) =p(n±, U, V )∫

p(n±, U, V ) dUdV(8)

The problem is that the normalization is intractable. In vari-ational inference (VI) (Jordan et al., 1999; Blei et al., 2016)one sidesteps this problem and approximates the posteriorwith a simpler variational distribution qλ(U, V ) by mini-mizing the Kullback-Leibler (KL) divergence to the poste-rior. Here, λ summarizes all parameters of the variationaldistribution, such as the means and variances of a Gaussian,see below. Minimizing the KL divergence is equivalent tooptimizing the evidence lower bound (ELBO) (Blei et al.,2016),

L(λ) = Eqλ [log p(n±, U, V )]−Eqλ [log qλ(U, V )]. (9)

For a restricted class of models, the ELBO can be com-puted in closed-form (Hoffman et al., 2013). Our model is


non-conjugate and requires instead black-box VI using thereparameterization trick (Rezende et al., 2014; Kingma &Welling, 2014).

4.1. Skip-Gram Filtering

In many applications such as streaming, the data arrive se-quentially. Thus, we can only condition our model on pastand not on future observations. We will first describe in-ference in such a (Kalman) filtering setup (Kalman et al.,1960; Welch & Bishop, 1995).

In the filtering scenario, the inference algorithm iterativelyupdates the variational distribution q as evidence from eachtime step t becomes available. We thereby use a variationaldistribution that factorizes across all times, q(U, V ) =∏Tt=1 q(Ut, Vt) and we update the variational factor at a

given time t based on the evidence at time t and the approx-imate posterior of the previous time step. Furthermore, atevery time t we use a fully-factorized distribution:

q(Ut, Vt) =

L∏

i=1

N (ui,t;µui,t,Σui,t)N (vi,t;µvi,t.Σvi,t),

The variational parameters are the means µui,t, µvi,t ∈ Rdand the covariance matrices Σui,t and Σvi,t, which we re-strict to be diagonal (mean-field approximation).

We now describe how we sequentially compute q(Ut, Vt)and use the result to proceed to the next time step. As otherMarkovian dynamical systems, our model assumes the fol-lowing recursion,

p(Ut, Vt|n±1:t) ∝ p(n±t |Ut, Vt) p(Ut, Vt|n±1:t−1). (10)

Within our variational approximation, the ELBO (Eq. 9)therefore separates into a sum of T terms, L =

∑t Lt with

Lt = E[log p(n±t |Ut, Vt)] + E[log p(Ut, Vt|n±1:t−1)]

− E[log q(Ut, Vt)], (11)

where all expectations are taken under q(Ut, Vt). We com-pute the entropy term −E[log q] in Eq. 11 analytically andestimate the gradient of the log likelihood by samplingfrom the variational distribution and using the reparam-eterization trick (Kingma & Welling, 2014; Salimans &Kingma, 2016). However, the second term of Eq. 11, con-taining the prior at time t, is still intractable. We approxi-mate the prior as

p(Ut, Vt|n±1:t−1) ≡Ep(Ut−1,Vt−1|n±

1:t−1)

[p(Ut, Vt|Ut−1, Vt−1)

]

≈ Eq(Ut−1,Vt−1)

[p(Ut, Vt|Ut−1, Vt−1)

]. (12)

The remaining expectation involves only Gaussians andcan be carried-out analytically. The resulting approximate

prior is a fully factorized distribution p(Ut, Vt|n±1:t−1) ≈∏Li=1N (ui,t; µui,t, Σui,t)N (vi,t; µvi,t, Σvit) with

µui,t = Σui,t(Σui,t−1 + σ2

t I)−1

µui,t−1;

Σui,t =[(

Σui,t−1 + σ2t I)−1

+ (1/σ20)I]−1

.(13)

Analogous update equations hold for µvi,t and Σvi,t. Thus,the second contribution in Eq. 11 (the prior) yields a closed-form expression. We can therefore compute its gradient.

4.2. Skip-Gram Smoothing

In contrast to filtering, where inference is conditioned onpast observations until a given time t, (Kalman) smoothingperforms inference based on the entire sequence of obser-vations n±1:T . This approach results in smoother trajectoriesand typically higher likelihoods than with filtering, becauseevidence is used from both future and past observations.

Besides the new inference scheme, we also use a differentvariational distribution. As the model is fitted jointly to alltime steps, we are no longer restricted to a variational distri-bution that factorizes in time. For simplicity we focus hereon the variational distribution for the word embeddings U ;the context embeddings V are treated identically. We use afactorized distribution over both embedding space and vo-cabulary space,

q(U1:T ) =

L∏

i=1

d∏

k=1

q(uik,1:T ). (14)

In the time domain, our variational approximation is struc-tured. To simplify the notation we now drop the indicesfor words i and embedding dimension k, hence we writeq(u1:T ) for q(uik,1:T ) where we focus on a single factor.This factor is a multivariate Gaussian distribution in thetime domain with tridiagonal precision matrix Λ,

q(u1:T ) = N (µ,Λ−1) (15)

Both the means µ = µ1:T and the entries of the tridiago-nal precision matrix Λ ∈ RT×T are variational parameters.This gives our variational distribution the interpretation of aposterior of a Kalman filter (Blei & Lafferty, 2006), whichcaptures correlations in time.

We fit the variational parameters by training the modeljointly on all time steps, using black-box VI and the repa-rameterization trick. As the computational complexity ofan update step scales as Θ(L2), we first pretrain the modelby drawing minibatches of L′ < L random words andL′ random contexts from the vocabulary (Hoffman et al.,2013). We then switch to the full batch to reduce the sam-pling noise. Since the variational distribution does not fac-torize in the time domain we always include all time steps1, . . . , T in the minibatch.


We also derive an efficient algorithm that allows us to es-timate the reparametrization gradient using Θ(T ) time andmemory, while a naive implementation of black-box varia-tional inference with our structured variational distributionwould require Θ(T 2) of both resources. The main idea is toparametrize Λ = B>B in terms of its Cholesky decompo-sition B, which is bidiagonal (Kılıc & Stanica, 2013), andto express gradients of B−1 in terms of gradients of B. Weuse mirror ascent (Ben-Tal et al., 2001; Beck & Teboulle,2003) to enforce positive definiteness of B. The algorithmis detailed in our supplementary material.

5. ExperimentsWe evaluate our method on three time-stamped text cor-pora. We demonstrate that our algorithms find smootherembedding trajectories than methods based on a staticmodel. This allows us to track semantic changes of indi-vidual words by following nearest-neighbor relations overtime. In our quantitative analysis, we find higher predictivelikelihoods on held-out data compared to our baselines.

Algorithms and Baselines. We report results from ourproposed algorithms from section 4 and compare againstbaselines from section 2:

• SGI denotes the non-Bayesian skip-gram modelwith independent random initializations of word vec-tors (Mikolov et al., 2013b). We used our own imple-mentation of the model by dropping the Kalman fil-tering prior and point-estimating embedding vectors.Word vectors at nearby times are made comparable byapproximate orthogonal transformations, which corre-sponds to Hamilton et al. (2016).

• SGP denotes the same approach as above, but withword and context vectors being pre-initialized with thevalues from the previous year, as in Kim et al. (2014).

• DSG-F: dynamic skip-gram filtering (proposed).

• DSG-S: dynamic skip-gram smoothing (proposed).

Data and preprocessing. Our three corpora exemplifyopposite limits both in the covered time span and in theamount of text per time step.

1. We used data from the Google books corpus2 (Michelet al., 2011) from the last two centuries (T = 209).This amounts to 5 million digitized books and approx-imately 1010 observed words. The corpus consists ofn-gram tables with n ∈ 1, . . . , 5, annotated by yearof publication. We considered the years from 1800 to

2http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

2008 (the latest available). In 1800, the size of the datais approximately∼ 7 ·107 words. We used the 5-gramcounts, resulting in a context window size of 4.

2. We used the “State of the Union” (SoU) addresses ofU.S. presidents, which spans more than two centuries,resulting in T = 230 different time steps and approx-imately 106 observed words.3 Some presidents gaveboth a written and an oral address; if these were lessthan a week apart we concatenated them and used theaverage date. We converted all words to lower caseand constructed the positive sample counts n+ij usinga context window size of 4.

3. We used a Twitter corpus of news tweets for 21 ran-domly drawn dates from 2010 to 2016. The mediannumber of tweets per day is 722. We converted alltweets to lower case and used a context window sizeof 4, which we restricted to stay within single tweets.

Hyperparameters. The vocabulary for each corpus wasconstructed from the 10,000 most frequent words through-out the given time period. In the Google books corpus, thenumber of words per year grows by a factor of 200 from theyear 1800 to 2008. To avoid that the vocabulary is domi-nated by modern words we normalized the word frequen-cies separately for each year before adding them up.

For the Google books corpus, we chose the embeddingdimension d = 200, which was also used in Kim et al.(2014). We set d = 100 for SoU and Twitter, as d = 200resulted in overfitting on these much smaller corpora. Theratio η =

∑ij n−ij,t/

∑ij n

+ij,t of negative to positive word-

context pairs was η = 1. The precise construction of thematrices n±t is explained in the supplementary material.We used the global prior variance σ2

0 = 1 for all corporaand all algorithms, including the baselines. The diffusionconstant D controls the time scale on which informationis shared between time steps. The optimal value for Ddepends on the application. A single corpus may exhibitsemantic shifts of words on different time scales, and theoptimal choice for D depends on the time scale in whichone is interested. We used D = 10−3 per year for Googlebooks and SoU, and D = 1 per year for the Twitter corpus,which spans a much shorter time range. In the supplemen-tary material, we provide details of the optimization proce-dure.

Qualitative results. We show that our approach results insmooth word embedding trajectories on all three corpora.We can automatically detect words that undergo significantsemantic changes over time.

Figure 1 in the introduction shows a fit of the dynamicskip-gram filtering algorithm to the Google books corpus.

3http://www.presidency.ucsb.edu/sou.php


Figure 3. Word embeddings over a sequence of years trained onGoogle books, using DSG-F (proposed, top row) and comparedto the static method by Hamilton et al. (2016) (bottom). We useddynamic t-SNE (Rauber et al., 2016) for dimensionality reduc-tion. Colored lines in the second to fourth column indicate the tra-jectories from the previous year. Our method infers smoother tra-jectories with only few words that move quickly. Figure 4 showsthat these effects persist in the original embedding space.

Here, we show the ten words whose word vectors changemost drastically over the last 150 years in terms of co-sine distance. Figure 3 visualizes word embedding cloudsover four subsequent years of Google books, where wecompare DSG-F against SGI. We mapped the normal-ized embedding vectors to two dimensions using dynamict-SNE (Rauber et al., 2016) (see supplement for details).Lines indicate shifts of word vectors relative to the preced-ing year. In our model only few words change their positionin the embedding space rapidly, while embeddings usingSGI show strong fluctuations, making the cloud’s motionhard to track.

Figure 4 visualizes the smoothness of the trajectories di-rectly in the embedding space (without the projection totwo dimensions). We consider differences between wordvectors in the year 1998 and the subsequent 10 years.In more detail, we compute histograms of the Euclideandistances ||uit − ui,t+δ|| over the word indexes i, whereδ = 1, . . . , 10 (as discussed previously, SGI uses a globalrotation to optimally align embeddings first). In our model,embedding vectors gradually move away from their origi-nal position as time progresses, indicating a directed mo-tion. In contrast, both baseline models show only little di-rected motion after the first time step, suggesting that mosttemporal changes are due to finite-size fluctuations of n±ij,t.Initialization schemes alone, thus, seem to have a minoreffect on smoothness.

Our approach allows us to detect semantic shifts in the us-age of specific words. Figures 5 and 1 both show the cosinedistance between a given word and its neighboring words(colored lines) as a function of time. Figure 5 shows resultson all three corpora and focuses on a comparison acrossmethods. We see that DSG-S and DSG-F (both proposed)

Figure 4. Histogram of distances between word vectors in theyear 1998 and their positions in subsequent years (colors).DSG-F (top panel) displays a continuous growth of these dis-tances over time, indicating a directed motion. In contrast, inSGP (middle) (Kim et al., 2014) and SGI (bottom) (Hamiltonet al., 2016), the distribution of distances jumps from the first tothe second year but then remains largely stationary, indicating ab-sence of a directed drift; i.e. almost all motion is random.

result in trajectories which display less noise than the base-lines SGP and SGI. The fact that the baselines predict zerocosine distance (no correlation) between the chosen wordpairs on the SoU and Twitter corpora suggests that thesecorpora are too small to successfully fit these models, incontrast to our approach which shares information in thetime domain. Note that as in dynamic topic models, skip-gram smoothing (DSG-S) may diffuse information into thepast (see ”presidential” to ”clinton-trump” in Fig. 5).

Quantitative results. We show that our approach gener-alizes better to unseen data. We thereby analyze held-outpredictive likelihoods on word-context pairs at a given timet, where t is excluded from the training set,

1|n±t |

log p(n±t |Ut, Vt). (16)

Above, |n±t | =∑i,j

(n+ij,t + n−ij,t

)denotes the total num-

ber of word-context pairs at time τt. Since inference is dif-ferent in all approaches, the definitions of word and con-text embedding matrices Ut and Vt in Eq. 16 have to beadjusted:

• For SGI and SGP, we did a chronological passthrough the time sequence and used the embeddingsUt = Ut−1 and Vt = Vt−1 from the previous timestep to predict the statistics n±ij,t at time step t.

• For DSG-F, we did the same pass to test n±ij,t. Wethereby set Ut and Vt to be the modes Ut−1, Vt−1 ofthe approximate posterior at the previous time step.

• For DSG-S, we held out 10%, 10% and 20% of thedocuments from the Google books, SoU, and Twittercorpora for testing, respectively. After training, weestimated the word (context) embeddings Ut (Vt) in


date

0.0

0.5

1.0

cos.

dis

tance "computer" to "accurate"

Google Books

date

"community" to "nature"

"State of the Union" addresses

2010 2011 2012 2013 2014 2015 2016

date

"presidential" to "barack"

TwitterDSG-SDSG-FSGISGP

1800 1850 1900 1950 2000

date

0.0

0.5

1.0

cos.

dis

tance "computer" to "machine"

1800 1850 1900 1950 2000

date

"community" to "transportation"

2010 2011 2012 2013 2014 2015 2016

date

"presidential" to "clinton-trump"

Figure 5. Smoothness of word embedding trajectories, compared across different methods. We plot the cosine distance between twowords (see captions) over time. High values indicate similarity. Our methods (DSG-S and DSG-F) find more interpretable trajectoriesthan the baselines (SGI and SGP). The different performance is most pronounced when the corpus is small (SoU and Twitter).

1800 1850 1900 1950 2000

date of test books

−0.56

−0.55

−0.54

−0.53

−0.52

norm

aliz

ed log(p

redic

tive lik

elih

ood) Google books

DSG-SDSG-FSGISGP

1800 1850 1900 1950 2000

date of test SoU address

−0.85

−0.80

−0.75

−0.70

−0.65

"State of the Union" addresses

DSG-SDSG-FSGISGP

2010 2011 2012 2013 2014 2015 2016

date of test tweets

−0.85

−0.80

−0.75

−0.70

−0.65

Twitter

DSG-SDSG-FSGISGP

1800 1820

−0.56

−0.55

−0.54

1980 2000

−0.53

−0.522015

−0.845

−0.840

−0.835

18821884−0.770

−0.768

Figure 6. Predictive log-likelihoods (Eq. 16) for two proposed versions of the dynamic skip-gram model (DSG-F & DSG-S) and twocompeting methods SGI (Hamilton et al., 2016) and SGP (Kim et al., 2014) on three different corpora (high values are better).

Eq. 16 by linear interpolation between the values ofUt−1 (Vt−1) and Ut+1 (Vt+1) in the mode of the vari-ational distribution, taking into account that the timestamps τt are in general not equally spaced.

The predictive likelihoods as a function of time τt areshown in Figure 6. For the Google Books corpus (left panelin figure 6), the predictive log-likelihood grows over timewith all four methods. This must be an artifact of the cor-pus since SGI does not carry any information of the past.A possible explanation is the growing number of words peryear from 1800 to 2008 in the Google Books corpus. Onall three corpora, differences between the two implementa-tions of the static model (SGI and SGP) are small, whichsuggests that pre-initializing the embeddings with the pre-vious result may improve their continuity but seems to havelittle impact on the predictive power. Log-likelihoods forthe skip-gram filter (DSG-F) grow over the first few timesteps as the filter sees more data, and then saturate. Theimprovement of our approach over the static model is par-ticularly pronounced in the SoU and Twitter corpora, whichare much smaller than the massive Google books corpus.There, sharing information between across time is crucialbecause there is little data at every time slice. Skip-gramsmoothing outperforms skip-gram filtering as it shares in-

formation in both time directions and uses a more flexiblevariational distribution.

6. ConclusionsWe presented the dynamic skip-gram model: a Bayesianprobabilistic model that combines word2vec with a latentcontinuous time series. We showed experimentally thatboth dynamic skip-gram filtering (which conditions onlyon past observations) and dynamic skip-gram smoothing(which uses all data) lead to smoothly changing embeddingvectors that are better at predicting word-context statisticsat held-out time steps. The benefits are most drastic whenthe data at individual time steps is small, such that fitting astatic word embedding model is hard. Our approach maybe used as a data mining and anomaly detection tool whenstreaming text on social media, as well as a tool for his-torians and social scientists interested in the evolution oflanguage.

Acknowledgements

We would like to thank Marius Kloft, Cheng Zhang,Andreas Lehrmann, Brian McWilliams, Romann Weber,Michael Clements, and Ari Pakman for valuable feedback.


ReferencesBarkan, Oren. Bayesian Neural Word Embedding. In Pro-

ceedings of the Thirty-First AAAI Conference on Artifi-cial Intelligence, 2017.

Beck, Amir and Teboulle, Marc. Mirror Descent and Non-linear Projected Subgradient Methods for Convex Opti-mization. Operations Research Letters, 31(3):167–175,2003.

Ben-Tal, Aharon, Margalit, Tamar, and Nemirovski,Arkadi. The Ordered Subsets Mirror Descent Optimiza-tion Method with Applications to Tomography. SIAMJournal on Optimization, 12(1):79–108, 2001.

Bengio, Yoshua, Ducharme, Rejean, Vincent, Pascal, andJauvin, Christian. A Neural Probabilistic LanguageModel. Journal of Machine Learning Research, 3:1137–1155, 2003.

Blei, David M and Lafferty, John D. Dynamic Topic Mod-els. In Proceedings of the 23rd International Conferenceon Machine Learning, pp. 113–120. ACM, 2006.

Blei, David M., Kucukelbir, Alp, and McAuliffe, Jon D.Variational Inference: A Review for Statisticians. arXivpreprint arXiv:1601.00670, 2016.

Charlin, Laurent, Ranganath, Rajesh, McInerney, James,and Blei, David M. Dynamic Poisson Factorization.In Proceedings of the 9th ACM Conference on Recom-mender Systems, pp. 155–162, 2015.

Fu, Yanwei and Sigal, Leonid. Semi-SupervisedVocabulary-Informed Learning. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition, pp. 5337–5346, 2016.

Gultekin, San and Paisley, John. A Collaborative KalmanFilter for Time-Evolving Dyadic Processes. In Proceed-ings of the 2nd International Conference on Data Min-ing, pp. 140–149, 2014.

Hamilton, William L, Leskovec, Jure, and Jurafsky, Dan.Diachronic word embeddings reveal statistical laws ofsemantic change. In Proceedings of the 54th AnnualMeeting of the Association for Computational Linguis-tics, pp. 1489–1501, 2016.

Hoffman, Matthew D, Blei, David M, Wang, Chong, andPaisley, John William. Stochastic Variational Inference.Journal of Machine Learning Research, 14(1):1303–1347, 2013.

Jerfel, Ghassen, Basbug, Mehmet E, and Engelhardt, Bar-bara E. Dynamic Compound Poisson Factorization. InArtificial Intelligence and Statistics, 2017.

Jordan, Michael I, Ghahramani, Zoubin, Jaakkola,Tommi S, and Saul, Lawrence K. An Introduction toVariational Methods for Graphical Models. Machinelearning, 37(2):183–233, 1999.

Kalman, Rudolph Emil et al. A New Approach to LinearFiltering and Prediction Problems. Journal of Basic En-gineering, 82(1):35–45, 1960.

Kılıc, Emrah and Stanica, Pantelimon. The Inverse ofBanded Matrices. Journal of Computational and AppliedMathematics, 237(1):126–135, 2013.

Kim, Yoon, Chiu, Yi-I, Hanaki, Kentaro, Hegde, Dar-shan, and Petrov, Slav. Temporal Analysis of LanguageThrough Neural Language Models. In Proceedings ofthe ACL 2014 Workshop on Language Technologies andComputational Social Science, pp. 61–65, 2014.

Kingma, Diederik P and Welling, Max. Auto-EncodingVariational Bayes. In Proceedings of the 2nd Interna-tional Conference on Learning Representations (ICLR),2014.

Kulkarni, Vivek, Al-Rfou, Rami, Perozzi, Bryan, andSkiena, Steven. Statistically Significant Detection ofLinguistic Change. In Proceedings of the 24th Inter-national Conference on World Wide Web, pp. 625–635,2015.

Levy, Omer and Goldberg, Yoav. Neural Word Embeddingas Implicit Matrix Factorization. In Advances in NeuralInformation Processing Systems, pp. 2177–2185, 2014.

Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden,Aviva Presser, Veres, Adrian, Gray, Matthew K,Pickett, Joseph P, Hoiberg, Dale, Clancy, Dan, Norvig,Peter, Orwant, Jon, et al. Quantitative Analysis ofCulture Using Millions of Digitized Books. Science,331(6014):176–182, 2011.

Mihalcea, Rada and Nastase, Vivi. Word Epoch Dis-ambiguation: Finding how Words Change Over Time.In Proceedings of the 50th Annual Meeting of the As-sociation for Computational Linguistics: Short Papers-Volume 2, pp. 259–263, 2012.

Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jef-frey. Efficient Estimation of Word Representations inVector Space. arXiv preprint arXiv:1301.3781, 2013a.

Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,Greg S, and Dean, Jeff. Distributed Representations ofWords and Phrases and their Compositionality. In Ad-vances in Neural Information Processing Systems 26, pp.3111–3119. 2013b.


Mikolov, Tomas, Yih, Wen-tau, and Zweig, Geoffrey. Lin-guistic Regularities in Continuous Space Word Repre-sentations. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies(NAACL-HLT-2013), pp. 746–751, 2013c.

Mnih, Andriy and Kavukcuoglu, Koray. Learning WordEmbeddings Efficiently with Noise-Contrastive Estima-tion. In Advances in Neural Information Processing Sys-tems, pp. 2265–2273, 2013.

Pennington, Jeffrey, Socher, Richard, and Manning,Christopher D. Glove: Global Vectors for Word Rep-resentation. In EMNLP, volume 14, pp. 1532–43, 2014.

Ranganath, Rajesh, Gerrish, Sean, and Blei, David M.Black Box Variational Inference. In AISTATS, pp. 814–822, 2014.

Ranganath, Rajesh, Perotte, Adler J, Elhadad, Noemie, andBlei, David M. The Survival Filter: Joint Survival Anal-ysis with a Latent Time Series. In UAI, pp. 742–751,2015.

Rauber, Paulo E., Falcao, Alexandre X., and Telea, Alexan-dru C. Visualizing Time-Dependent Data Using Dy-namic t-SNE. In EuroVis 2016 - Short Papers, 2016.

Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra,Daan. Stochastic Backpropagation and Approximate In-ference in Deep Generative Models. In The 31st Interna-tional Conference on Machine Learning (ICML), 2014.

Rudolph, Maja and Blei, David. Dynamic BernoulliEmbeddings for Language Evolution. arXiv preprintarXiv:1703.08052, 2017.

Rudolph, Maja, Ruiz, Francisco, Mandt, Stephan, and Blei,David. Exponential Family Embeddings. In Advancesin Neural Information Processing Systems, pp. 478–486,2016.

Sagi, Eyal, Kaufmann, Stefan, and Clark, Brady. Trac-ing Semantic Change with Latent Semantic Analysis.Current Methods in Historical Semantics, pp. 161–183,2011.

Sahoo, Nachiketa, Singh, Param Vir, and Mukhopadhyay,Tridas. A Hidden Markov Model for Collaborative Fil-tering. MIS Quarterly, 36(4):1329–1356, 2012.

Salimans, Tim and Kingma, Diederik P. Weight Normaliza-tion: A Simple Reparameterization to Accelerate Train-ing of Deep Neural Networks. In Advances in NeuralInformation Processing Systems, pp. 901–901, 2016.

Socher, Richard, Bauer, John, Manning, Christopher D,and Ng, Andrew Y. Parsing with Compositional VectorGrammars. In ACL (1), pp. 455–465, 2013a.

Socher, Richard, Perelygin, Alex, Wu, Jean Y, Chuang, Ja-son, Manning, Christopher D, Ng, Andrew Y, and Potts,Christopher. Recursive Deep Models for Semantic Com-positionality over a Sentiment Treebank. In Proceedingsof the 2013 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), volume 1631, pp.1642, 2013b.

Uhlenbeck, George E and Ornstein, Leonard S. On theTheory of the Brownian Motion. Physical Review, 36(5):823, 1930.

Vilnis, Luke and McCallum, Andrew. Word Representa-tions via Gaussian Embedding. In Proceedings of the2nd International Conference on Learning Representa-tions (ICLR), 2014.

Wang, Chong, Blei, David, and Heckerman, David. Con-tinuous time dynamic topic models. In Proceedings ofthe Twenty-Fourth Conference on Uncertainty in Artifi-cial Intelligence, pp. 579–586, 2008.

Welch, Greg and Bishop, Gary. An Introduction to theKalman Filter. 1995.

Supplementary Material to “Dynamic Word Embeddings”

Robert Bamler 1 Stephan Mandt 1

Table S1. Hyperparameters for skip-gram filtering and skip-gramsmoothing.

PARAMETER COMMENT

L=104 vocabulary sizeL′=103 batch size for smoothingd=100 embedding dimension for SoU and Twitterd=200 embedding dimension for Google books

Ntr =5000 number of training steps for each t (filtering)N ′tr =5000 number of pretraining steps with minibatch

sampling (smoothing; see Algorithm 2)Ntr =1000 number of training steps without minibatch

sampling (smoothing; see Algorithm 2)cmax =4 context window size for positive examplesη=1 ratio of negative to positive examplesγ=0.75 context exponent for negative examplesD=10−3 diffusion const. per year (Google books & SoU)D=1 diffusion const. per year (Twitter)σ20 =1 variance of overall priorα=10−2 learning rate (filtering)α′=10−2 learning rate during minibatch phase (smoothing)α=10−3 learning rate after minibatch phase (smoothing)β1 =0.9 decay rate of 1st moment estimateβ2 =0.99 decay rate of 2nd moment estimate (filtering)β2 =0.999 decay rate of 2nd moment estimate (smoothing)δ=10−8 regularizer of Adam optimizer

1. Dimensionality Reduction in Figure 1To create the word-clouds in Figure 1 of the main text wemapped the fitted word embeddings from Rd to the two-dimensional plane using dynamic t-SNE (Rauber et al.,2016). Dynamic t-SNE is a non-parametric dimension-ality reduction algorithm for sequential data. The algo-rithm finds a projection to a lower dimension by solvinga non-convex optimization problem that aims at preservingnearest-neighbor relations at each individual time step. In

1Disney Research, 4720 Forbes Avenue, Pittsburgh,PA 15213, USA. Correspondence to: Robert Bamler<[email protected]>, Stephan Mandt<[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

addition, projections at neighboring time steps are alignedwith each other by a quadratic penalty with prefactor λ ≥ 0for sudden movements.

There is a trade-off between finding good local projectionsfor each individual time step (λ → 0), and finding smoothprojections (large λ). Since we want to analyze the smooth-ness of word embedding trajectories, we want to avoidbias towards smooth projections. Unfortunately, settingλ = 0 is not an option since, in this limit, the optimizationproblem is invariant under independent rotations at eachtime, rendering trajectories in the two-dimensional projec-tion plane meaningless. To still avoid bias towards smoothprojections, we anneal λ exponentially towards zero overthe course of the optimization. We start the optimizer withλ = 0.01, and we reduce λ by 5% with each training step.We run 100 optimization steps in total, so that λ ≈ 6×10−6

at the end of the training procedure. We used the open-source implementation,1 set the target perplexities to 200,and used default values for all other parameters.

2. Hyperparemeters and Construction of n±1:T

Table S1 lists the hyperparameters used in our experiments.For the Google books corpus, we used the same contextwindow size cmax and embedding dimension d as in (Kimet al., 2014). We reduced d for the SoU and Twitter corporato avoid overfitting to these much smaller data sets.

In constrast to word2vec, we construct our positive andnegative count matrices n±ij,t deterministically in a prepro-cessing step. As detailed below, this is done such that itresembles as closely as possible the stochastic approach inword2vec (Mikolov et al., 2013). In every update step,word2vec stochastically samples a context window sizeuniformly in an interval [1, · · · , cmax], thus the context sizefluctuates and nearby words appear more often in the samecontext than words that are far apart from each other inthe sentence. We follow a deterministic scheme that re-sults in similar statistics. For each pair of words (w1, w2)in a given sentence, we increase the counts n+iw1 jw2

bymax (0, 1− k/cmax), where 0 ≤ k ≤ cmax is the num-ber of words that appear between w1 and w2, and iw1 andjw2 are the words’ unique indices in the vocabulary.

1https://github.com/paulorauber/thesne

arX

iv:1

702.

0835

9v2

[st

at.M

L]

17

Jul 2

017


Algorithm 1 Skip-gram filtering; see section 4 of the maintext.

Remark: All updates are analogous for word and con-text vectors; we drop their indices for simplicity.Input: number of time steps T , time stamps τ1:T , posi-tive and negative examples n±1:T , hyperparameters.

Init. prior means µik,1 ← 0 and variances Σi,1 = Id×dInit. variational means µik,1 ← 0 and var. Σi,1 = Id×dfor t = 1 to T do

if t 6= 1 thenUpdate approximate Gaussian prior with params.µik,t and Σi,t using µik,t−1 and Σi,t−1, see Eq. 13.

end ifCompute entropy Eq[log q(·)] analytically.Compute expected log Gaussian prior with parametersµik,t and Σk,t analytically.Maximize Lt in Eq. 11, using black-box VI with thereparametrization trick.Obtain µik,t and Σi,t as outcome of the optimization.

end for

We also used a deterministic variant of word2vec to con-struct the negative count matrices n−t . In word2vec, η nega-tive samples (i, j) are drawn for each positive sample (i, j′)by drawing η independent values for j from a distributionP ′t (j) defined below. We define n−ij,t such that it matchesthe expectation value of the number of times that word2vecwould sample the negative word-context pair (i, j). Specif-ically, we define

Pt(i) =

∑Lj=1 n

+ij,t∑L

i′,j=1 n+i′j,t

, (S1)

P ′t (j) =

(Pt(j)

)γ∑Lj′=1

(Pt(j′)

)γ , (S2)

n−ij,t =

( L∑

i′,j′=1

n+i′j′,t

)ηPt(i)P

′t (j). (S3)

We chose γ = 0.75 as proposed in (Mikolov et al., 2013),and we set η = 1. In practice, it is not necessary to explic-itly construct the full matrices n−t , and it is more efficientto keep only the distributions Pt(i) and P ′t (j) in memory.

3. Skip-gram Filtering AlgorithmThe skip-gram filtering algorithm is described in section 4of the main text. We provide a formulation in pseudocodein Algorithm 1.

4. Skip-gram Smoothing AlgorithmIn this section, we give details for the skip-gram smoothingalgorithm, see section 4 of the main text. A summary is

Algorithm 2 Skip-gram smoothing; see section 4. We dropindices i, j, and k for word, context, end embedding dimen-sion, respectively, when they are clear from context.

Input: number of time steps T , time stamps τ1:T , word-context counts n+1:T , hyperparameters in Table S1

Obtain n−t ∀t using Eqs. S1–S3Initialize µu,1:T , µv,1:T ← 0Initialize νu,1:T , νv,1:T , ωu,1:T−1, and ωv,1:T−1 such

that B>u Bu = B>v Bv = Π (see Eqs. S5 and S11)for step = 1 to N ′tr do

Draw I ⊂ 1, . . . , L′ with |I| = L′ uniformlyDraw J ⊂ 1, . . . , L′ with |J | = L′ uniformlyfor all i ∈ I do

Draw ε[s]ui,1:T ∼ N (0, I)

Solve Bu,ixui,1:T = εui,1:T for xui,1:Tend forObtain xvj,1:T by repeating last loop ∀j ∈ JCalculate gradient estimates of L for minibatch

(I,J ) using Eqs. S10, S14, and S15Obtain update steps d[·] for all variational parameters

using Adam optimizer with parameters in Table S1Update µu,1:T ← µu,1:T +d[µu,1:T ], and analogously

for µv,1:T , ωu,1:T−1, and ωv,1:T−1Update νu,1:T and νv,1:T according to Eq. S18

end forRepeat above loop for Ntr more steps, this time without

minibatch sampling (i.e., setting L′ = L)

provided in Algorithm 2.

Variational distribution. For now, we focus on the wordembeddings, and we simplify the notation by dropping theindices for the vocabulary and embedding dimensions. Thevariational distribution for a single embedding dimensionof a single word embedding trajectory is

q(u1:T ) = N (µu,1:T , (B>u Bu)−1). (S4)

Here, µu,1:T is the vector of mean values, and Bu is theCholesky decomposition of the precision matrix. We re-strict the latter to be bidiagonal,

Bu =

νu,1 ωu,1νu,2 ωu,2

. . . . . .νu,T−1 ωu,T−1

νT

(S5)

with νu,t > 0 for all t ∈ 1, . . . , T. The variational pa-rameters are µu,1:T , νu,1:T , and ω1:T−1. The variationaldistribution of the context embedding trajectories v1:T hasthe same structure.


With the above form of Bu, the variational distribution is aGaussian with an arbitrary tridiagonal symmetric precisionmatrix B>u Bu. We chose this variational distribution be-cause it is the exact posterior of a hidden time-series modelwith a Kalman filtering prior and Gaussian noise (Blei &Lafferty, 2006). Note that our variational distribution is ageneralization of a fully factorized (mean-field) distribu-tion, which is obtained for ωu,t = 0 ∀t. In the generalcase, ωu,t 6= 0, the variational distribution can capture cor-relations between all time steps, with a dense covariancematrix (B>u Bu)−1.

Gradient estimation. The skip-gram smoothing algo-rithm uses stochastic gradient ascent to find the variationalparameters that maximize the ELBO,

L = Eq[log p(U1:T , V1:T , n

±1:T )

]− Eq

[log q(U1:T , V1:T )

].

(S6)

Here, the second term is the entropy, which can be evalu-ated analytically. We obtain for each component in vocab-ulary and embedding space,

−Eq[log q(u1:T )] = −∑

t

log(νu,t) + const. (S7)

and analogously for −Eq[log q(v1:T )].

The first term on the right-hand side of Eq. S6 can-not be evaluated analytically. We approximate its gra-dient by sampling from q using the reparameterizationtrick (Kingma & Welling, 2014; Rezende et al., 2014).A naive calculation would require Ω(T 2) computing timesince the derivatives of L with respect to νu,t and ωu,t foreach t depend on the count matrices n±t′ of all t′. However,as we show next, there is a more efficient way to obtain allgradient estimates in Θ(T ) time.

We focus again on a single dimension of a single word em-bedding trajectory u1:T , and we drop the indices i and k.We draw S independent samples u[s]1:T with s ∈ 1, . . . , Sfrom q(u1:T ) by parameterizing

u[s]1:T = µu,1:T + x

[s]u,1:T (S8)

with

x[s]u,1:T = B−1u ε

[s]u,1:T where ε

[s]u,1:T ∼ N (0, I). (S9)

We obtain x[s]u,1:T in Θ(T ) time by solving the bidiagonal

linear system Bux[s]u,1:T = ε

[s]u,1:T . Samples v[s]1:T for the

context embedding trajectories are obtained analogously.Our implementation uses S = 1, i.e., we draw only a sin-gle sample per training step. Averaging over several sam-ples is done implicitly since we calculate the update steps

using the Adam optimizer (Kingma & Ba, 2014), which ef-fectively averages over several gradient estimates in its firstmoment estimate.

The derivatives of L with respect to µu,1:T can be obtainedusing Eq. S8 and the chain rule. We find

∂L∂µu,1:T

≈ 1

S

S∑

s=1

[Γ[s]u,1:T −Πu

[s]1:T

]. (S10)

Here, Π ∈ RT×T is the precision matrix of the prioru1:T ∼ N (0,Π−1). It is tridiagonal and therefore thematrix-multiplication Πu

[s]1:T can be carried out efficiently.

The non-zero matrix elements of Π are

Π11 = σ−20 + σ−21

ΠTT = σ−20 + σ−2T−1Πtt = σ−20 + σ−2t−1 + σ−2t ∀t ∈ 2, . . . , T − 1

Π1,t+1 = Πt+1,1 = −σ−2t . (S11)

The term Γ[s]u,1:T on the right-hand side of Eq. S10 comes

from the expectation value of the log-likelihood under q. Itis given by

Γ[s]ui,t =

L∑

j=1

[(n+ij,t + n−ij,t

)σ(−u[s]>i,t v

[s]j,t

)− n−ij,t

]v[s]j,t

(S12)

where we temporarily restored the indices i and j for wordsand contexts, respectively. In deriving Eq. S12, we used therelations ∂ log σ(x)/∂x = σ(−x) and σ(−x) = 1− σ(x).

The derivatives of L with respect to νu,t and ωu,t are moreintricate. Using the parameterization in Eqs. S8–S9, thederivatives are functions of ∂B−1u /∂νt and ∂B−1u /∂ωt, re-spectively, where B−1u is a dense (upper triangular) T × Tmatrix. An efficient way to obtain these derivatives is touse the relation

∂B−1u∂νt

= −B−1u∂Bu∂νt

B−1u (S13)

and similarly for ∂B−1u /∂ωt. Using this relation andEqs. S8–S9, we obtain the gradient estimates

∂L∂νu,t

≈ − 1

S

S∑

s=1

y[s]u,tx

[s]u,t −

1

νu,t, (S14)

∂L∂ωu,t

≈ − 1

S

S∑

s=1

y[s]u,tx

[s]u,t+1. (S15)

The second term on the right-hand side of Eq. S14 is thederivative of the entropy, Eq. S7, and

y[s]u,1:T = (B>u )−1

[Γ[s]u,1:T −Πu

[s]1:T

]. (S16)


The values y[s]u,1:T can again be obtained in Θ(T ) time bybringing B>u to the left-hand side and solving the corre-sponding bidiagonal linear system of equations.

Sampling in vocabulary space. In the above paragraph,we described an efficient strategy to obtain gradient esti-mates in only Θ(T ) time. However, the gradient estimationscales quadratic in the vocabulary size L because all L2 el-ements of the positive count matrices n+t contribute to thegradients. In order speed up the optimization, we pretrainthe model using a minibatch of size L′ < L in vocabularyspace as explained below. The computational complexityof a single training step in this setup scales as (L′)2 ratherthan L2. After N ′tr = 5000 training steps with minibatchsize L′, we switch to the full batch size of L and train themodel for another Ntr = 1000 steps.

The subsampling in vocabulary space works as follows. Ineach training step, we independently draw a set I of L′

random distinct words and a set J of L′ random distinctcontexts from a uniform probability over the vocabulary.We then estimate the gradients of L with respect to onlythe variational parameters that correspond to words i ∈ Iand contexts j ∈ J . This is possible because both the priorof our dynamic skip-gram model and the variational distri-bution factorize in the vocabulary space. The likelihood ofthe model, however, does not factorize. This affects onlythe definition of Γ

[s]uik,t in Eq. S12. We replace Γ

[s]uik,t by

an estimate Γ[s]′uik,t based on only the contexts j ∈ J in the

current minibatch,

Γ[s]ui,t =

L

L′∑

j∈J

[ (n+ij,t + n−ij,t

)σ(−u[s]>i,t v

[s]j,t

)

− n−ij,t]v[s]j,t. (S17)

Here, the prefactor L/L′ restores the correct ratio betweenevidence and prior knowledge (Hoffman et al., 2013).

Enforcing positive definiteness. We update the varia-tional parameters using stochastic gradient ascent with theAdam optimizer (Kingma & Ba, 2014). The parame-ters νu,1:T are the eigenvalues of the matrix Bu, whichis the Cholesky decomposition of the precision matrixof q. Therefore, νu,t has to be positive for all t ∈1, . . . , T. We use mirror ascent (Ben-Tal et al., 2001;Beck & Teboulle, 2003) to enforce νu,t > 0. Specifically,we update νt to a new value ν′t defined by

ν′u,t =1

2νu,td[νu,t] +

√(1

2νu,td[νu,t]

)2

+ ν2u,t (S18)

where d[νu,t] is the step size obtained from the Adam op-timizer. Eq. S18 can be derived from the general mirrorascent update rule Φ′(ν′u,t) = Φ′(νu,t) + d[νu,t] with the

mirror map Φ : x 7→ −c1 log(x)+c2x2/2, where we set the

parameters to c1 = νu,t and c2 = 1/νu,t for dimensionalreasons. The update step in Eq. S18 increases (decreases)νu,t for positive (negative) d[νu,t], while always keeping itsvalue positive.

Natural basis. As a final remark, let us discuss an op-tional extension to the skip-gram smoothing algorithm thatconverges in less training steps. This extension only in-creases the efficiency of the algorithm. It does not changethe underlying model or the choice of variational distri-bution. Observe that the prior of the dynamic skip-grammodel connects only neighboring time-steps with eachother. Therefore, the gradient of L with respect to µu,tdepends only on the values of µu,t−1 and µu,t+1. A naiveimplementation of gradient ascent would thus require T−1update steps until a change of µu,1 affects updates of µu,T .

This problem can be avoided with a change of basis fromµu,1:T to new parameters ρu,1:T ,

µu,1:T = Aρu,1:T (S19)

with an appropriately chosen invertible matrix A ∈ RT×T .Derivatives of L with respect to ρu,1:T are given by thechain rule, ∂L/∂ρu,1:T = (∂L/∂µu,1:T )A. A natural (butinefficient) choice for A is to stack the eigenvectors of theprior precision matrix Π, see Eq. S11, into a matrix. Theeigenvectors of Π are the Fourier modes of the Kalman fil-tering prior (with a regularization due to σ0). Therefore,there is a component ρu,t that corresponds to the zero-modeof Π, and this component learns an average word embed-ding over all times. Higher modes correspond to changesof the embedding vector over time. A single update to thezero immediately affects all elements of µu,1:T , and there-fore changes the word embeddings at all time steps. Thus,information propagates quickly along the time dimension.The downside of this choice for A is that the transforma-tion in Eq. S19 has complexity Ω(T 2), which makes updatesteps slow.

As a compromise between efficiency and a natural ba-sis, we propose to set A in Eq. S19 to the Cholesky de-composition of the prior covariance matrix Π−1 ≡ AA>.Thus, A is still a dense (upper triangular) matrix, and, inour experiments, updates to the last component ρu,T af-fect all components of µu,1:T in an approximately equalamount. Since Π is tridiagonal, the inverse of A is bidiago-nal, and Eq. S19 can be evaluated in Θ(T ) time by solvingAµu,1:T = ρu,1:T for µu,1:T . This is the parameterizationwe used in our implementation of the skip-gram smoothingalgorithm.


ReferencesBeck, Amir and Teboulle, Marc. Mirror Descent and Non-

linear Projected Subgradient Methods for Convex Opti-mization. Operations Research Letters, 31(3):167–175,2003.

Ben-Tal, Aharon, Margalit, Tamar, and Nemirovski,Arkadi. The Ordered Subsets Mirror Descent Optimiza-tion Method with Applications to Tomography. SIAMJournal on Optimization, 12(1):79–108, 2001.

Blei, David M and Lafferty, John D. Dynamic Topic Mod-els. In Proceedings of the 23rd International Conferenceon Machine Learning, pp. 113–120. ACM, 2006.

Hoffman, Matthew D, Blei, David M, Wang, Chong, andPaisley, John William. Stochastic Variational Inference.Journal of Machine Learning Research, 14(1):1303–1347, 2013.

Kim, Yoon, Chiu, Yi-I, Hanaki, Kentaro, Hegde, Dar-shan, and Petrov, Slav. Temporal Analysis of LanguageThrough Neural Language Models. In Proceedings ofthe ACL 2014 Workshop on Language Technologies andComputational Social Science, pp. 61–65, 2014.

Kingma, Diederik and Ba, Jimmy. Adam: A Methodfor Stochastic Optimization. arXiv preprintarXiv:1412.6980, 2014.

Kingma, Diederik P and Welling, Max. Auto-EncodingVariational Bayes. In Proceedings of the 2nd Interna-tional Conference on Learning Representations (ICLR),2014.

Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,Greg S, and Dean, Jeff. Distributed Representations ofWords and Phrases and their Compositionality. In Ad-vances in Neural Information Processing Systems 26, pp.3111–3119. 2013.

Rauber, Paulo E., Falcao, Alexandre X., and Telea, Alexan-dru C. Visualizing Time-Dependent Data Using Dy-namic t-SNE. In EuroVis 2016 - Short Papers, 2016.

Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra,Daan. Stochastic Backpropagation and Approximate In-ference in Deep Generative Models. In The 31st Interna-tional Conference on Machine Learning (ICML), 2014.

Dynamic Word Embeddings - arxiv.org

Documents

Transcript of Dynamic Word Embeddings - arxiv.org