Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family...

41
Exponential Family Embeddings Maja Rudolph ([email protected]) with Francisco Ruiz, and David Blei Columbia University September 12, 2017

Transcript of Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family...

Page 1: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Exponential Family Embeddings

Maja Rudolph ([email protected])with Francisco Ruiz, and David Blei

Columbia University

September 12, 2017

Page 2: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Exponential Family Embeddings

• class of conditionally specified models

• goal: learn distributed representations of objects

00...100...00

−→

1.2−0.5...−1.74.3

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations byback-propagating errors”. In: Nature 323 (1986), p. 9.

Geoffrey E Hinton. “Learning distributed representations of concepts”. In: Proceedings of theeighth annual conference of the cognitive science society. Vol. 1. Amherst, MA. 1986, p. 12.

Maja Rudolph et al. “Exponential Family Embeddings”. In: Advances in Neural InformationProcessing Systems. 2016, pp. 478–486.

Page 3: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Exponential Family Embeddings

• goal: learn distributed representations of objects

• objects: words in text, neurons in neuro-science data, or items incollaborative filtering task

single neuron held out 25% of neurons held outModel K D 10 K D 100 K D 10 K D 100�� 0:261˙ 0:004 0:251˙ 0:004 0:261˙ 0:004 0:252˙ 0:004�-��� (c=10) 0:230˙ 0:003 0:230˙ 0:003 0:242˙ 0:004 0:242˙ 0:004�-��� (c=50) 0:226˙ 0:003 0:222˙ 0:003 0:233˙ 0:003 0:230˙ 0:003��-��� (c=10) 0:238˙ 0:004 0:233˙ 0:003 0:258˙ 0:004 0:244˙ 0:004

Table 2: Analysis of neural data: mean squared error and standard errors of neuralactivity (on the test set) for di�erent models. Both ��-��� models significantlyoutperform ��; �-��� is more accurate than ��-���.

Figure 1: Top view of the zebrafish brain, with blue circles at the location of theindividual neurons. We zoom on 3 neurons and their 50 nearest neighbors (smallblue dots), visualizing the “synaptic weights” learned by a �-��� model (K D 100).The edge color encodes the inner product of the neural embedding vector and thecontext vectors ⇢>n ˛m for each neighborm. Positive values are green, negative valuesare red, and the transparency is proportional to the magnitude. With these weightswe can form hypotheses about how nearby neurons are connected.

the lagged activity conditional on the simultaneous lags of surrounding neurons. Westudied context sizes c 2 f10; 50g and latent dimension K 2 f10; 100g.Models. We compare ��-��� to probabilistic factor analysis (��), fitting K-dimensional factors for each neuron andK-dimensional factor loadings for each timeframe. In ��, each entry of the data matrix is a Gaussian with mean equal to theinner product of the corresponding factor and factor loading.

Evaluation. We train each model on the first 95% of the time frames and holdout the last 5% for testing. With the test set, we use two types of evaluation. (1)Leave one out: For each neuron xi in the test set, we use the measurements of theother neurons to form predictions. For �� this means the other neurons are usedto recover the factor loadings; for ��-��� this means the other neurons are usedto construct the context. (2) Leave 25% out: We randomly split the neurons into 4folds. Each neuron is predicted using the three sets of neurons that are out of its fold.(This is a more di�cult task.) Note in ��-���, the missing data might change thesize of the context of some neurons. See Table 5 in Appendix B for the choice ofhyperparameters.

10

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations byback-propagating errors”. In: Nature 323 (1986), p. 9.

Geoffrey E Hinton. “Learning distributed representations of concepts”. In: Proceedings of theeighth annual conference of the cognitive science society. Vol. 1. Amherst, MA. 1986, p. 12.

Maja Rudolph et al. “Exponential Family Embeddings”. In: Advances in Neural InformationProcessing Systems. 2016, pp. 478–486.

Page 4: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Fitted Poisson Embeddings – Similarity Queries

Maruchan ramen soup Yoplait strawberry y. Mountain Dew soda Dean Foods 1 % milkMaruchan chicken ramen Yoplait vanilla yogurt Pepsi wild cherry soda Dean Foods 2 % milkMaruchan ramen, ls. Yoplait cherry yogurt Dr Pepper soda Dean Foods fat free milkKemps chocolate milk Yoplait blueberry yogurt Martin’s potato chips Danone fat free yoghurt

tSNE Component 1

tSN

E C

ompo

nent

2

• representations learned from shopping data (counts not text)

Page 5: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Exponential Family Embeddings

• encapsulates main ideas of neural language models:• each observation is modeled conditionally on a context

• the conditional distributions come from the exponential family

• each object has two embeddings: an embedding vector ρ and the contextvector α

Yoshua Bengio et al. “A neural probabilistic language model”. In: Journal of machine learningresearch 3.Feb (2003), pp. 1137–1155.

Tomas Mikolov et al. “Distributed representations of words and phrases and theircompositionality”. In: Neural Information Processing Systems. 2013, pp. 3111–3119.

Jeffrey Pennington et al. “Glove: Global Vectors for Word Representation.” In: Conference onEmpirical Methods on Natural Language Processing. Vol. 14. 2014, pp. 1532–1543.

Page 6: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Exponential Family Embeddings

• use exponential families for the conditional of each data point,

xi |xci ∼ ExpFam(ηi(xci), t(xi))

• the natural parameter combines the embedding and context vectors,

ηi(xci) = fi

ρ[i]>∑j∈ci

α[j]xj

• the exponential family embedding (ef-emb) has latent variables for eachindex:an embedding ρ[i] and a context vector α[i]

Page 7: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Pseudo-likelihood

• combine these ingredients in a “pseudo-likelihood”

L (ρ,α) =n∑i=1

(η>i t(xi)− a(ηi)

)+ logp(ρ) + logp(α).

• fit with stochastic optimization; exponential families simplify thegradients.

• (Stochastic gradients give justification to NN ideas like “negativesampling.”)

0 100 200 300 400 500

Barry C Arnold, Enrique Castillo, Jose Maria Sarabia, et al. “Conditionally specified distributions:an introduction”. In: Statistical Science 16.3 (2001), pp. 249–274.

Page 8: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Exponential Family Embeddings

In summary, an EF-Emb has 3 ingredients (context, conditional distribution,parameterization)

• Multinomial embedding for text (similar to CBOW)

• Poisson embeddings for shopping data (or movie ratings data)

• Bernoulli embeddings for text (related to word2vec)

Tomas Mikolov et al. “Distributed representations of words and phrases and theircompositionality”. In: Neural Information Processing Systems. 2013, pp. 3111–3119.

Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: ICLRWorkshop Proceedings. arXiv:1301.3781 (2013).

Page 9: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Multinomial embeddings for text

• observations xi : one-hot vectors

• context ci : index of words before and after

• each word is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK

• Categorical distribution on xi |xci

• natural parameter (log probability):

ηiv = ρ>v

∑j∈ci ,w∈V

αwxjw

Page 10: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Poisson embeddings for movie ratings and shopping data

• observations: counts

• context: other movies same user rated, other items same user purchased

• each item is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK

• Poisson distribution on xui |xci

• natural parameter (log rate):

ηui = ρ>i

∑j∈ci

αjxuj

Page 11: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Results – Poisson Embeddings – Better Held-out LogLikelihood

• Movie Ratings: MovieLens-100k.

• Market Baskets: IRI dataset. Over 100,000 baskets of 8000 distinct items

Market Baskets

Model K = 20 K = 100Poisson Emb. −7.11 −6.95Poisson Fact. −7.74 −7.63Poisson PCA −8.31 −11.01

Movie Ratings

K = 20 K = 100−5.69 −5.73−5.80 −5.86−5.92 −7.50

P. Gopalan, J. Hofman, and D. M. Blei. “Scalable recommendation with hierarchical Poissonfactorization”. In: Uncertainty in Artificial Intelligence. 2015.

Michael Collins, Sanjoy Dasgupta, and Robert E Schapire. “A generalization of principalcomponents analysis to the exponential family”. In: Neural Information Processing Systems. 2001,pp. 617–624.

Page 12: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Bernoulli embeddings

• breaks up one-hot constraint of word indicators in text

• instead of Categorical with expensive normalization use conditionalBernoullis to model individual entries in data matrix

• biased SGD (modeling ones, subsampling zeros) we get an objective thatclosely resembles word2vec

p(xiv | xci) = Bernoulli(piv)

piv = σ

ρ>v ∑j∈ci ,w∈V

αwxjw

• rest of the talk: extensions of Bernoulli embeddings

Tomas Mikolov et al. “Distributed representations of words and phrases and theircompositionality”. In: Neural Information Processing Systems. 2013, pp. 3111–3119.

Page 13: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Why Embeddings?

• as input features in downsteam NLP tasks

• as output layer in deep models for word prediction tasks

• document classification

Ronan Collobert et al. “Natural language processing (almost) from scratch”. In: Journal ofMachine Learning Research 12.Aug (2011), pp. 2493–2537.

Jason Weston et al. “Deep learning via semi-supervised embedding”. In: Neural Networks: Tricksof the Trade. Springer, 2012, pp. 639–655.

Matt Taddy. “Document classification by inversion of distributed language representations”. In:arXiv preprint arXiv:1504.07295 (2015).

Page 14: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Why Embeddings?

• This talk: As descriptive statistic of text for computational social science

Page 15: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

US Congressional Record (1858 - 2009)

Page 16: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

The Meaning of Words Changes - Computer

computer

1858 1986computer computerdraftsman softwaredraftsmen computerscopyist copyright

photographer technologicalcomputers innovationcopyists mechanicaljanitor hardware

accountant technologiesbookkeeper vehicles

Maja Rudolph and David Blei. “Dynamic Embeddings for Language Evolution”. In:arXiv:1703.08052. 2017.

Page 17: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Dynamic Bernoulli Embeddings

Page 18: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Dynamic Bernoulli Embeddings

• Divide corpus into time slices t = 1, · · · ,T

• E.g. divide speeches from 1858 - 2009 into 76 time slices, 2 years each

• Model semantic changes using Gaussian random walk on the embeddingvectors ρ(t)

v

• Fit objective using stochastic gradients

Page 19: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Dynamic Bernoulli Embeddings

• ρ(0)v ∼ N(0, (1/λ

(0)ρ )I)

• ρ(t)v ∼ N(ρ

(t−1)v , (1/λρ)I)

Page 20: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Results - Held-out Likelihood

Senate speeches

context size 2 context size 8

s-emb −2.409± 0.001 −2.286± 0.001t-emb −2.444± 0.001 −2.458± 0.001d-emb [this work] −2.340± 0.001 −2.282± 0.001

N ≈ 14M, V = 25k, K = 100

Maja Rudolph et al. “Exponential Family Embeddings”. In: Advances in Neural InformationProcessing Systems. 2016, pp. 478–486.

William L Hamilton, Jure Leskovec, and Dan Jurafsky. “Diachronic Word Embeddings RevealStatistical Laws of Semantic Change”. In: arXiv preprint arXiv:1605.09096 (2016).

Page 21: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Dynamic Embeddings — U.S. Senate Speeches (1858 - 2009)

Page 22: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Grouped data

• What if the data is grouped differently?

• How do we share statistical strength when we cannot exploit dynamics?

Page 23: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Structured Embedding Models for Grouped Data

• Goal: uncover how word usage differs between different groups

data embedding of groups grouped by sizeArXiv abstracts text 15k terms 19 subject areas 15M wordsSenate speeches text 15k terms 83 home state/party 20M wordsShopping data counts 5.5k items 12 months 0.5M trips

Maja Rudolph, Francisco Ruiz, and David Blei. “Structured Embedding Models for GroupedData”. In: Advances in Neural Information Processing Systems. 2017.

Page 24: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Hierarchical Embedding Model

ρ(s)v

αv

X(s)

ρ(0)v

V

S

ρ(s)v ∼N (ρ(0)

v ,σ2ρI)

Page 25: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Amortized Embedding Model

ρ(s)v

αv

X(s)

ρ(0)v

V

S

ρ(s)v = fs(ρ

(0)v )

Page 26: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Neural Network maps Global Embeddings to Group SpecificEmbeddings

input:word vector

ρv

output:group specificword vectorρ(s)v

W (s)1 W (s)

2

...

......• We compare feed forward NNs and ResNet architectures.

Kaiming He et al. “Deep residual learning for image recognition. arXiv 2015”. In: arXiv preprintarXiv:1512.03385 ().

Page 27: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Results

ArXiv papers Senate speeches Shopping data

Global emb −2.176± 0.005 −2.239± 0.002 −0.772± 0.000Separated emb −2.500± 0.012 −2.915± 0.004 −0.807± 0.002s-emb −2.287± 0.007 −2.645± 0.002 −0.770± 0.001s-emb (hierarchical) −2.170± 0.003 − 2.217± 0.001 −0.767± 0.000s-emb (amortiz+feedf) −2.153± 0.004 −2.484± 0.002 −0.774± 0.000s-emb (amortiz+resnet) − 2.120± 0.004 −2.249± 0.002 − 0.762± 0.000

Page 28: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Amortized Embedding of Intelligence

Page 29: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Which Words Does a Group Use Most Differently?

Amortized embeddings uncover which words are used most differently byRepublican Senators (red) and Democratic Senators (blue) from different states.

Page 30: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Summary

• Exponential family embeddings• conditionally specified models• learn distributed representations• Bernoulli embeddings for text

• Structured embeddings• dynamics• hierarchical modeling• amortization

Page 31: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Discussion: How do these methods relate to embeddings 2.0?

• context vectors global (slow learning?)

• embeddings specific to each group

• amortization: embeddings are constructed, not retrieved

• predefined partitioning of the data:• determines groups• determines number of representations• determines when which representation needs to be accessed

• a smarter system should be able to learn group structure

• can CLS theory help us design such models?

Page 32: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Contact info: Maja Rudolph ([email protected])

Page 33: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Multinomial embeddings for text

• observations xi : one-hot vectors

• context ci : index of words before and after

• each word is associated with two embedding vectors:• embedding vector ρv ∈ RK• context vector αv ∈ RK

xi |xci ∼ Categorical(ηi)

ηiv = ρ>v

∑j∈ci

α>xj

Page 34: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Exponential family

• Exponential family with natural parameter η, sufficient statistic t(x), andlog-normalizer a(η).

x ∼ ExpFam(η, t(x)) = h(x)exp{ηT t(x)− a(η)}

• e.g. Gaussian for reals, Poisson for counts, categorical for categorical,Bernoulli for binary...

• nice properties, derive algorithm once

Page 35: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Dynamic Bernoulli Embeddings

• xiv | xci ∼ Bern(piv)

• ηiv = log piv1−piv

• ηiv = ρ>v(∑

j∈ci

∑v′ αv′xjv′

)• αv ∼ N(0, (1/λα)I)

• ρ(0)v ∼ N(0, (1/λ

(0)ρ )I)

• ρ(t)v ∼ N(ρ

(t−1)v , (1/λρ)I)

Page 36: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Dynamic Embeddings — ACM abstracts (1951 - 2014)

Page 37: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

values

1858 2000values values

fluctuations sacredvalue inalienable

currencies uniquefluctuation preservingdepreciation exemplifiedfluctuating principles

purchasing power philanthropyfluctuate virtuesbasis historical

Page 38: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

fine

1858 2004fine fine

luxurious punishedfinest penitentiariescoarse imprisonmentbeautiful misdemeanor

imprisonment punishablefiner offenselighter guiltyweaves convictionspun penitentiary

Page 39: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

data (ACM)

1961 1969 1991 2011 2014data data data data data

directories repositories voluminous raw data data streamsfiles voluminous raw data voluminous voluminous

bibliographic lineage repositories data sources raw dataformatted metadata data streams data streams warehousesretrieval snapshots data sources dws dwspublishing data streams volumes repositories repositoriesarchival raw data dws warehouses data sourcesarchives cleansing dsms marts data mining

manuscripts data mining data access volumes marts

Page 40: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

Detecting Words with Largest Drift

drift(v) = ||ρ(T)v −ρ

(0)v ||

words with largest drift

iraq 3.09 coin 2.39tax cuts 2.84 social security 2.38health care 2.62 fine 2.38energy 2.55 signal 2.38medicare 2.55 program 2.36discipline 2.44 moves 2.35text 2.41 credit 2.34values 2.40 unemployment 2.34

Page 41: Exponential Family Embeddingsmic3,stanford.… · Maja Rudolph et al.“Exponential Family Embeddings”.In: Advances in Neural Information Processing Systems. 2016, pp. 478–486.

unemployment

1858 1940 2000unemployment unemployment unemploymentunemployed unemployed joblessdepression depression rateacute alleviating depression

deplorable destitution forecastsalleviating acute cratedestitution reemployment upwardurban deplorable lag

employment employment economistsdistressing distress predict