Fast and Accurate Inference for Topic Models

James FouldsUniversity of California, Santa Cruz

Presented at eBay Research Labs

Motivation• There is an ever-increasing wealth of digital

information available– Wikipedia– News articles– Scientific articles– Literature– Debates– Blogs, social media …

• We would like automatic methods to help us understand this content

Motivation

• Personalized recommender systems• Social network analysis• Exploratory tools for scientists• The digital humanities• …

The Digital Humanities

Dimensionality reduction

The quick brown fox jumps over the sly lazy dog

The quick brown fox jumps over the sly lazy dog[5 6 37 1 4 30 9 22 570 12]

Foxes Dogs Jumping[40% 40% 20% ]

Latent Variable Models

XΦParameters

Latent variables

Observed dataData Points

Dimensionality(X) >> dimensionality(Z)Z is a bottleneck, which finds a compressed, low-dimensional representation of X

Latent Feature Models forSocial Networks

Alice Bob

Claire

CyclingFishingRunning

WaltzRunning

TangoSalsa

Alice Bob

Claire

WaltzRunning

TangoSalsa

Alice Bob

Claire

WaltzRunning

TangoSalsa

Alice Bob

Claire

Miller, Griffiths, Jordan (2009)Latent Feature Relational Model

WaltzRunning

TangoSalsa

Cycling Fishing Running Tango Salsa Waltz

ClaireZ =

Alice Bob

Claire

Latent Representations

• Binary latent feature

• Latent class

• Mixed membership

Cycling Fishing Running Tango Salsa WaltzAlice 1 1 1Bob 1 1Claire 1 1

Cycling Fishing Running Tango Salsa WaltzAlice 0.2 0.4 0.4Bob 0.5 0.5Claire 0.9 0.1

Cycling Fishing Running Tango Salsa WaltzAlice 1Bob 1Claire 1

• Latent class

Latent Variable ModelsAs Matrix Factorization

WaltzRunning

TangoSalsa

ClaireZ =

Alice Bob

Claire

WaltzRunning

TangoSalsa

ClaireZ =

Alice Bob

Claire E[Y] =(ZWZT)

Topics

Topic 1Reinforcement learning

Topic 2Learning algorithms

Topic 3Character recognition

Distributionover allwords indictionary

A vector of discrete probabilities (sums to one)

Topics

Top 10 words

Topics

Top 10 words

Latent Dirichlet Allocation(Blei et al., 2003)

•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

•For each topic k• Draw its distribution over words φ(k) ~ Dirichlet(β)• For each word wd,n

LDA as Matrix Factorization

θ φTx

Let’s say we want to build an LDAtopic model on Wikipedia

LDA on Wikipedia

Time (s)

VB (10,000 documents)

1 hour 6 hours

12 hours

10 mins

LDA on Wikipedia

Time (s)

1 hour 6 hours

12 hours

10 mins

LDA on Wikipedia

Time (s)

1 full iteration = 3.5 days!

1 hour 6 hours

12 hours

10 mins

LDA on Wikipedia

Stochastic variational inference

Time (s)

Stochastic VB (all documents)

Stochastic variational inference

1 hour 6 hours

12 hours

10 mins

LDA on Wikipedia

Stochastic collapsed variational inference

Time (s)

SCVB0 (all documents)Stochastic VB (all documents)VB (10,000 documents)VB (100,000 documents)

1 hour 6 hours

12 hours

10 mins

Available tools

VB Collapsed Gibbs Sampling Collapsed VB

Batch Blei et al. (2003) Griffiths and Steyvers (2004)

Teh et al. (2007), Asuncion et al.

(2009)

Stochastic Hoffman et al. (2010, 2013)

Mimno et al. (2012) (partially collapsed VB/Gibbs hybrid)

Available tools

VB Collapsed Gibbs Sampling Collapsed VB

Batch Blei et al. (2003) Griffiths and Steyvers (2004)

Teh et al. (2007), Asuncion et al.

(2009)

Stochastic Hoffman et al. (2010, 2013)

Mimno et al. (2012) (partially collapsed VB/Gibbs hybrid)

Collapsed Inference for LDAGriffiths and Steyvers (2004)

• Marginalize out the parameters, and perform inference on the latent variables only

𝚽 Z

• Marginalize out the parameters, and perform inference on the latent variables only

– Simpler, faster and fewer update equations– Better mixing for Gibbs sampling

• Collapsed Gibbs sampler

Word-topic counts

Document-topic counts

Topic counts

Stochastic Optimization for ML

Stochastic algorithms– While (not converged)

• Process a subset of the dataset, to estimate the update• Update parameters

• Stochastic gradient descent– Estimate the gradient

• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational

parameters• Online EM (Cappe and Moulines, 2009)

– Estimate E-step sufficient statistics

Goal: Build a Fast, Accurate,Scalable Algorithm for LDA

• Collapsed LDA– Easy to implement– Fast– Accurate– Mixes well / propagates information quickly

• Stochastic algorithms– Scalable

• Quickly forgets random initialization• Memory requirements, update time independent of size of data set• Can estimate topics before a single pass of the data is complete

• Our contribution: an algorithm which gets the best of both worlds

Variational Bayesian Inference

• An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X)

KL(Q || P)

KL(Q || P)P

Collapsed Variational Bayes(Teh et al., 2007)

• K-dimensional discrete variational distributions for each token

• Mean field assumption

• Improved variational bound

Collapsed VBMean field assumption

The Quick Brown Fox Jumped Over

Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

Topics

Foxes 0 1 1 1 0 0

Dogs 1 0 0 0 0 0

Jumping 0 0 0 0 1 1

• CVB0 (Asuncion et al., 2009)

Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

Foxes 0.33 0.5 0.9 1 0 0.2

Dogs 0.33 0.3 0.1 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

CVB0 Statistics

• Simple sums over the variational parameters

• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational parameters

• Online EM (Cappe and Moulines, 2009)– Estimate E-step sufficient statistics

• Stochastic CVB0– Estimate the CVB0 statistics

• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational parameters

• Online EM (Cappe and Moulines, 2009)– Estimate E-step sufficient statistics

• Stochastic CVB0– Estimate the CVB0 statistics

Estimating CVB0 Statistics

• Pick a random word i from a random document j

Estimating CVB0 Statistics

• Pick a random word i from a random document j

• An unbiased estimator is:

Stochastic CVB0

• In an online algorithm, we cannot store the variational parameters

• But we can update them!

Stochastic CVB0

• Keep an online average of the CVB0 statistics

Extra Refinements

• Optional burn-in passes per document

• Minibatches

• Operating on sparse counts

Stochastic CVB0Putting it all Together

Experimental Results – Large Scale

Experimental Results – Small Scale

• Real-time or near real-time results are important for EDA applications

• Human participants shown the top ten words from each topic

Experimental Results – Small Scale

NIPS (5 Seconds) New York Times (60 Seconds)

Mean number of errors

Standard deviations: 1.1 1.2 1.0 2.4

Convergence Analysis

• Theorem: with an appropriate sequence of step sizes, SCVB0 converges to a stationary point of the MAP, with adjusted hyper-parameters

• Step 1) An alternative derivation of “batch SCVB0” as an EM algorithm for MAP

EM statistics:

E-step responsibilites

EM statistics:

E-step:

Equivalent to SCVB0 update, but withhyper-parameters adjusted by one

EM statistics:

M-step:

E-step:

Synchronize parameters (estimated EM statistics)with the EM statistics

• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm

Goal: Find the roots of a function

Observe noisy measurement

Move in the direction of the noisy measurement

The step that the EM algorithm takes

• Step 3) Show that the stochastic approximation algorithm converges

• A Lyapunov function is an “objective function” for an SA algorithm.

• The existence of such a function, with certain conditions holding, is sufficient for convergence with an appropriate sequence of step sizes

• The existence of such a function, with certain properties holding, is sufficient for convergence with an appropriate sequence of step sizes

• We show that (the negative of the Lagrangian of)

the EM lower bound is such a Lyapunov function

• The existence of such a function, with certain properties holding, is sufficient for convergence with an appropriate sequence of step sizes

• We show that (the negative of the Lagrangian of)

the EM lower bound is such a Lyapunov function

Future work

• Exploit sparsity

• Parallelization

• Nonparametric extensions

• Generalizations to other models?

Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )

User-specified logical rules

Probabilistic model

Fast inference

Probabilistic model

Fast inference

Structured predictionEntity resolution

Collective classification

Link prediction

Publications from my Thesis Work

Algorithm papers• J. R. Foulds, L. Boyles, C. DuBois, P. Smyth and M. Welling. Stochastic collapsed variational

Bayesian inference for latent Dirichlet allocation. KDD 2013.

• J. R. Foulds and P. Smyth. Annealing Paths for the Evaluation of Topic Models. UAI 2014.

Modeling papers• J. R. Foulds, P. Smyth. Modeling scientific impact with topical influence regression. EMNLP

• J. R. Foulds, A. Asuncion, C. DuBois, C. T. Butts, P. Smyth. A dynamic relational infinite feature model for longitudinal social networks. AI STATS 2011

Other publications• C. DuBois, J. R. Foulds, P. Smyth. Latent set models for two-mode network data. ICWSM 2011.

• J. R. Foulds, N. Navaroli, P. Smyth, A. Ihler. Revisiting MAP estimation, message passing and perfect graphs. AI STATS 2011.

• J. R. Foulds and P. Smyth. Multi-instance mixture models and semi-supervised learning. SIAM SDM 2011.

• J. R. Foulds and E. Frank. Speeding up and boosting diverse density learning. Discovery Science, 2010.

• J. R. Foulds and E. Frank. A review of multi-instance learning assumptions. Knowledge Engineering Review, 25(1), 2010.

• J. R. Foulds and E. Frank. Revisiting multiple-instance learning via embedded instance selection. Australasian Joint Conference on Artificial Intelligence, 2008.

• J. R. Foulds and L. R. Foulds, A probabilistic dynamic programming model of rape seed harvesting. International Journal of Operational Research 2006, 1(4), 2006.

• J. R. Foulds and L. R. Foulds, Bridge lane direction specification for sustainable traffic management. Asia-Pacific Journal of Operational Research, 23(2), 2006.

Thanks to my Collaborators

• My PhD advisor, Padhraic Smyth

• SCVB0 is also joint work with:– Levi Boyles– Chris DuBois– Max Welling

Thank You!

Questions?

Fast and Accurate Inference for Topic Models

Documents

Transcript of Fast and Accurate Inference for Topic Models

THE ROLE OF PROBABILITY-BASED INFERENCE IN AN INTELLIGENT ... · Probability-based inference in complex networks of interdependent variables is an active topic in statistical research,

Mr. LDA: A Flexible Large Scale Topic Modeling Package ...jbg/docs/2012_ · Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce ... For

Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.

Topic 3: Inference and Prediction - University of Manitobahome.cc.umanitoba.ca/~godwinrt/7010/topic3.pdf · Topic 3: Inference and Prediction We’ll be concerned here with testing

Measuring Topic Quality in Latent Dirichlet Allocationsergey/slides/N14_PhMLtalk.pdf · Topic modeling Measuring topic quality On Bayesian inference Latent Dirichlet Allocation Outline

Memory Bounded Inference on Topic Models

Are You Smarter Than a Lab Rat? Are You Smarter Than a 5 th Grader? 1,000,000 Inference Topic 1 Inference Topic 2 F.O.S. Topic 3 F.O.S. Topic 4 Author’s.

OrthoFinder2: fast and accurate phylogenomic orthology analysis … · These challenges are all addressed by OrthoFinder (see methods) to provide accurate and scalable ortholog inference

On Tight Approximate Inference of the Logistic-Normal ...proceedings.mlr.press/v2/ahmed07a/ahmed07a.pdf · On Tight Approximate Inference of the Logistic-Normal Topic Admixture Model

Efﬁcient Methods for Topic Model Inference on …wcohen/10-605/papers/fast-topic-model-1.pdfEfﬁcient Methods for Topic Model Inference on Streaming Document Collections Limin Yao,

StochasticCollapsedVariationalBayesianInference ... · gorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA)

Emergent Inference of Hidden Markov Models in Spiking ......WTA circuit can get accurate inference results of HMMs. Index Terms—Hidden Markov models (HMMs), neural imple-mentation,

Topic 30: Random Effects. Outline One-way random effects model –Data –Model –Inference.

Scalable Inference for Nested Chinese Restaurant … Inference for Nested Chinese Restaurant Process Topic Models Jianfei Cheny, Jun Zhuy, Jie Luz, and Shixia Liuz yDept. of Comp.

Evaluating Supervised Topic Models in the Presence of OCR ... · topic. Given a collection of documents, tools from Bayesian statistics (such as Gibbs sampling and variational inference)

Generative Clustering, Topic Modeling, & Bayesian Inference · Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 11-13,

On Smoothing and Inference for Topic Models

A Latent Concept Topic Model for Robust Topic Inference ...by this, we propose a latent concept topic model (LCTM) that infers topics based on document-level co-occurrence of references

Topic 5 Statistical inference: point and interval estimate.

Topic 1: Introduction - cs.rit.edulr/courses/nn/lecture/topic 1.pdf · to a Knowledge Base (KB) containing the problem domain heuristics. • The inference is the result of interpolating