Fast and Accurate Inference for Topic Models

Post on 06-Jan-2016

34 views 1 download

Tags:

description

Fast and Accurate Inference for Topic Models. James Foulds University of California, Santa Cruz Presented at eBay Research Labs. Motivation. There is an ever-increasing wealth of digital information available Wikipedia News articles Scientific articles Literature Debates - PowerPoint PPT Presentation

Transcript of Fast and Accurate Inference for Topic Models

Fast and Accurate Inference for Topic Models

James FouldsUniversity of California, Santa Cruz

Presented at eBay Research Labs

2

Motivation• There is an ever-increasing wealth of digital

information available– Wikipedia– News articles– Scientific articles– Literature– Debates– Blogs, social media …

• We would like automatic methods to help us understand this content

3

Motivation

• Personalized recommender systems• Social network analysis• Exploratory tools for scientists• The digital humanities• …

4

The Digital Humanities

5

Dimensionality reduction

The quick brown fox jumps over the sly lazy dog

6

Dimensionality reduction

The quick brown fox jumps over the sly lazy dog[5 6 37 1 4 30 9 22 570 12]

7

Dimensionality reduction

The quick brown fox jumps over the sly lazy dog[5 6 37 1 4 30 9 22 570 12]

Foxes Dogs Jumping[40% 40% 20% ]

8

Latent Variable Models

Z

XΦParameters

Latent variables

Observed dataData Points

Dimensionality(X) >> dimensionality(Z)Z is a bottleneck, which finds a compressed, low-dimensional representation of X

Latent Feature Models forSocial Networks

Alice Bob

Claire

Latent Feature Models forSocial Networks

CyclingFishingRunning

WaltzRunning

TangoSalsa

Alice Bob

Claire

Latent Feature Models forSocial Networks

CyclingFishingRunning

WaltzRunning

TangoSalsa

Alice Bob

Claire

Latent Feature Models forSocial Networks

CyclingFishingRunning

WaltzRunning

TangoSalsa

Alice Bob

Claire

Miller, Griffiths, Jordan (2009)Latent Feature Relational Model

CyclingFishingRunning

WaltzRunning

TangoSalsa

Cycling Fishing Running Tango Salsa Waltz

Alice

Bob

ClaireZ =

Alice Bob

Claire

14

Latent Representations

• Binary latent feature

• Latent class

• Mixed membership

Cycling Fishing Running Tango Salsa WaltzAlice 1 1 1Bob 1 1Claire 1 1

Cycling Fishing Running Tango Salsa WaltzAlice 0.2 0.4 0.4Bob 0.5 0.5Claire 0.9 0.1

Cycling Fishing Running Tango Salsa WaltzAlice 1Bob 1Claire 1

15

Latent Representations

• Binary latent feature

• Latent class

• Mixed membership

Cycling Fishing Running Tango Salsa WaltzAlice 1 1 1Bob 1 1Claire 1 1

Cycling Fishing Running Tango Salsa WaltzAlice 0.2 0.4 0.4Bob 0.5 0.5Claire 0.9 0.1

Cycling Fishing Running Tango Salsa WaltzAlice 1Bob 1Claire 1

16

Latent Representations

• Binary latent feature

• Latent class

• Mixed membership

Cycling Fishing Running Tango Salsa WaltzAlice 1 1 1Bob 1 1Claire 1 1

Cycling Fishing Running Tango Salsa WaltzAlice 0.2 0.4 0.4Bob 0.5 0.5Claire 0.9 0.1

Cycling Fishing Running Tango Salsa WaltzAlice 1Bob 1Claire 1

17

Latent Variable ModelsAs Matrix Factorization

18

Latent Variable ModelsAs Matrix Factorization

Miller, Griffiths, Jordan (2009)Latent Feature Relational Model

CyclingFishingRunning

WaltzRunning

TangoSalsa

Cycling Fishing Running Tango Salsa Waltz

Alice

Bob

ClaireZ =

Alice Bob

Claire

Miller, Griffiths, Jordan (2009)Latent Feature Relational Model

CyclingFishingRunning

WaltzRunning

TangoSalsa

Cycling Fishing Running Tango Salsa Waltz

Alice

Bob

ClaireZ =

Alice Bob

Claire E[Y] =(ZWZT)

21

Topics

Topic 1Reinforcement learning

Topic 2Learning algorithms

Topic 3Character recognition

Distributionover allwords indictionary

A vector of discrete probabilities (sums to one)

22

Topics

Topic 1Reinforcement learning

Topic 2Learning algorithms

Topic 3Character recognition

Top 10 words

23

Topics

Topic 1Reinforcement learning

Topic 2Learning algorithms

Topic 3Character recognition

Top 10 words

24

Latent Dirichlet Allocation(Blei et al., 2003)

•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

φ

25

Latent Dirichlet Allocation(Blei et al., 2003)

•For each topic k• Draw its distribution over words φ(k) ~ Dirichlet(β)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

φ

26

Latent Dirichlet Allocation(Blei et al., 2003)

•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

φ

27

Latent Dirichlet Allocation(Blei et al., 2003)

•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

φ

28

Latent Dirichlet Allocation(Blei et al., 2003)

•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

φ

29

Latent Dirichlet Allocation(Blei et al., 2003)

•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

φ

30

Latent Dirichlet Allocation(Blei et al., 2003)

•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

φ

31

LDA as Matrix Factorization

θ φTx

32

Let’s say we want to build an LDAtopic model on Wikipedia

33

LDA on Wikipedia

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood

VB (10,000 documents)

1 hour 6 hours

12 hours

10 mins

34

LDA on Wikipedia

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood

VB (10,000 documents)

VB (100,000 documents)

1 hour 6 hours

12 hours

10 mins

35

LDA on Wikipedia

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood

VB (10,000 documents)

VB (100,000 documents)

1 full iteration = 3.5 days!

1 hour 6 hours

12 hours

10 mins

36

LDA on Wikipedia

Stochastic variational inference

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood

Stochastic VB (all documents)

VB (10,000 documents)

VB (100,000 documents)

Stochastic variational inference

1 hour 6 hours

12 hours

10 mins

37

LDA on Wikipedia

Stochastic collapsed variational inference

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood

SCVB0 (all documents)Stochastic VB (all documents)VB (10,000 documents)VB (100,000 documents)

1 hour 6 hours

12 hours

10 mins

38

Available tools

VB Collapsed Gibbs Sampling Collapsed VB

Batch Blei et al. (2003) Griffiths and Steyvers (2004)

Teh et al. (2007), Asuncion et al.

(2009)

Stochastic Hoffman et al. (2010, 2013)

Mimno et al. (2012) (partially collapsed VB/Gibbs hybrid)

???

39

Available tools

VB Collapsed Gibbs Sampling Collapsed VB

Batch Blei et al. (2003) Griffiths and Steyvers (2004)

Teh et al. (2007), Asuncion et al.

(2009)

Stochastic Hoffman et al. (2010, 2013)

Mimno et al. (2012) (partially collapsed VB/Gibbs hybrid)

???

40

Collapsed Inference for LDAGriffiths and Steyvers (2004)

• Marginalize out the parameters, and perform inference on the latent variables only

Z

𝛉

𝚽 Z

41

Collapsed Inference for LDAGriffiths and Steyvers (2004)

• Marginalize out the parameters, and perform inference on the latent variables only

– Simpler, faster and fewer update equations– Better mixing for Gibbs sampling

42

• Collapsed Gibbs sampler

Collapsed Inference for LDAGriffiths and Steyvers (2004)

43

• Collapsed Gibbs sampler

Collapsed Inference for LDAGriffiths and Steyvers (2004)

Word-topic counts

44

• Collapsed Gibbs sampler

Collapsed Inference for LDAGriffiths and Steyvers (2004)

Document-topic counts

45

• Collapsed Gibbs sampler

Collapsed Inference for LDAGriffiths and Steyvers (2004)

Topic counts

46

Stochastic Optimization for ML

Stochastic algorithms– While (not converged)

• Process a subset of the dataset, to estimate the update• Update parameters

47

Stochastic Optimization for ML

• Stochastic gradient descent– Estimate the gradient

• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational

parameters• Online EM (Cappe and Moulines, 2009)

– Estimate E-step sufficient statistics

48

Goal: Build a Fast, Accurate,Scalable Algorithm for LDA

• Collapsed LDA– Easy to implement– Fast– Accurate– Mixes well / propagates information quickly

• Stochastic algorithms– Scalable

• Quickly forgets random initialization• Memory requirements, update time independent of size of data set• Can estimate topics before a single pass of the data is complete

• Our contribution: an algorithm which gets the best of both worlds

49

Variational Bayesian Inference

• An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X)

P

Q

50

Variational Bayesian Inference

• An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X)

KL(Q || P)

P

Q

51

Variational Bayesian Inference

• An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X)

KL(Q || P)P

Q

52

Collapsed Variational Bayes(Teh et al., 2007)

• K-dimensional discrete variational distributions for each token

53

Collapsed Variational Bayes(Teh et al., 2007)

• K-dimensional discrete variational distributions for each token

• Mean field assumption

54

Collapsed Variational Bayes(Teh et al., 2007)

• K-dimensional discrete variational distributions for each token

• Mean field assumption

• Improved variational bound

55

Collapsed VBMean field assumption

The Quick Brown Fox Jumped Over

Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

Words

Topics

56

• Collapsed Gibbs sampler

Collapsed Variational Bayes(Teh et al., 2007)

The Quick Brown Fox Jumped Over

Foxes 0 1 1 1 0 0

Dogs 1 0 0 0 0 0

Jumping 0 0 0 0 1 1

57

• Collapsed Gibbs sampler

• CVB0 (Asuncion et al., 2009)

Collapsed Variational Bayes(Teh et al., 2007)

58

• CVB0 (Asuncion et al., 2009)

Collapsed Variational Bayes(Teh et al., 2007)

The Quick Brown Fox Jumped Over

Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

59

• CVB0 (Asuncion et al., 2009)

Collapsed Variational Bayes(Teh et al., 2007)

The Quick Brown Fox Jumped Over

Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

60

• CVB0 (Asuncion et al., 2009)

Collapsed Variational Bayes(Teh et al., 2007)

The Quick Brown Fox Jumped Over

Foxes 0.33 0.5 0.9 1 0 0.2

Dogs 0.33 0.3 0.1 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

61

CVB0 Statistics

• Simple sums over the variational parameters

62

Stochastic Optimization for ML

• Stochastic gradient descent– Estimate the gradient

• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational parameters

• Online EM (Cappe and Moulines, 2009)– Estimate E-step sufficient statistics

• Stochastic CVB0– Estimate the CVB0 statistics

63

Stochastic Optimization for ML

• Stochastic gradient descent– Estimate the gradient

• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational parameters

• Online EM (Cappe and Moulines, 2009)– Estimate E-step sufficient statistics

• Stochastic CVB0– Estimate the CVB0 statistics

64

Estimating CVB0 Statistics

65

Estimating CVB0 Statistics

• Pick a random word i from a random document j

66

Estimating CVB0 Statistics

• Pick a random word i from a random document j

• An unbiased estimator is:

67

Stochastic CVB0

• In an online algorithm, we cannot store the variational parameters

• But we can update them!

68

Stochastic CVB0

• Keep an online average of the CVB0 statistics

69

Extra Refinements

• Optional burn-in passes per document

• Minibatches

• Operating on sparse counts

70

Stochastic CVB0Putting it all Together

71

Experimental Results – Large Scale

72

Experimental Results – Large Scale

73

Experimental Results – Small Scale

• Real-time or near real-time results are important for EDA applications

• Human participants shown the top ten words from each topic

74

Experimental Results – Small Scale

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

SCVB0

SVB

NIPS (5 Seconds) New York Times (60 Seconds)

Mean number of errors

Standard deviations: 1.1 1.2 1.0 2.4

75

Convergence Analysis

• Theorem: with an appropriate sequence of step sizes, SCVB0 converges to a stationary point of the MAP, with adjusted hyper-parameters

76

Convergence Analysis

• Step 1) An alternative derivation of “batch SCVB0” as an EM algorithm for MAP

EM statistics:

E-step responsibilites

77

Convergence Analysis

• Step 1) An alternative derivation of “batch SCVB0” as an EM algorithm for MAP

EM statistics:

E-step:

Equivalent to SCVB0 update, but withhyper-parameters adjusted by one

78

Convergence Analysis

• Step 1) An alternative derivation of “batch SCVB0” as an EM algorithm for MAP

EM statistics:

M-step:

E-step:

Synchronize parameters (estimated EM statistics)with the EM statistics

79

Convergence Analysis

• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm

80

Convergence Analysis

• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm

Goal: Find the roots of a function

81

Convergence Analysis

• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm

Observe noisy measurement

82

Convergence Analysis

• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm

Observe noisy measurement

Move in the direction of the noisy measurement

83

Convergence Analysis

• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm

The step that the EM algorithm takes

84

Convergence Analysis

• Step 3) Show that the stochastic approximation algorithm converges

• A Lyapunov function is an “objective function” for an SA algorithm.

• The existence of such a function, with certain conditions holding, is sufficient for convergence with an appropriate sequence of step sizes

85

Convergence Analysis

• Step 3) Show that the stochastic approximation algorithm converges

• A Lyapunov function is an “objective function” for an SA algorithm.

• The existence of such a function, with certain properties holding, is sufficient for convergence with an appropriate sequence of step sizes

• We show that (the negative of the Lagrangian of)

the EM lower bound is such a Lyapunov function

86

Convergence Analysis

• Step 3) Show that the stochastic approximation algorithm converges

• A Lyapunov function is an “objective function” for an SA algorithm.

• The existence of such a function, with certain properties holding, is sufficient for convergence with an appropriate sequence of step sizes

• We show that (the negative of the Lagrangian of)

the EM lower bound is such a Lyapunov function

87

Future work

• Exploit sparsity

• Parallelization

• Nonparametric extensions

• Generalizations to other models?

88

Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )

User-specified logical rules

89

Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )

User-specified logical rules

Probabilistic model

90

Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )

User-specified logical rules

Probabilistic model

Fast inference

91

Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )

User-specified logical rules

Probabilistic model

Fast inference

Structured predictionEntity resolution

Collective classification

Link prediction

92

Publications from my Thesis Work

Algorithm papers• J. R. Foulds, L. Boyles, C. DuBois, P. Smyth and M. Welling. Stochastic collapsed variational

Bayesian inference for latent Dirichlet allocation. KDD 2013.

• J. R. Foulds and P. Smyth. Annealing Paths for the Evaluation of Topic Models. UAI 2014.

Modeling papers• J. R. Foulds, P. Smyth. Modeling scientific impact with topical influence regression. EMNLP

2013.

• J. R. Foulds, A. Asuncion, C. DuBois, C. T. Butts, P. Smyth. A dynamic relational infinite feature model for longitudinal social networks. AI STATS 2011

93

Other publications• C. DuBois, J. R. Foulds, P. Smyth. Latent set models for two-mode network data. ICWSM 2011.

• J. R. Foulds, N. Navaroli, P. Smyth, A. Ihler. Revisiting MAP estimation, message passing and perfect graphs. AI STATS 2011.

• J. R. Foulds and P. Smyth. Multi-instance mixture models and semi-supervised learning. SIAM SDM 2011.

• J. R. Foulds and E. Frank. Speeding up and boosting diverse density learning. Discovery Science, 2010.

• J. R. Foulds and E. Frank. A review of multi-instance learning assumptions. Knowledge Engineering Review, 25(1), 2010.

• J. R. Foulds and E. Frank. Revisiting multiple-instance learning via embedded instance selection. Australasian Joint Conference on Artificial Intelligence, 2008.

• J. R. Foulds and L. R. Foulds, A probabilistic dynamic programming model of rape seed harvesting. International Journal of Operational Research 2006, 1(4), 2006.

• J. R. Foulds and L. R. Foulds, Bridge lane direction specification for sustainable traffic management. Asia-Pacific Journal of Operational Research, 23(2), 2006.

94

Thanks to my Collaborators

• My PhD advisor, Padhraic Smyth

• SCVB0 is also joint work with:– Levi Boyles– Chris DuBois– Max Welling

95

Thank You!

Questions?