Post on 28-Mar-2015
The IBP Compound Dirichlet Process and its Application to Focused Topic
Modeling
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Presented by Eric Wang9/16/2011
Introduction• Latent Dirichlet Allocation (LDA) is a powerful and ubiquitous
topic modeling framework.
• Incorporating the hierarchical Dirichlet process (HDP) into the LDA allows for more flexible topic modeling by estimating the global topic proportions.
• A drawback of HDP-LDA is that a topic that is rare globally will also have a low expected proportion within each document.
• The authors propose a model that allows a rare topic to still have large mass within individual documents.
Hierarchical Dirichlet Process• The hierarchical Dirichlet process (HDP) is a prior for Bayesian
nonparametric mixed membership modeling of data groups.
• Hierarchically, it can be defined as
where m indexes the data group.
• In HDP, the expectation of the mixing weights in is . In practice, the mixing weights in is the global average of the mixture membership.
Indian Buffet Process• The Indian Buffet Process (IBP) defines a distribution over
binary matrices with an infinite number of columns, and a finite number of non-zero entries.
• Hierarchically, it is defined as
where m and k denote the rows and columns of binary matrix b. It can be represented via a stick-breaking construction
IBP Compound Dirichlet Process• Combining HDP and IBP into single prior yields an infinite
“spike-slab” prior (ICD).
• A spike distribution (IBP) determines which variables are drawn from the slab (DP).
• The model assumes the following generative process
IBP Compound Dirichlet Process• The atom masses of data group m is Dirichlet distributed as
follows
where
• In this construction, the are the topic proportions for document m and B is a binary vector indicating usage of the dictionary elements.
Focused Topic Models• The authors use ICD to develop the Focused Topic model
(FTM).
• In this framework, a global distribution over topics is drawn and shared over all documents as in HDP-LDA.
• Each document infers a subset of topics from the global menu. The subset is determined by the binary vector . Since the binary vector is independent of the global topic proportions, topics that are rare globally can still make up a large proportion of individual documents.
Focused Topic Models• The generative process for the FTM is as follows
Posterior Inference• To sample the topic indicator for word i in document m,
where the integral
has an analytical form and .
• This is an important point because it suggests a general framework that can be adapted to other applications.
Posterior Inference• The joint probability of and the total number of words
assigned to topic k is
and is log differentiable with respect to and .
• A hybrid MC algorithm is used to sample from their posteriors.
Posterior Inference• The topic weights are sampled as
• And the binary topic indicators are sampled as
• Notice here that if a topic is used, it is automatically considered “active”, and additional (unused) topics can be activated.
Empirical Results• The authors considered three different text datasets:
• All models were run for 1000 iterations, with the first 500 iterations discarded as burn-in.
Empirical Results• Model Perplexity
• Topic Correlation
Empirical Results• Here, the authors compare the number of topics a word
appears in (a). The FTM has more concentrated topics.
• In (b), the authors show the number of documents the topics appear in. The plot illustrates that HDP has many topics that appear in only a few documents, while a significant portion of the FTM topics appear in many documents.
Discussion• The authors have proposed a novel model called the IBP
compound Dirichlet Process (ICD) that decouples the across-data topic prevalence and the intra-data topic proportions.
• The Focused Topic Model (FTM) was developed from the ICD that addressed several key shortcomings of HDP-LDA.
• In HDL-LDA, the global topic prevalence affects the proportion a topic can appear within a document, but in FTM, globally rare topics can still be highly occupied within a document.
• FTM shows improved perplexity relative to HDP.