Probabilistic Topic Models - Latent Dirichlet Allocation€¦ · Latent Dirichlet allocation....

Post on 19-Aug-2020

9 views 0 download

Transcript of Probabilistic Topic Models - Latent Dirichlet Allocation€¦ · Latent Dirichlet allocation....

Probabilistic Topic Models -Latent Dirichlet Allocation

Machine Learning Reading Group

March 2014

ML Reading Group Probabilistic Topic Models 1 / 12

Related Papers

D. Blei. Probabilistic topic models. Communications of the ACM,55(4):77-84, 2012.

D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation. Journal ofMachine Learning Research 3:993-1022, 2003.

ML Reading Group Probabilistic Topic Models 2 / 12

Motivation

Searching and exploring documents based on the themes that runthrough them

Rather than finding documents through keyword search alone, wemight first find the theme that we are interested in, and then examinethe documents related to that theme.

Probabilistic topic modeling: a suite of algorithms that aim todiscover and annotate large archives of documents with thematicinformation.

Note: these algorithms, sometimes in different names, are used forother data types (audio, image, video...)

ML Reading Group Probabilistic Topic Models 3 / 12

Topic Models (1)

Unigram model: the words of every document are drawnindependently from a single multinomial distribution

p(w) =∏

n

p(wn) (1)

ML Reading Group Probabilistic Topic Models 4 / 12

Topic Models (2)

Mixture of unigrams: each document is generated by first choosing atopic z and then generating N words independently from theconditional multinomial p(w |z)

p(w) =∑

z

p(z)∏

n

p(wn|z) (2)

ML Reading Group Probabilistic Topic Models 5 / 12

Topic Models (3)

Probabilistic latent semantic indexing (PLSI): it captures thepossibility that a document d may contain multiple topics

p(d ,wn) = p(d)∑

z

p(wn|z)p(z |d) (3)

ML Reading Group Probabilistic Topic Models 6 / 12

Latent Dirichlet Allocation (1)

Latent Dirichlet Allocation (LDA): expands PLSI by introducing priorson probability distributions

Better generalisability on new documents

ML Reading Group Probabilistic Topic Models 7 / 12

Latent Dirichlet Allocation (2)

Generative Process1 Randomly choose a distribution over topics2 For each word in the document

Randomly choose a topic from the distribution over topics in step 1Randomly choose a word from the corresponding distribution over thevocabulary

ML Reading Group Probabilistic Topic Models 8 / 12

Latent Dirichlet Allocation (3)

Graphical Model

ML Reading Group Probabilistic Topic Models 9 / 12

Latent Dirichlet Allocation (4)

Posterior Computation

Posterior is intractable - need to approximate it

Variational inference

MCMC - Gibbs sampling

ML Reading Group Probabilistic Topic Models 10 / 12

Latent Dirichlet Allocation (5)

Extensions

LDA can be exteded by relaxing some of the original assumptions

Bag-of-words: Not suitable for language generation

Solution: Integrating syntax

Bag-of-documents: Not suitable for chronologically ordereddocuments

Solution: Dynamic topic models

Number of topics: Assumed to be known and fixed

Solution: Bayesian nonparametric topic models

ML Reading Group Probabilistic Topic Models 11 / 12

Latent Dirichlet Allocation (6)

Open Issues

Evaluation and model checking: Interpretability over goodness of fit.

Visualization and user interfaces: Intuitive ways to visualise topics.

Data discovery : Seeking help of domain experts.

ML Reading Group Probabilistic Topic Models 12 / 12