Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation...

75
Topic Recognition Algorithmic Methods in the Humanities · June 23, 2016 Florian Becker KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association I NSTITUTE OF T HEORETICAL I NFORMATICS · ALGORITHMICS GROUP www.kit.edu Latent Dirichlet Allocation

Transcript of Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation...

Page 1: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Topic Recognition

Algorithmic Methods in the Humanities · June 23, 2016Florian Becker

KIT – University of the State of Baden-Wuerttemberg andNational Laboratory of the Helmholtz Association

INSTITUTE OF THEORETICAL INFORMATICS · ALGORITHMICS GROUP

www.kit.edu

Latent Dirichlet Allocation

Page 2: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

More and more text

http://www.passion-estampes.com/npe/newsletter-francois-schuiten.html

1

Page 3: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

More and more text

Mass production of text:

> 4000 peer-reviewed papers / day

nearly 3 million blog posts / day

500 million tweets / day

http://www.passion-estampes.com/npe/newsletter-francois-schuiten.htmlhttp://www.passion-estampes.com/npe/newsletter-francois-schuiten.html

1

Page 4: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

(Probabilistic) Topic Models - Intro

Automatically extract topics from documents

Organizing and searching of large collections of text

2

Page 5: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

(Probabilistic) Topic Models - Intro

Automatically extract topics from documents

Organizing and searching of large collections of text

Algorithm: Corpus→ Topics

Input corpus, int K (number of topics)Output K topics

2

Page 6: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

(Probabilistic) Topic Models - Intro

Automatically extract topics from documents

Organizing and searching of large collections of text

Algorithm: Corpus→ Topics

Input corpus, int K (number of topics)Output K topics

Distribution overwords

2

Page 7: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

(Probabilistic) Topic Models - Intro

Automatically extract topics from documents

Organizing and searching of large collections of text

Algorithm: Corpus→ Topics

Input corpus, int K (number of topics)Output K topics

Distribution overwords

0.15*algorithm0.1*complexity0.05*program0.05*turing. . .. . .

2

Page 8: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Topic Simplex

A document is a distribution over topics

3

Page 9: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Topic Simplex

A document is a distribution over topics

Philosophy

3

Page 10: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Topic Simplex

A document is a distribution over topics

LinguisticsPhilosophy

3

Page 11: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Topic Simplex

A document is a distribution over topics

LinguisticsPhilosophy

Computer Science

3

Page 12: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Topic Simplex

A document is a distribution over topics

LinguisticsPhilosophy

Computer Science

Algorithms

3

Page 13: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Topic Simplex

A document is a distribution over topics

LinguisticsPhilosophy

Computer Science

Algorithms

Generative Grammar

3

Page 14: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Topic Simplex

A document is a distribution over topics

LinguisticsPhilosophy

Computer Science

Algorithms

Generative Grammar

Logic

3

Page 15: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Outline

Latent Dirichlet Allocation: What does it do?

4

Page 16: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Outline

Latent Dirichlet Allocation: What does it do?

Assumptions

Generative Process

Dirichlet Distribution

4

Page 17: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Outline

Latent Dirichlet Allocation: What does it do?

Assumptions

Generative Process

Dirichlet Distribution

Demo

4

Page 18: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Outline

Latent Dirichlet Allocation: What does it do?

Assumptions

Generative Process

Dirichlet Distribution

Demo

Latent Dirichlet Allocation: How does it do it?

4

Page 19: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Outline

Latent Dirichlet Allocation: What does it do?

Assumptions

Generative Process

Dirichlet Distribution

Demo

Latent Dirichlet Allocation: How does it do it?

Inference

Gibbs Sampling

4

Page 20: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Outline

Latent Dirichlet Allocation: What does it do?

Assumptions

Generative Process

Dirichlet Distribution

Demo

Latent Dirichlet Allocation: How does it do it?

Conclusion

Inference

Gibbs Sampling

4

Page 21: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Latent Dirichlet Allocation

5

Page 22: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Latent Dirichlet Allocation

Unsupervised Learning Model

Finding clusters of similar texts

Generative Model

6

Page 23: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Latent Dirichlet Allocation

Assumptions

A document is represented as a bag of words

A document is about multiple topics

A topic is a distribution over words

Order of documents in corpus does not matter

Every document is generated by a generative process

7

Page 24: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Generative Process

Algorithm: Generative Process

1. Choose θi ∼ Dir(α),2. Choose ϕk ∼ Dir(β)3. For each of the word positions i , j

(a) Choose a topic zi ,j ∼ Multinomial(θi ).(b) Choose a word wi ,j ∼ Multinomial(ϕzi ,j )

j ∈ {1, . . . , Ni}, and i ∈ {1, . . . , D}

Ni - Number of words in document iD - Number of documents

8

Page 25: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

Binomial Distribution→ Multinomial Distribution→ Dirichlet Distribution

9

Page 26: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

Binomial Distribution→ Multinomial Distribution→ Dirichlet Distribution

Binomial Distribution (PMF)f (k ; n, p) = Pr(X = k ) =

(nk

)pk (1− p)n−k

9

Page 27: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

Binomial Distribution→ Multinomial Distribution→ Dirichlet Distribution

Binomial Distribution (PMF)f (k ; n, p) = Pr(X = k ) =

(nk

)pk (1− p)n−k

success/failure experimentsExample: fair coin, 6 tosses

Probability of 5 heads?

Pr(5 heads) = f (5) = Pr(X = 5) =(6

5

)0.55(1− 0.5)6−5 ≈ 0.09375

9

Page 28: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

Multinomial Distribution (PMF)

f (x1, . . . , xk ; n, p1, . . . , pk ) =n!

x1! · · · xk !px1

1 · · · pxkk

10

Page 29: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

Multinomial Distribution (PMF)

f (x1, . . . , xk ; n, p1, . . . , pk ) =n!

x1! · · · xk !px1

1 · · · pxkk

Some event with 3 outcomes: X = [x1, x2, x3]

10

Page 30: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

Multinomial Distribution (PMF)

f (x1, . . . , xk ; n, p1, . . . , pk ) =n!

x1! · · · xk !px1

1 · · · pxkk

Some event with 3 outcomes: X = [x1, x2, x3]

heads

tai ls

edge

10

Page 31: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

Multinomial Distribution (PMF)

f (x1, . . . , xk ; n, p1, . . . , pk ) =n!

x1! · · · xk !px1

1 · · · pxkk

Some event with 3 outcomes: X = [x1, x2, x3]

heads

tai ls

edge

http://rired.ru/wp-content/uploads/2013/03/851428coin.jpg

10

Page 32: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

Multinomial Distribution (PMF)

f (x1, . . . , xk ; n, p1, . . . , pk ) =n!

x1! · · · xk !px1

1 · · · pxkk

Some event with 3 outcomes: X = [x1, x2, x3]

heads

tai ls

edge

~p = [ 12 −

ε2 , 1

2 −ε2 , ε]

http://rired.ru/wp-content/uploads/2013/03/851428coin.jpg

10

Page 33: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

11

Page 34: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

α = [1, 1, 1]

11

Page 35: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

α = [2, 1, 1]

11

Page 36: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

α = [2, 2, 2]

11

Page 37: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

α = [3, 3, 3]

11

Page 38: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

α = [4, 4, 4]

11

Page 39: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

α = [5, 5, 5]

11

Page 40: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

α = [10, 10, 10]

11

Page 41: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Dirichlet Distribution

The Dirichlet distribution Dir (α) is a distribution over the space of multino-mial distributions.

α = [0.9, 0.9, 0.9]

11

Page 42: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Plate Notation

Vertex ≡ random variable

Edge ≡ dependence

12

Page 43: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Plate Notation: LDA

α - Dirichlet parameterizationβk - topics (dist. over words)θd - topic proportions for d th documentzd ,n - topic assignment for nth word in d th document

13

Page 44: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

LDA: Demo

LinguisticsPhilosophy

Computer Science

14

Page 45: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

LDA: Demo

LinguisticsPhilosophy

Computer Science

Depending on how the corpus changes. . .

14

Page 46: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

LDA and Inference

Goal: Automatically discover topics from a collection of documents

Only documents themselves are observed

topics, per-document topic distributions, and the per-document per-word topic assignments is hidden

15

Page 47: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

LDA and Inference

Goal: Automatically discover topics from a collection of documents

Only documents themselves are observed

topics, per-document topic distributions, and the per-document per-word topic assignments is hidden

15

Page 48: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Inference

How to infer latent variables?

16

Page 49: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Inference

How to infer latent variables?

p(β1:K , θ1:D, z1:D|w1:D) = p(β1:K ,θ1:D ,z1:D ,w1:D)p(w1:D)

βk - topics (dist. over words)θd - topic proportions for d th documentzd ,n - topic assignment for nth word in d th documentwd - observed words

16

Page 50: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Inference

How to infer latent variables?

p(β1:K , θ1:D, z1:D|w1:D) = p(β1:K ,θ1:D ,z1:D ,w1:D)p(w1:D) marginal

probability

βk - topics (dist. over words)θd - topic proportions for d th documentzd ,n - topic assignment for nth word in d th documentwd - observed words

16

Page 51: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Inference

Problem: Marginal probability is intractable to compute

17

Page 52: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Inference

Problem: Marginal probability is intractable to compute

Could only be computed theoretically:Sum the joint distribution over every possible instance of the hidden topicstructure.

17

Page 53: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Inference

Problem: Marginal probability is intractable to compute

Could only be computed theoretically:Sum the joint distribution over every possible instance of the hidden topicstructure.

In other words: Sum over all possible ways of assigning each observedword of the collection to one of the topics.

17

Page 54: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Inference

Problem: Marginal probability is intractable to compute

Could only be computed theoretically:Sum the joint distribution over every possible instance of the hidden topicstructure.

In other words: Sum over all possible ways of assigning each observedword of the collection to one of the topics.

⇒ Approximation !

17

Page 55: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling

used for Bayesian inference

randomized algorithm

Markov Chain Monte Carlo Algorithm

Method to find (good) topics

18

Page 56: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

Text mining algorithms can be used to find structure in text corpora likePlato’s dialogues

19

Page 57: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

Text mining algorithms can be used to find structure in text corpora likePlato’s dialogues

- - - - - - -text mining algorithms structure corpora Aristotle dialogues

19

Page 58: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

Text mining algorithms can be used to find structure in text corpora likePlato’s dialogues

1. Randomly assign words to topics

1 3 2 1 2 1 2text mining algorithms structure corpora Plato dialogues

19

Page 59: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

Text mining algorithms can be used to find structure in text corpora likePlato’s dialogues

1. Randomly assign words to topics

1 3 2 1 2 1 2text mining algorithms structure corpora Plato dialogues

2. Do this for all documents in corpus

19

Page 60: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

1 3 2 1 2 1 2text mining algorithms structure corpora Plato dialogues

20

Page 61: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

1 3 2 1 2 1 2text mining algorithms structure corpora Plato dialogues

1 2 3text 65 54 59mining 21 4 12algorithms 100 74 122structure 20 12 14corpora 5 2 12Plato 35 33 42dialogues 24 27 31

20

Page 62: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

1 3 2 1 2 1 2text mining algorithms structure corpora Plato dialogues

1 2 3text 65 54 59mining 21 4 12algorithms 100 74 122structure 20 12 14corpora 5 2 12Plato 35 33 42dialogues 24 27 31

Counts from all documents

20

Page 63: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

1 2 3text 65 54 59mining 21 4 12algorithms 100 74 122structure 20 12 14corpora 5 2 12Plato 35 33 42dialogues 24 27 31

Counts from all documents

1 3 ??? 1 2 1 2text mining algorithms structure corpora Plato dialogues

sample word algorithm

20

Page 64: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

1 3 ??? 1 2 1 2text mining algorithms structure corpora Plato dialogues

21

Page 65: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

3. Topic distribution in this document

Topic 1 Topic 3Topic 2

1 3 ??? 1 2 1 2text mining algorithms structure corpora Plato dialogues

21

Page 66: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

3. Topic distribution in this document

Topic 1 Topic 3Topic 2

4. Word distribution over topics

1 3 ??? 1 2 1 2text mining algorithms structure corpora Plato dialogues

21

Page 67: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

3. Topic distribution in this document

Topic 1 Topic 3Topic 2

4. Word distribution over topics

1 2 3text 65 54 59mining 21 4 12algorithms 100 74 122structure 20 12 14corpora 5 2 12Plato 35 33 42dialogues 24 27 31

1 3 ??? 1 2 1 2text mining algorithms structure corpora Plato dialogues

21

Page 68: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

3. Topic distribution in this document

Topic 1 Topic 3Topic 2

4. Word distribution over topics

1 3 ??? 1 2 1 2text mining algorithms structure corpora Plato dialogues

21

Page 69: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

3. Topic distribution in this document

Topic 1 Topic 3Topic 2

4. Word distribution over topics

1 3 ??? 1 2 1 2text mining algorithms structure corpora Plato dialogues

21

Page 70: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

3. Topic distribution in this document

Topic 1 Topic 3Topic 2

4. Word distribution over topics5. Sample according to green area

1 3 ??? 1 2 1 2text mining algorithms structure corpora Plato dialogues

21

Page 71: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Gibbs Sampling - Example

3. Topic distribution in this document

Topic 1 Topic 3Topic 2

4. Word distribution over topics5. Sample according to green area

1 3 1 1 2 1 2text mining algorithms structure corpora Plato dialogues

reassign to Topic 1

21

Page 72: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Conclusion - Take home message

Wrap up

22

Page 73: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Conclusion - Take home message

Wrap up

Topic models find the hidden topical patterns that pervade a unstruc-tured collection of text

Generative process as a model of how texts are composedWords are allocated according a Dirichlet distribution over topics

22

Page 74: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Conclusion - Take home message

Wrap up

Topic models find the hidden topical patterns that pervade a unstruc-tured collection of text

Generative process as a model of how texts are composedWords are allocated according a Dirichlet distribution over topics

Inference

Gibbs sampling can be used for approximating the hidden variables

22

Page 75: Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Florian Becker – Latent Dirichlet Allocation Institute of Theoretical InformaticsAlgorithmics Group

Resources

http://www.cs.columbia.edu/ blei/

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. ”Latent dirichlet allocation.” theJournal of machine Learning research 3 (2003): 993-1022. APA

Porteous, Ian, et al. ”Fast collapsed gibbs sampling for latent dirichlet allocation.” Pro-ceedings of the 14th ACM SIGKDD international conference on Knowledge discoveryand data mining. ACM, 2008.

23