Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54...

68
23/05/16 1 Topic Model Tutorial Part 1 – The Intuition Christoph Carl Kling [email protected]

Transcript of Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54...

Page 1: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 1

Topic Model TutorialPart 1 – The Intuition

Christoph Carl [email protected]

Page 2: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 2

Conference Dinner

Page 3: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 3

Conference dinner

- I sit at a table with a probability proportional to the number

of people already sitting there

Page 4: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 4

Conference dinner

- I sit at a table with a probability proportional to the number

of people already sitting there

- If everybody does the same and there are more and more

people entering, the probabilities for choosing the tables

converge

Page 5: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 5

Conference dinner

- I sit at a table with a probability proportional to the number

of people already sitting there

- If everybody does the same and there are more and more

people entering, the probabilities for choosing the tables

converge

- The scheme yields a sample of a Dirichlet distribution

Parameters: initial number of participants at each table

Page 6: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 6

Dirichlet Distribution

- The scheme yields a sample of a Dirichlet distribution

Parameters: initial number of participants at each table

- “rich get richer”, preferential attachment

- Initial settings of < 1 participant at each table produce

sparse distributions

Page 7: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 7

In reality, I choose tables based on the number of people AND the topic they talk about!

Page 8: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 8

In reality, I choose tables based on the number of people AND the topic they talk about!

Page 9: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 9

Topics

Page 10: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 10

Page 11: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 11

Articles are labelled with tags(e.g. politics, economy, sports, ...)

Page 12: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 12

Politics: election, party, vote, candidate, ...Economy: dollar, crisis, financial, market, …Sports: soccer, basketball, match, score, ...

Articles are labelled with tags(e.g. politics, economy, sports, ...)

Page 13: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 13

Politics: election, party, vote, candidate, ...Economy: dollar, crisis, financial, market, …Sports: soccer, basketball, match, score, ...

Articles are labelled with tags(e.g. politics, economy, sports, ...)

Topics

Page 14: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 14

Topic Modelling

Page 15: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 15

Topic Modelling

Automatically extract topics from text documents!

Page 16: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 16

Latent Semantic Analysis

Page 17: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 17

Term-document matrix

high occurrencelow occurrence

Page 18: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 18

Term-document matrix

term frequencies document 4

high occurrencelow occurrence

Page 19: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 19

Term-document matrix

how often does document 4contain the word “blood”?

high occurrencelow occurrence

Page 20: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 20

Latent Semantic Analysis (LSA)

- Topic model based on “matrix decomposition”

Page 21: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 21

Latent Semantic Analysis (LSA)

- Topic model based on “matrix decomposition”

- Topics are described by “loadings” over the terms

Topic 1

Page 22: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 22

The Test Dataset

Page 23: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 23

probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model

probabilistic topic model famous fashion model

famous fashion modelfamous fashion model

famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model

document 0: document 1: document 2: document 3: document 4: document 5: document 6: document 7: document 8: document 9: document 10: document 11: document 12: document 13: document 14: document 15: document 16: document 17: document 18: document 19:

Test dataset

Page 24: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 24

Topic 1: famous, fashion, modelTopic 2: model, probabilistic, topic

Expected topics

probabilistic topic model probabilistic topic model …

famous fashion model famous fashion model

...

Test dataset

Page 25: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 25

Test dataset

Term-document matrix

Page 26: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 26

Test dataset

Term-document matrix

Page 27: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 27

Test dataset

Term-document matrix

Page 28: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 28

Topic 1 Topic 2

LSA

Page 29: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 29

Topic 1 Topic 2

LSA

Page 30: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 30

Topic 1 Topic 2

LSA

Page 31: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 31

Topic 1 Topic 2

LSA

Page 32: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 32

Topic 1 Topic 2

LSA

Page 33: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 33

LSA – Weaknesses

- Topic loadings can be negative → hard to interpret!

- LSA has problems to cope with word ambiguities

Page 34: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 34

Probabilistic LSA

Page 35: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 35

Probabilistic LSA (PLSA)

- Based on categorical distributions

Page 36: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 36

Probabilistic LSA (PLSA)

- Based on categorical distributions

- Probabilistic model that explains

the creation of documents

Page 37: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 37

Probabilistic LSA (PLSA)

The PLSA model for the creation of words in documents:

1) Documents have each a categorical distribution

over the topics

Page 38: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 38

Probabilistic LSA (PLSA)

The PLSA model for the creation of words in documents:

1) Documents have each a categorical distribution

over the topics

2) Topics have each a categorical distribution over

all words

Page 39: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 39

Probabilistic LSA (PLSA)

The PLSA model for the creation of words in documents:

1) Documents have each a categorical distribution

over the topics

2) Topics have each a categorical distribution over

all words

3) Creation of a word in document i:

1)Draw a topic from

2)Draw a word from

Page 40: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 40

Topic 1 Topic 2

Probabilistic LSA (PLSA)

Page 41: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 41

Topic 1 Topic 2

Probabilistic LSA (PLSA)

Page 42: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 42

Topic 1 Topic 2

Probabilistic LSA (PLSA)

Page 43: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 43

Document 0 (probabilistic topic model)

Probabilistic LSA (PLSA)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Topic 1 Topic 2

Loadin

g

Page 44: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 44

Document 0 (probabilistic topic model) Document 7 (famous fashion model)

Probabilistic LSA (PLSA)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Topic 1 Topic 2

Loadin

g

Page 45: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 45

PLSA – Strengths & Weaknesses

- Topics are probability distributions and easy to interpret!

- PLSA still has problems to cope with ambiguous words

Page 46: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 46

Latent Dirichlet Allocation

Page 47: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 47

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

Page 48: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 48

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

famous fashion modeldocument 7:

Page 49: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 49

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

famous fashion model

Topic 1 Topic 1 ?

document 7:

Page 50: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 50

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

famous fashion model

Topic 1 Topic 1 → Topic 1

document 7:

Page 51: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 51

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

probabilistic topic modeldocument 0:

Page 52: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 52

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

probabilistic topic model

Topic 2 Topic 2 ?

document 0:

Page 53: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 53

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

probabilistic topic model

Topic 2 Topic 2 → Topic 2

document 0:

Page 54: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 54

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

- We would need some preference for already assigned

topics in a document

Page 55: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 55

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

- We would need some preference for already assigned

topics in a document

→ Dirichlet distribution!

Page 56: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 56

Dirichlet distribution

Page 57: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 57

Dirichlet distribution

Page 58: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 58

Topic 1 Topic 2

Latent Dirichlet Allocation (LDA)

Page 59: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 59

Document 0 (probabilistic topic model) Document 7 (famous fashion model)

Probabilistic topic model (with sparse Dirichlet)

Page 60: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 60

LDA – Strengths

- LDA can cope with ambiguous words!

- Most popular topic model

Page 61: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 61

(Human) Evaluation

Page 62: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 62

PLSA LDATopic 1 family,registered,like,hard,members,… first,network,time,won,week,third,...

Topic 2 high,left,planned,organization,story,… two,house,found,police,car,home,..

Topic 3 normal,predicted,first,chief,health,… cents,futures,cent,lower,higher,...

… … ...

Page 63: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 63

Topic Model Game

- Tests the semantic coherence of topics

- Given the top-5 words of a topic and an intruder word

from a different topic – find the intruder word!

Page 64: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 64

Topic Model Game

Given the top-5 words of a topic and an intruder word from a different topic – find the intruder word!

air pollution power blood environmental nuclear

Page 65: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 65

Topic Model Game

Given the top-5 words of a topic and an intruder word from a different topic – find the intruder word!

air pollution power blood environmental nuclear

Page 66: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 66

Topic Model Game

https://tinyurl.com/tmt16

Page 67: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 67

Summary

Page 68: Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54 Latent Dirichlet Allocation (LDA) - A word in a document is likely to belong to the

23/05/16 68

Summary

- Dirichlet distribution (Polya urn scheme)

- Latent Semantic Analysis (LSA)

- Probabilistic Latent Semantic Analysis (PLSA)

- Latent Dirichlet Allocation (LDA)

- Human evaluation of topic models