Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54...

Post on 22-Jan-2021

7 views 0 download

Transcript of Topic Model Tutorial Part 1 – The IntuitionTopic 2 Topic 2 → Topic 2 document 0: 23/05/16 54...

23/05/16 1

Topic Model TutorialPart 1 – The Intuition

Christoph Carl Klingckling@uni-koblenz.de

23/05/16 2

Conference Dinner

23/05/16 3

Conference dinner

- I sit at a table with a probability proportional to the number

of people already sitting there

23/05/16 4

Conference dinner

- I sit at a table with a probability proportional to the number

of people already sitting there

- If everybody does the same and there are more and more

people entering, the probabilities for choosing the tables

converge

23/05/16 5

Conference dinner

- I sit at a table with a probability proportional to the number

of people already sitting there

- If everybody does the same and there are more and more

people entering, the probabilities for choosing the tables

converge

- The scheme yields a sample of a Dirichlet distribution

Parameters: initial number of participants at each table

23/05/16 6

Dirichlet Distribution

- The scheme yields a sample of a Dirichlet distribution

Parameters: initial number of participants at each table

- “rich get richer”, preferential attachment

- Initial settings of < 1 participant at each table produce

sparse distributions

23/05/16 7

In reality, I choose tables based on the number of people AND the topic they talk about!

23/05/16 8

In reality, I choose tables based on the number of people AND the topic they talk about!

23/05/16 9

Topics

23/05/16 10

23/05/16 11

Articles are labelled with tags(e.g. politics, economy, sports, ...)

23/05/16 12

Politics: election, party, vote, candidate, ...Economy: dollar, crisis, financial, market, …Sports: soccer, basketball, match, score, ...

Articles are labelled with tags(e.g. politics, economy, sports, ...)

23/05/16 13

Politics: election, party, vote, candidate, ...Economy: dollar, crisis, financial, market, …Sports: soccer, basketball, match, score, ...

Articles are labelled with tags(e.g. politics, economy, sports, ...)

Topics

23/05/16 14

Topic Modelling

23/05/16 15

Topic Modelling

Automatically extract topics from text documents!

23/05/16 16

Latent Semantic Analysis

23/05/16 17

Term-document matrix

high occurrencelow occurrence

23/05/16 18

Term-document matrix

term frequencies document 4

high occurrencelow occurrence

23/05/16 19

Term-document matrix

how often does document 4contain the word “blood”?

high occurrencelow occurrence

23/05/16 20

Latent Semantic Analysis (LSA)

- Topic model based on “matrix decomposition”

23/05/16 21

Latent Semantic Analysis (LSA)

- Topic model based on “matrix decomposition”

- Topics are described by “loadings” over the terms

Topic 1

23/05/16 22

The Test Dataset

23/05/16 23

probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model

probabilistic topic model famous fashion model

famous fashion modelfamous fashion model

famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model

document 0: document 1: document 2: document 3: document 4: document 5: document 6: document 7: document 8: document 9: document 10: document 11: document 12: document 13: document 14: document 15: document 16: document 17: document 18: document 19:

Test dataset

23/05/16 24

Topic 1: famous, fashion, modelTopic 2: model, probabilistic, topic

Expected topics

probabilistic topic model probabilistic topic model …

famous fashion model famous fashion model

...

Test dataset

23/05/16 25

Test dataset

Term-document matrix

23/05/16 26

Test dataset

Term-document matrix

23/05/16 27

Test dataset

Term-document matrix

23/05/16 28

Topic 1 Topic 2

LSA

23/05/16 29

Topic 1 Topic 2

LSA

23/05/16 30

Topic 1 Topic 2

LSA

23/05/16 31

Topic 1 Topic 2

LSA

23/05/16 32

Topic 1 Topic 2

LSA

23/05/16 33

LSA – Weaknesses

- Topic loadings can be negative → hard to interpret!

- LSA has problems to cope with word ambiguities

23/05/16 34

Probabilistic LSA

23/05/16 35

Probabilistic LSA (PLSA)

- Based on categorical distributions

23/05/16 36

Probabilistic LSA (PLSA)

- Based on categorical distributions

- Probabilistic model that explains

the creation of documents

23/05/16 37

Probabilistic LSA (PLSA)

The PLSA model for the creation of words in documents:

1) Documents have each a categorical distribution

over the topics

23/05/16 38

Probabilistic LSA (PLSA)

The PLSA model for the creation of words in documents:

1) Documents have each a categorical distribution

over the topics

2) Topics have each a categorical distribution over

all words

23/05/16 39

Probabilistic LSA (PLSA)

The PLSA model for the creation of words in documents:

1) Documents have each a categorical distribution

over the topics

2) Topics have each a categorical distribution over

all words

3) Creation of a word in document i:

1)Draw a topic from

2)Draw a word from

23/05/16 40

Topic 1 Topic 2

Probabilistic LSA (PLSA)

23/05/16 41

Topic 1 Topic 2

Probabilistic LSA (PLSA)

23/05/16 42

Topic 1 Topic 2

Probabilistic LSA (PLSA)

23/05/16 43

Document 0 (probabilistic topic model)

Probabilistic LSA (PLSA)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Topic 1 Topic 2

Loadin

g

23/05/16 44

Document 0 (probabilistic topic model) Document 7 (famous fashion model)

Probabilistic LSA (PLSA)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Topic 1 Topic 2

Loadin

g

23/05/16 45

PLSA – Strengths & Weaknesses

- Topics are probability distributions and easy to interpret!

- PLSA still has problems to cope with ambiguous words

23/05/16 46

Latent Dirichlet Allocation

23/05/16 47

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

23/05/16 48

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

famous fashion modeldocument 7:

23/05/16 49

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

famous fashion model

Topic 1 Topic 1 ?

document 7:

23/05/16 50

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

famous fashion model

Topic 1 Topic 1 → Topic 1

document 7:

23/05/16 51

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

probabilistic topic modeldocument 0:

23/05/16 52

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

probabilistic topic model

Topic 2 Topic 2 ?

document 0:

23/05/16 53

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

probabilistic topic model

Topic 2 Topic 2 → Topic 2

document 0:

23/05/16 54

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

- We would need some preference for already assigned

topics in a document

23/05/16 55

Latent Dirichlet Allocation (LDA)

- A word in a document is likely to belong to the same topic

as the other words of that document

- We would need some preference for already assigned

topics in a document

→ Dirichlet distribution!

23/05/16 56

Dirichlet distribution

23/05/16 57

Dirichlet distribution

23/05/16 58

Topic 1 Topic 2

Latent Dirichlet Allocation (LDA)

23/05/16 59

Document 0 (probabilistic topic model) Document 7 (famous fashion model)

Probabilistic topic model (with sparse Dirichlet)

23/05/16 60

LDA – Strengths

- LDA can cope with ambiguous words!

- Most popular topic model

23/05/16 61

(Human) Evaluation

23/05/16 62

PLSA LDATopic 1 family,registered,like,hard,members,… first,network,time,won,week,third,...

Topic 2 high,left,planned,organization,story,… two,house,found,police,car,home,..

Topic 3 normal,predicted,first,chief,health,… cents,futures,cent,lower,higher,...

… … ...

23/05/16 63

Topic Model Game

- Tests the semantic coherence of topics

- Given the top-5 words of a topic and an intruder word

from a different topic – find the intruder word!

23/05/16 64

Topic Model Game

Given the top-5 words of a topic and an intruder word from a different topic – find the intruder word!

air pollution power blood environmental nuclear

23/05/16 65

Topic Model Game

Given the top-5 words of a topic and an intruder word from a different topic – find the intruder word!

air pollution power blood environmental nuclear

23/05/16 66

Topic Model Game

https://tinyurl.com/tmt16

23/05/16 67

Summary

23/05/16 68

Summary

- Dirichlet distribution (Polya urn scheme)

- Latent Semantic Analysis (LSA)

- Probabilistic Latent Semantic Analysis (PLSA)

- Latent Dirichlet Allocation (LDA)

- Human evaluation of topic models