CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED...

39
CS249: ADVANCED DATA MINING Instructor: Yizhou Sun [email protected] May 10, 2017 Text Data: Topic Models

Transcript of CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED...

Page 1: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

CS249: ADVANCED DATA MINING

Instructor: Yizhou [email protected]

May 10, 2017

Text Data: Topic Models

Page 2: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Announcements•Course project proposal

• Due May 8th (11:59pm)

•Homework 3 out• Due May 10th (11:59pm)

•Midterm Exam• In class May 15th

2

Page 3: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Methods to Learn

3

Vector Data Text Data Recommender System

Graph & Network

Classification Decision Tree; Naïve Bayes; Logistic RegressionSVM; NN

Label Propagation

Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models; kernel k-means

PLSA;LDA

Matrix Factorization SCAN; Spectral Clustering

Prediction Linear RegressionGLM

Collaborative Filtering

Ranking PageRank

Feature Representation

Word embedding Network embedding

Page 4: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Text Data: Topic Models• Text Data and Topic Models

• Revisit of Multinomial Mixture Model

• Probabilistic Latent Semantic Analysis (pLSA)

• Latent Dirichlet Allocation (LDA)

• Summary

4

Page 5: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Text Data

•Word/term•Document

• A sequence of words

•Corpus• A collection of

documents

5

Page 6: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Represent a Document•Most common way: Bag-of-Words

• Ignore the order of words

• keep the count

6

c1 c2 c3 c4 c5 m1 m2 m3 m4

Vector space model

Page 7: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

More Details• Represent the doc as a vector where each entry

corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it)• Number of words is huge• Select and use a smaller set of words that are of interest• E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop-

words• Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’

could be substituted by the single stem ‘learn’• Other simplifications can also be invented and used• The set of different remaining words is called dictionary or vocabulary. Fix

an ordering of the terms in the dictionary so that you can operate them by their index.

• Can be extended to bi-gram, tri-gram, or so

7

Page 8: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Limitations of Vector Space Model•Dimensionality

• High dimensionality

•Sparseness• Most of the entries are zero

•Shallow representation• The vector representation does not capture semantic relations between words

8

Page 9: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Topics•Topic

• A topic is represented by a word distribution

• Relate to an issue

9

Page 10: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Topic Models

• Topic modeling• Get topics automatically from a corpus

• Assign documents to topics automatically

• Most frequently used topic models• pLSA• LDA

10

Page 11: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Text Data: Topic Models• Text Data and Topic Models

• Revisit of Multinomial Mixture Model

• Probabilistic Latent Semantic Analysis (pLSA)

• Latent Dirichlet Allocation (LDA)

• Summary

11

Page 12: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Review of Multinomial Distribution•Select n data points from K categories, each with probability 𝑝𝑝𝑘𝑘• n trials of independent categorical distribution

• E.g., get 1-6 from a dice with 1/6

• When n=1, it is also called discrete distribution

•When K=2, binomial distribution• n trials of independent Bernoulli distribution

• E.g., flip a coin to get heads or tails

12

Page 13: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Multinomial Mixture Model• For documents with bag-of-words representation• 𝒙𝒙𝑑𝑑 = (𝑥𝑥𝑑𝑑𝑑, 𝑥𝑥𝑑𝑑𝑑, … , 𝑥𝑥𝑑𝑑𝑑𝑑), 𝑥𝑥𝑑𝑑𝑛𝑛 is the number of words for nth word in the vocabulary

• Generative model • For each document

• Sample its cluster label 𝑧𝑧~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝝅𝝅)• 𝝅𝝅 = (𝜋𝜋𝑑,𝜋𝜋𝑑, … ,𝜋𝜋𝐾𝐾), 𝜋𝜋𝑘𝑘 is the proportion of jth cluster• 𝑝𝑝 𝑧𝑧 = 𝑘𝑘 = 𝜋𝜋𝑘𝑘

• Sample its word vector 𝒙𝒙𝑑𝑑~𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑𝑑𝑑(𝜷𝜷𝑧𝑧)• 𝜷𝜷𝑧𝑧 = 𝛽𝛽𝑧𝑧𝑑,𝛽𝛽𝑧𝑧𝑑, … ,𝛽𝛽𝑧𝑧𝑑𝑑 ,𝛽𝛽𝑧𝑧𝑛𝑛 is the parameter associate with nth word

in the vocabulary

• 𝑝𝑝 𝒙𝒙𝑑𝑑|𝑧𝑧 = 𝑘𝑘 = ∑𝑛𝑛 𝑥𝑥𝑑𝑑𝑛𝑛 !∏𝑛𝑛 𝑥𝑥𝑑𝑑𝑛𝑛!

∏𝑛𝑛𝛽𝛽𝑘𝑘𝑛𝑛𝑥𝑥𝑑𝑑𝑛𝑛 ∝ ∏𝑛𝑛 𝛽𝛽𝑘𝑘𝑛𝑛

𝑥𝑥𝑑𝑑𝑛𝑛

13

Page 14: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Likelihood Function•For a set of M documents

𝐿𝐿 = �𝑑𝑑

𝑝𝑝(𝒙𝒙𝑑𝑑) = �𝑑𝑑

�𝑘𝑘

𝑝𝑝(𝒙𝒙𝑑𝑑 , 𝑧𝑧 = 𝑘𝑘)

= �𝑑𝑑

�𝑘𝑘

𝑝𝑝 𝒙𝒙𝑑𝑑 𝑧𝑧 = 𝑘𝑘 𝑝𝑝(𝑧𝑧 = 𝑘𝑘)

∝�𝑑𝑑

�𝑘𝑘

𝑝𝑝(𝑧𝑧 = 𝑘𝑘)�𝑛𝑛

𝛽𝛽𝑘𝑘𝑛𝑛𝑥𝑥𝑑𝑑𝑛𝑛

14

Page 15: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Mixture of Unigrams• For documents represented by a sequence of words•𝒘𝒘𝑑𝑑 = (𝑤𝑤𝑑𝑑𝑑,𝑤𝑤𝑑𝑑𝑑, … ,𝑤𝑤𝑑𝑑𝑑𝑑𝑑𝑑), 𝑁𝑁𝑑𝑑 is the length of document d, 𝑤𝑤𝑑𝑑𝑛𝑛 is the word at the nth position of the document

• Generative model• For each document

• Sample its cluster label 𝑧𝑧~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝝅𝝅)• 𝝅𝝅 = (𝜋𝜋𝑑,𝜋𝜋𝑑, … ,𝜋𝜋𝐾𝐾), 𝜋𝜋𝑘𝑘 is the proportion of jth cluster• 𝑝𝑝 𝑧𝑧 = 𝑘𝑘 = 𝜋𝜋𝑘𝑘

• For each word in the sequence• Sample the word 𝑤𝑤𝑑𝑑𝑛𝑛~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝜷𝜷𝑧𝑧)• 𝑝𝑝 𝑤𝑤𝑑𝑑𝑛𝑛|𝑧𝑧 = 𝑘𝑘 = 𝛽𝛽𝑘𝑘𝑤𝑤𝑑𝑑𝑛𝑛

15

Page 16: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Likelihood Function•For a set of M documents

𝐿𝐿 = �𝑑𝑑

𝑝𝑝(𝒘𝒘𝑑𝑑) = �𝑑𝑑

�𝑘𝑘

𝑝𝑝(𝒘𝒘𝑑𝑑 , 𝑧𝑧 = 𝑘𝑘)

= �𝑑𝑑

�𝑘𝑘

𝑝𝑝 𝒘𝒘𝑑𝑑 𝑧𝑧 = 𝑘𝑘 𝑝𝑝(𝑧𝑧 = 𝑘𝑘)

= �𝑑𝑑

�𝑘𝑘

𝑝𝑝(𝑧𝑧 = 𝑘𝑘)�𝑛𝑛

𝛽𝛽𝑘𝑘𝑤𝑤𝑑𝑑𝑛𝑛

16

Page 17: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Text Data: Topic Models• Text Data and Topic Models

• Revisit of Multinomial Mixture Model

• Probabilistic Latent Semantic Analysis (pLSA)

• Latent Dirichlet Allocation (LDA)

• Summary

18

Page 18: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Notations•Word, document, topic

•𝑤𝑤,𝑑𝑑, 𝑧𝑧•Word count in document

• 𝑑𝑑(𝑤𝑤,𝑑𝑑)•Word distribution for each topic (𝛽𝛽𝑧𝑧)

•𝛽𝛽𝑧𝑧𝑤𝑤: 𝑝𝑝(𝑤𝑤|𝑧𝑧)•Topic distribution for each document (𝜃𝜃𝑑𝑑)

•𝜃𝜃𝑑𝑑𝑧𝑧: 𝑝𝑝(𝑧𝑧|𝑑𝑑) (Yes, soft clustering)

19

Page 19: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Issues of Mixture of Unigrams•All the words in the same documents are sampled from the same topic

• In practice, people switch topics during their writing

20

Page 20: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Illustration of pLSA

21

Page 21: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Generative Model for pLSA

•Describe how a document is generated probabilistically• For each position in d, 𝑛𝑛 = 1, … ,𝑁𝑁𝑑𝑑

• Generate the topic for the position as𝑧𝑧𝑛𝑛~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝜽𝜽𝑑𝑑), 𝑑𝑑. 𝑑𝑑. ,𝑝𝑝 𝑧𝑧𝑛𝑛 = 𝑘𝑘 = 𝜃𝜃𝑑𝑑𝑘𝑘

(Note, 1 trial multinomial, i.e., discrete distribution)

• Generate the word for the position as𝑤𝑤𝑛𝑛~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝜷𝜷𝑧𝑧𝑛𝑛), 𝑑𝑑. 𝑑𝑑. ,𝑝𝑝 𝑤𝑤𝑛𝑛 = 𝑤𝑤 = 𝛽𝛽𝑧𝑧𝑛𝑛𝑤𝑤

22

Page 22: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Graphical Model

Note: Sometimes, people add parameters such as 𝜃𝜃 𝑎𝑎𝑛𝑛𝑑𝑑 𝛽𝛽 into the graphical model

23

Page 23: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

The Likelihood Function for a Corpus•Probability of a word𝑝𝑝 𝑤𝑤|𝑑𝑑 = �

𝑘𝑘

𝑝𝑝(𝑤𝑤, 𝑧𝑧 = 𝑘𝑘|𝑑𝑑) = �𝑘𝑘

𝑝𝑝 𝑤𝑤 𝑧𝑧 = 𝑘𝑘 𝑝𝑝 𝑧𝑧 = 𝑘𝑘|𝑑𝑑 = �𝑘𝑘

𝛽𝛽𝑘𝑘𝑤𝑤𝜃𝜃𝑑𝑑𝑘𝑘

•Likelihood of a corpus

24

𝜋𝜋𝑑𝑑 𝑑𝑑𝑑𝑑 𝑚𝑚𝑑𝑑𝑚𝑚𝑎𝑎𝑚𝑚𝑚𝑚𝑢𝑢 𝑑𝑑𝑐𝑐𝑛𝑛𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑑𝑑 𝑚𝑚𝑛𝑛𝑑𝑑𝑢𝑢𝑐𝑐𝑑𝑑𝑚𝑚, i.e., 1/M

Page 24: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Re-arrange the Likelihood Function•Group the same word from different positions together

max 𝑚𝑚𝑐𝑐𝑙𝑙𝐿𝐿 = �𝑑𝑑𝑤𝑤

𝑑𝑑 𝑤𝑤,𝑑𝑑 𝑚𝑚𝑐𝑐𝑙𝑙�𝑧𝑧

𝜃𝜃𝑑𝑑𝑧𝑧 𝛽𝛽𝑧𝑧𝑤𝑤

𝑑𝑑. 𝑑𝑑.�𝑧𝑧

𝜃𝜃𝑑𝑑𝑧𝑧 = 1 𝑎𝑎𝑛𝑛𝑑𝑑 �𝑤𝑤

𝛽𝛽𝑧𝑧𝑤𝑤 = 1

25

Page 25: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Optimization: EM Algorithm• Repeat until converge

• E-step: for each word in each document, calculate is conditional probability belonging to each topic

𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 ∝ 𝑝𝑝 𝑤𝑤 𝑧𝑧,𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑑𝑑 = 𝛽𝛽𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧 (𝑑𝑑. 𝑑𝑑. ,𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 =𝛽𝛽𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧

∑𝑧𝑧𝑧 𝛽𝛽𝑧𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧𝑧)

• M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood

𝛽𝛽𝑧𝑧𝑤𝑤 ∝ ∑𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 𝑑𝑑 𝑤𝑤,𝑑𝑑 (𝑑𝑑. 𝑑𝑑. ,𝛽𝛽𝑧𝑧𝑤𝑤 = ∑𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 𝑐𝑐 𝑤𝑤,𝑑𝑑∑𝑤𝑤′,𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤

𝑧,𝑑𝑑 𝑐𝑐 𝑤𝑤′,𝑑𝑑)

𝜃𝜃𝑑𝑑𝑧𝑧 ∝�𝑤𝑤

𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 𝑑𝑑 𝑤𝑤,𝑑𝑑 (𝑑𝑑. 𝑑𝑑. ,𝜃𝜃𝑑𝑑𝑧𝑧 =∑𝑤𝑤 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 𝑑𝑑 𝑤𝑤,𝑑𝑑

𝑁𝑁𝑑𝑑)

26

Page 26: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Text Data: Topic Models• Text Data and Topic Models

• Revisit of Multinomial Mixture Model

• Probabilistic Latent Semantic Analysis (pLSA)

• Latent Dirichlet Allocation (LDA)

• Summary

28

Page 27: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Limitations of pLSA•Not a proper generative model

•𝜽𝜽𝑑𝑑 is treated as a parameter

• Cannot model new documents

•Solution:• Make it a proper generative model by adding priors to 𝜃𝜃 and 𝛽𝛽

29

Page 28: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Review of Conjugate Prior• Model:

• 𝑝𝑝 𝑥𝑥 𝜃𝜃• Prior:

• 𝑝𝑝 𝜃𝜃 𝛼𝛼• Posterior:

• 𝑝𝑝 𝜃𝜃 𝑥𝑥,𝛼𝛼 ∝ 𝑝𝑝 𝜃𝜃, 𝑥𝑥 𝛼𝛼 = 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝑝𝑝(𝜃𝜃|𝛼𝛼)

• Conjugate prior:• If 𝑝𝑝 𝜃𝜃 𝛼𝛼 and 𝑝𝑝 𝜃𝜃 𝑥𝑥,𝛼𝛼 belong to the same distribution family (with different parameters)

30

Page 29: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Dirichlet-Multinomial Conjugate•Dirichlet distribution: 𝜽𝜽~𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝑚𝑚𝑑𝑑𝑑𝑑(𝜶𝜶)

• 𝑑𝑑. 𝑑𝑑. , 𝑝𝑝 𝜽𝜽 𝜶𝜶 = Γ(∑𝑘𝑘 𝛼𝛼𝑘𝑘)∏𝑘𝑘 Γ(𝛼𝛼𝑘𝑘)

∏𝑘𝑘 𝜃𝜃𝑘𝑘𝛼𝛼𝑘𝑘−𝑑, where 𝛼𝛼𝑘𝑘 > 0

• 𝐸𝐸 𝜃𝜃𝑘𝑘 = 𝛼𝛼𝑘𝑘∑𝑘𝑘′ 𝛼𝛼𝑘𝑘′

• 𝑉𝑉𝑎𝑎𝑑𝑑 𝜃𝜃𝑘𝑘 = 𝛼𝛼𝑘𝑘 𝛼𝛼0−𝛼𝛼𝑘𝑘𝛼𝛼02 𝛼𝛼0+𝑑

, where 𝛼𝛼0 = ∑𝑘𝑘 𝛼𝛼𝑘𝑘

31𝐸𝐸𝑥𝑥𝑎𝑎𝑚𝑚𝑝𝑝𝑚𝑚𝑑𝑑:𝜽𝜽~𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝑚𝑚𝑑𝑑𝑑𝑑 𝜶𝜶 ,𝑤𝑤𝐷𝑑𝑑𝑑𝑑𝑑𝑑 𝜶𝜶/𝛼𝛼0 = (

12

,13

,16

)

Presenter
Presentation Notes
\alpha_k psudo count
Page 30: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

More Examples

32

Dirichlet(1,1,1)

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

Dirichlet(2,2,2)

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

Dirichlet(10,10,10)

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

Dirichlet(2,2,10)

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

Dirichlet(2,10,2)

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

Dirichlet(0.1,0.1,0.1)

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

Page 31: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Simplex View •𝛼𝛼 = (2,3,4)

33

0

x2

0.50

10.8

0.2

x1

0.60.4

0.4

0.2 10

x3

0.6

0.8

1

Page 32: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

More Examples in the Simplex View

34

Page 33: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Dirichlet-Multinomial Conjugate•Posterior

•𝑝𝑝 𝜽𝜽 𝜶𝜶, 𝑧𝑧𝑑, 𝑧𝑧𝑑, … 𝑧𝑧𝑛𝑛 ∝ 𝑝𝑝 𝑧𝑧𝑑, … , 𝑧𝑧𝑛𝑛 𝜽𝜽 𝑝𝑝 𝜽𝜽 𝜶𝜶

∝ ∏𝑘𝑘 𝜃𝜃𝑘𝑘𝛼𝛼𝑘𝑘+𝑐𝑐𝑘𝑘−𝑑

where 𝑑𝑑𝑘𝑘 is the number of z that takes value of k

35

Page 34: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

The Graphical Model of LDA

36

𝜽𝜽𝒅𝒅~𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫 𝜶𝜶𝜷𝜷𝒌𝒌~𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫(𝜼𝜼)

Page 35: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Posterior Inference for LDA

• Joint distribution of latent variables and documents is

37

Page 36: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Posterior Inference•Posterior of the latent variables

•Solutions• Gibbs sampling• Variational inference

38

=

𝑚𝑚𝑎𝑎𝑑𝑑𝑙𝑙𝑑𝑑𝑛𝑛𝑎𝑎𝑚𝑚 𝑝𝑝 𝒘𝒘 𝑑𝑑𝑑𝑑 𝑑𝑑𝑛𝑛𝑑𝑑𝑑𝑑𝑎𝑎𝑑𝑑𝑑𝑑𝑎𝑎𝑖𝑖𝑚𝑚𝑑𝑑!

Page 37: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

References for LDA•Blei et al., Latent Dirichlet Allocation, JLMR 3 (2003)

•Griffiths et al., Finding Scientific Topics, PNAS (2004)

39

Page 38: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Text Data: Topic Models• Text Data and Topic Models

• Revisit of Multinomial Mixture Model

• Probabilistic Latent Semantic Analysis (pLSA)

• Latent Dirichlet Allocation (LDA)

• Summary

40

Page 39: CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic

Summary• Basic Concepts

• Word/term, document, corpus, topic• How to represent a document

• Mixture of unigrams• pLSA

• Generative model• Likelihood function• EM algorithm

• LDA• Dirichlet-multinomial conjugate • Posterior inference

41