CS145: INTRODUCTION TO DATA...
Transcript of CS145: INTRODUCTION TO DATA...
CS145: INTRODUCTION TO DATA MINING
Instructor: Yizhou [email protected]
December 4, 2017
Text Data: Topic Model
Methods to be Learnt
2
Vector Data Set Data Sequence Data Text Data
Classification Logistic Regression; Decision Tree; KNN;SVM; NN
Naรฏve Bayes for Text
Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models
PLSA
Prediction Linear RegressionGLM*
Frequent Pattern Mining
Apriori; FP growth GSP; PrefixSpan
Similarity Search DTW
Text Data: Topic Models
โขText Data and Topic Models
โขRevisit of Mixture Model
โขProbabilistic Latent Semantic Analysis (pLSA)
โขSummary
3
Text Data
โขWord/term
โขDocument
โข A sequence of words
โขCorpus
โข A collection of
documents
4
Represent a Document
โขMost common way: Bag-of-Words
โข Ignore the order of words
โข keep the count
5
c1 c2 c3 c4 c5 m1 m2 m3 m4
Vector space model
Topics
โขTopic
โข A topic is represented by a word
distribution
โข Relate to an issue
6
Topic Models
โข Topic modelingโข Get topics automatically from a corpus
โข Assign documents to topics automatically
โข Most frequently used topic modelsโข pLSA
โข LDA
7
Text Data: Topic Models
โขText Data and Topic Models
โขRevisit of Mixture Model
โขProbabilistic Latent Semantic Analysis (pLSA)
โขSummary
8
Mixture Model-Based Clustering
โข A set C of k probabilistic clusters C1, โฆ,Ck
โข probability density/mass functions: f1, โฆ, fk,
โข Cluster prior probabilities: w1, โฆ, wk, ฯ๐๐ค๐ = 1
โข Joint Probability of an object i and its cluster Cj is:
โข๐(๐ฅ๐ , ๐ง๐ = ๐ถ๐) = ๐ค๐๐๐ ๐ฅ๐
โข ๐ง๐: hidden random variable
โข Probability of i is:
โข ๐ ๐ฅ๐ = ฯ๐๐ค๐๐๐(๐ฅ๐)9
๐1(๐ฅ)
๐2(๐ฅ)
Maximum Likelihood Estimation
โข Since objects are assumed to be generated
independently, for a data set D = {x1, โฆ, xn},
we have,
๐ ๐ท =เท
๐
๐ ๐ฅ๐ =เท
๐
๐
๐ค๐๐๐(๐ฅ๐)
โ ๐๐๐๐ ๐ท =
๐
๐๐๐๐ ๐ฅ๐ =
๐
๐๐๐
๐
๐ค๐๐๐(๐ฅ๐)
โข Task: Find a set C of k probabilistic clusters
s.t. P(D) is maximized10
Gaussian Mixture Model
โขGenerative model
โข For each object:
โข Pick its cluster, i.e., a distribution component: ๐~๐๐ข๐๐ก๐๐๐๐ข๐๐๐ ๐ค1, โฆ , ๐ค๐
โข Sample a value from the selected distribution:
๐|๐~๐ ๐๐, ๐๐2
โขOverall likelihood function
โข๐ฟ ๐ท| ๐ = ฯ๐ฯ๐๐ค๐๐(๐ฅ๐|๐๐ , ๐๐2)
s.t. ฯ๐๐ค๐ = 1 ๐๐๐ ๐ค๐ โฅ 0
11
Multinomial Mixture Model
โข For documents with bag-of-words representationโข ๐๐ = (๐ฅ๐1, ๐ฅ๐2, โฆ , ๐ฅ๐๐), ๐ฅ๐๐ is the number of words for nth word in the vocabulary
โข Generative model โข For each document
โข Sample its cluster label ๐ง~๐๐ข๐๐ก๐๐๐๐ข๐๐๐(๐ )โข ๐ = (๐1, ๐2, โฆ , ๐๐พ), ๐๐ is the proportion of kth cluster
โข ๐ ๐ง = ๐ = ๐๐
โข Sample its word vector ๐๐~๐๐ข๐๐ก๐๐๐๐๐๐๐(๐ท๐ง)โข ๐ท๐ง = ๐ฝ๐ง1, ๐ฝ๐ง2, โฆ , ๐ฝ๐ง๐ , ๐ฝ๐ง๐ is the parameter associate with nth word
in the vocabulary
โข ๐ ๐๐|๐ง = ๐ =ฯ๐ ๐ฅ๐๐ !
ฯ๐ ๐ฅ๐๐!ฯ๐๐ฝ๐๐
๐ฅ๐๐ โ ฯ๐๐ฝ๐๐๐ฅ๐๐
12
Likelihood Function
โขFor a set of M documents
๐ฟ =เท
๐
๐(๐๐) =เท
๐
๐
๐(๐๐ , ๐ง = ๐)
=เท
๐
๐
๐ ๐๐ ๐ง = ๐ ๐(๐ง = ๐)
โเท
๐
๐
๐(๐ง = ๐)เท
๐
๐ฝ๐๐๐ฅ๐๐
13
Mixture of Unigrams
โข For documents represented by a sequence of wordsโข๐๐ = (๐ค๐1, ๐ค๐2, โฆ , ๐ค๐๐๐
), ๐๐ is the length of document d, ๐ค๐๐ is the word at the nth position of the document
โข Generative modelโข For each document
โข Sample its cluster label ๐ง~๐๐ข๐๐ก๐๐๐๐ข๐๐๐(๐ )โข ๐ = (๐1, ๐2, โฆ , ๐๐พ), ๐๐ is the proportion of kth cluster
โข ๐ ๐ง = ๐ = ๐๐
โข For each word in the sequenceโข Sample the word ๐ค๐๐~๐๐ข๐๐ก๐๐๐๐ข๐๐๐(๐ท๐ง)
โข ๐ ๐ค๐๐|๐ง = ๐ = ๐ฝ๐๐ค๐๐
14
Likelihood Function
โขFor a set of M documents
๐ฟ =เท
๐
๐(๐๐) =เท
๐
๐
๐(๐๐ , ๐ง = ๐)
=เท
๐
๐
๐ ๐๐ ๐ง = ๐ ๐(๐ง = ๐)
=เท
๐
๐
๐(๐ง = ๐)เท
๐
๐ฝ๐๐ค๐๐
15
Question
โขAre multinomial mixture model and mixture of unigrams model equivalent? Why?
16
Text Data: Topic Models
โขText Data and Topic Models
โขRevisit of Mixture Model
โขProbabilistic Latent Semantic Analysis (pLSA)
โขSummary
17
Notations
โขWord, document, topic
โข๐ค, ๐, ๐ง
โขWord count in document
โข ๐(๐ค, ๐)
โขWord distribution for each topic (๐ฝ๐ง)
โข๐ฝ๐ง๐ค: ๐(๐ค|๐ง)
โขTopic distribution for each document (๐๐)
โข๐๐๐ง: ๐(๐ง|๐) (Yes, soft clustering)
18
Issues of Mixture of Unigrams
โขAll the words in the same documents are sampled from the same topic
โข In practice, people switch topics during their writing
19
Illustration of pLSA
20
Generative Model for pLSA
โขDescribe how a document is generated probabilistically
โข For each position in d, ๐ = 1,โฆ ,๐๐โข Generate the topic for the position as๐ง๐~๐๐ข๐๐ก๐๐๐๐ข๐๐๐(๐ฝ๐), ๐. ๐. , ๐ ๐ง๐ = ๐ = ๐๐๐
(Note, 1 trial multinomial, i.e., categorical distribution)
โข Generate the word for the position as
๐ค๐~๐๐ข๐๐ก๐๐๐๐ข๐๐๐(๐ท๐ง๐), ๐. ๐. , ๐ ๐ค๐ = ๐ค = ๐ฝ๐ง๐๐ค
21
Graphical Model
Note: Sometimes, people add parameters such as ๐ ๐๐๐ ๐ฝ into the graphical model
22
The Likelihood Function for a Corpus
โขProbability of a word๐ ๐ค|๐ =
๐
๐(๐ค, ๐ง = ๐|๐) =
๐
๐ ๐ค ๐ง = ๐ ๐ ๐ง = ๐|๐ =
๐
๐ฝ๐๐ค๐๐๐
โขLikelihood of a corpus
23
๐๐ ๐๐ ๐ข๐ ๐ข๐๐๐๐ฆ ๐๐๐๐ ๐๐๐๐๐๐ ๐๐ ๐ข๐๐๐๐๐๐, i.e., 1/M
Re-arrange the Likelihood Function
โขGroup the same word from different positions together
max ๐๐๐๐ฟ =
๐๐ค
๐ ๐ค, ๐ ๐๐๐
๐ง
๐๐๐ง ๐ฝ๐ง๐ค
๐ . ๐ก.
๐ง
๐๐๐ง = 1 ๐๐๐
๐ค
๐ฝ๐ง๐ค = 1
24
Optimization: EM Algorithm
โข Repeat until converge
โข E-step: for each word in each document, calculate its conditional
probability belonging to each topic
๐ ๐ง ๐ค, ๐ โ ๐ ๐ค ๐ง, ๐ ๐ ๐ง ๐ = ๐ฝ๐ง๐ค๐๐๐ง (๐. ๐. , ๐ ๐ง ๐ค, ๐
=๐ฝ๐ง๐ค๐๐๐ง
ฯ๐งโฒ๐ฝ๐งโฒ๐ค๐๐๐งโฒ)
โข M-step: given the conditional distribution, find the parameters that
can maximize the expected likelihood
๐ฝ๐ง๐ค โ ฯ๐ ๐ ๐ง ๐ค, ๐ ๐ ๐ค, ๐ (๐. ๐. , ๐ฝ๐ง๐ค =ฯ๐ ๐ ๐ง ๐ค, ๐ ๐ ๐ค,๐
ฯ๐คโฒ,๐
๐ ๐ง ๐คโฒ, ๐ ๐ ๐คโฒ,๐)
๐๐๐ง โ
๐ค
๐ ๐ง ๐ค, ๐ ๐ ๐ค, ๐ (๐. ๐. , ๐๐๐ง =ฯ๐ค ๐ ๐ง ๐ค, ๐ ๐ ๐ค, ๐
๐๐)
25
Exampleโข Two documents, two topics
โข Vocabulary: {data, mining, frequent, pattern, web, information, retrieval}
โข At some iteration of EM algorithm, E-step
26
Example (Continued)
โข M-step
27
๐ฝ11 =0.8 โ 5 + 0.5 โ 2
11.8 + 5.8= 5/17.6
๐ฝ12 =0.8 โ 4 + 0.5 โ 3
11.8 + 5.8= 4.7/17.6
๐ฝ13 = 3/17.6๐ฝ14 = 1.6/17.6๐ฝ15 = 1.3/17.6๐ฝ16 = 1.2/17.6๐ฝ17 = 0.8/17.6
๐11 =11.8
17
๐12 =5.2
17
Text Data: Topic Models
โขText Data and Topic Models
โขRevisit of Mixture Model
โขProbabilistic Latent Semantic Analysis (pLSA)
โขSummary
28
Summary
โขBasic Concepts
โข Word/term, document, corpus, topic
โขMixture of unigrams
โขpLSA
โข Generative model
โข Likelihood function
โข EM algorithm
29
Quiz
โขQ1: Is Multinomial Naรฏve Bayes a linear classifier?
โขQ2: In pLSA, For the same word in different positions in a document, do they have the same conditional probability ๐ ๐ง ๐ค, ๐ ?
30