CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2018Fall_CS145/Slides/16...E-step:...
Transcript of CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2018Fall_CS145/Slides/16...E-step:...
CS145: INTRODUCTION TO DATA MINING
Instructor: Yizhou [email protected]
December 3, 2018
Text Data: Topic Model
Methods to be Learnt
2
Vector Data Set Data Sequence Data Text Data
Classification Logistic Regression; Decision Tree; KNN;SVM; NN
NaΓ―ve Bayes for Text
Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models
PLSA
Prediction Linear RegressionGLM*
Frequent Pattern Mining
Apriori; FP growth GSP; PrefixSpan
Similarity Search DTW
Text Data: Topic Modelsβ’Text Data and Topic Models
β’Revisit of Mixture Model
β’Probabilistic Latent Semantic Analysis (pLSA)
β’Summary3
Text Data
β’Word/termβ’Document
β’ A sequence of words
β’Corpusβ’ A collection of
documents
4
Represent a Documentβ’Most common way: Bag-of-Words
β’ Ignore the order of words
β’ keep the count
5
c1 c2 c3 c4 c5 m1 m2 m3 m4
Vector space model
Topicsβ’Topic
β’ A topic is represented by a word distribution
β’ Relate to an issue
6
Topic Models
β’ Topic modelingβ’ Get topics automatically from a corpus
β’ Assign documents to topics automatically
β’ Most frequently used topic modelsβ’ pLSAβ’ LDA
7
Text Data: Topic Modelsβ’Text Data and Topic Models
β’Revisit of Mixture Model
β’Probabilistic Latent Semantic Analysis (pLSA)
β’Summary8
Mixture Model-Based Clusteringβ’ A set C of k probabilistic clusters C1, β¦,Ck
β’ probability density/mass functions: f1, β¦, fk,
β’ Cluster prior probabilities: w1, β¦, wk, βπππ€π€ππ = 1
β’ Joint Probability of an object i and its cluster Cj is:β’ππ(π₯π₯ππ , π§π§ππ = πΆπΆππ) = π€π€ππππππ π₯π₯ππβ’ π§π§ππ: hidden random variable
β’ Probability of i is:β’ ππ π₯π₯ππ = βπππ€π€ππππππ(π₯π₯ππ)
9
ππ1(π₯π₯)
ππ2(π₯π₯)
Maximum Likelihood Estimationβ’ Since objects are assumed to be generated independently, for a data set D = {x1, β¦, xn}, we have,
ππ π·π· = οΏ½ππ
ππ π₯π₯ππ = οΏ½ππ
οΏ½ππ
π€π€ππππππ(π₯π₯ππ)
β ππππππππ π·π· = οΏ½ππ
ππππππππ π₯π₯ππ = οΏ½ππ
πππππποΏ½ππ
π€π€ππππππ(π₯π₯ππ)
β’ Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized
10
Gaussian Mixture Modelβ’Generative model
β’ For each object:β’ Pick its cluster, i.e., a distribution component: ππ~ππππππππππππππππππππππ π€π€1, β¦ ,π€π€ππ
β’ Sample a value from the selected distribution: ππ|ππ~ππ ππππ,ππππ2
β’Overall likelihood functionβ’πΏπΏ π·π·| ππ = βππ βππ π€π€ππππ(π₯π₯ππ|ππππ ,ππππ2)s.t. βππ π€π€ππ = 1 ππππππ π€π€ππ β₯ 0
11
Multinomial Mixture Modelβ’ For documents with bag-of-words representationβ’ ππππ = (π₯π₯ππ1, π₯π₯ππ2, β¦ , π₯π₯ππππ), π₯π₯ππππ is the number of words for nth word in the vocabulary
β’ Generative model β’ For each document
β’ Sample its cluster label π§π§~ππππππππππππππππππππππ(π π )β’ π π = (ππ1,ππ2, β¦ ,πππΎπΎ), ππππ is the proportion of kth clusterβ’ ππ π§π§ = ππ = ππππ
β’ Sample its word vector ππππ~ππππππππππππππππππππππ(π·π·π§π§)β’ π·π·π§π§ = π½π½π§π§1,π½π½π§π§2, β¦ ,π½π½π§π§ππ ,π½π½π§π§ππ is the parameter associate with nth word
in the vocabulary
β’ ππ ππππ|π§π§ = ππ = βππ π₯π₯ππππ !βππ π₯π₯ππππ!
βπππ½π½πππππ₯π₯ππππ β βππ π½π½ππππ
π₯π₯ππππ
12
Likelihood Functionβ’For a set of M documents
πΏπΏ = οΏ½ππ
ππ(ππππ) = οΏ½ππ
οΏ½ππ
ππ(ππππ , π§π§ = ππ)
= οΏ½ππ
οΏ½ππ
ππ ππππ π§π§ = ππ ππ(π§π§ = ππ)
βοΏ½ππ
οΏ½ππ
ππ(π§π§ = ππ)οΏ½ππ
π½π½πππππ₯π₯ππππ
13
Mixture of Unigramsβ’ For documents represented by a sequence of wordsβ’ππππ = (π€π€ππ1,π€π€ππ2, β¦ ,π€π€ππππππ), ππππ is the length of document d, π€π€ππππ is the word at the nth position of the document
β’ Generative modelβ’ For each document
β’ Sample its cluster label π§π§~ππππππππππππππππππππππ(π π )β’ π π = (ππ1,ππ2, β¦ ,πππΎπΎ), ππππ is the proportion of kth clusterβ’ ππ π§π§ = ππ = ππππ
β’ For each word in the sequenceβ’ Sample the word π€π€ππππ~ππππππππππππππππππππππ(π·π·π§π§)β’ ππ π€π€ππππ|π§π§ = ππ = π½π½πππ€π€ππππ
14
Likelihood Functionβ’For a set of M documents
πΏπΏ = οΏ½ππ
ππ(ππππ) = οΏ½ππ
οΏ½ππ
ππ(ππππ , π§π§ = ππ)
= οΏ½ππ
οΏ½ππ
ππ ππππ π§π§ = ππ ππ(π§π§ = ππ)
= οΏ½ππ
οΏ½ππ
ππ(π§π§ = ππ)οΏ½ππ
π½π½πππ€π€ππππ
15
Questionβ’Are multinomial mixture model and mixture of unigrams model equivalent? Why?
16
Text Data: Topic Modelsβ’Text Data and Topic Models
β’Revisit of Mixture Model
β’Probabilistic Latent Semantic Analysis (pLSA)
β’Summary17
Notationsβ’Word, document, topic
β’π€π€,ππ, π§π§β’Word count in document
β’ ππ(π€π€,ππ)β’Word distribution for each topic (π½π½π§π§)
β’π½π½π§π§π€π€:ππ(π€π€|π§π§)β’Topic distribution for each document (ππππ)
β’πππππ§π§:ππ(π§π§|ππ) (Yes, soft clustering)
18
Issues of Mixture of Unigramsβ’All the words in the same documents are sampled from the same topic
β’ In practice, people switch topics during their writing
19
Illustration of pLSA
20
Generative Model for pLSA
β’Describe how a document d is generated probabilisticallyβ’ For each position in d, ππ = 1, β¦ ,ππππ
β’ Generate the topic for the position asπ§π§ππ~ππππππππππππππππππππππ(π½π½ππ), ππ. ππ. , ππ π§π§ππ = ππ = ππππππ
(Note, 1 trial multinomial, i.e., categorical distribution)
β’ Generate the word for the position asπ€π€ππ|π§π§ππ~ππππππππππππππππππππππ(π·π·π§π§ππ), ππ. ππ. ,ππ π€π€ππ = π€π€|π§π§ππ =
π½π½π§π§πππ€π€
21
Graphical Model
Note: Sometimes, people add parameters such as ππ ππππππ π½π½ into the graphical model
22
The Likelihood Function for a Corpusβ’Probability of a wordππ π€π€|ππ = οΏ½
ππ
ππ(π€π€, π§π§ = ππ|ππ) = οΏ½ππ
ππ π€π€ π§π§ = ππ ππ π§π§ = ππ|ππ = οΏ½ππ
π½π½πππ€π€ππππππ
β’Likelihood of a corpus
23
ππππ ππππ πππππππππππππ’π’ ππππππππππππππππππππ ππππ ππππππππππππππ, i.e., 1/M
Re-arrange the Likelihood Functionβ’Group the same word from different positions together
max πππππππΏπΏ = οΏ½πππ€π€
ππ π€π€,ππ πππππποΏ½π§π§
πππππ§π§ π½π½π§π§π€π€
ππ. ππ.οΏ½π§π§
πππππ§π§ = 1 ππππππ οΏ½π€π€
π½π½π§π§π€π€ = 1
24
Optimization: EM Algorithmβ’ Repeat until converge
β’ E-step: for each word in each document, calculate its conditional probability belonging to each topicππ π§π§ π€π€,ππ β ππ π€π€ π§π§,ππ ππ π§π§ ππ = π½π½π§π§π€π€πππππ§π§ (ππ. ππ. ,ππ π§π§ π€π€,ππ
=π½π½π§π§π€π€πππππ§π§
βπ§π§π§ π½π½π§π§π§π€π€πππππ§π§π§)
β’ M-step: given the conditional distribution, find the parameters that can maximize the expected complete log-likelihood
π½π½π§π§π€π€ β βππ ππ π§π§ π€π€,ππ ππ π€π€,ππ (ππ. ππ. ,π½π½π§π§π€π€ = βππ ππ π§π§ π€π€,ππ ππ π€π€,ππβπ€π€β²,ππ ππ π§π§ π€π€
π§,ππ ππ π€π€β²,ππ)
πππππ§π§ βοΏ½π€π€
ππ π§π§ π€π€,ππ ππ π€π€,ππ (ππ. ππ. ,πππππ§π§ =βπ€π€ ππ π§π§ π€π€,ππ ππ π€π€,ππ
ππππ)
25
Exampleβ’ Two documents, two topics
β’ Vocabulary: {1: data, 2: mining, 3: frequent, 4: pattern, 5: web, 6: information, 7: retrieval}
β’ At some iteration of EM algorithm, E-step
26
Example (Continued)β’ M-step
27
π½π½11 =0.8 β 5 + 0.5 β 2
11.8 + 5.8= 5/17.6
π½π½12 =0.8 β 4 + 0.5 β 3
11.8 + 5.8= 4.7/17.6
π½π½13 = 3/17.6π½π½14 = 1.6/17.6π½π½15 = 1.3/17.6π½π½16 = 1.2/17.6π½π½17 = 0.8/17.6
ππ11 =11.817
ππ12 =5.217
Text Data: Topic Modelsβ’Text Data and Topic Models
β’Revisit of Mixture Model
β’Probabilistic Latent Semantic Analysis (pLSA)
β’Summary28
Summaryβ’Basic Concepts
β’ Word/term, document, corpus, topic
β’Mixture of unigramsβ’pLSA
β’ Generative model
β’ Likelihood function
β’ EM algorithm
29
CS145: INTRODUCTION TO DATA MINING
Instructor: Yizhou [email protected]
December 5, 2018
Final Review
Learnt Algorithms
31
Vector Data Set Data Sequence Data Text Data
Classification Logistic Regression; Decision Tree; KNN;SVM; NN
NaΓ―ve Bayes for Text
Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models
PLSA
Prediction Linear RegressionGLM*
Frequent Pattern Mining
Apriori; FP growth GSP; PrefixSpan
Similarity Search DTW
Deadlines aheadβ’Homework 6 (optional)
β’ Sunday, Dec. 9, 2018, 11:59pmβ’ We will pick 5 highest score
β’Course projectβ’ Tuesday, Dec. 11, 2018, 11:59 pm.β’ Donβt forget peer evaluation form:
β’ One question is: βDo you think some member in your group should be given a lower score than the group score? If yes, please list the name, and explain why.β
32
Final Examβ’ Time
β’ 12/13, 11:30am-2:30pmβ’ Location
β’ Franz Hall: 1260β’ Policy
β’ Closed book examβ’ You can take two βreference sheetsβ of A4 size, i.e., one in addition to the midterm βreference sheetβ
β’ You can bring a simple calculator
33
Content to Coverβ’All the content learned so far
β’ ~20% before midterm
β’ ~80% after midterm
34
Vector Data Set Data SequenceData
Text Data
Classification Logistic Regression; Decision Tree; KNN;SVM; NN
NaΓ―ve Bayes for Text
Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models
PLSA
Prediction Linear RegressionGLM*
Frequent Pattern Mining
Apriori; FP growth
GSP;PrefixSpan
Similarity Search
DTW
Type of Questionsβ’Similar to Midterm
β’ True or false
β’ Conceptual questions
β’ Computation questions
35
Sample question on DTWβ’Suppose that we have two sequences S1 and S2 as follows:β’ S1 = <1, 2, 5, 3, 2, 1, 7>
β’ S2 = <2, 3, 2, 1, 7, 4, 3, 0, 2, 5>
β’Compute the distance between two sequences according to the dynamic time warping algorithm.
36
Whatβs next?β’ Following-up courses
β’ CS247: Advanced Data Miningβ’ Focus on Text, Recommender Systems, and
Networks/Graphsβ’ Will be offered in Spring 2019
β’ CS249: Probabilistic Graphical Models for Structured Dataβ’ Focus on Probabilistic Models on text and graph dataβ’ Research seminar: You are expected to read papers
and present to the whole class; you are expected to write a survey or conduct a course project
β’ Will be offered in Winter 201937
Thank you and good luck!β’Give us feedback by submitting evaluation form
β’1-2 undergraduate research intern positions are availableβ’ Please refer to piazza: https://piazza.com/class/jmpc925zfo92qs?cid=274
38