CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2018Fall_CS145/Slides/16...E-step:...

CS145: INTRODUCTION TO DATA MINING

Instructor: Yizhou Sunyzsun@cs.ucla.edu

December 3, 2018

Text Data: Topic Model

Methods to be Learnt

Vector Data Set Data Sequence Data Text Data

Classification Logistic Regression; Decision Tree; KNN;SVM; NN

Naïve Bayes for Text

Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models

Prediction Linear RegressionGLM*

Frequent Pattern Mining

Apriori; FP growth GSP; PrefixSpan

Similarity Search DTW

Text Data: Topic Models•Text Data and Topic Models

•Revisit of Mixture Model

•Probabilistic Latent Semantic Analysis (pLSA)

•Summary3

Text Data

•Word/term•Document

• A sequence of words

•Corpus• A collection of

documents

Represent a Document•Most common way: Bag-of-Words

• Ignore the order of words

• keep the count

c1 c2 c3 c4 c5 m1 m2 m3 m4

Vector space model

Topics•Topic

• A topic is represented by a word distribution

• Relate to an issue

Topic Models

• Topic modeling• Get topics automatically from a corpus

• Assign documents to topics automatically

• Most frequently used topic models• pLSA• LDA

•Summary8

Mixture Model-Based Clustering• A set C of k probabilistic clusters C1, …,Ck

• probability density/mass functions: f1, …, fk,

• Cluster prior probabilities: w1, …, wk, ∑𝑗𝑗𝑤𝑤𝑗𝑗 = 1

• Joint Probability of an object i and its cluster Cj is:•𝑃𝑃(𝑥𝑥𝑖𝑖 , 𝑧𝑧𝑖𝑖 = 𝐶𝐶𝑗𝑗) = 𝑤𝑤𝑗𝑗𝑓𝑓𝑗𝑗 𝑥𝑥𝑖𝑖• 𝑧𝑧𝑖𝑖: hidden random variable

• Probability of i is:• 𝑃𝑃 𝑥𝑥𝑖𝑖 = ∑𝑗𝑗𝑤𝑤𝑗𝑗𝑓𝑓𝑗𝑗(𝑥𝑥𝑖𝑖)

𝑓𝑓1(𝑥𝑥)

𝑓𝑓2(𝑥𝑥)

Maximum Likelihood Estimation• Since objects are assumed to be generated independently, for a data set D = {x1, …, xn}, we have,

𝑃𝑃 𝐷𝐷 = �𝑖𝑖

𝑃𝑃 𝑥𝑥𝑖𝑖 = �𝑖𝑖

�𝑗𝑗

𝑤𝑤𝑗𝑗𝑓𝑓𝑗𝑗(𝑥𝑥𝑖𝑖)

⇒ 𝑙𝑙𝑙𝑙𝑙𝑙𝑃𝑃 𝐷𝐷 = �𝑖𝑖

𝑙𝑙𝑙𝑙𝑙𝑙𝑃𝑃 𝑥𝑥𝑖𝑖 = �𝑖𝑖

𝑙𝑙𝑙𝑙𝑙𝑙�𝑗𝑗

𝑤𝑤𝑗𝑗𝑓𝑓𝑗𝑗(𝑥𝑥𝑖𝑖)

• Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized

Gaussian Mixture Model•Generative model

• For each object:• Pick its cluster, i.e., a distribution component: 𝑍𝑍~𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑙𝑙𝑙𝑙𝑀𝑀 𝑤𝑤1, … ,𝑤𝑤𝑘𝑘

• Sample a value from the selected distribution: 𝑋𝑋|𝑍𝑍~𝑁𝑁 𝜇𝜇𝑍𝑍,𝜎𝜎𝑍𝑍2

•Overall likelihood function•𝐿𝐿 𝐷𝐷| 𝜃𝜃 = ∏𝑖𝑖 ∑𝑗𝑗 𝑤𝑤𝑗𝑗𝑝𝑝(𝑥𝑥𝑖𝑖|𝜇𝜇𝑗𝑗 ,𝜎𝜎𝑗𝑗2)s.t. ∑𝑗𝑗 𝑤𝑤𝑗𝑗 = 1 𝑎𝑎𝑀𝑀𝑎𝑎 𝑤𝑤𝑗𝑗 ≥ 0

Multinomial Mixture Model• For documents with bag-of-words representation• 𝒙𝒙𝑑𝑑 = (𝑥𝑥𝑑𝑑1, 𝑥𝑥𝑑𝑑2, … , 𝑥𝑥𝑑𝑑𝑑𝑑), 𝑥𝑥𝑑𝑑𝑛𝑛 is the number of words for nth word in the vocabulary

• Generative model • For each document

• Sample its cluster label 𝑧𝑧~𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑙𝑙𝑙𝑙𝑀𝑀(𝝅𝝅)• 𝝅𝝅 = (𝜋𝜋1,𝜋𝜋2, … ,𝜋𝜋𝐾𝐾), 𝜋𝜋𝑘𝑘 is the proportion of kth cluster• 𝑝𝑝 𝑧𝑧 = 𝑘𝑘 = 𝜋𝜋𝑘𝑘

• Sample its word vector 𝒙𝒙𝑑𝑑~𝑚𝑚𝑀𝑀𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑙𝑙𝑚𝑚𝑀𝑀𝑎𝑎𝑙𝑙(𝜷𝜷𝑧𝑧)• 𝜷𝜷𝑧𝑧 = 𝛽𝛽𝑧𝑧1,𝛽𝛽𝑧𝑧2, … ,𝛽𝛽𝑧𝑧𝑑𝑑 ,𝛽𝛽𝑧𝑧𝑛𝑛 is the parameter associate with nth word

in the vocabulary

• 𝑝𝑝 𝒙𝒙𝑑𝑑|𝑧𝑧 = 𝑘𝑘 = ∑𝑛𝑛 𝑥𝑥𝑑𝑑𝑛𝑛 !∏𝑛𝑛 𝑥𝑥𝑑𝑑𝑛𝑛!

∏𝑛𝑛𝛽𝛽𝑘𝑘𝑛𝑛𝑥𝑥𝑑𝑑𝑛𝑛 ∝ ∏𝑛𝑛 𝛽𝛽𝑘𝑘𝑛𝑛

𝑥𝑥𝑑𝑑𝑛𝑛

Likelihood Function•For a set of M documents

𝐿𝐿 = �𝑑𝑑

𝑝𝑝(𝒙𝒙𝑑𝑑) = �𝑑𝑑

�𝑘𝑘

𝑝𝑝(𝒙𝒙𝑑𝑑 , 𝑧𝑧 = 𝑘𝑘)

= �𝑑𝑑

�𝑘𝑘

𝑝𝑝 𝒙𝒙𝑑𝑑 𝑧𝑧 = 𝑘𝑘 𝑝𝑝(𝑧𝑧 = 𝑘𝑘)

∝�𝑑𝑑

�𝑘𝑘

𝑝𝑝(𝑧𝑧 = 𝑘𝑘)�𝑛𝑛

𝛽𝛽𝑘𝑘𝑛𝑛𝑥𝑥𝑑𝑑𝑛𝑛

Mixture of Unigrams• For documents represented by a sequence of words•𝒘𝒘𝑑𝑑 = (𝑤𝑤𝑑𝑑1,𝑤𝑤𝑑𝑑2, … ,𝑤𝑤𝑑𝑑𝑑𝑑𝑑𝑑), 𝑁𝑁𝑑𝑑 is the length of document d, 𝑤𝑤𝑑𝑑𝑛𝑛 is the word at the nth position of the document

• Generative model• For each document

• Sample its cluster label 𝑧𝑧~𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑙𝑙𝑙𝑙𝑀𝑀(𝝅𝝅)• 𝝅𝝅 = (𝜋𝜋1,𝜋𝜋2, … ,𝜋𝜋𝐾𝐾), 𝜋𝜋𝑘𝑘 is the proportion of kth cluster• 𝑝𝑝 𝑧𝑧 = 𝑘𝑘 = 𝜋𝜋𝑘𝑘

• For each word in the sequence• Sample the word 𝑤𝑤𝑑𝑑𝑛𝑛~𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑙𝑙𝑙𝑙𝑀𝑀(𝜷𝜷𝑧𝑧)• 𝑝𝑝 𝑤𝑤𝑑𝑑𝑛𝑛|𝑧𝑧 = 𝑘𝑘 = 𝛽𝛽𝑘𝑘𝑤𝑤𝑑𝑑𝑛𝑛

Likelihood Function•For a set of M documents

𝐿𝐿 = �𝑑𝑑

𝑝𝑝(𝒘𝒘𝑑𝑑) = �𝑑𝑑

�𝑘𝑘

𝑝𝑝(𝒘𝒘𝑑𝑑 , 𝑧𝑧 = 𝑘𝑘)

= �𝑑𝑑

�𝑘𝑘

𝑝𝑝 𝒘𝒘𝑑𝑑 𝑧𝑧 = 𝑘𝑘 𝑝𝑝(𝑧𝑧 = 𝑘𝑘)

= �𝑑𝑑

�𝑘𝑘

𝑝𝑝(𝑧𝑧 = 𝑘𝑘)�𝑛𝑛

𝛽𝛽𝑘𝑘𝑤𝑤𝑑𝑑𝑛𝑛

Question•Are multinomial mixture model and mixture of unigrams model equivalent? Why?

•Summary17

Notations•Word, document, topic

•𝑤𝑤,𝑎𝑎, 𝑧𝑧•Word count in document

• 𝑐𝑐(𝑤𝑤,𝑎𝑎)•Word distribution for each topic (𝛽𝛽𝑧𝑧)

•𝛽𝛽𝑧𝑧𝑤𝑤:𝑝𝑝(𝑤𝑤|𝑧𝑧)•Topic distribution for each document (𝜃𝜃𝑑𝑑)

•𝜃𝜃𝑑𝑑𝑧𝑧:𝑝𝑝(𝑧𝑧|𝑎𝑎) (Yes, soft clustering)

Issues of Mixture of Unigrams•All the words in the same documents are sampled from the same topic

• In practice, people switch topics during their writing

Illustration of pLSA

Generative Model for pLSA

•Describe how a document d is generated probabilistically• For each position in d, 𝑀𝑀 = 1, … ,𝑁𝑁𝑑𝑑

• Generate the topic for the position as𝑧𝑧𝑛𝑛~𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑙𝑙𝑙𝑙𝑀𝑀(𝜽𝜽𝑑𝑑), 𝑀𝑀. 𝑒𝑒. , 𝑝𝑝 𝑧𝑧𝑛𝑛 = 𝑘𝑘 = 𝜃𝜃𝑑𝑑𝑘𝑘

(Note, 1 trial multinomial, i.e., categorical distribution)

• Generate the word for the position as𝑤𝑤𝑛𝑛|𝑧𝑧𝑛𝑛~𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑙𝑙𝑀𝑀𝑙𝑙𝑙𝑙𝑀𝑀(𝜷𝜷𝑧𝑧𝑛𝑛), 𝑀𝑀. 𝑒𝑒. ,𝑝𝑝 𝑤𝑤𝑛𝑛 = 𝑤𝑤|𝑧𝑧𝑛𝑛 =

𝛽𝛽𝑧𝑧𝑛𝑛𝑤𝑤

Graphical Model

Note: Sometimes, people add parameters such as 𝜃𝜃 𝑎𝑎𝑀𝑀𝑎𝑎 𝛽𝛽 into the graphical model

The Likelihood Function for a Corpus•Probability of a word𝑝𝑝 𝑤𝑤|𝑎𝑎 = �

𝑘𝑘

𝑝𝑝(𝑤𝑤, 𝑧𝑧 = 𝑘𝑘|𝑎𝑎) = �𝑘𝑘

𝑝𝑝 𝑤𝑤 𝑧𝑧 = 𝑘𝑘 𝑝𝑝 𝑧𝑧 = 𝑘𝑘|𝑎𝑎 = �𝑘𝑘

𝛽𝛽𝑘𝑘𝑤𝑤𝜃𝜃𝑑𝑑𝑘𝑘

•Likelihood of a corpus

𝜋𝜋𝑑𝑑 𝑀𝑀𝑖𝑖 𝑀𝑀𝑖𝑖𝑀𝑀𝑎𝑎𝑙𝑙𝑙𝑙𝑢𝑢 𝑐𝑐𝑙𝑙𝑀𝑀𝑖𝑖𝑀𝑀𝑎𝑎𝑒𝑒𝑐𝑐𝑒𝑒𝑎𝑎 𝑎𝑎𝑖𝑖 𝑀𝑀𝑀𝑀𝑀𝑀𝑓𝑓𝑙𝑙𝑐𝑐𝑚𝑚, i.e., 1/M

Re-arrange the Likelihood Function•Group the same word from different positions together

max 𝑙𝑙𝑙𝑙𝑙𝑙𝐿𝐿 = �𝑑𝑑𝑤𝑤

𝑐𝑐 𝑤𝑤,𝑎𝑎 𝑙𝑙𝑙𝑙𝑙𝑙�𝑧𝑧

𝜃𝜃𝑑𝑑𝑧𝑧 𝛽𝛽𝑧𝑧𝑤𝑤

𝑖𝑖. 𝑀𝑀.�𝑧𝑧

𝜃𝜃𝑑𝑑𝑧𝑧 = 1 𝑎𝑎𝑀𝑀𝑎𝑎 �𝑤𝑤

𝛽𝛽𝑧𝑧𝑤𝑤 = 1

Optimization: EM Algorithm• Repeat until converge

• E-step: for each word in each document, calculate its conditional probability belonging to each topic𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑎𝑎 ∝ 𝑝𝑝 𝑤𝑤 𝑧𝑧,𝑎𝑎 𝑝𝑝 𝑧𝑧 𝑎𝑎 = 𝛽𝛽𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧 (𝑀𝑀. 𝑒𝑒. ,𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑎𝑎

=𝛽𝛽𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧

∑𝑧𝑧𝑧 𝛽𝛽𝑧𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧𝑧)

• M-step: given the conditional distribution, find the parameters that can maximize the expected complete log-likelihood

𝛽𝛽𝑧𝑧𝑤𝑤 ∝ ∑𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑎𝑎 𝑐𝑐 𝑤𝑤,𝑎𝑎 (𝑀𝑀. 𝑒𝑒. ,𝛽𝛽𝑧𝑧𝑤𝑤 = ∑𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑎𝑎 𝑐𝑐 𝑤𝑤,𝑑𝑑∑𝑤𝑤′,𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤

𝑧,𝑎𝑎 𝑐𝑐 𝑤𝑤′,𝑑𝑑)

𝜃𝜃𝑑𝑑𝑧𝑧 ∝�𝑤𝑤

𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑎𝑎 𝑐𝑐 𝑤𝑤,𝑎𝑎 (𝑀𝑀. 𝑒𝑒. ,𝜃𝜃𝑑𝑑𝑧𝑧 =∑𝑤𝑤 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑎𝑎 𝑐𝑐 𝑤𝑤,𝑎𝑎

𝑁𝑁𝑑𝑑)

Example• Two documents, two topics

• Vocabulary: {1: data, 2: mining, 3: frequent, 4: pattern, 5: web, 6: information, 7: retrieval}

• At some iteration of EM algorithm, E-step

Example (Continued)• M-step

𝛽𝛽11 =0.8 ∗ 5 + 0.5 ∗ 2

11.8 + 5.8= 5/17.6

𝛽𝛽12 =0.8 ∗ 4 + 0.5 ∗ 3

11.8 + 5.8= 4.7/17.6

𝛽𝛽13 = 3/17.6𝛽𝛽14 = 1.6/17.6𝛽𝛽15 = 1.3/17.6𝛽𝛽16 = 1.2/17.6𝛽𝛽17 = 0.8/17.6

𝜃𝜃11 =11.817

𝜃𝜃12 =5.217

•Summary28

Summary•Basic Concepts

• Word/term, document, corpus, topic

•Mixture of unigrams•pLSA

• Generative model

• Likelihood function

• EM algorithm

CS145: INTRODUCTION TO DATA MINING

Instructor: Yizhou Sunyzsun@cs.ucla.edu

December 5, 2018

Final Review

Learnt Algorithms

Vector Data Set Data Sequence Data Text Data

Apriori; FP growth GSP; PrefixSpan

Similarity Search DTW

Deadlines ahead•Homework 6 (optional)

• Sunday, Dec. 9, 2018, 11:59pm• We will pick 5 highest score

•Course project• Tuesday, Dec. 11, 2018, 11:59 pm.• Don’t forget peer evaluation form:

• One question is: “Do you think some member in your group should be given a lower score than the group score? If yes, please list the name, and explain why.”

Final Exam• Time

• 12/13, 11:30am-2:30pm• Location

• Franz Hall: 1260• Policy

• Closed book exam• You can take two “reference sheets” of A4 size, i.e., one in addition to the midterm “reference sheet”

• You can bring a simple calculator

Content to Cover•All the content learned so far

• ~20% before midterm

• ~80% after midterm

Vector Data Set Data SequenceData

Text Data

Apriori; FP growth

GSP;PrefixSpan

Similarity Search

Type of Questions•Similar to Midterm

• True or false

• Conceptual questions

• Computation questions

Sample question on DTW•Suppose that we have two sequences S1 and S2 as follows:• S1 = <1, 2, 5, 3, 2, 1, 7>

• S2 = <2, 3, 2, 1, 7, 4, 3, 0, 2, 5>

•Compute the distance between two sequences according to the dynamic time warping algorithm.

What’s next?• Following-up courses

• CS247: Advanced Data Mining• Focus on Text, Recommender Systems, and

Networks/Graphs• Will be offered in Spring 2019

• CS249: Probabilistic Graphical Models for Structured Data• Focus on Probabilistic Models on text and graph data• Research seminar: You are expected to read papers

and present to the whole class; you are expected to write a survey or conduct a course project

• Will be offered in Winter 201937

Thank you and good luck!•Give us feedback by submitting evaluation form

•1-2 undergraduate research intern positions are available• Please refer to piazza: https://piazza.com/class/jmpc925zfo92qs?cid=274

CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2018Fall_CS145/Slides/16...E-step:...

Documents

Transcript of CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2018Fall_CS145/Slides/16...E-step:...

CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/... · CS145: INTRODUCTION TO DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017

CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/... · Cross-Validation Likelihood •The likelihood of the training data will increase when increasing

CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/...skyfall genre: drama King Kong Sam Mendes Charlie tag: Oscar Nomination Generate L different meta-path

CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/05SVM.pdf · CS145: INTRODUCTION TO DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu October

On-Line Analytical Processing (OLAP) Introductionopenclassroom.stanford.edu/.../cs145/old-site/docs/pdfs/OLAPIntro.pdf · Jennifer Widom More terminology Data warehousing Bring data

CS145: Probability & Computingcs.brown.edu/courses/csci1450/lectures/lec03_independent.pdf · 2020. 9. 22. · CS145: Probability & Computing Lecture 3: Conditioning, Independence,

TinkerFund: Kick-start Caltech Crowdfundingcourses.cms.caltech.edu/cs145/2015/tinkerfund.pdf · TinkerFund: Kick-start Caltech Crowdfunding William Ko California Institute of Technology

Precise Indoor Positioning without GPS using a Device Meshcourses.cms.caltech.edu/cs145/2017/positioning.pdf · Precise Indoor Positioning without GPS using a Device Mesh Gavy Aggarwal

Lectures 7 & 8: Transactions - web.stanford.eduweb.stanford.edu/class/cs145/lectures/lecture-8/Lecture_8_TXNs.pdfWhat you will learn about in this section 1. Our “model” of the

CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/16Text_NB.p… · Bayes’ Theorem: Basics •Bayes’ Theorem: •Let X be a data sample (“evidence”)

Tweet Rises: Twitter Sentiment Analysis - Caltech …courses.cms.caltech.edu/cs145/2014/tweetrises.pdf · Tweet Rises: Twitter Sentiment Analysis Aleksander Bello California Institute

CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2019Winter_CS145/Slides/...Why Data Mining? •The Explosive Growth of Data: from terabytes to petabytes •Data collection

CS145: Probability & Computing - Brown Universitycs.brown.edu/courses/csci1450/lectures/lec04_discrete.pdfLECTURE 2 Review of probability models • Readings: Sections 1.3-1.4 •

CS145 Introduction - The Stanford University InfoLabinfolab.stanford.edu/~ullman/fcdb/aut07/slides/intro.pdfHomework in CS145 is not a “mini-test.” You try as many times as you

CS145: INTRODUCTION TO DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/14...14 Sequential Pattern Mining Algorithms •Concept introduction and an initial Apriori-like

CS145 Introduction

Trajectory Data Pattern Miningweb.cs.ucla.edu › ~zaniolo › papers › chp%3A10.1007%2F978-3-319... · 2014-08-26 · Trajectory Data Pattern Mining Elio Masciari1(B), Gao Shi

1 Relational Algebra* and Tuple Calculus * The slides in this lecture are adapted from slides used in Standford's CS145 course.

CS145: Probability & Computingcs.brown.edu/courses/csci1450/lectures/lec19_mAbsorb.pdfCS145: Probability & Computing Lecture 19: Markov Chains & Absorption, Instructor: Eli Upfal Brown

CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED DATA MINING Instructor: Yizhou Sun yzsun@cs.ucla.edu May 10, 2017 Text Data: Topic