CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED...

of 39 /39
CS249: ADVANCED DATA MINING Instructor: Yizhou Sun [email protected] May 10, 2017 Text Data: Topic Models

Embed Size (px)

Transcript of CS249: ADVANCED DATA MININGweb.cs.ucla.edu/~yzsun/classes/2017Spring_CS249/Slides/...CS249: ADVANCED...

  • CS249: ADVANCED DATA MINING

    Instructor: Yizhou [email protected]

    May 10, 2017

    Text Data: Topic Models

    mailto:[email protected]

  • Announcements•Course project proposal

    • Due May 8th (11:59pm)

    •Homework 3 out• Due May 10th (11:59pm)

    •Midterm Exam• In class May 15th

    2

  • Methods to Learn

    3

    Vector Data Text Data Recommender System

    Graph & Network

    Classification Decision Tree; Naïve Bayes; Logistic RegressionSVM; NN

    Label Propagation

    Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models; kernel k-means

    PLSA;LDA

    Matrix Factorization SCAN; Spectral Clustering

    Prediction Linear RegressionGLM

    Collaborative Filtering

    Ranking PageRank

    Feature Representation

    Word embedding Network embedding

  • Text Data: Topic Models• Text Data and Topic Models

    • Revisit of Multinomial Mixture Model

    • Probabilistic Latent Semantic Analysis (pLSA)

    • Latent Dirichlet Allocation (LDA)

    • Summary

    4

  • Text Data

    •Word/term•Document

    • A sequence of words

    •Corpus• A collection of

    documents

    5

  • Represent a Document•Most common way: Bag-of-Words

    • Ignore the order of words• keep the count

    6

    c1 c2 c3 c4 c5 m1 m2 m3 m4

    Vector space model

  • More Details• Represent the doc as a vector where each entry

    corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it)• Number of words is huge• Select and use a smaller set of words that are of interest• E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop-

    words• Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’

    could be substituted by the single stem ‘learn’• Other simplifications can also be invented and used• The set of different remaining words is called dictionary or vocabulary. Fix

    an ordering of the terms in the dictionary so that you can operate them by their index.

    • Can be extended to bi-gram, tri-gram, or so

    7

  • Limitations of Vector Space Model•Dimensionality

    • High dimensionality

    •Sparseness• Most of the entries are zero

    •Shallow representation• The vector representation does not capture semantic relations between words

    8

  • Topics•Topic

    • A topic is represented by a word distribution

    • Relate to an issue

    9

  • Topic Models

    • Topic modeling• Get topics automatically from a corpus

    • Assign documents to topics automatically

    • Most frequently used topic models• pLSA• LDA

    10

  • Text Data: Topic Models• Text Data and Topic Models

    • Revisit of Multinomial Mixture Model

    • Probabilistic Latent Semantic Analysis (pLSA)

    • Latent Dirichlet Allocation (LDA)

    • Summary

    11

  • Review of Multinomial Distribution•Select n data points from K categories, each with probability 𝑝𝑝𝑘𝑘• n trials of independent categorical distribution

    • E.g., get 1-6 from a dice with 1/6

    • When n=1, it is also called discrete distribution

    •When K=2, binomial distribution• n trials of independent Bernoulli distribution• E.g., flip a coin to get heads or tails

    12

  • Multinomial Mixture Model• For documents with bag-of-words representation• 𝒙𝒙𝑑𝑑 = (𝑥𝑥𝑑𝑑𝑑, 𝑥𝑥𝑑𝑑𝑑, … , 𝑥𝑥𝑑𝑑𝑑𝑑), 𝑥𝑥𝑑𝑑𝑛𝑛 is the number of words for nth word in the vocabulary

    • Generative model • For each document

    • Sample its cluster label 𝑧𝑧~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝝅𝝅)• 𝝅𝝅 = (𝜋𝜋𝑑,𝜋𝜋𝑑, … ,𝜋𝜋𝐾𝐾), 𝜋𝜋𝑘𝑘 is the proportion of jth cluster• 𝑝𝑝 𝑧𝑧 = 𝑘𝑘 = 𝜋𝜋𝑘𝑘

    • Sample its word vector 𝒙𝒙𝑑𝑑~𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑𝑑𝑑(𝜷𝜷𝑧𝑧)• 𝜷𝜷𝑧𝑧 = 𝛽𝛽𝑧𝑧𝑑,𝛽𝛽𝑧𝑧𝑑, … ,𝛽𝛽𝑧𝑧𝑑𝑑 ,𝛽𝛽𝑧𝑧𝑛𝑛 is the parameter associate with nth word

    in the vocabulary

    • 𝑝𝑝 𝒙𝒙𝑑𝑑|𝑧𝑧 = 𝑘𝑘 =∑𝑛𝑛 𝑥𝑥𝑑𝑑𝑛𝑛 !∏𝑛𝑛 𝑥𝑥𝑑𝑑𝑛𝑛!

    ∏𝑛𝑛𝛽𝛽𝑘𝑘𝑛𝑛𝑥𝑥𝑑𝑑𝑛𝑛 ∝ ∏𝑛𝑛 𝛽𝛽𝑘𝑘𝑛𝑛

    𝑥𝑥𝑑𝑑𝑛𝑛

    13

  • Likelihood Function•For a set of M documents

    𝐿𝐿 = �𝑑𝑑

    𝑝𝑝(𝒙𝒙𝑑𝑑) = �𝑑𝑑

    �𝑘𝑘

    𝑝𝑝(𝒙𝒙𝑑𝑑 , 𝑧𝑧 = 𝑘𝑘)

    = �𝑑𝑑

    �𝑘𝑘

    𝑝𝑝 𝒙𝒙𝑑𝑑 𝑧𝑧 = 𝑘𝑘 𝑝𝑝(𝑧𝑧 = 𝑘𝑘)

    ∝�𝑑𝑑

    �𝑘𝑘

    𝑝𝑝(𝑧𝑧 = 𝑘𝑘)�𝑛𝑛

    𝛽𝛽𝑘𝑘𝑛𝑛𝑥𝑥𝑑𝑑𝑛𝑛

    14

  • Mixture of Unigrams• For documents represented by a sequence of words•𝒘𝒘𝑑𝑑 = (𝑤𝑤𝑑𝑑𝑑,𝑤𝑤𝑑𝑑𝑑, … ,𝑤𝑤𝑑𝑑𝑑𝑑𝑑𝑑), 𝑁𝑁𝑑𝑑 is the length of document d, 𝑤𝑤𝑑𝑑𝑛𝑛 is the word at the nth position of the document

    • Generative model• For each document

    • Sample its cluster label 𝑧𝑧~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝝅𝝅)• 𝝅𝝅 = (𝜋𝜋𝑑,𝜋𝜋𝑑, … ,𝜋𝜋𝐾𝐾), 𝜋𝜋𝑘𝑘 is the proportion of jth cluster• 𝑝𝑝 𝑧𝑧 = 𝑘𝑘 = 𝜋𝜋𝑘𝑘

    • For each word in the sequence• Sample the word 𝑤𝑤𝑑𝑑𝑛𝑛~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝜷𝜷𝑧𝑧)• 𝑝𝑝 𝑤𝑤𝑑𝑑𝑛𝑛|𝑧𝑧 = 𝑘𝑘 = 𝛽𝛽𝑘𝑘𝑤𝑤𝑑𝑑𝑛𝑛

    15

  • Likelihood Function•For a set of M documents

    𝐿𝐿 = �𝑑𝑑

    𝑝𝑝(𝒘𝒘𝑑𝑑) = �𝑑𝑑

    �𝑘𝑘

    𝑝𝑝(𝒘𝒘𝑑𝑑 , 𝑧𝑧 = 𝑘𝑘)

    = �𝑑𝑑

    �𝑘𝑘

    𝑝𝑝 𝒘𝒘𝑑𝑑 𝑧𝑧 = 𝑘𝑘 𝑝𝑝(𝑧𝑧 = 𝑘𝑘)

    = �𝑑𝑑

    �𝑘𝑘

    𝑝𝑝(𝑧𝑧 = 𝑘𝑘)�𝑛𝑛

    𝛽𝛽𝑘𝑘𝑤𝑤𝑑𝑑𝑛𝑛

    16

  • Text Data: Topic Models• Text Data and Topic Models

    • Revisit of Multinomial Mixture Model

    • Probabilistic Latent Semantic Analysis (pLSA)

    • Latent Dirichlet Allocation (LDA)

    • Summary

    18

  • Notations•Word, document, topic

    •𝑤𝑤,𝑑𝑑, 𝑧𝑧•Word count in document

    • 𝑑𝑑(𝑤𝑤,𝑑𝑑)•Word distribution for each topic (𝛽𝛽𝑧𝑧)

    •𝛽𝛽𝑧𝑧𝑤𝑤: 𝑝𝑝(𝑤𝑤|𝑧𝑧)•Topic distribution for each document (𝜃𝜃𝑑𝑑)

    •𝜃𝜃𝑑𝑑𝑧𝑧: 𝑝𝑝(𝑧𝑧|𝑑𝑑) (Yes, soft clustering)

    19

  • Issues of Mixture of Unigrams•All the words in the same documents are sampled from the same topic

    • In practice, people switch topics during their writing

    20

  • Illustration of pLSA

    21

  • Generative Model for pLSA

    •Describe how a document is generated probabilistically• For each position in d, 𝑛𝑛 = 1, … ,𝑁𝑁𝑑𝑑

    • Generate the topic for the position as𝑧𝑧𝑛𝑛~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝜽𝜽𝑑𝑑), 𝑑𝑑. 𝑑𝑑. ,𝑝𝑝 𝑧𝑧𝑛𝑛 = 𝑘𝑘 = 𝜃𝜃𝑑𝑑𝑘𝑘

    (Note, 1 trial multinomial, i.e., discrete distribution)

    • Generate the word for the position as𝑤𝑤𝑛𝑛~𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝜷𝜷𝑧𝑧𝑛𝑛), 𝑑𝑑. 𝑑𝑑. ,𝑝𝑝 𝑤𝑤𝑛𝑛 = 𝑤𝑤 = 𝛽𝛽𝑧𝑧𝑛𝑛𝑤𝑤

    22

  • Graphical Model

    Note: Sometimes, people add parameters such as 𝜃𝜃 𝑎𝑎𝑛𝑛𝑑𝑑 𝛽𝛽 into the graphical model

    23

  • The Likelihood Function for a Corpus•Probability of a word𝑝𝑝 𝑤𝑤|𝑑𝑑 = �

    𝑘𝑘

    𝑝𝑝(𝑤𝑤, 𝑧𝑧 = 𝑘𝑘|𝑑𝑑) = �𝑘𝑘

    𝑝𝑝 𝑤𝑤 𝑧𝑧 = 𝑘𝑘 𝑝𝑝 𝑧𝑧 = 𝑘𝑘|𝑑𝑑 = �𝑘𝑘

    𝛽𝛽𝑘𝑘𝑤𝑤𝜃𝜃𝑑𝑑𝑘𝑘

    •Likelihood of a corpus

    24

    𝜋𝜋𝑑𝑑 𝑑𝑑𝑑𝑑 𝑚𝑚𝑑𝑑𝑚𝑚𝑎𝑎𝑚𝑚𝑚𝑚𝑢𝑢 𝑑𝑑𝑐𝑐𝑛𝑛𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑑𝑑 𝑚𝑚𝑛𝑛𝑑𝑑𝑢𝑢𝑐𝑐𝑑𝑑𝑚𝑚, i.e., 1/M

  • Re-arrange the Likelihood Function•Group the same word from different positions together

    max 𝑚𝑚𝑐𝑐𝑙𝑙𝐿𝐿 = �𝑑𝑑𝑤𝑤

    𝑑𝑑 𝑤𝑤,𝑑𝑑 𝑚𝑚𝑐𝑐𝑙𝑙�𝑧𝑧

    𝜃𝜃𝑑𝑑𝑧𝑧 𝛽𝛽𝑧𝑧𝑤𝑤

    𝑑𝑑. 𝑑𝑑.�𝑧𝑧

    𝜃𝜃𝑑𝑑𝑧𝑧 = 1 𝑎𝑎𝑛𝑛𝑑𝑑 �𝑤𝑤

    𝛽𝛽𝑧𝑧𝑤𝑤 = 1

    25

  • Optimization: EM Algorithm• Repeat until converge

    • E-step: for each word in each document, calculate is conditional probability belonging to each topic

    𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 ∝ 𝑝𝑝 𝑤𝑤 𝑧𝑧,𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑑𝑑 = 𝛽𝛽𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧 (𝑑𝑑. 𝑑𝑑. ,𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 =𝛽𝛽𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧

    ∑𝑧𝑧𝑧 𝛽𝛽𝑧𝑧𝑧𝑤𝑤𝜃𝜃𝑑𝑑𝑧𝑧𝑧)

    • M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood

    𝛽𝛽𝑧𝑧𝑤𝑤 ∝ ∑𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 𝑑𝑑 𝑤𝑤,𝑑𝑑 (𝑑𝑑. 𝑑𝑑. ,𝛽𝛽𝑧𝑧𝑤𝑤 =∑𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 𝑐𝑐 𝑤𝑤,𝑑𝑑

    ∑𝑤𝑤′,𝑑𝑑 𝑝𝑝 𝑧𝑧 𝑤𝑤𝑧,𝑑𝑑 𝑐𝑐 𝑤𝑤′,𝑑𝑑 )

    𝜃𝜃𝑑𝑑𝑧𝑧 ∝�𝑤𝑤

    𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 𝑑𝑑 𝑤𝑤,𝑑𝑑 (𝑑𝑑. 𝑑𝑑. ,𝜃𝜃𝑑𝑑𝑧𝑧 =∑𝑤𝑤 𝑝𝑝 𝑧𝑧 𝑤𝑤,𝑑𝑑 𝑑𝑑 𝑤𝑤,𝑑𝑑

    𝑁𝑁𝑑𝑑)

    26

  • Text Data: Topic Models• Text Data and Topic Models

    • Revisit of Multinomial Mixture Model

    • Probabilistic Latent Semantic Analysis (pLSA)

    • Latent Dirichlet Allocation (LDA)

    • Summary

    28

  • Limitations of pLSA•Not a proper generative model

    •𝜽𝜽𝑑𝑑 is treated as a parameter• Cannot model new documents

    •Solution:• Make it a proper generative model by adding priors to 𝜃𝜃 and 𝛽𝛽

    29

  • Review of Conjugate Prior• Model:

    • 𝑝𝑝 𝑥𝑥 𝜃𝜃• Prior:

    • 𝑝𝑝 𝜃𝜃 𝛼𝛼• Posterior:

    • 𝑝𝑝 𝜃𝜃 𝑥𝑥,𝛼𝛼 ∝ 𝑝𝑝 𝜃𝜃, 𝑥𝑥 𝛼𝛼 = 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝑝𝑝(𝜃𝜃|𝛼𝛼)

    • Conjugate prior:• If 𝑝𝑝 𝜃𝜃 𝛼𝛼 and 𝑝𝑝 𝜃𝜃 𝑥𝑥,𝛼𝛼 belong to the same distribution family (with different parameters)

    30

  • Dirichlet-Multinomial Conjugate•Dirichlet distribution: 𝜽𝜽~𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝑚𝑚𝑑𝑑𝑑𝑑(𝜶𝜶)

    • 𝑑𝑑. 𝑑𝑑. , 𝑝𝑝 𝜽𝜽 𝜶𝜶 = Γ(∑𝑘𝑘 𝛼𝛼𝑘𝑘)∏𝑘𝑘 Γ(𝛼𝛼𝑘𝑘)∏𝑘𝑘 𝜃𝜃𝑘𝑘

    𝛼𝛼𝑘𝑘−𝑑, where 𝛼𝛼𝑘𝑘 > 0

    • 𝐸𝐸 𝜃𝜃𝑘𝑘 =𝛼𝛼𝑘𝑘

    ∑𝑘𝑘′ 𝛼𝛼𝑘𝑘′

    • 𝑉𝑉𝑎𝑎𝑑𝑑 𝜃𝜃𝑘𝑘 =𝛼𝛼𝑘𝑘 𝛼𝛼0−𝛼𝛼𝑘𝑘𝛼𝛼02 𝛼𝛼0+𝑑

    , where 𝛼𝛼0 = ∑𝑘𝑘 𝛼𝛼𝑘𝑘

    31𝐸𝐸𝑥𝑥𝑎𝑎𝑚𝑚𝑝𝑝𝑚𝑚𝑑𝑑:𝜽𝜽~𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝑚𝑚𝑑𝑑𝑑𝑑 𝜶𝜶 ,𝑤𝑤𝐷𝑑𝑑𝑑𝑑𝑑𝑑 𝜶𝜶/𝛼𝛼0 = (

    12

    ,13

    ,16

    )

    PresenterPresentation Notes\alpha_k psudo count

  • More Examples

    32

    Dirichlet(1,1,1)

    0 0.2 0.4 0.6 0.8 10

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    Dirichlet(2,2,2)

    0 0.2 0.4 0.6 0.8 10

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    Dirichlet(10,10,10)

    0 0.2 0.4 0.6 0.8 10

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    Dirichlet(2,2,10)

    0 0.2 0.4 0.6 0.8 10

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    Dirichlet(2,10,2)

    0 0.2 0.4 0.6 0.8 10

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    Dirichlet(0.1,0.1,0.1)

    0 0.2 0.4 0.6 0.8 10

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

  • Simplex View •𝛼𝛼 = (2,3,4)

    33

    0

    x2

    0.50

    10.8

    0.2

    x1

    0.60.4

    0.4

    0.2 10

    x3

    0.6

    0.8

    1

  • More Examples in the Simplex View

    34

  • Dirichlet-Multinomial Conjugate•Posterior

    •𝑝𝑝 𝜽𝜽 𝜶𝜶, 𝑧𝑧𝑑, 𝑧𝑧𝑑, … 𝑧𝑧𝑛𝑛 ∝ 𝑝𝑝 𝑧𝑧𝑑, … , 𝑧𝑧𝑛𝑛 𝜽𝜽 𝑝𝑝 𝜽𝜽 𝜶𝜶

    ∝ ∏𝑘𝑘 𝜃𝜃𝑘𝑘𝛼𝛼𝑘𝑘+𝑐𝑐𝑘𝑘−𝑑

    where 𝑑𝑑𝑘𝑘 is the number of z that takes value of k

    35

  • The Graphical Model of LDA

    36

    𝜽𝜽𝒅𝒅~𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫 𝜶𝜶𝜷𝜷𝒌𝒌~𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫(𝜼𝜼)

  • Posterior Inference for LDA

    • Joint distribution of latent variables and documents is

    37

  • Posterior Inference•Posterior of the latent variables

    •Solutions• Gibbs sampling• Variational inference

    38

    =

    𝑚𝑚𝑎𝑎𝑑𝑑𝑙𝑙𝑑𝑑𝑛𝑛𝑎𝑎𝑚𝑚 𝑝𝑝 𝒘𝒘 𝑑𝑑𝑑𝑑 𝑑𝑑𝑛𝑛𝑑𝑑𝑑𝑑𝑎𝑎𝑑𝑑𝑑𝑑𝑎𝑎𝑖𝑖𝑚𝑚𝑑𝑑!

  • References for LDA•Blei et al., Latent Dirichlet Allocation, JLMR 3 (2003)

    •Griffiths et al., Finding Scientific Topics, PNAS (2004)

    39

  • Text Data: Topic Models• Text Data and Topic Models

    • Revisit of Multinomial Mixture Model

    • Probabilistic Latent Semantic Analysis (pLSA)

    • Latent Dirichlet Allocation (LDA)

    • Summary

    40

  • Summary• Basic Concepts

    • Word/term, document, corpus, topic• How to represent a document

    • Mixture of unigrams• pLSA

    • Generative model• Likelihood function• EM algorithm

    • LDA• Dirichlet-multinomial conjugate • Posterior inference

    41

    CS249: Advanced Data MiningAnnouncementsMethods to LearnText Data: Topic ModelsText DataRepresent a DocumentMore DetailsLimitations of Vector Space ModelTopicsTopic ModelsText Data: Topic ModelsReview of Multinomial DistributionMultinomial Mixture ModelLikelihood FunctionMixture of UnigramsLikelihood FunctionText Data: Topic ModelsNotationsIssues of Mixture of UnigramsIllustration of pLSAGenerative Model for pLSAGraphical ModelThe Likelihood Function for a CorpusRe-arrange the Likelihood FunctionOptimization: EM AlgorithmText Data: Topic ModelsLimitations of pLSAReview of Conjugate PriorDirichlet-Multinomial ConjugateMore ExamplesSimplex View More Examples in the Simplex ViewDirichlet-Multinomial ConjugateThe Graphical Model of LDAPosterior Inference for LDAPosterior InferenceReferences for LDAText Data: Topic ModelsSummary