CS 6347 Lecture 21 - University of Texas at Dallas · 2017-01-09 · Lecture 21. Topic Models and...

CS 6347

Lecture 21

Topic Models and LDA(some slides by David Blei)

Generative vs. Discriminative Models

• Recall that, in Bayesian networks, there could be many different, but equivalent models of the same joint distribution

• Although these two models are equivalent (in the sense that they imply the same independence relations), they can differ significantly when it comes to inference/prediction

𝑋𝑋

𝑌𝑌 𝑋𝑋

𝑌𝑌

GenerativeDiscriminative

• Generative models: we can think of the observations as being generated by the latent variables

– Start sampling at the top and work downwards

– Examples?

𝑋𝑋

𝑌𝑌 𝑋𝑋

𝑌𝑌

• Generative models: we can think of the observations as being generated by the latent variables

– Start sampling at the top and work downwards

– Examples: HMMs, naïve Bayes, LDA

𝑋𝑋

𝑌𝑌 𝑋𝑋

𝑌𝑌

• Discriminative models: most useful for discriminating the values of the latent variables

– Almost always used for supervised learning

– Examples?

𝑋𝑋

𝑌𝑌 𝑋𝑋

𝑌𝑌

• Discriminative models: most useful for discriminating the values of the latent variables

– Almost always used for supervised learning

– Examples: CRFs

𝑋𝑋

𝑌𝑌 𝑋𝑋

𝑌𝑌

• Suppose we are only interested in the prediction task (i.e., estimating 𝑝𝑝(𝑌𝑌|𝑋𝑋))

– Discriminative model: 𝑝𝑝 𝑋𝑋,𝑌𝑌 = 𝑝𝑝 𝑋𝑋 𝑝𝑝(𝑌𝑌|𝑋𝑋)

– Generative model: 𝑝𝑝 𝑋𝑋,𝑌𝑌 = 𝑝𝑝 𝑌𝑌 𝑝𝑝(𝑋𝑋|𝑌𝑌)

𝑋𝑋

𝑌𝑌 𝑋𝑋

𝑌𝑌

Generative Models

• The primary advantage of generative models is that they provide a model of the data generating process

– Could generate “new” data samples by using the model

• Topic models

– Methods for discovering themes (topics) from a collection (e.g., books, newspapers, etc.)

– Annotates the collection according to the discovered themes

– Use the annotations to organize, search, summarize, etc.

Models of Text Documents

• Bag-of-words models: assume that the ordering of words in a document do not matter

– This is typically false as certain phrases can only appear together

• Unigram model: all words in a document are drawn uniformly at random from categorical distribution

• Mixture of unigrams model: for each document, we first choose a topic 𝑧𝑧 and then generate words for the document from the conditional distribution 𝑝𝑝(𝑤𝑤|𝑧𝑧)

– Topics are just probability distributions over words

Topic Models

Latent Dirichlet Allocation (LDA)

• 𝛼𝛼 and 𝜂𝜂 are parameters of the prior distributions over 𝜃𝜃 and 𝛽𝛽 respectively

• 𝜃𝜃𝑑𝑑 is the distribution of topics for document 𝑑𝑑 (real vector of length 𝐾𝐾)

• 𝛽𝛽𝑘𝑘 is the distribution of words for topic 𝑘𝑘 (real vector of length 𝑉𝑉)

• 𝑧𝑧𝑑𝑑,𝑛𝑛 is the topic for the 𝑛𝑛th word in the 𝑑𝑑th document

• 𝑤𝑤𝑑𝑑,𝑛𝑛 is the 𝑛𝑛th word of the 𝑑𝑑th document

• Plate notation

– There are 𝑁𝑁 ⋅ 𝐷𝐷 different variables that represent the observed words in the different documents

– There are 𝐾𝐾 total topics (assumed to be known in advance)

– There are 𝐷𝐷 total documents 13

• The only observed variables are the words in the documents

– The topic for each word, the distribution over topics for each document, and the distribution of words per topic are all latent variables in this model

• The model contains both continuous and discrete random variables

– 𝜃𝜃𝑑𝑑 and 𝛽𝛽𝑘𝑘 are vectors of probabilities

– 𝑧𝑧𝑑𝑑,𝑛𝑛 is an integer in {1, … ,𝐾𝐾} that indicates the topic of the 𝑛𝑛thword in the 𝑑𝑑th document

– 𝑤𝑤𝑑𝑑,𝑛𝑛 is an integer in 1, … ,𝑉𝑉 which indexes over all possible words

• 𝜃𝜃𝑑𝑑~𝐷𝐷𝐷𝐷𝐷𝐷(𝛼𝛼) where 𝐷𝐷𝐷𝐷𝐷𝐷(𝛼𝛼) is the Dirichlet distribution with parameter vector 𝛼𝛼 > 0

• 𝛽𝛽𝑘𝑘~𝐷𝐷𝐷𝐷𝐷𝐷(𝜂𝜂) with parameter vector 𝜂𝜂 > 0

• Dirichlet distribution over 𝑥𝑥1, … , 𝑥𝑥𝐾𝐾 such that 𝑥𝑥1, … , 𝑥𝑥𝐾𝐾 ≥ 0 and ∑𝑖𝑖 𝑥𝑥𝑖𝑖 = 1

𝑓𝑓(𝑥𝑥1, … , 𝑥𝑥𝐾𝐾;𝛼𝛼1, … ,𝛼𝛼𝐾𝐾) ∝�𝑖𝑖

𝑥𝑥𝑖𝑖𝛼𝛼𝑖𝑖−1

– The Dirichlet distribution is a distribution over probability distributions over 𝐾𝐾 elements

Latent Dirichlet Allocation (LDA)• The discrete random variables are distributed via the corresponding

probability distributions

𝑝𝑝(𝑧𝑧𝑑𝑑,𝑛𝑛 = 𝑘𝑘 𝜃𝜃𝑑𝑑 = 𝜃𝜃𝑑𝑑 𝑘𝑘

𝑝𝑝 𝑤𝑤𝑑𝑑,𝑛𝑛 = 𝑣𝑣 𝑧𝑧𝑑𝑑,𝑛𝑛,𝛽𝛽1, … ,𝛽𝛽𝐾𝐾 = 𝛽𝛽𝑧𝑧𝑑𝑑,𝑛𝑛 𝑣𝑣

– Here, 𝜃𝜃𝑑𝑑 𝑘𝑘 is the 𝑘𝑘th element of the vector 𝜃𝜃𝑑𝑑 which corresponds to the percentage of document 𝑑𝑑 corresponding to topic 𝑘𝑘

• The joint distribution is then

𝑝𝑝 𝑤𝑤, 𝑧𝑧,𝜃𝜃,𝛽𝛽 𝛼𝛼, 𝜂𝜂 = �𝑘𝑘

𝑝𝑝(𝛽𝛽𝑘𝑘|𝜂𝜂)�𝑑𝑑

𝑝𝑝(𝜃𝜃𝑑𝑑|𝛼𝛼)�𝑛𝑛

𝑝𝑝(𝑧𝑧𝑑𝑑,𝑛𝑛 𝜃𝜃𝑑𝑑 𝑝𝑝 𝑤𝑤𝑑𝑑,𝑛𝑛 𝑧𝑧𝑑𝑑,𝑛𝑛,𝛽𝛽

• LDA is a generative model

– We can think of the words as being generated by a probabilistic process defined by the model

– How reasonable is the generative model?

• Inference in this model is NP-hard

• Given the 𝐷𝐷 documents, want to find the parameters that best maximize the joint probability

– Can use an EM based approach called variational EM

Variational EM

• Recall that the EM algorithm constructed a lower bound using Jensen’s inequality

𝑙𝑙 𝜃𝜃 = �𝑘𝑘=1

𝐾𝐾

log �𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘

𝑝𝑝(𝑥𝑥𝑜𝑜𝑜𝑜𝑠𝑠𝑘𝑘𝑘𝑘 , 𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘|𝜃𝜃)

= �𝑘𝑘=1

𝐾𝐾

log �𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘

𝑞𝑞𝑘𝑘 𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘 ⋅𝑝𝑝 𝑥𝑥𝑜𝑜𝑜𝑜𝑠𝑠𝑘𝑘

𝑘𝑘 , 𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘 𝜃𝜃𝑞𝑞𝑘𝑘 𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠

≥ �𝑘𝑘=1

𝐾𝐾

�𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘

𝑞𝑞𝑘𝑘 𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘 log𝑝𝑝 𝑥𝑥𝑜𝑜𝑜𝑜𝑠𝑠𝑘𝑘

𝑘𝑘 , 𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘 𝜃𝜃𝑞𝑞𝑘𝑘 𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘

Variational EM

• Performing the optimization over 𝑞𝑞 is equivalent to computing 𝑝𝑝(𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘|𝑥𝑥𝑜𝑜𝑜𝑜𝑠𝑠𝑘𝑘

𝑘𝑘 ,𝜃𝜃)

• This can be intractable in practice

– Instead, restrict 𝑞𝑞 to lie in some restricted set of distributions 𝑄𝑄

– For example, could make a mean-field assumption

𝑞𝑞 𝑥𝑥𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘 = �𝑖𝑖∈𝑚𝑚𝑖𝑖𝑠𝑠𝑘𝑘

𝑞𝑞𝑖𝑖(𝑥𝑥𝑖𝑖)

• The resulting algorithm only yields an approximation to the log-likelihood

EM for Topic Models

𝑝𝑝 𝑤𝑤 𝛼𝛼, 𝜂𝜂 = ��𝑘𝑘

𝑝𝑝(𝛽𝛽𝑘𝑘|𝜂𝜂)��𝑧𝑧

�𝑑𝑑

𝑝𝑝(𝜃𝜃𝑑𝑑|𝛼𝛼)�𝑛𝑛

𝑝𝑝(𝑧𝑧𝑑𝑑,𝑛𝑛 𝜃𝜃𝑑𝑑 𝑝𝑝 𝑤𝑤𝑑𝑑,𝑛𝑛 𝑧𝑧𝑑𝑑,𝑛𝑛,𝛽𝛽 𝑑𝑑𝜃𝜃 𝑑𝑑𝛽𝛽

• To apply variational EM, we write

log𝑝𝑝 𝑤𝑤 𝛼𝛼, 𝜂𝜂 = log��𝑧𝑧

𝑝𝑝 𝑤𝑤, 𝑧𝑧,𝜃𝜃,𝛽𝛽 𝛼𝛼, 𝜂𝜂 𝑑𝑑𝜃𝜃𝑑𝑑𝛽𝛽

≥ 𝑞𝑞 𝑧𝑧,𝜃𝜃,𝛽𝛽 log��𝑧𝑧

𝑝𝑝 𝑤𝑤, 𝑧𝑧,𝜃𝜃,𝛽𝛽 𝛼𝛼, 𝜂𝜂𝑞𝑞 𝑧𝑧,𝜃𝜃,𝛽𝛽

𝑑𝑑𝜃𝜃𝑑𝑑𝛽𝛽

where we restrict the distribution 𝑞𝑞 to be of the following form

𝑞𝑞 𝑧𝑧,𝜃𝜃,𝛽𝛽 = �𝑘𝑘

𝑞𝑞(𝛽𝛽𝑘𝑘|𝜂𝜂)�𝑑𝑑

𝑞𝑞 𝜃𝜃𝑑𝑑 𝛼𝛼 �𝑛𝑛

𝑞𝑞(𝑧𝑧𝑑𝑑,𝑛𝑛)

Example of LDA

Extensions of LDA

• Author– Topic model

– 𝑎𝑎𝑑𝑑 is the group of authors for the 𝑑𝑑th document

– 𝑥𝑥𝑑𝑑,𝑛𝑛 is the author of the 𝑛𝑛thword of the 𝑑𝑑th document

– 𝜃𝜃𝑎𝑎 is the topic distribution for author 𝑎𝑎

– 𝑧𝑧𝑑𝑑,𝑛𝑛 is the topic for the 𝑛𝑛thword of the 𝑑𝑑th document

The Author-Topic Model for Authors and Documents Rosen-Zvi et al.

Extensions of LDA

• Label 𝑌𝑌𝑑𝑑 for each document represents a value to be predicted from the document

– E.g., number of stars for each document in a corpus of movie reviews

CS 6347 Lecture 21 - University of Texas at Dallas · 2017-01-09 · Lecture 21. Topic Models and...

Documents

Transcript of CS 6347 Lecture 21 - University of Texas at Dallas · 2017-01-09 · Lecture 21. Topic Models and...

A Discriminative Topic Model using Document Network Structurewwyang/files/2016_acl_docblock_slides.pdf · models. 124 Relational Topic Model [Chang and Blei, 2010]: Topic Model, Document

Hidden Markov Models David Meir Blei November 1, 1999.

Nonparametric Bayes Pachinko Allocation by Li, Blei and McCallum (UAI 2007)

Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.

DECO Bly UK Ltd DECO Blei P. GmbH

David M. Blei - Computer Science Department at Princeton University

Final Deficiency Rept NCR WBN 6347 re excessive conduit … · 2012. 11. 30. · Final Deficiency Rept NCR WBN 6347 re excessive conduit bends.Initially reported on 851022.Calculations

Die Blei-Blei Methoden - uni- · PDF filePb used as reference. so, increase of . 208. Pb/ 204. Pb, 207. ... of the 'conformable' Pb model. ... The uncertainty of a data (age) is as

Worst-case scenarios as a stress testing tool for risk …/media/richmondfedorg/...Worst-case scenarios as a stress testing tool for risk models I Azamat Abdymomunov, Sharon Blei,

ler.letras.up.ptler.letras.up.pt/uploads/ficheiros/6347.pdf · Created Date: 4/8/2009 7:48:59 PM

Sean Gerrish and David Blei Princeton University …wallach/workshops/nips2011...Sean Gerrish and David Blei Princeton University Computer Science 17 December 2011 These news articles

Hierarchical Implicit Models and Likelihood-Free ...blei/papers/TranRanganathBlei2017.pdfFor modeling,§ 2describes hierarchical implicit models, a class of Bayesian hierarchical models

Topic Modeling in Embedding Spaces · the hidden semantic structure in a collection of documents (Blei et al.,2003;Blei,2012). Topic models and their extensions have been applied

Hierarchical Models - Princeton University Computer … · Hierarchical Models David M. Blei October 17, 2011 1 Introduction ... – Can you predict what other songs he/she will like

Neural Models for Documents with Metadata · LatentDirichletAllocation Blei,Ng,andJordan. Latent Dirichlet Allocation. JMLR.2003. DavidBlei. Probabilistic topic models. Comm. ACM.2012

Machine Learning and Econometrics · 8/28/2015 · Ruiz, Athey and Blei (2017) evaluate on days with unusual prices Athey, Blei, Donnelly and Ruiz (2017) evaluate change in purchases

Article 'What is this corpus about?': Using topic ...clok.uclan.ac.uk/19494/1/19494 Murakami et al.pdf · explanation of topic models describes LDA and is largely based on Blei (2012).

Hierarchical Models - cs.princeton.edu · Hierarchical Models David M. Blei October 17, 2011 1 Introduction ... – Can you predict what other songs he/she will like or dislike?

CS 6347 Lecture 18 Topic Models and LDA · 2017-04-07 · Latent Dirichlet Allocation (LDA) • Inference in this model is NP-hard • Given the 𝐷documents, want to find the parameters

ISQS 6347, Data