Outline
2
Topic Models: An Example of Latent Dirichlet Allocation
Learning : Expectation Maximization Algorithms
Learning: Gibbs sampling
Problem set-up
Given: a collection of documents Want:
Detect key ‘topics’ discussed in the collection For each document detect which ‘topics’ are
discussed there Requirements:
No supervision (documents are not labeled) Probabilistic methods
Motivation Visualization of collections:
what are the topics discussed? which documents discuss a topic?
Opinion mining: what is sentiment towards a product aspects? what are important aspects of a product?
Dimensionality reduction: for information retrieval for document classification
Summarization: ensuring topic coverage in a summary
. . .
Latent Semantic Analysis [Deerwester et al., 1990]
Decomposition of the coccurence matrix Approximate the co-occurence matrix
X = U§ V0
X ¼X̂ = Uk§ V0k
Hope: terms having common meaning are mapped to the
same direction
- Hope: documents discussing similar topics have similar
representation- Non-zero inner products
between documents with non-overlapping terms
Latent Semantic Analysis [Deerwester et al., 1990]
Optimal rank k approximation (Frobenius norm):
X ¼X̂ = Uk§ V0k
X̂ = argminX 0:r ank(X 0)=k kX ¡ X 0k
Latent Semantic Analysis [Deerwester et al., 1990]
Not motivated probabilistically (no clean underlying probability model)
No obvious interpretation of directions
X ¼X̂ = Uk§ V0k
Probabilistic LSA [Hofmann, 99]
Parameters: Distributions of topics in document P(z | d), for every d Distribution of words for every topic, P(w | z), z 2 {1, …K}
Generative story: For each document d
For each word occurrence i in document d Select topic zi for the word from P(zi | d) Generate word wi from P(wi | zi)
Note: Words in the same documents can be generated from different topics
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Generative story:Given parameters generate text
Generative story:Given parameters generate text
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Probabilistic LSA : Example
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
Note: Words in the same documents can be generated
from different topics Does not take into account order, the
following to texts are guaranteed to have the same probability under the model
delays due to the volcanic ash cloud will affect Formula1 teams’ preparations
to ash due Formula1 affect volcanic the will teams’ cloud preparations delays
PLSA: Example
We considered:
Document 1
Document 2
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
?
?
In fact we are solving reverse problem:
Document 1
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
Document 2
… Obama will not attend the ceremony due to delays caused by eruption …
delaysvolcanicvolcanoashcloud…
0.0030.0020.0010.0010.001…
Eruption P(w | z = 1)
footballteamsballpreparationsFormula1…
0.0040.0020.0020.0010.001…
Sport P(w | z = 2)
0.0050.0010.0010.00010.0001…
Politics P(w | z = 3)
ObamaMerkelceremonyattendparty…
?
?
? ? ?
We are going to talk about learning in a moment
We are going to talk about learning in a moment
Directed Graphical Models
… ……
Doc 1 Doc M
N1NM
- Roughly, arrows denote conditional dependencies- A generative story corresponds to a topological order on the graph
- Roughly, arrows denote conditional dependencies- A generative story corresponds to a topological order on the graph
z
w
If they are variables, maybe we can generate them from some variables?
NM
z
wµ '
K
®
¯
Prior distribution for topic distributions in documents
Prior distribution for word distributions
This is essentially the Latent Dirichlet Allocation (LDA) model (Blei et al, 2001): Hierarchical Bayesian generalization of PLSA
This is essentially the Latent Dirichlet Allocation (LDA) model (Blei et al, 2001): Hierarchical Bayesian generalization of PLSA
Latent Dirichlet Allocation Parameters: two scalars: and Generative story:
For draw word distrib-s For each document d
Draw topic distribs For each word occurrence i in document d
Select topic zi for the word from Generate word wi fromz 2 f1;: : : ;K g ' z » Dir(¯)
® ¯
µd » Dir(®)
µd' zi
Dirichlet distribution: think of it as a “distribution of over
distributions”
We will talk about semantics of these parameters later
Dirichlet distribution Defines a distribution over p1,..,pK-1 such as:
- can be regarded as a distribution “Distribution over distributions” (K-1 dimensional simplex)
Form of the Dirichlet distribution (with symmetric prior)
pi > 0P K ¡ 1
i=1 pi < 1
(p1;p2; : : : ;pK ¡ 1;1¡P K ¡ 1
i=1 xi )
f (p1; : : : ;pK ¡ 1j®) = 1B (®)
Q Ki=1 p®¡ 1
i
Normalization (Beta function)
Meaning of hyperparameter
Consider case K= 2 (i.e., Beta distribution), pdf:
® = 0.5 : prefers “biased” distributions
® = 2 : prefers smooth distributions
Often, we want sparse distributions: though the collection may represent
1,000 topics, each document discusses 2-3 topics. In this
case we use small ®
Can be regarded as “pseudo counts”: ®=2
means that you observed 1 example of
each class
Summary so far We defined a generative model of document
collections: Now, first we will consider how to do MAP
estimation, i.e. Then we will consider Bayesian methods
P (W;Z; ' ;µj®;¯ )
( '̂ ; µ̂) = arg max' ;µP (' ;µj®;¯ ;W )
Expectation maximization algorithm
Expectation Maximization
41
EM is a class of algorithms that is used to estimate parameters in the presence of missing attributes.
Non-convex optimization, therefore:
converges to a local maximum of the maximum likelihood function (or posterior distribution if we incorporate the prior distribution).
can be very sensitive to the starting point
In LDA we do not observe from each topic z a word
is generated
Three coin example
42
We observe a series of coin tosses generated in the following way:
A person has three coins.
Coin 0: probability of Head is ®
Coin 1: probability of Head p
Coin 2: probability of Head q
Consider the following coin-tossing scenario
Three coin example
43
Scenario:
Toss coin 0 (do not show it to anyone!).
If Head – toss coin 1 M times;
Else -- toss coin 2 M times.
Only the series of tosses are observed
HHHT, HTHT, HHHT, HTTH
What are the parameters of the coins ? (®, p, q)
Three coin example
44
Scenario:
Toss coin 0 (do not show it to anyone!).
If Head – toss coin 1 M times;
Else -- toss coin 2 M times.
Only the series of tosses are observed
HHHT, HTHT, HHHT, HTTH
What are the parameters of the coins ? (®, p, q)
There is no closed form solution to the problem
Key Intuition
45
If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, that would be trivial.
Key Intuition
46
If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, there was no problem.
Instead, use an iterative approach for estimating the parameters:
Guess the probability that a data point came from Coin 1/2
Generate fictional labels, weighted according to this probability.
Re-estimate the initial parameter setting: set them to maximize the likelihood of these augmented data.
This process can be iterated and can be shown to converge to a local maximum of the likelihood function
EM-algorithm: coins (E-step)
47
We will assume (for a minute) that we know the parameters and use it to estimate which Coin it is
Then, we will use the estimation for the tossed Coin, to estimate the most likely parameters and so on...
What is the probability that the ith data point came from Coin1 ?
EM-algorithm: coins (M-step)
48
At this point we would like to compute the likelihood of the data, and find the parameters which maximize it
We will maximize the likelihood of the data (n data points)
But one of the variables: the coin name is hidden:
Instead on each step we maximize expectation of the likelihood over the coin name:
EM-algorithm: coins
50
Now fine the most likely parameters by setting derivatives to zero:
Note that are fixed (computed from the previous estimate of
parameters)
EM-algorithm: coins
51
Now fine the most likely parameters by setting derivatives to zero:
Note that are fixed (computed from the previous estimate of
parameters)
EM for PLSA / LDA
55
NM
z
wµ '
K
®
¯
+®¡ 1+¯ ¡ 1+®
Can be negative… Ignore it for now. Can be
regarded as smoothing
Summary so far We defined a generative model of document
collections: We considered how to do MAP estimation, i.e.
Now we could visualize collections Estimate topic distributions in each document
P (W;Z; ' ;µj®;¯ )
( '̂ ; µ̂) = arg max' ;µP (' ;µj®;¯ ;W )
(P (wjz) = ' z)
(P (zjd) = µd)
Example: topics of a document
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
P (zjd) = µd
Outline
59
Topic Models: An Example of Latent Dirichlet Allocation
Learning : Expectation Maximization Algorithms
Learning: Gibbs sampling
Gibbs Sampling for PLSA / LDA
60
NM
z
wµ '
K
®
¯
Choosing word for topic j for position
i
Choosing topic j for document di
Gibbs Sampling for PLSA / LDA
We do not directly obtain and
Instead we observe which words are assigned to which topics:
We can estimate the parameters from the sample (and take into account pseudo counts priors)
P (zjd) = µd
P (wjz) = ' z(w)
… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …
For details on Gibbs sampling refer to Finding Scientific Topics, Griffiths and
Steyvers, PNAS 2004
Summary
62
We considered the most standard “topic model”: PLSA / LDA
Reviewed two basic inference techniques:
EM
Collapsed Gibbs sampling
These methods are applicable to the majority of models we will consider in class
Formalities
63
Doodle poll: we will stay with Friday, 2pm Paper selection:
Deadline extended Not sure what we will do with the next class
April 30 (watch for announcements) If we do not get all the paper selected, review
selections may be affected These methods are applicable to the majority of
models we will consider in class
References
64
PLSA: Hofmann, Probabilistic Latent Semantic Indexing, SIGIR 99
LDA (they use variational inference though): Blei, Ng, Jordan, and Lafferty, Latent Dirichlet allocation, Journal of Machine Learning Research, 2003
Collapsed Gibbs sampling for LDA: Griffiths and Steyvers, Finding Scientific Topics, PNAS 2004
EM (original paper): Dempster, Laird and Rubin. Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society B, 1977
EM, a note by Michael Collins: people.csail.mit.edu/mcollins/papers/wpeII.4.ps
Video Tutorial on EM by Chris Bishop: http://videolectures.net/mlss04_bishop_gmvm/ (see part 4)
Top Related