Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April 23 2010 TexPoint...

64
Unsupervised and Weakly- Supervised Probabilistic Modeling of Text Ivan Titov April 23 2010

Transcript of Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April 23 2010 TexPoint...

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text

Ivan Titov

April 23 2010

Outline

2

Topic Models: An Example of Latent Dirichlet Allocation

Learning : Expectation Maximization Algorithms

Learning: Gibbs sampling

Problem set-up

Given: a collection of documents Want:

Detect key ‘topics’ discussed in the collection For each document detect which ‘topics’ are

discussed there Requirements:

No supervision (documents are not labeled) Probabilistic methods

Motivation Visualization of collections:

what are the topics discussed? which documents discuss a topic?

Opinion mining: what is sentiment towards a product aspects? what are important aspects of a product?

Dimensionality reduction: for information retrieval for document classification

Summarization: ensuring topic coverage in a summary

. . .

Latent Semantic Analysis [Deerwester et al., 1990]

Decomposition of the coccurence matrix Approximate the co-occurence matrix

X = U§ V0

X ¼X̂ = Uk§ V0k

Hope: terms having common meaning are mapped to the

same direction

- Hope: documents discussing similar topics have similar

representation- Non-zero inner products

between documents with non-overlapping terms

Latent Semantic Analysis [Deerwester et al., 1990]

Optimal rank k approximation (Frobenius norm):

X ¼X̂ = Uk§ V0k

X̂ = argminX 0:r ank(X 0)=k kX ¡ X 0k

Latent Semantic Analysis [Deerwester et al., 1990]

Not motivated probabilistically (no clean underlying probability model)

No obvious interpretation of directions

X ¼X̂ = Uk§ V0k

Probabilistic LSA [Hofmann, 99]

Parameters: Distributions of topics in document P(z | d), for every d Distribution of words for every topic, P(w | z), z 2 {1, …K}

Generative story: For each document d

For each word occurrence i in document d Select topic zi for the word from P(zi | d) Generate word wi from P(wi | zi)

Note: Words in the same documents can be generated from different topics

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Generative story:Given parameters generate text

Generative story:Given parameters generate text

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Probabilistic LSA : Example

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

Note: Words in the same documents can be generated

from different topics Does not take into account order, the

following to texts are guaranteed to have the same probability under the model

delays due to the volcanic ash cloud will affect Formula1 teams’ preparations

to ash due Formula1 affect volcanic the will teams’ cloud preparations delays

PLSA: Example

We considered:

Document 1

Document 2

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

?

?

In fact we are solving reverse problem:

Document 1

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Document 2

… Obama will not attend the ceremony due to delays caused by eruption …

delaysvolcanicvolcanoashcloud…

0.0030.0020.0010.0010.001…

Eruption P(w | z = 1)

footballteamsballpreparationsFormula1…

0.0040.0020.0020.0010.001…

Sport P(w | z = 2)

0.0050.0010.0010.00010.0001…

Politics P(w | z = 3)

ObamaMerkelceremonyattendparty…

?

?

? ? ?

We are going to talk about learning in a moment

We are going to talk about learning in a moment

Directed Graphical Models

… ……

Doc 1 Doc M

N1NM

- Roughly, arrows denote conditional dependencies- A generative story corresponds to a topological order on the graph

- Roughly, arrows denote conditional dependencies- A generative story corresponds to a topological order on the graph

z

w

Plate Notation

… ……

Doc 1 Doc M

N1NM

z

w

NM

z

w

Plate Notation

NM

z

w

Generated from P(z | d)

Generate from P(w | z)

Distributions are also variables…

NM

z

wµ '

K

(' k = P (wjzk))

(µd = P (zjd))

If they are variables, maybe we can generate them from some variables?

NM

z

wµ '

K

®

¯

Prior distribution for topic distributions in documents

Prior distribution for word distributions

This is essentially the Latent Dirichlet Allocation (LDA) model (Blei et al, 2001): Hierarchical Bayesian generalization of PLSA

This is essentially the Latent Dirichlet Allocation (LDA) model (Blei et al, 2001): Hierarchical Bayesian generalization of PLSA

Latent Dirichlet Allocation Parameters: two scalars: and Generative story:

For draw word distrib-s For each document d

Draw topic distribs For each word occurrence i in document d

Select topic zi for the word from Generate word wi fromz 2 f1;: : : ;K g ' z » Dir(¯)

® ¯

µd » Dir(®)

µd' zi

Dirichlet distribution: think of it as a “distribution of over

distributions”

We will talk about semantics of these parameters later

Dirichlet distribution Defines a distribution over p1,..,pK-1 such as:

- can be regarded as a distribution “Distribution over distributions” (K-1 dimensional simplex)

Form of the Dirichlet distribution (with symmetric prior)

pi > 0P K ¡ 1

i=1 pi < 1

(p1;p2; : : : ;pK ¡ 1;1¡P K ¡ 1

i=1 xi )

f (p1; : : : ;pK ¡ 1j®) = 1B (®)

Q Ki=1 p®¡ 1

i

Normalization (Beta function)

Meaning of hyperparameter

Consider case K= 2 (i.e., Beta distribution), pdf:

® = 0.5 : prefers “biased” distributions

® = 2 : prefers smooth distributions

Often, we want sparse distributions: though the collection may represent

1,000 topics, each document discusses 2-3 topics. In this

case we use small ®

Can be regarded as “pseudo counts”: ®=2

means that you observed 1 example of

each class

Summary so far We defined a generative model of document

collections: Now, first we will consider how to do MAP

estimation, i.e. Then we will consider Bayesian methods

P (W;Z; ' ;µj®;¯ )

( '̂ ; µ̂) = arg max' ;µP (' ;µj®;¯ ;W )

Expectation maximization algorithm

Expectation Maximization

41

EM is a class of algorithms that is used to estimate parameters in the presence of missing attributes.

Non-convex optimization, therefore:

converges to a local maximum of the maximum likelihood function (or posterior distribution if we incorporate the prior distribution).

can be very sensitive to the starting point

In LDA we do not observe from each topic z a word

is generated

Three coin example

42

We observe a series of coin tosses generated in the following way:

A person has three coins.

Coin 0: probability of Head is ®

Coin 1: probability of Head p

Coin 2: probability of Head q

Consider the following coin-tossing scenario

Three coin example

43

Scenario:

Toss coin 0 (do not show it to anyone!).

If Head – toss coin 1 M times;

Else -- toss coin 2 M times.

Only the series of tosses are observed

HHHT, HTHT, HHHT, HTTH

What are the parameters of the coins ? (®, p, q)

Three coin example

44

Scenario:

Toss coin 0 (do not show it to anyone!).

If Head – toss coin 1 M times;

Else -- toss coin 2 M times.

Only the series of tosses are observed

HHHT, HTHT, HHHT, HTTH

What are the parameters of the coins ? (®, p, q)

There is no closed form solution to the problem

Key Intuition

45

If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, that would be trivial.

Key Intuition

46

If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, there was no problem.

Instead, use an iterative approach for estimating the parameters:

Guess the probability that a data point came from Coin 1/2

Generate fictional labels, weighted according to this probability.

Re-estimate the initial parameter setting: set them to maximize the likelihood of these augmented data.

This process can be iterated and can be shown to converge to a local maximum of the likelihood function

EM-algorithm: coins (E-step)

47

We will assume (for a minute) that we know the parameters and use it to estimate which Coin it is

Then, we will use the estimation for the tossed Coin, to estimate the most likely parameters and so on...

What is the probability that the ith data point came from Coin1 ?

EM-algorithm: coins (M-step)

48

At this point we would like to compute the likelihood of the data, and find the parameters which maximize it

We will maximize the likelihood of the data (n data points)

But one of the variables: the coin name is hidden:

Instead on each step we maximize expectation of the likelihood over the coin name:

EM-algorithm: coins (M-step)

49

Continue:

EM-algorithm: coins

50

Now fine the most likely parameters by setting derivatives to zero:

Note that are fixed (computed from the previous estimate of

parameters)

EM-algorithm: coins

51

Now fine the most likely parameters by setting derivatives to zero:

Note that are fixed (computed from the previous estimate of

parameters)

Summary: EM

52

Summary: EM

53

EM for PLSA / LDA

54

NM

z

wµ '

K

EM for PLSA / LDA

55

NM

z

wµ '

K

®

¯

+®¡ 1+¯ ¡ 1+®

Can be negative… Ignore it for now. Can be

regarded as smoothing

Summary so far We defined a generative model of document

collections: We considered how to do MAP estimation, i.e.

Now we could visualize collections Estimate topic distributions in each document

P (W;Z; ' ;µj®;¯ )

( '̂ ; µ̂) = arg max' ;µP (' ;µj®;¯ ;W )

(P (wjz) = ' z)

(P (zjd) = µd)

Example (Science collection) Top 10 words for 10 topics out of 128

(ordered byP (wjz) = ' z(w)

Example: topics of a document

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

P (zjd) = µd

Outline

59

Topic Models: An Example of Latent Dirichlet Allocation

Learning : Expectation Maximization Algorithms

Learning: Gibbs sampling

Gibbs Sampling for PLSA / LDA

60

NM

z

wµ '

K

®

¯

Choosing word for topic j for position

i

Choosing topic j for document di

Gibbs Sampling for PLSA / LDA

We do not directly obtain and

Instead we observe which words are assigned to which topics:

We can estimate the parameters from the sample (and take into account pseudo counts priors)

P (zjd) = µd

P (wjz) = ' z(w)

… Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

For details on Gibbs sampling refer to Finding Scientific Topics, Griffiths and

Steyvers, PNAS 2004

Summary

62

We considered the most standard “topic model”: PLSA / LDA

Reviewed two basic inference techniques:

EM

Collapsed Gibbs sampling

These methods are applicable to the majority of models we will consider in class

Formalities

63

Doodle poll: we will stay with Friday, 2pm Paper selection:

Deadline extended Not sure what we will do with the next class

April 30 (watch for announcements) If we do not get all the paper selected, review

selections may be affected These methods are applicable to the majority of

models we will consider in class

References

64

PLSA: Hofmann, Probabilistic Latent Semantic Indexing, SIGIR 99

LDA (they use variational inference though): Blei, Ng, Jordan, and Lafferty, Latent Dirichlet allocation, Journal of Machine Learning Research, 2003

Collapsed Gibbs sampling for LDA: Griffiths and Steyvers, Finding Scientific Topics, PNAS 2004

EM (original paper): Dempster, Laird and Rubin. Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society B, 1977

EM, a note by Michael Collins: people.csail.mit.edu/mcollins/papers/wpeII.4.ps

Video Tutorial on EM by Chris Bishop: http://videolectures.net/mlss04_bishop_gmvm/ (see part 4)