Topic Models. Outline Review of directed models – Independencies, d-separation, and “explaining...

Topic Models

Outline

• Review of directed models– Independencies, d-separation, and “explaining

away”– Learning for Bayes nets

• Directed models for text– Naïve Bayes models– Latent Dirichlet allocation (LDA)

REVIEW OF DIRECTED MODELS (AKA BAYES NETS)

Directed Model = Graph + Conditional Probability

Distributions

The Graph (Some) Pairwise Conditional Independencies

Plate notation lets us denote complex graphs

=

Directed Models > HMMs

t

v

s 0.9

0.5

0.50.8

0.2

0.1

u

a

c

0.6

0.4

a

c

0.3

0.7

a

c

0.5

0.5

a

c

0.9

0.1

S1

a

S2

a

S3

c

S4

a

S P(S)

s 1.0

t 0.0

u 0.0

v 0.0

S S’ P(S’|S)

s s 0.1

s t 0.9

s u 0.0

…

t s 0.5

t t 0.5

.. … …

S X P(X|S)

s a 0.9

s c 0.1

t a 0.6

t c 0.4

.. … …

Directed Models > HMMs

t

v

s 0.9

0.5

0.5

0.8 0.2

0.1

u

a

c

0.6

0.4

a

c

0.3

0.7

a

c

0.5

0.5

a

c

0.9

0.1

S1

a

S2

a

S3

c

S4

a

Important point:• I can compute Pr(S2=t | aaca)• So inference does not always

“follow the arrows”

1

SOME MORE DETAILS ON DIRECTED MODELS

The example police say we’re in violation:• Insufficient use of “Monty Hall” problem• Discussing Bayes nets without discussing

burglar alarms

The (highly practical) Monty Hall problem

• You’re in a game show. Behind one door is a car. Behind the others, goats.

• You pick one of three doors, say #1

• The host, Monty Hall, opens one door, revealing…a goat!

• You now can either stick with your guess or change doors

A B

First guess The money

C The revealed goatD

Stick, or swap?

ESecond guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

D P(D)

Stick 0.5

Swap

0.5

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

otherwise0

},{ if5.0

},{ if0.1

),|( bacba

bacba

bBaAcCP

A few minutes later, the goat from behind door C drives away in the car.


A B


C The goatD

Stick or swap?

ESecond guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

otherwise0

},{ if5.0

},{ if0.1

),|( bacba

bacba

bBaAcCP

A C D P(E|A,C,D)

… … … …

otherwise0

)(},{ if0.1

if0.1

),,|(

swapdcae

stickdae

DCAeEP


A B


C The goatD

Stick or swap?

ESecond guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

A C D P(E|A,C,D)

… … … …

…again by the chain rule:

P(A,B,C,D,E) =

P(E|A,C,D) *

P(D) *

P(C | A,B ) *

P(B ) *

P(A)

We could construct the joint and compute P(E=B|D=swap)


A B


C The goatD

Stick or swap?

ESecond guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

A C D P(E|A,C,D)

… … … …

The joint table has…?

3*3*3*2*3 = 162 rows

The conditional probability tables

(CPTs) shown have … ?

3 + 3 + 3*3*3 + 2*3*3 = 51 rows <

162 rows

Big questions:

• why are the CPTs smaller?

• how much smaller are the CPTs than the joint?

• can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs?


A B


C The goatD

Stick or swap?

ESecond guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

A C D P(E|A,C,D)

… … … …

Why is the CPTs representation smaller? Follow the money! (B)

otherwise0

)(},{ if0.1

if0.1

),,|(

swapdcae

stickdae

DCAeEP

),,,|(

),,|(

,,,,

dDbCbBaAeEP

dDcCaAeEP

edcba

E is conditionally independent of B given A,D,C

DCABE ,,|

BDCAEI },,,{,

Conditional Independence (again)

Definition: R and L are conditionally independent given M if

for all x,y,z in {T,F}:P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:Let S1 and S2 and S3 be sets of variables.Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)= P(S1’s assignments S3’s assignments)


A B


C The goatD

Stick or swap?

ESecond guess

What are the conditional indepencies?• I<A, {B}, C> ?• I<A, {C}, B> ? • I<E, {A,C}, B> ?• I<D, {E}, B> ?• …

What Independencies does a Bayes Net Model?

• In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all

its non-descendants in the graph given the value of all its parents.

• This implies

• But what else does it imply?

n

iiin XparentsXPXXP

11 ))(|()(


• Example:

Z

Y

X

Given Y, does learning the value of Z tell usnothing at all new about X?

I.e., is P(X|Y, Z) equal to P(X | Y)?

Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z.

Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y).


• Let I<X,Y,Z> represent X and Z being conditionally independent given Y.

• I<X,Y,Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.

Y

X Z

Things get a little more confusing

• X has no parents, so we know all its parents’ values trivially

• Z is not a descendant of X• So, I<X,{},Z>, even though there’s a

undirected path from X to Z through an unknown variable Y.

• What if we do know the value of Y, though? Or one of its descendants?

ZX

Y

The “Burglar Alarm” example• Your house has a twitchy burglar

alarm that is also sometimes triggered by earthquakes.

• Earth arguably doesn’t care whether your house is currently being burgled

• While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!

Burglar Earthquake

Alarm

Phone Call

Things get a lot more confusing• But now suppose you learn

that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.

• Earthquake “explains away” the hypothetical burglar.

• But then it must not be the case that

I<Burglar,{Phone Call}, Earthquake>, even though

I<Burglar,{}, Earthquake>!

Burglar Earthquake

Alarm

Phone Call

“Explaining away”

X YE

5.0)( YP5.0)( XP

YXE {}|

|

YX

EYX

NO

YES

This is “explaining away”:

• E is common symptom of two causes, X and Y

• After observing E=1, both X and Y become more probable

• After observing E=1 and X=1, Y becomes less probable

• since X alone is enough to “explain” E

“Explaining away” and common-sense

Historical note:

• Classical logic is monotonic: the more you know, the more you deduce.

• “Common-sense” reasoning is not monotonic

• birds fly

• but, not after being cooked for 20min/lb at 350o F

• This led to numerous “non-monotonic logics” for AI

• This examples shows that Bayes nets are not monotonic

• If P(Y|E) is “your belief” in Y after observing E,

• and P(Y|X,E) is “your belief” in Y after observing E,X

• your belief in Y decreases after you discover X

How can I make this less confusing?

• But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.

• Earthquake “explains away” the hypothetical burglar.

• But then it must not be the case that

I<Burglar,{Phone Call}, Earthquake>, even though

I<Burglar,{}, Earthquake>!

Burglar Earthquake

Alarm

Phone Call

d-separation to the rescue

• Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation.

• Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true: ...

ie. X and Z are dependent iff there exists an unblocked path

A path is “blocked” when...

• There exists a variable V on the path such that• it is in the evidence set E• the arcs putting Y in the path are “tail-to-tail”

• Or, there exists a variable V on the path such that• it is in the evidence set E• the arcs putting Y in the path are “tail-to-head”

• Or, ...

Y

Y

unknown “common causes” of X and Z impose dependency

unknown “causal chains” connecting X an Z impose dependency

A path is “blocked” when… (the funky case)

• … Or, there exists a variable V on the path such that• it is NOT in the evidence set E• neither are any of its descendants• the arcs putting Y on the path are “head-to-head”

Y

Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z

Summary: d-separation

Z

Z

Z

X YE

)|(),|(

| ,,

EXPYEXP

EYXYEXI

)|(),|(:,, eExXPyYeExXPeyx

There are three ways paths from X to Y given evidence E can be blocked.

X is d-separated from Y given E iff all paths from X to Y given E are blocked

If X is d-separated from Y given E, then I<X,E,Y>

LEARNING FOR BAYES NETS

32

(Review) Breaking it down:Learning parameters for the “naïve” HMM

• Training data defines unique path through HMM!

– Transition probabilities• Probability of transitioning from state i to state j =

number of transitions from i to j total transitions from state i

– Emission probabilities• Probability of emitting symbol k from state i =

number of times k generated from i number of transitions from i

with smoothing, of course

33

(Review) Breaking it down:NER using the “naïve” HMM

• Define the HMM structure: – one state per entity type

• Training data defines unique path through HMM for each labeled example– Use this to estimate transition and emission

probabilities• At test time for a sequence x

– Use Viterbi to find sequence of states s that maximizes Pr(x|s)

– Use s to derive labels for the sequence x

Learning for Bayes nets ~ Learning for HMMS

if everything is observed

• Input: • Sample of the joint:• Graph structure of the variables

• for I=1,…,N, you know Xi and parents(Xi)

• Output: • Estimated CPTs

),....,(),....,,....,(),,....,,( 122

111

211

mn

mnn xxxxxxx

A B

CD

E

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

B P(B)

1 0.33

2 0.33

3 0.33

…

Learning Method (discrete variables):

• Estimate each CPT independently

• Use a MLE or MAP

Learning for Bayes nets ~ Learning for HMMS

if some things are not observed

• Input: • Sample of the joint:• Graph structure of the variables

• for I=1,…,N, you know Xi and parents(Xi)

• Output: • Estimated CPTs

),....,(),....,,....,(),,....,,( 122

111

211

mn

mnn xxxxxxx

A B

CD

E

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

B P(B)

1 0.33

2 0.33

3 0.33

…

Learning Method (discrete variables):

• Use inference* to estimate distribution of the unobserved values

• Use EM

* The HMM methods generalize to trees. I’ll talk about Gibbs sampling soon

LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT

Supervised Multinomial Naïve Bayes• Naïve Bayes Model: Compact representation

C

W1 W2 W3 ….. WN

C

W

N

M

M

b

b

=

Supervised Multinomial Naïve Bayes• Naïve Bayes Model: Compact representation

C

W1 W2 W3 ….. WN

C

W

N

M

M

b

=

K

Review – supervised Naïve Bayes

• Multinomial Naïve Bayes

C

W1 W2 W3 ….. WN

M

• For each class 1..K

• Construct a multinomial i

• For each document d = 1,, M

• Generate Cd ~ Mult( . | )

• For each position n = 1,, Nd

• Generate wn ~ Mult(.|,Cd) … or if you prefer wn ~ Pr(w|Cd)b K

Review – unsupervised Naïve Bayes

• Mixture model: EM solution

E-step:

M-step:

Key capability: estimate distribution of latent variables given observed variables

Review – unsupervised Naïve Bayes

• Mixture model: unsupervised naïve Bayes model

C

W

NM

b

• Joint probability of words and classes:

• But classes are not visible:Z

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)

• Every document is a mixture of topics

C

W

NM

b

Z

• For i=1…K:• Let bi be a multinomial over words

• For each document d:• Let d be a distribution over {1,..,K}• For each word position in d:

• Pick a topic z from d • Pick a word w from bi

• Turns out to be hard to fit:• Lots of parameters!• Also: only applies to the training dataK

The LDA Topic Model

LDA

• Motivation

w

M

N

Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words)

• For each document d = 1,,M

• Generate d ~ D1(…)

• For each word n = 1,, Nd

• generate wn ~ D2( . | θdn)

Now pick your favorite distributions for D1, D2

• Latent Dirichlet Allocation

z

w

M

N

a• For each document d = 1,,M

• Generate d ~ Dir(. | )

• For each position n = 1,, Nd

• generate zn ~ Mult( . | d)

• generate wn ~ Mult( . | zn)

“Mixed membership”kk

jjk nn

nnnnjz

...),...,,|Pr(

11,21

fK

b

• LDA’s view of a document

• LDA topics

Review - LDA

• Latent Dirichlet Allocation– Parameter learning:

• Variational EM– Numerical approximation using lower-bounds– Results in biased solutions– Convergence has numerical guarantees

• Gibbs Sampling – Stochastic simulation– unbiased solutions– Stochastic convergence

Review - LDA

• Gibbs sampling – works for any directed model!– Applicable when joint distribution is hard to evaluate but conditional

distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution

Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

Why does Gibbs sampling work?

• What’s the fixed point?– Stationary distribution of the chain is the joint

distribution• When will it converge (in the limit)?

– If graph defined by the chain is connected• How long will it take to converge?

– Depends on second eigenvector of that graph

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables

Fr: Parameter estimation for text analysis - Gregor Heinrich

LDA

• Latent Dirichlet Allocation

z

w

M

N

a• Randomly initialize each zm,n

• Repeat for t=1,….

• For each doc m, word n

• Find Pr(zmn=k|other z’s)

• Sample zmn according to that distr.

“Mixed membership”

fK

b

EVEN MORE DETAIL ON LDA…

Way way more detail

More detail

What gets learned…..

In A Math-ier Notation

N[*,k]N[d,k]

M[w,k]N[*,*]=V

for each document d and word position j in d• z[d,j] = k, a random topic• N[d,k]++• W[w,k]++ where w = id of j-th word in d

for each document d and word position j in d• z[d,j] = k, a new random topic• update N, W to reflect the new assignment of z:

• N[d,k]++; N[d,k’] - - where k’ is old z[d,j]• W[w,k]++; W[w,k’] - - where w is w[d,j]

for each pass t=1,2,….

Some comments on LDA

• Very widely used model• Also a component of many other models

Topic Models. Outline Review of directed models – Independencies, d-separation, and “explaining...

Documents

Transcript of Topic Models. Outline Review of directed models – Independencies, d-separation, and “explaining...