Topic Models. Outline Review of directed models – Independencies, d-separation, and “explaining...
-
Upload
edwin-morton -
Category
Documents
-
view
216 -
download
1
Transcript of Topic Models. Outline Review of directed models – Independencies, d-separation, and “explaining...
Topic Models
Outline
• Review of directed models– Independencies, d-separation, and “explaining
away”– Learning for Bayes nets
• Directed models for text– Naïve Bayes models– Latent Dirichlet allocation (LDA)
REVIEW OF DIRECTED MODELS (AKA BAYES NETS)
Directed Model = Graph + Conditional Probability
Distributions
The Graph (Some) Pairwise Conditional Independencies
Plate notation lets us denote complex graphs
=
Directed Models > HMMs
t
v
s 0.9
0.5
0.50.8
0.2
0.1
u
a
c
0.6
0.4
a
c
0.3
0.7
a
c
0.5
0.5
a
c
0.9
0.1
S1
a
S2
a
S3
c
S4
a
S P(S)
s 1.0
t 0.0
u 0.0
v 0.0
S S’ P(S’|S)
s s 0.1
s t 0.9
s u 0.0
…
t s 0.5
t t 0.5
.. … …
S X P(X|S)
s a 0.9
s c 0.1
t a 0.6
t c 0.4
.. … …
Directed Models > HMMs
t
v
s 0.9
0.5
0.5
0.8 0.2
0.1
u
a
c
0.6
0.4
a
c
0.3
0.7
a
c
0.5
0.5
a
c
0.9
0.1
S1
a
S2
a
S3
c
S4
a
Important point:• I can compute Pr(S2=t | aaca)• So inference does not always
“follow the arrows”
1
SOME MORE DETAILS ON DIRECTED MODELS
The example police say we’re in violation:• Insufficient use of “Monty Hall” problem• Discussing Bayes nets without discussing
burglar alarms
Slide 10
The (highly practical) Monty Hall problem
• You’re in a game show. Behind one door is a car. Behind the others, goats.
• You pick one of three doors, say #1
• The host, Monty Hall, opens one door, revealing…a goat!
• You now can either stick with your guess or change doors
A B
First guess The money
C The revealed goatD
Stick, or swap?
ESecond guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
D P(D)
Stick 0.5
Swap
0.5
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
otherwise0
},{ if5.0
},{ if0.1
),|( bacba
bacba
bBaAcCP
A few minutes later, the goat from behind door C drives away in the car.
Slide 12
The (highly practical) Monty Hall problem
A B
First guess The money
C The goatD
Stick or swap?
ESecond guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
otherwise0
},{ if5.0
},{ if0.1
),|( bacba
bacba
bBaAcCP
A C D P(E|A,C,D)
… … … …
otherwise0
)(},{ if0.1
if0.1
),,|(
swapdcae
stickdae
DCAeEP
Slide 13
The (highly practical) Monty Hall problem
A B
First guess The money
C The goatD
Stick or swap?
ESecond guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
A C D P(E|A,C,D)
… … … …
…again by the chain rule:
P(A,B,C,D,E) =
P(E|A,C,D) *
P(D) *
P(C | A,B ) *
P(B ) *
P(A)
We could construct the joint and compute P(E=B|D=swap)
Slide 14
The (highly practical) Monty Hall problem
A B
First guess The money
C The goatD
Stick or swap?
ESecond guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
A C D P(E|A,C,D)
… … … …
The joint table has…?
3*3*3*2*3 = 162 rows
The conditional probability tables
(CPTs) shown have … ?
3 + 3 + 3*3*3 + 2*3*3 = 51 rows <
162 rows
Big questions:
• why are the CPTs smaller?
• how much smaller are the CPTs than the joint?
• can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs?
Slide 15
The (highly practical) Monty Hall problem
A B
First guess The money
C The goatD
Stick or swap?
ESecond guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
A C D P(E|A,C,D)
… … … …
Why is the CPTs representation smaller? Follow the money! (B)
otherwise0
)(},{ if0.1
if0.1
),,|(
swapdcae
stickdae
DCAeEP
),,,|(
),,|(
,,,,
dDbCbBaAeEP
dDcCaAeEP
edcba
E is conditionally independent of B given A,D,C
DCABE ,,|
BDCAEI },,,{,
Slide 16
Conditional Independence (again)
Definition: R and L are conditionally independent given M if
for all x,y,z in {T,F}:P(R=x M=y ^ L=z) = P(R=x M=y)
More generally:Let S1 and S2 and S3 be sets of variables.Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,
P(S1’s assignments S2’s assignments & S3’s assignments)= P(S1’s assignments S3’s assignments)
Slide 17
The (highly practical) Monty Hall problem
A B
First guess The money
C The goatD
Stick or swap?
ESecond guess
What are the conditional indepencies?• I<A, {B}, C> ?• I<A, {C}, B> ? • I<E, {A,C}, B> ?• I<D, {E}, B> ?• …
Slide 18
What Independencies does a Bayes Net Model?
• In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all
its non-descendants in the graph given the value of all its parents.
• This implies
• But what else does it imply?
n
iiin XparentsXPXXP
11 ))(|()(
Slide 19
What Independencies does a Bayes Net Model?
• Example:
Z
Y
X
Given Y, does learning the value of Z tell usnothing at all new about X?
I.e., is P(X|Y, Z) equal to P(X | Y)?
Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z.
Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y).
Slide 20
What Independencies does a Bayes Net Model?
• Let I<X,Y,Z> represent X and Z being conditionally independent given Y.
• I<X,Y,Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.
Y
X Z
Slide 21
Things get a little more confusing
• X has no parents, so we know all its parents’ values trivially
• Z is not a descendant of X• So, I<X,{},Z>, even though there’s a
undirected path from X to Z through an unknown variable Y.
• What if we do know the value of Y, though? Or one of its descendants?
ZX
Y
Slide 22
The “Burglar Alarm” example• Your house has a twitchy burglar
alarm that is also sometimes triggered by earthquakes.
• Earth arguably doesn’t care whether your house is currently being burgled
• While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!
Burglar Earthquake
Alarm
Phone Call
Slide 23
Things get a lot more confusing• But now suppose you learn
that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.
• Earthquake “explains away” the hypothetical burglar.
• But then it must not be the case that
I<Burglar,{Phone Call}, Earthquake>, even though
I<Burglar,{}, Earthquake>!
Burglar Earthquake
Alarm
Phone Call
Slide 24
“Explaining away”
X YE
5.0)( YP5.0)( XP
YXE {}|
|
YX
EYX
NO
YES
This is “explaining away”:
• E is common symptom of two causes, X and Y
• After observing E=1, both X and Y become more probable
• After observing E=1 and X=1, Y becomes less probable
• since X alone is enough to “explain” E
Slide 25
“Explaining away” and common-sense
Historical note:
• Classical logic is monotonic: the more you know, the more you deduce.
• “Common-sense” reasoning is not monotonic
• birds fly
• but, not after being cooked for 20min/lb at 350o F
• This led to numerous “non-monotonic logics” for AI
• This examples shows that Bayes nets are not monotonic
• If P(Y|E) is “your belief” in Y after observing E,
• and P(Y|X,E) is “your belief” in Y after observing E,X
• your belief in Y decreases after you discover X
Slide 26
How can I make this less confusing?
• But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.
• Earthquake “explains away” the hypothetical burglar.
• But then it must not be the case that
I<Burglar,{Phone Call}, Earthquake>, even though
I<Burglar,{}, Earthquake>!
Burglar Earthquake
Alarm
Phone Call
Slide 27
d-separation to the rescue
• Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation.
• Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true: ...
ie. X and Z are dependent iff there exists an unblocked path
Slide 28
A path is “blocked” when...
• There exists a variable V on the path such that• it is in the evidence set E• the arcs putting Y in the path are “tail-to-tail”
• Or, there exists a variable V on the path such that• it is in the evidence set E• the arcs putting Y in the path are “tail-to-head”
• Or, ...
Y
Y
unknown “common causes” of X and Z impose dependency
unknown “causal chains” connecting X an Z impose dependency
Slide 29
A path is “blocked” when… (the funky case)
• … Or, there exists a variable V on the path such that• it is NOT in the evidence set E• neither are any of its descendants• the arcs putting Y on the path are “head-to-head”
Y
Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z
Slide 30
Summary: d-separation
Z
Z
Z
X YE
)|(),|(
| ,,
EXPYEXP
EYXYEXI
)|(),|(:,, eExXPyYeExXPeyx
There are three ways paths from X to Y given evidence E can be blocked.
X is d-separated from Y given E iff all paths from X to Y given E are blocked
If X is d-separated from Y given E, then I<X,E,Y>
Slide 31
LEARNING FOR BAYES NETS
32
(Review) Breaking it down:Learning parameters for the “naïve” HMM
• Training data defines unique path through HMM!
– Transition probabilities• Probability of transitioning from state i to state j =
number of transitions from i to j total transitions from state i
– Emission probabilities• Probability of emitting symbol k from state i =
number of times k generated from i number of transitions from i
with smoothing, of course
33
(Review) Breaking it down:NER using the “naïve” HMM
• Define the HMM structure: – one state per entity type
• Training data defines unique path through HMM for each labeled example– Use this to estimate transition and emission
probabilities• At test time for a sequence x
– Use Viterbi to find sequence of states s that maximizes Pr(x|s)
– Use s to derive labels for the sequence x
Slide 34
Learning for Bayes nets ~ Learning for HMMS
if everything is observed
• Input: • Sample of the joint:• Graph structure of the variables
• for I=1,…,N, you know Xi and parents(Xi)
• Output: • Estimated CPTs
),....,(),....,,....,(),,....,,( 122
111
211
mn
mnn xxxxxxx
A B
CD
E
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
B P(B)
1 0.33
2 0.33
3 0.33
…
Learning Method (discrete variables):
• Estimate each CPT independently
• Use a MLE or MAP
Slide 35
Learning for Bayes nets ~ Learning for HMMS
if some things are not observed
• Input: • Sample of the joint:• Graph structure of the variables
• for I=1,…,N, you know Xi and parents(Xi)
• Output: • Estimated CPTs
),....,(),....,,....,(),,....,,( 122
111
211
mn
mnn xxxxxxx
A B
CD
E
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
B P(B)
1 0.33
2 0.33
3 0.33
…
Learning Method (discrete variables):
• Use inference* to estimate distribution of the unobserved values
• Use EM
* The HMM methods generalize to trees. I’ll talk about Gibbs sampling soon
LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT
Supervised Multinomial Naïve Bayes• Naïve Bayes Model: Compact representation
C
W1 W2 W3 ….. WN
C
W
N
M
M
b
b
=
Supervised Multinomial Naïve Bayes• Naïve Bayes Model: Compact representation
C
W1 W2 W3 ….. WN
C
W
N
M
M
b
=
K
Review – supervised Naïve Bayes
• Multinomial Naïve Bayes
C
W1 W2 W3 ….. WN
M
• For each class 1..K
• Construct a multinomial i
• For each document d = 1,, M
• Generate Cd ~ Mult( . | )
• For each position n = 1,, Nd
• Generate wn ~ Mult(.|,Cd) … or if you prefer wn ~ Pr(w|Cd)b K
Review – unsupervised Naïve Bayes
• Mixture model: EM solution
E-step:
M-step:
Key capability: estimate distribution of latent variables given observed variables
Review – unsupervised Naïve Bayes
• Mixture model: unsupervised naïve Bayes model
C
W
NM
b
• Joint probability of words and classes:
• But classes are not visible:Z
Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)
• Every document is a mixture of topics
C
W
NM
b
Z
• For i=1…K:• Let bi be a multinomial over words
• For each document d:• Let d be a distribution over {1,..,K}• For each word position in d:
• Pick a topic z from d • Pick a word w from bi
• Turns out to be hard to fit:• Lots of parameters!• Also: only applies to the training dataK
The LDA Topic Model
LDA
• Motivation
w
M
N
Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words)
• For each document d = 1,,M
• Generate d ~ D1(…)
• For each word n = 1,, Nd
• generate wn ~ D2( . | θdn)
Now pick your favorite distributions for D1, D2
• Latent Dirichlet Allocation
z
w
M
N
a• For each document d = 1,,M
• Generate d ~ Dir(. | )
• For each position n = 1,, Nd
• generate zn ~ Mult( . | d)
• generate wn ~ Mult( . | zn)
“Mixed membership”kk
jjk nn
nnnnjz
...),...,,|Pr(
11,21
fK
b
• LDA’s view of a document
• LDA topics
Review - LDA
• Latent Dirichlet Allocation– Parameter learning:
• Variational EM– Numerical approximation using lower-bounds– Results in biased solutions– Convergence has numerical guarantees
• Gibbs Sampling – Stochastic simulation– unbiased solutions– Stochastic convergence
Review - LDA
• Gibbs sampling – works for any directed model!– Applicable when joint distribution is hard to evaluate but conditional
distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution
Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.
Why does Gibbs sampling work?
• What’s the fixed point?– Stationary distribution of the chain is the joint
distribution• When will it converge (in the limit)?
– If graph defined by the chain is connected• How long will it take to converge?
– Depends on second eigenvector of that graph
Called “collapsed Gibbs sampling” since you’ve marginalized away some variables
Fr: Parameter estimation for text analysis - Gregor Heinrich
LDA
• Latent Dirichlet Allocation
z
w
M
N
a• Randomly initialize each zm,n
• Repeat for t=1,….
• For each doc m, word n
• Find Pr(zmn=k|other z’s)
• Sample zmn according to that distr.
“Mixed membership”
fK
b
EVEN MORE DETAIL ON LDA…
Way way more detail
More detail
What gets learned…..
In A Math-ier Notation
N[*,k]N[d,k]
M[w,k]N[*,*]=V
for each document d and word position j in d• z[d,j] = k, a random topic• N[d,k]++• W[w,k]++ where w = id of j-th word in d
for each document d and word position j in d• z[d,j] = k, a new random topic• update N, W to reflect the new assignment of z:
• N[d,k]++; N[d,k’] - - where k’ is old z[d,j]• W[w,k]++; W[w,k’] - - where w is w[d,j]
for each pass t=1,2,….
Some comments on LDA
• Very widely used model• Also a component of many other models