Post on 09-Jul-2020
600.465 - Intro to NLP - J. Eisner 1
Topic Modeling
600.465 - Intro to NLP - J. Eisner 2
Embedding Documents A trick from Information Retrieval Each document in corpus is a length-k vector Or each paragraph, or whatever
(0, 3, 3, 1, 0, 7, . . . 1, 0)
a single documentPretty sparse, so pretty noisy! Hard to tell which docs are similar.
Wish we could smooth the vector:what would the document look like if it rambled on forever?
600.465 - Intro to NLP - J. Eisner 3
Latent Semantic Analysis A trick from Information Retrieval Each document in corpus is a length-k vector Plot all documents in corpus
True plot in k dimensionsReduced-dimensionality plot
a clustering algorithmmight discover thiscluster of “similar”documents – similarthanks to smoothing!
600.465 - Intro to NLP - J. Eisner 4
Latent Semantic Analysis Reduced plot is a perspective drawing of true plot It projects true plot onto a few axes a best choice of axes – shows most variation in the data. Found by linear algebra: “Principal Components Analysis” (PCA)
True plot in k dimensionsReduced-dimensionality plot
600.465 - Intro to NLP - J. Eisner 5
Latent Semantic Analysis SVD plot allows best possible reconstruction of true plot
(i.e., can recover 3-D coordinates with minimal distortion) Ignores variation in the axes that it didn’t pick Hope that variation’s just noise and we want to ignore it
True plot in k dimensionsReduced-dimensionality plot
topic A
topi
c B
600.465 - Intro to NLP - J. Eisner 6
Latent Semantic Analysis SVD finds a small number of topic vectors Approximates each doc as linear combination of topics Coordinates in reduced plot = linear coefficients How much of topic A in this document? How much of topic B? Each topic is a collection of words that tend to appear together
True plot in k dimensionsReduced-dimensionality plot
topic A
topi
c B
600.465 - Intro to NLP - J. Eisner 7
Latent Semantic Analysis New coordinates might actually be useful for Info Retrieval To compare 2 documents, or a query and a document: Project both into reduced space: do they have topics in common? Even if they have no words in common!
True plot in k dimensionsReduced-dimensionality plot
topic A
topi
c B
600.465 - Intro to NLP - J. Eisner 8
Latent Semantic Analysis topics extracted for IR might help sense disambiguation
Each word is like a tiny document: (0,0,0,1,0,0,…) Express word as a linear combination of topics Each topic corresponds to a sense? E.g., “Jordan” has Mideast and Sports topics
(plus Advertising topic, alas, which is same sense as Sports) Word’s sense in a document: which of its topics are strongest in the
document? Groups senses as well as splitting them One word has several topics and many words have same topic
600.465 - Intro to NLP - J. Eisner 9
Latent Semantic Analysis A perspective on Principal Components Analysis (PCA) Imagine an electrical circuit that connects terms to docs …
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
matrix of strengths(how strong is each
term in each document?)
Each connection has aweight given by the matrix.
600.465 - Intro to NLP - J. Eisner 10
Latent Semantic Analysis Which documents is term 5 strong in?
docs 2, 5, 6light up strongest.
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
600.465 - Intro to NLP - J. Eisner 11
This answers a queryconsisting of terms 5 and 8!
really just matrix multiplication:term vector (query) x strength matrix = doc vector .
Latent Semantic Analysis Which documents are terms 5 and 8 strong in?
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
600.465 - Intro to NLP - J. Eisner 12
Latent Semantic Analysis Conversely, what terms are strong in document 5?
gives doc 5’s coordinates!
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
600.465 - Intro to NLP - J. Eisner 13
Latent Semantic Analysis SVD approximates by smaller 3-layer network Forces sparse data through a bottleneck, smoothing it
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
topics
600.465 - Intro to NLP - J. Eisner 14
Latent Semantic Analysis I.e., smooth sparse data by matrix approx: M A B A encodes camera angle, B gives each doc’s new coords
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
matrixM
A
B
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
topics
600.465 - Intro to NLP - J. Eisner 15
Latent Semantic AnalysisCompletely symmetric! Regard A, B as projecting terms and docsinto a low-dimensional “topic space” where their similarity can bejudged.
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
matrixM
A
B
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
topics
600.465 - Intro to NLP - J. Eisner 16
Latent Semantic Analysis Completely symmetric. Regard A, B as projecting terms and docs
into a low-dimensional “topic space” where their similarity can bejudged.
Cluster documents (helps sparsity problem!) Cluster words Compare a word with a doc Identify a word’s topics with its senses sense disambiguation by looking at document’s senses
Identify a document’s topics with its topics topic categorization
600.465 - Intro to NLP - J. Eisner 17
If you’ve seen SVD before …
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
SVD actually decomposes M = A D B’ exactly A = camera angle (orthonormal); D diagonal; B’ orthonormal
matrixM
A
B’
D
600.465 - Intro to NLP - J. Eisner 18
If you’ve seen SVD before …
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
Keep only the largest j < k diagonal elements of D This gives best possible approximation to M using only j blue units
matrixM
A
B’
D
600.465 - Intro to NLP - J. Eisner 19
If you’ve seen SVD before …
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
Keep only the largest j < k diagonal elements of D This gives best possible approximation to M using only j blue units
matrixM
A
B’
D
600.465 - Intro to NLP - J. Eisner 20
If you’ve seen SVD before …
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
documents1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9terms
To simplify picture, can write M A (DB’) = AB
matrixM
A
B =DB’
How should you pick j (number of blue units)? Just like picking number of clusters: How well does system work with each j (on held-out data)?
21
Directed graphical models (Bayes nets)
slide thanks to Zoubin Ghahramani (modified)
Under any model: p(A, B, C, D, E) = p(A)p(B|A)p(C|A,B)p(D|A,B,C)p(E|A,B,C,D)Model above says:
22
Unigram model for generating text
w1 w2 w3…
p(w1) p(w2) p(w3) …
23
Explicitly show model’s parameters
w1 w2 w3…
p() p(w1 | ) p(w2 | ) p(w3 | ) …
“ is a vector that sayswhich unigrams are likely”
24
“Plate notation” simplifies diagram
p() p(w1 | ) p(w2 | ) p(w3 | ) …
wN1
“ is a vector that sayswhich unigrams are likely”
25
Learn from observed words(rather than vice-versa)
p() p(w1 | ) p(w2 | ) p(w3 | ) …
wN1
26
Explicitly show prior over (e.g., Dirichlet)
wN1
p() p( | ) p(w1 | ) p(w2 | ) p(w3 | ) …
“Even if we didn’t observeword 5, the prior says that5 = 0 is a terrible guess” given
Dirichlet()wi
27
dog the cat
0
Dirichlet DistributionEach point on a k dimensional simplex is a multinomial probabilitydistribution:
1
1
1
1
3
2
3.0
5.0
2.0
dog the cat
0
0
1
1i
i
1i
i
slide thanks to Nigel Crook
28
Dirichlet Distribution
0 1
1
1
3
2
1
A Dirichlet Distribution is a distribution over multinomialdistributions in the simplex.
0 1
1
1
1
3
21
1
3
11
12
slide thanks to Nigel Crook
29
slide thanks to Percy Liang and Dan Klein
30
Dirichlet DistributionExample draws from a Dirichlet Distribution over the 3-simplex:
0
1
1
2
3
1
2
0
3
1
2
3
Dirichlet(5,5,5)
Dirichlet(0.2, 5, 0.2)
Dirichlet(0.5,0.5,0.5)
slide thanks to Nigel Crook
31
Explicitly show prior over (e.g., Dirichlet)
wN
p() p( | ) p(w1 | ) p(w2 | ) p(w3 | ) …
“Even if we didn’t observeword 5, the prior says that5 = 0 is a terrible guess”
Posterior distributionp( | , w)
is also a Dirichletjust like the prior p( | ).
prior = Dirichlet() posterior = Dirichlet(+counts(w))Mean of posterior is like the max-likelihood estimate of ,but smooth the corpus counts by adding “pseudocounts” .
(But better to use whole posterior, not just the mean.)
32
Training and Test Documents
wN1
“Learn from document 1,use it to predict document 2”
wN2
What do goodconfigurations looklike if N1 is large?
What if N1 is small?
test
train
33
Many Documents
w1N1
“Each document has itsown unigram model”
w2N2
w3N3 Now does observing
docs 1 and 3 help stillpredict doc 2?
Only if learns thatall the ’s are similar
(low variance).
And in that case,why even haveseparate ’s?
34
Many Documents
wND
D
“Each document has itsown unigram model”
givend Dirichlet()wdi d
or tuned to maximizetraining or dev set likelihood
35
Bayesian Text Categorization
wND
DK
“Each document choosesone of only K topics(unigram models)”
givenk Dirichlet()wdi k but which k?
36
Bayesian Text Categorization
w
z
ND
DK
“Each document choosesone of only K topics(unigram models)”
Allows documents to differconsiderably while somestill share parameters.
And, we can infer theprobability that twodocuments have the
same topic z.
Might observe some topics.
givenk Dirichlet()wdi zd
a topicin 1…K
given Dirichlet()zd
a distributionover topics 1…K
37
Latent Dirichlet Allocation(Blei, Ng & Jordan 2003)
w
z
ND
DK
“Each documentchooses a mixtureof all K topics;
each word gets itsown topic”
38
(Part of) one assignment to LDA’s variables
slide thanks to Dave Blei
39
(Part of) one assignment to LDA’s variables
slide thanks to Dave Blei
40
Latent Dirichlet Allocation: Inference?
w
z1
DK
w1 w2
z2
w3
z3
K
…
…
41
Finite-State Dirichlet Allocation(Cui & Eisner 2006)
w1
z1
w2
z2
w3
z3…
K
D
“A different HMM foreach document”
…
42
Variants of Latent Dirichlet Allocation
Syntactic topic model: A word or its topic isinfluenced by its syntactic position.
Correlated topic model, hierarchical topic model, …:Some topics resemble other topics.
Polylingual topic model: All versions of the samedocument use the same topic mixture, even ifthey’re in different languages. (Why useful?)
Relational topic model: Documents on the sametopic are generated separately but tend to link toone another. (Why useful?)
Dynamic topic model: We also observe a year foreach document. The k topics used in 2011 haveevolved slightly from their counterparts in 2010.
43
Dynamic Topic Model
slide thanks to Dave Blei
44
Dynamic Topic Model
slide thanks to Dave Blei
45
Dynamic Topic Model
slide thanks to Dave Blei
46
Dynamic Topic Model
slide thanks to Dave Blei
47
Remember: Finite-State Dirichlet Allocation(Cui & Eisner 2006)
w1
z1
w2
z2
w3
z3…
K
D
“A different HMM foreach document”
…
48
Bayesian HMM
w1
z1
w2
z2
w3
z3…
K
D
“Shared HMM forall documents”
(or just have 1 document)
We have to estimatetransition parameters and emission parameters .