Variational Inference for Adaptor...
Transcript of Variational Inference for Adaptor...
![Page 1: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/1.jpg)
Variational Inference forAdaptor Gramars
Shay CohenSchool of Computer ScienceCarnegie Mellon University
David BleiComputer Science DepartmentPrinceton University
Noah SmithSchool of Computer ScienceCarnegie Mellon University
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 1/32
![Page 2: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/2.jpg)
OutlineThe lifecycle of unsupervised learning:
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 2/32
![Page 3: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/3.jpg)
Outline
We give a new representation to an existing model (adaptorgrammars)
This representation leads to a new variational inferencealgorithm for adaptor grammars
We do a sanity check on word segmentation, comparing tostate-of-the-art results
Our inference algorithm permits to do dependencyunsupervised parsing with adaptor grammars
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 3/32
![Page 4: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/4.jpg)
Problem 1 - PP Attachment
I saw the boy with the telescope I saw the boy with the telescopeS
VVVVVVVVVVVVV
hhhhhhhhhhhhh
NP VP
eeeeeeeeeeeeeeeeeee
qqqqqqqMMMMMMM
N V NP
qqqqqqqMMMMMMM PP
qqqqqqqMMMMMMM
I saw the boy with thetelescope
S
VVVVVVVVVVVVV
hhhhhhhhhhhhh
NP VP
eeeeeeeeeeeeeeeeeee
MMMMMMM
N V NP
hhhhhhhhhhhhhMMMMMMM
NP
qqqqqqqMMMMMMM PP
qqqqqqqMMMMMMM
I saw the boy with thetelescope
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 4/32
![Page 5: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/5.jpg)
Problem 2 - Word Segmentation
Matthewslikeswordfighting MatthewslikeswordfightingMatthews like sword fighting Matthews likes word fighting
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 5/32
![Page 6: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/6.jpg)
What is missing?Context could resolve this ambiguity
But we want unsupervised learning...
Where do we get the context?
. . . . . .
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 6/32
![Page 7: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/7.jpg)
Problem 1 - PP Attachment(S (NP The boy with the telescope) (V entered) (NP the park))I saw the boy with the telescope I saw the boy with the telescope
S
VVVVVVVVVVVVV
hhhhhhhhhhhhh
NP VP
eeeeeeeeeeeeeeeeeee
qqqqqqqMMMMMMM
N V NP
qqqqqqqMMMMMMM PP
qqqqqqqMMMMMMM
I saw the boy with thetelescope
S
VVVVVVVVVVVVV
hhhhhhhhhhhhh
NP VP
eeeeeeeeeeeeeeeeeee
MMMMMMM
N V NP
hhhhhhhhhhhhhMMMMMMM
NP
qqqqqqqMMMMMMM PP
qqqqqqqMMMMMMM
I saw the boy with thetelescope
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 7/32
![Page 8: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/8.jpg)
Problem 2 - Word Segmentation
Word fighting is the new hobby of computational linguists.Mr. Matthews is a computational linguist.
Matthewslikeswordfighting MatthewslikeswordfightingMatthews like sword fighting Matthews likes word fighting
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 8/32
![Page 9: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/9.jpg)
Dreaming Up Patterns
Context helps. Where do we get it? Adaptor grammars(Johnson et al. 2006)
Define a distribution over trees
New samples depend on the history - “rich get richer”dynamics
Dream up “patterns” as we go along
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 9/32
![Page 10: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/10.jpg)
Adaptor Grammars
Use the Pitman-Yor process with PCFGs as base distribution
To make it fully Bayesian, we also have a Dirichlet prior over thePCFG rules
Originally represented using the Chinese restaurant process(CRP)
CRP is convenient for sampling – not for variational inference
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 10/32
![Page 11: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/11.jpg)
Variational Inference in a Nutshell
“Posterior inference” requires that we find parse trees z1, ..., zngiven raw sentences x1, ..., xn
Mean-field approximation: take all hidden variables: z1, ..., znand parameters θ.Find a posterior of the form q(z1, ..., zn, θ) = q(θ)
∏ni=1 q(zi)
(makes inference tractable)
Makes independence assumptions in the posterior
That’s all! Almost. We need a manageable representation forz1, ..., zn and θ
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 11/32
![Page 12: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/12.jpg)
Sampling vs. Variational Inference
MCMC sampling variational inferenceconvergence guaranteed local maximumspeed slow fastalgorithm randomized objective optimizationparallelization non-trivial easy
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 12/32
![Page 13: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/13.jpg)
Stick Breaking Representation
Sticks are sampled from the GEM distributionEverything which is a number in this slide, belongs to θ
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 13/32
![Page 14: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/14.jpg)
Stick Breaking Representation
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 14/32
![Page 15: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/15.jpg)
Stick Breaking Representation
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 15/32
![Page 16: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/16.jpg)
Stick Breaking Representation
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 16/32
![Page 17: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/17.jpg)
Stick Breaking Representation
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 17/32
![Page 18: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/18.jpg)
Stick Breaking Representation
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 18/32
![Page 19: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/19.jpg)
Truncated Stick Approximation
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 19/32
![Page 20: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/20.jpg)
Sanity Check - Word Segmentation
Task is to segment a sequence of phonemes into wordsExample: yuwanttulUk&tDIs→ yu want tu lUk &t DIs
Models language acquisition in children (using the corpus fromBrent and Cartwright, 1996)
The corpus includes 9,790 utterances
Has been used before with adaptor grammars with threegrammars
Baseline: Sampling method from Johnson and Goldwater, 2009
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 20/32
![Page 21: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/21.jpg)
Word Segmentation - GrammarsUnigram Grammar Sentence
MMMMMMM
qqqqqqq
Word
qqqqqqqMMMMMMM Word
qqqqqqqMMMMMMM
yu want
Sentence→Word+
Word→ Char+
“Word” is adapted (hence, if something was a Wordconstituent previously, it is more likely to appear again)There are additional grammars: collocation grammar andsyllable grammar (take into account more informationabout language)Words are segmented according to “Word” constituentsAll grammars are not recursiveUsed in Johnson and Goldwater (2009)
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 21/32
![Page 22: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/22.jpg)
Word Segmentation - Results
grammar our paper J&G 2009GUnigram 0.84 0.81GColloc 0.86 0.86GSyllable 0.83 0.89
J&G 2009 - Johnson and Goldwater (2009) – best result
Scores reported are F1 measure
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 22/32
![Page 23: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/23.jpg)
Variants
Model:Pitman-Yor Process vs. Dirichlet Process (did not have mucheffect)
Inference:Fixed Stick vs. Dynamic Stick Expansion (fixed stick is better)
Decoding:Minimum Bayes Risk vs. Viterbi (MBR does better)
See paper for details!
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 23/32
![Page 24: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/24.jpg)
Running Time
Running time (clock time) of the sampler and variationalinference is approximately the same (note that implementationsare different)
However, variational inference can be parallelized
Reduction in clock time by factor of 2.8 when parallelizing on 20weaker CPUs
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 24/32
![Page 25: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/25.jpg)
Syntax and Power Law
0 2 4 6 8 10 12
−14
−12
−10
−8
−6
−4
−2
log Rank
log
Fre
quen
cy
EnglishChinesePortugueseTurkish
Motivating adaptorgrammars for unsu-pervised parsing, aplot of log rank ofconstituents vs. theirlog frequency
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 25/32
![Page 26: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/26.jpg)
Recursive Grammars
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 26/32
![Page 27: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/27.jpg)
Recursive Grammars - Solution
Our finite approximation of the stick zeros all “bad” events in thevariational distribution
Equivalent to inference when assuming the model is:
p′(x , z) =p(x , z)I(x , z /∈ bad)∑
(x ,z)/∈bad
p(x , z)
where p is the original adaptor grammar model that givesnon-zero probability to bad events and I is an 0/1 indicator
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 27/32
![Page 28: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/28.jpg)
Unsupervised Parsing Setting
Experiments on the English Penn Treebank
Stripped off punctuation and kept only part-of-speech tags
Used adaptor grammars with dependency model with valence(Klein and Manning, 2004)
DMV has a PCFG representation (Smith, 2006)
We “adapt” the nonterminals that head noun constituents
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 28/32
![Page 29: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/29.jpg)
Unsupervised Parsing - Results
model Viterbi MBRnon-Bayesian 45.8 46.1Dirichlet prior 45.9 46.1with adaptor grammars 48.3 50.2
Results are attachment-accuracy - fraction of parents correctlyidentified
A gain over vanilla Dirichlet, which is the prior used withadaptor grammars on the PCFG rules
Other priors (instead of Dirichlet prior) give performance ∼60.Can use it with adaptor grammars - future work
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 29/32
![Page 30: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/30.jpg)
Summary
We described a variational inference algorithm for adaptorgrammars
We showed it can lead to improvement in performance forvarious grammars
We showed it can be faster than sampling when parallelizationis used
We applied adaptor grammars to dependency unsupervisedparsing
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 30/32
![Page 31: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/31.jpg)
Thanks! Questions?
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 31/32
![Page 32: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend](https://reader034.fdocuments.in/reader034/viewer/2022050221/5f66708118faf8667d2352ef/html5/thumbnails/32.jpg)
Sampling vs. Variational Inference
0 50 100 150 200
2500
0030
0000
3500
0040
0000
4500
0050
0000
time step
−lo
g bo
und/
likel
ihoo
d
Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 32/32