Learning Approximate Inference Policies for Fast Prediction
description
Transcript of Learning Approximate Inference Policies for Fast Prediction
![Page 1: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/1.jpg)
11
Jason Eisner
ICML “Inferning” WorkshopJune 2012
Learning Approximate Inference Policies for Fast Prediction
![Page 2: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/2.jpg)
Beware: Bayesians in Roadway
A Bayesian is the person who writes down the function you wish you could optimize
![Page 3: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/3.jpg)
lexicon (word types)semantics
sentences
discourse context
resources
entailmentcorrelation
inflectioncognatestransliterationabbreviationneologismlanguage evolution
translationalignment
editingquotation
speech misspellings,typos formatting entanglement annotation
Ntokens
To recover variables, model and exploit their correlations
![Page 4: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/4.jpg)
Motivating Tasks Structured prediction (e.g., for NLP problems)
Parsing ( trees) Machine translation ( word strings) Word variants ( letter strings, phylogenies, grids)
Unsupervised learning via Bayesian generative models Given a few verb conjugation tables and a lot of text
Find/organize/impute all verb conjugation tables of the language
![Page 5: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/5.jpg)
![Page 6: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/6.jpg)
Motivating Tasks Structured prediction (e.g., for NLP problems)
Parsing ( trees) Machine translation ( word strings) Word variants ( letter strings, phylogenies, grids)
Unsupervised learning via Bayesian generative models Given a few verb conjugation tables and a lot of text
Find/organize/impute all verb conjugation tables of the language Given some facts and a lot of text
Discover more facts through information extraction and reasoning
![Page 7: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/7.jpg)
Current Methods
Dynamic programming Exact but slow
Approximate inference in graphical models Are approximations any good? May use dynamic programming as subroutine
(structured BP)
Sequential classification
![Page 8: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/8.jpg)
Speed-Accuracy Tradeoffs Inference requires lots of computation
Is some computation going to waste? Sometimes the best prediction is overdetermined … Quick ad hoc methods sometimes work: how to respond?
Is some computation actively harmful? In approximate inference, passing a message can hurt
Frustrating to simplify model just to fix this Want to keep improving our models! But need good fast approximate inference Choose approximations automatically
Tuned to data distribution & loss function “Trainable hacks” – more robust
![Page 9: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/9.jpg)
This talk is about “trainable hacks”
Prediction device
(suitable for domain)
training data
likelihood
feedback
![Page 10: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/10.jpg)
This talk is about “trainable hacks”
Prediction device
(suitable for domain)
training data
loss + runtime
feedback
![Page 11: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/11.jpg)
Bayesian Decision Theory
Predictionrule
Loss
Datadistribution
What prediction rule? (approximate inference + beyond) What loss function? (can include runtime) How to optimize? (backprop, RL, …) What data distribution? (may have to impute)
Optimizedparameters of prediction rule
![Page 12: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/12.jpg)
This talk is about “trainable hacks”
Prediction device
(suitable for domain)
Probabilisticdomain model
Partialdata
Completetraining data
loss + runtime
feedback
![Page 13: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/13.jpg)
Part 1:Your favorite approximate inference algorithm is a trainable hack
![Page 14: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/14.jpg)
General CRFs: Unrestricted model structure.
Add edges to model the conditional distribution well. But exact inference is intractable. So use loopy sum-product or max-product BP.
14
Y1
Y2
Y4
Y3
X1 X2 X3
![Page 15: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/15.jpg)
Inference: compute properties of the posterior distribution.
The cat sat on the ma
t .
DT .9NN .05
…
NN .8JJ .1
…
VBD .7VB .1
…
IN .9NN .01
…
DT .9NN .05
…
NN .4JJ .3
…
. .99, .001
…
15
General CRFs: Unrestricted model structure
![Page 16: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/16.jpg)
Decoding: coming up with predictions from the results of inference.
The cat sat on the ma
t .
DT NN VBD IN DT NN .
16
General CRFs: Unrestricted model structure
![Page 17: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/17.jpg)
One uses CRFs with several approximations: Approximate inference. Approximate decoding. Mis-specified model structure. MAP training (vs. Bayesian).
Why are we still maximizing data likelihood? Our system is more like a Bayes-inspired
neural network that makes predictions.
Could be present in
linear-chain CRFs as well.
17
General CRFs: Unrestricted model structure
![Page 18: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/18.jpg)
Adjust ϴ to (locally) minimize training loss E.g., via back-propagation (+ annealing)
“Empirical Risk Minimization under Approximations (ERMA)”
p(y|x)x
(Appr.)Inferenc
e
(Appr.)Decodin
gŷ L(y*,ŷ)Black box decision
functionparameterized by ϴ
18
Train directly to minimize task loss(Stoyanov, Ropson, & Eisner 2011; Stoyanov & Eisner 2012)
![Page 19: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/19.jpg)
Optimization Criteria
19
![Page 20: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/20.jpg)
Optimization Criteria
20
MLE
![Page 21: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/21.jpg)
Optimization Criteria
21
MLE
![Page 22: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/22.jpg)
Optimization Criteria
22
MLE
![Page 23: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/23.jpg)
Experimental Results 3 NLP problems; also synthetic data We show that:
General CRFs work better when they match dependencies in the data.
Minimum risk training results in more accurate models.
ERMA software package available at www.clsp.jhu.edu/~ves/software
23
![Page 24: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/24.jpg)
ERMA software packagehttp://www.clsp.jhu.edu/~ves/software
Includes syntax for describing general CRFs. Supports sum-product and max-product BP. Can optimize several commonly used loss functions:
MSE, Accuracy, F-score. The package is generic: Little effort to model new problems. About1-3 days to express each problem in our formalism.
24
![Page 25: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/25.jpg)
Modeling Congressional Votes
The ConVote corpus [Thomas et al., 2006]
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just
for the underlying bill…
25
![Page 26: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/26.jpg)
Modeling Congressional Votes
The ConVote corpus [Thomas et al., 2006]
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just
for the underlying bill…
Yea
26
![Page 27: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/27.jpg)
Modeling Congressional Votes
The ConVote corpus [Thomas et al., 2006]
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just
for the underlying bill…
Yea
Had it not been for the heroic actions of the passengers of United
flight 93 who forced the plane down over
Pennsylvania, congress's ability to serve …
Yea
Mr. Sensenbrenner
27
![Page 28: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/28.jpg)
Modeling Congressional Votes
The ConVote corpus [Thomas et al., 2006]
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just
for the underlying bill…
Yea
Had it not been for the heroic actions of the passengers of United
flight 93 who forced the plane down over
Pennsylvania, congress's ability to serve …
Yea
Mr. Sensenbrenner
28
![Page 29: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/29.jpg)
Modeling Congressional Votes
An example from the ConVote corpus [Thomas et al., 2006]
• Predict representative votes based on debates.
29
Y/N
![Page 30: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/30.jpg)
Modeling Congressional Votes
An example from the ConVote corpus [Thomas et al., 2006]
• Predict representative votes based on debates.
30
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just for the underlying
bill…
Y/N
Text
![Page 31: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/31.jpg)
Modeling Congressional Votes
An example from the ConVote corpus [Thomas et al., 2006]
• Predict representative votes based on debates.
31
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just for the underlying
bill…
Y/N
Text
Y/N
Context Tex
t
![Page 32: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/32.jpg)
Modeling Congressional Votes
32
Accuracy
Non-loopy baseline(2 SVMs + min-cut)
71.2
![Page 33: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/33.jpg)
Modeling Congressional Votes
33
Accuracy
Non-loopy baseline(2 SVMs + min-cut)
71.2Loopy CRF models(inference via loopy sum-prod BP)
![Page 34: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/34.jpg)
Modeling Congressional Votes
34
Accuracy
Non-loopy baseline(2 SVMs + min-cut)
71.2Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
78.2
![Page 35: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/35.jpg)
Modeling Congressional Votes
35
Accuracy
Non-loopy baseline(2 SVMs + min-cut)
71.2Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
78.2Softmax-margin(loss-aware)
79.0
![Page 36: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/36.jpg)
Modeling Congressional Votes
36
Accuracy
Non-loopy baseline(2 SVMs + min-cut)
71.2Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
78.2Softmax-margin(loss-aware)
79.0ERMA (loss- and approximation-aware)
84.5*Boldfaced results are significantly better than all others (p < 0.05)
![Page 37: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/37.jpg)
Information Extraction from Semi-Structured Text
What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of TechnologyTopic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737)
ABSTRACT: We will demonstrate the system "automata" that implements finite state machines……After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package
CMU Seminar Announcement Corpus [Freitag, 2000]
37
![Page 38: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/38.jpg)
start timelocation
speaker
speaker
Information Extraction from Semi-Structured Text
What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of TechnologyTopic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737)
ABSTRACT: We will demonstrate the system "automata" that implements finite state machines……After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package
CMU Seminar Announcement Corpus [Freitag, 2000]
38
![Page 39: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/39.jpg)
Skip-Chain CRF for Info Extraction Extract speaker, location, stime, and etime
from seminar announcement emails
Sutner
S
Who:
O
Prof.
S
Klaus
S
will
O
Prof.
S
Sutner
S… …
… …
CMU Seminar Announcement Corpus [Freitag, 2000]Skip-chain CRF [Sutton and McCallum, 2005; Finkel et al., 2005]
39
![Page 40: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/40.jpg)
Semi-Structured Information Extraction
40
F1
Non-loopy baseline(linear-chain CRF)
86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood) 87.1
![Page 41: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/41.jpg)
Semi-Structured Information Extraction
41
F1
Non-loopy baseline(linear-chain CRF)
86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood) 87.1Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
89.5
![Page 42: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/42.jpg)
Semi-Structured Information Extraction
42
F1
Non-loopy baseline(linear-chain CRF)
86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood) 87.1Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
89.5Softmax-margin(loss-aware)
90.2
![Page 43: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/43.jpg)
Semi-Structured Information Extraction
43
F1
Non-loopy baseline(Linear-chain CRF)
86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood) 87.1Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
89.5Softmax-margin(loss-aware)
90.2ERMA (loss- and approximation-aware)
90.9
*Boldfaced results are significantly better than all others (p < 0.05).
![Page 44: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/44.jpg)
Collective Multi-Label Classification
Reuters Corpus Version 2[Lewis et al, 2004]
The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…
Oil
Libya
Sports
44
![Page 45: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/45.jpg)
Collective Multi-Label Classification
Reuters Corpus Version 2[Lewis et al, 2004]
The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…
Oil
Libya
Sports
45
![Page 46: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/46.jpg)
Collective Multi-Label Classification
The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…
Oil
Libya
Sports
46
![Page 47: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/47.jpg)
Collective Multi-Label Classification
[Ghamrawi and McCallum, 2005;Finley and Joachims, 2008]
The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…
Oil
Libya
Sports
47
![Page 48: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/48.jpg)
Multi-Label Classification
48
F1
Non-loopy baseline(logistic regression for each label)
81.6
![Page 49: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/49.jpg)
Multi-Label Classification
49
F1
Non-loopy baseline(logistic regression for each label)
81.6Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
84.0
![Page 50: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/50.jpg)
Multi-Label Classification
50
F1
Non-loopy baseline(logistic regression for each label)
81.6Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
84.0Softmax-margin(loss-aware)
83.8
![Page 51: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/51.jpg)
Multi-Label Classification
51
F1
Non-loopy baseline(logistic regression for each label)
81.6Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)
84.0Softmax-margin(loss-aware)
83.8ERMA (loss- and approximation-aware)
84.6*Boldfaced results are significantly better than all others (p < 0.05)
![Page 52: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/52.jpg)
Summary
52
Congressional Vote Modeling
(Accuracy)
Semi-str. Inf. Extraction
(F1)
Multi-label Classification
(F1)
Non-loopy baseline 71.2 87.1 81.6Loopy CRF models
Maximum-likelihood training 78.2 89.5 84.0ERMA 84.5 90.9 84.6
![Page 53: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/53.jpg)
Synthetic Data Generate a CRF at random
Random structure & parameters Use Gibbs sampling to generate data
Forget the parameters Optionally add noise to the structure
Learn the parameters from the sampled data Evaluate using one of four loss functions Total of 12 models of different size/connectivity
53
![Page 54: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/54.jpg)
Synthetic Data: Results
Test Loss Train Objective Δ Loss compared to true model
wins/ties/losses (over 12 models)
MSEApprLogL baseline .71
MSE . 05 12/0/0
AccuracyApprLogL baseline . 75
Accuracy .01 11/0/1
F-ScoreApprLogL baseline 1.17
F-Score .08 10/2/0
ApprLogL ApprLogL baseline -.31
54
![Page 55: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/55.jpg)
Introducing Structure Mismatch
55
![Page 56: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/56.jpg)
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
56
x L(y*,ŷ)
Black box decision function
parameterized by ϴ
(independently done by Domke 2010, 2011)
![Page 57: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/57.jpg)
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
57
x L(y*,ŷ)
![Page 58: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/58.jpg)
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
58
x L(y*,ŷ)
Neural network
![Page 59: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/59.jpg)
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
59
x L(y*,ŷ)
Neural network
![Page 60: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/60.jpg)
Y1
Y2
Y4Y3
X1 X2 X3
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
60
x L(y*,ŷ)
CRF System
![Page 61: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/61.jpg)
Error Back-Propagation
61
![Page 62: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/62.jpg)
Error Back-Propagation
62
![Page 63: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/63.jpg)
Error Back-Propagation
63
![Page 64: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/64.jpg)
Error Back-Propagation
64
![Page 65: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/65.jpg)
Error Back-Propagation
65
![Page 66: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/66.jpg)
Error Back-Propagation
66
![Page 67: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/67.jpg)
Error Back-Propagation
67
![Page 68: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/68.jpg)
Error Back-Propagation
68
![Page 69: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/69.jpg)
Error Back-Propagation
69
![Page 70: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/70.jpg)
Error Back-Propagation
70
VoteReidbill77
P(VoteReidbill77=Yea | x)
m(y1y2)=m(y3y1)*m(y4y1)
ϴ
![Page 71: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/71.jpg)
Error Back-Propagation• Applying the differentiation chain rule
over and over.• Forward pass:
– Regular computation (inference + decoding) in the model (+ remember intermediate quantities).
• Backward pass:– Replay the forward pass in reverse
computing gradients.71
![Page 72: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/72.jpg)
• Run inference and decoding:
Inference (loopy BP)
The Forward Pass
θ messages beliefs
Decoding
output
Loss
L
72
![Page 73: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/73.jpg)
• Replay the computation backward calculating gradients:
Inference (loopy BP)
The Backward Pass
θ messages beliefs
Decoding
output
Loss
L
ð(L)=1ð(output)ð(f)= L/f
ð(messages) ð(beliefs)ð(θ)
73
![Page 74: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/74.jpg)
Gradient-Based Optimization
• Use a local optimizer to find θ* that minimize training loss.
• In practice, we use a second-order method, Stochastic Meta Descent (Schradoulph 1999).– Some more automatic differentiation magic
needed to compute vector-Hessian products(Pearlmutter 1994).
• Both gradient and vector-Hessian computation have the same complexity as the forward pass (small constant factor).
74
![Page 75: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/75.jpg)
Deterministic Annealing• Some loss functions are not
differentiable (e.g., accuracy)• Some inference methods are not
differentiable (e.g., max-product BP).• Replace Max with Softmax and
anneal.
75
![Page 76: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/76.jpg)
Part 2:What other trainable inference devices can we devise?
Part 1:Your favorite approximate inference algorithm is a trainable hack
Preferably can tune for speed-accuracy tradeoff
(Horvitz 1989, “flexible computation”)
Prediction device
(suitable for domain)
![Page 77: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/77.jpg)
1. Lookup methods
Hash tables Memory-based learning Dual-path models
(look up if possible, else do deeper inference)
(in general, dynamic mixtures of policies: Halpern & Pass 2010)
![Page 78: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/78.jpg)
2. Choose Fast Model Structure Static choice of fast model structure (Sebastiani & Ramoni 1998)
Learning a low-treewidth model (e.g., Bach & Jordan 2001, Narasimhan & Bilmes 2004)
Learning a sparse model (e.g., Lee et al. 2007) Learning an approximate arithmetic circuit (Lowd & Domingos 2010)
Dynamic choice of fast model structure Dynamic feature selection (Dulac-Arnold et al., 2011; Busa-Fekete et al.,
2012; He et al., 2012; Stoyanov & Eisner, 2012) Evidence-specific tree (Chechetka & Guestrin 2010) Data-dependent convex optimization problem (Domke 2008, 2012)
![Page 79: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/79.jpg)
3. Pruning Unlikely Hypotheses Tune aggressiveness of pruning
Pipelines, cascades, beam-width selection Classifiers or competitive thresholds E.g., Taskar & Weiss 2010, Bodenstab et al. 2011
![Page 80: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/80.jpg)
4. Pruning Work During Search Early stopping
Message-passing inference (Stoyanov et al. 2011)
![Page 81: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/81.jpg)
ERMA: Increasing Speed by Early Stopping(synthetic data)
81
![Page 82: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/82.jpg)
4. Pruning Work During Search Early stopping before convergence
Message-passing inference (Stoyanov et al. 2011)
Agenda-based dynamic prog. (Jiang et al. 2012) – approximate A*! Update some messages more often
In generalized BP, some messages are more complex Order of messages also affects convergence rate
Cf. residual BP Cf. flexible arithmetic circuit computation (Filardo & Eisner 2012)
Coarsen or drop messages selectively Value of computation Cf. expectation propagation (for likelihood only)
![Page 83: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/83.jpg)
5. Sequential Decisions with Revision Common to use sequential decision
processes for structured prediction MaltParser, SEARN, etc.
![Page 84: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/84.jpg)
1 2 3 4 5 6 7 8 9
Economic news had little effect on financial markets .
JJ NN VBD JJ NN IN JJ NNS .
REDUCELA(NMOD)SHIFTLA(SBJ)SHIFTSHIFTLA(NMOD)RA(OBJ)RA(NMOD)SHIFTLA(NMOD)RA(PMOD)REDUCEREDUCESHIFTRA(P)
NMOD SBJ NMOD
OBJ
NMOD NMOD
PMOD
Algorithm example (from Joakim Nivre)
ROOT
0
P
![Page 85: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/85.jpg)
5. Sequential Decisions with Revision Common to use sequential decision processes for
structured prediction MaltParser, SEARN, etc.
Often treated as reinforcement learning Cumulative or delayed reward Try to avoid “contagious” mistakes
New opportunity: Enhanced agent that can backtrack and fix errors
The flip side of RL lookahead! (only in a forgiving environment) Sometimes can observe such agents (in psych lab)
Or widen its beam and explore in parallel
![Page 86: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/86.jpg)
Open Questions
Effective algorithm that dynamically assesses value of computation.
Theorems of the following form: If true model comes from distribution P, then with high probability there exists a fast/accurate policy in the policy space. (better yet, find the policy!)
Effective policy learning methods.
![Page 87: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/87.jpg)
On Policy Learning Methods … Basically large reinforcement learning problems
But rather strange ones! (Eisner & Daumé 2011) Policy ( priorities) trajectory reward Often, many equivalent trajectories will get the same answer
Search in policy parameter space Policy gradient (doesn’t work) Direct search (e.g., Nelder-Mead)
Search in priority space Need a surrogate objective, like A*
Search in trajectory space SEARN (too slow for some controllers) Loss-augmented inference (Chiang et al. 2009; McAllester et al. 2010) Response surface methodology (really searches in policy space) Integer linear programming
![Page 88: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/88.jpg)
Part 3:Beyond ERMA to IRMA
Empirical Risk Minimization under Approximations
Part 1:Your favorite approximate inference algorithm is a trainable hackPart 2:What other trainable inference devices can we devise?
Imputed
![Page 89: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/89.jpg)
Where does p(x, y) come from?
Predictionrule
Loss
Datadistribution
Optimizedparameters of prediction rule
![Page 90: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/90.jpg)
Generative vs. discriminative
training data vs. dev data (Raina et al. 2003) unsupervised vs. supervised data (McCallum et al. 2006) regularization vs. empirical loss (Lasserre et al. 2006) data distribution vs. decision rule (this work; cf. Lacoste-Julien 2011)
engineeringscienceOptimizedparameters of prediction rule
![Page 91: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/91.jpg)
Data imputation (Little & Rubin 1987) May need to “complete” missing data What are we given? How do we need to complete it? How do we complete it?
engineeringscienceOptimizedparameters of prediction rule
![Page 92: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/92.jpg)
1. Have plenty of inputs; impute outputs “Model compression / uptraining / structure compilation”
GMM -> VQ (Hedelin & Skoglund 2000) ensemble -> single classifier (Bucila et al. 2006) sparse coding -> regression or NN
(Kavukcuoglu et al., 2008; Jarrett et al., 2009; Gregor & LeCun, 2010) CRF or PCFG -> local classifiers (Liang, Daume & Klein 2008) latent-variable PCFG -> deterministic sequential parser
(Petrov et al. 2010)
sampling instead of 1-best [stochastic] local search -> regression (Boyan & Moore 2000) k-step planning in an MDP -> classification or k'-step planning
(e.g., rollout in Bertsekas 2005; Ross et al. 2011, DAgger) BN -> arithmetic circuit (Lowd & Domingos, 2010)
![Page 93: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/93.jpg)
2. Have good outputs; impute inputs Distort inputs from input-output pairs
Abu-Mostafa 1995 SVMs can be regarded as doing this too!
Structured prediction: Impute possible missing inputs Impute many Chinese sentences that should translate
into each observed English sentence (Li et al., 2011)
![Page 94: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/94.jpg)
3. Insufficient training data to impute well Assumed that we have a good slow model at
training time But what if we don’t? Could sample from posterior over model
parameters as well …
![Page 95: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/95.jpg)
4. Statistical Relational Learning
May only have one big incomplete training example!
Sample jointly from (model parameters, completions of the data) Need a censorship model to mask data plausibly Need a distribution over queries as well – query is part of (x,y) pair
What model should we use here? Start with a “base MRF” to allow domain-specific inductive bias
some variables rarely observed; some values rarely observed But try to respect the marginals we can get good estimates of
Want IRMA ERMA as we get more and more training data Need a high-capacity model to get consistency
Learn MRF close to base MRF? Use a GP based on the base MRF?
![Page 96: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/96.jpg)
Summary: Speed-Accuracy Tradeoffs Inference requires lots of computation
Is some computation going to waste? Sometimes the best prediction is overdetermined … Quick ad hoc methods sometimes work: how to respond?
Is some computation actively harmful? In approximate inference, passing a message can hurt
Frustrating to simplify model just to fix this Want to keep improving our models! But need good fast approximate inference Choose approximations automatically
Tuned to data distribution & loss function “Trainable hacks” – more robust
![Page 97: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/97.jpg)
Summary: Bayesian Decision Theory
Predictionrule
Loss
Datadistribution
What prediction rule? (approximate inference + beyond) What loss function? (can include runtime) How to optimize? (backprop, RL, …) What data distribution? (may have to impute)
Optimizedparameters of prediction rule
![Page 98: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/98.jpg)
FIN
![Page 99: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/99.jpg)
Hal Daumé(+ 2 UMD students)
René Vidal (+ students)
Matt Gormley
Current Collaborators
Katherine Wu Jay Feldman Tim Vieira Adam Teichert Michael Paul
Nick Andrews Henry Pao Wes Filardo Jason Smith Ariya Rastrow
Ves Stoyanov Ben Van Durme Mark Dredze Yanif Ahmad(+ student)
Frank Ferraro
Undergrads& junior grad
students
Mid to seniorgrad students
Faculty
![Page 100: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/100.jpg)
NLP Tasks15-20 years of introducing new formalisms, models & algorithms across NLP Parsing
Dependency, constituency, categorial, … Deep syntax Grammar induction
Word-internal modeling Morphology Phonology Transliteration Named entities
Translation Syntax-based (synchronous, quasi-synchronous, training, decoding)
Miscellaneous Tagging, sentiment, text cat, topics, coreference, web scraping … Generic algorithms on automata, hypergraphs, graphical models
![Page 101: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/101.jpg)
Current Guiding Themes
1. Principled Bayesian models of various interesting NLP domains. Discover underlying structure with little supervision Requires new learning and inference algorithms
2. Learn fast, accurate policies for structured prediction and large-scale relational reasoning.
3. Unified computational infrastructure for NLP and AI. A declarative programming language that supports modularity Backed by a searchable space of strategies & data structures
Machine learning + linguistic structure.
Fashion statistical models that capture good intuitions about various kinds of linguistic structure. Develop efficient algorithms to apply these models to data. Be generic.
![Page 102: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/102.jpg)
Fast but Principled Reasoning to Analyze Data Principled:
New models suited to the data + new inference algorithms for those models = draw appropriate conclusions from data
Fast prediction: Inference algorithms + approximations trained to balance speed & acc. = 80% of the benefit at 20% of the cost
Reusable frameworks for modeling & prediction
![Page 103: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/103.jpg)
Word-Internal ModelingVariation in a name within and across languages
E step: re-estimate distribution over all spanning trees Requires: Corpus model with sequential generation, copying, mutation
M step: re-estimate name mutation model along likely tree edges Required: Trainable parametric model of name mutation
![Page 104: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/104.jpg)
Word-Internal ModelingVariation in a name within and across languages
![Page 105: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/105.jpg)
Word-Internal ModelingSpelling of named entities The “gazetteer problem” in NER systems
Using gazetteer features helps performance on in-gazetteer names. But hurts performance on out-of-gazetteer names! Spelling features essentially do not learn from the in-gazetteer
names.
Solution: Generate your gazetteer Treat the gazetteer itself as training data for a generative model of
entity names. Includes spelling features. Non-parametric model generates good results.
Include this sub-model within a full NER model. Not obvious how, especially for a discriminative NER model. Can exploit additional gazetteer data, such as town population.
Problem & solution extend to other dictionary resources in NLP Acronyms, IM-speak, cognate translations, …
![Page 106: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/106.jpg)
Word-Internal ModelingInference over multiple strings
2011 dissertation by Markus Dreyer Organize corpus tokens into morphological paradigms Infer missing forms
![Page 107: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/107.jpg)
String and sequence modelingOptimal inference of strings
2011 dissertation by Markus Dreyer Organize corpus types into morphological paradigms Infer missing forms Cool model – but exact inference is intractable, even undecidable
Dual decomposition to the rescue? Will allow MAP inference in such models
Message passing algorithm If it converges, the answer is guaranteed correct
Wasn’t obvious how to infer strings by dual decomposition We have one technique and are working on others
So far, we’ve applied it to intersecting many automata E.g, exact consensus output of ASR or MT systems Usually converges reasonably quickly
![Page 108: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/108.jpg)
String and sequence modelingOptimal inference of strings
O(100*n*g) per iteration
![Page 109: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/109.jpg)
Grammar Induction Finding the “best” grammar is a horrible optimization problem
Even for overly simple definitions of “best”
Two new attacks: Mathematical programming techniques
Branch and bound + Dantzig-Wolfe decomposition over the sentences + Stochastic local search
Deep learning “Inside” and “outside” strings should depend on each other only
through a nonterminal (context-freeness) CCA should be able to find that nonterminal (spectral learning) But first need vector representations of inside and outside strings So use CCA to build up representations recursively (deep learning)
![Page 110: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/110.jpg)
Improved Topic ModelsResults improve on the state of the art
What can we learn from distributional properties of words?
Some words group together into “topics.” Tend to cooccur in documents; or have similar syntactic arguments.
But are there further hidden variables governing this? Try to get closer to underlying meaning or discourse space.
Future: Embed words or phonemes in a structured feature space whose structure must be learned
![Page 111: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/111.jpg)
Applied NLP TasksResults improve on the state of the art Add more global features to the model …
Need approximate inference, but it’s worth it Especially if we train for the approximate inference condition
Within-document coreference Build up properties of the underlying entities
Gender, number, animacy, semantic type, head word Sentiment polarity
Exploit cross-document references that signal(dis)agreement between two authors
Multi-label text categorization Exploit correlations between labels on same document
Information extraction Exploit correlations between labels on faraway words
![Page 112: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/112.jpg)
112
Database generated websites
Post ID Author
520 Demon
521 Ushi
Author Title
Demon Moderator
Ushi Pink Space Monkey
Author Location
Demon Pennsylvania
Ushi Where else?
Database back-end Web-page code produced by querying DB
(...) (...)
![Page 113: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/113.jpg)
113
Website generated databases*
Post ID Author
520 Demon
521 Ushi
Author Title
Demon Moderator
Ushi Pink Space Monkey
Author Location
Demon Pennsylvania
Ushi Where else?
Recovered database Given web pages
* Thanks, Bayes!We state a prior over annotated grammars
And a prior over database schemasAnd a prior over database contents
![Page 114: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/114.jpg)
114
Relational database Webpages Why isn’t this easy?
Could write a custom script … … for every website in every language?? (and maintain it??)
Why are database-backed websites important?1. Vast amounts of useful information are published this way! (most?)2. In 2007, Dark Web project @ U. Arizona estimated 50,000
extremist/terrorist websites; fastest growth was in Web 2.0 sites Some were transient sites, or subcommunities on larger sites
3. Our techniques could extend to analyze other semistructured docs
Why are NLP methods relevant? Like NL, these webpages are meant to be read by humans But they’re a mixture of NL text, tables, semi-structured data, repetitive
formatting … Harvest NL text + direct facts (including background facts for NLP) Helpful that HTML is a tree: we know about those
![Page 115: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/115.jpg)
115
Shopping & auctions (with user comments)
![Page 116: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/116.jpg)
116
News articles & blogs...
![Page 117: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/117.jpg)
117
...with user comments
![Page 118: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/118.jpg)
118
Crime listings
![Page 119: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/119.jpg)
119
Social networking
![Page 120: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/120.jpg)
120
Collaborative encyclopedias
![Page 121: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/121.jpg)
121
Linguistic resources (monolingual, bilingual)
![Page 122: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/122.jpg)
122
Classified ads
![Page 123: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/123.jpg)
123
Catalogs
![Page 124: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/124.jpg)
124
Public records (in some countries)
Real estate, car ownership, sex offenders, legal judgments, inmate data, death records, census data, map data, genealogy, elected officials, licensed professionals …http://www.publicrecordcenter.com
![Page 125: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/125.jpg)
125
Public records (in some countries)
![Page 126: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/126.jpg)
126
Directories of organizations (e.g., Yellow Pages)
Banks of the World >> South Africa >> Union Bank of Nigeria PLC
![Page 127: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/127.jpg)
127
Directories of people
![Page 128: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/128.jpg)
128
Different types of structured fields
Explicit fields
Fields with internal structure
Iterated field
![Page 129: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/129.jpg)
129
Forums, bulletin boards, etc.
![Page 130: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/130.jpg)
130
Lots of structured & unstructured content
Date of post
AuthorPost
Title (moderator, member, ...) Geographic location of poster
![Page 131: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/131.jpg)
Fast but Principled Reasoning to Analyze Data Principled:
New models suited to the data + new inference algorithms for those models = draw appropriate conclusions from data
Fast prediction: Inference algorithms + approximations trained to balance speed & acc. = 80% of the benefit at 20% of the cost
Reusable frameworks for modeling & prediction
![Page 132: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/132.jpg)
ERMAEmpirical Risk Minimization under Approximations Our pretty models are approximate
Our inference procedures are also approximate Our decoding procedures are also approximate Our training procedures are also approximate (non-Bayesian)
So let’s train to minimize loss in the presence of all these approximations Striking improvements on several real NLP tasks
(as well as a range of synthetic data)
![Page 133: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/133.jpg)
Speed-Aware ERMAEmpirical Risk Minimization under Approximations So let’s train to minimize loss in the presence of all these
approximations Striking improvements on several real NLP tasks
(as well as a range of synthetic data)
Even better, let’s train to minimize loss + runtime Need special parameters to control degree of approximation
How long to run? Which messages to pass? Which features to use? Get substantial speedups at little cost to accuracy
Next extension: Probabilistic relational models Learn to do fast approximate probabilistic reasoning about slots and
fillers in a knowledge base Detect interesting facts, answer queries, improve info extraction Generate plausible supervised training data – minimize imputed risk
![Page 134: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/134.jpg)
Learned Dynamic PrioritizationMore minimization of loss + runtime
Many inference procedurestake nondeterministic steps that refine current beliefs. Graphical models: Which message to update next? Parsing: Which constituent to extend next? Parsing: Which build action, or should we backtrack & revise?
Should we prune, or narrow or widen the beam? Coreference: Which clusters to merge next?
Learn a fast policy that decides what step to take next. “Compile” a slow inference procedure into a fast one that is
tailored to the specific data distribution and task loss. Hard training problem in order to make test fast.
We’re trying a bunch of different techniques.
![Page 135: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/135.jpg)
135
Compressed LearningSublinear time
How do we do unsupervised learning on many terabytes of data??
Can’t afford to do many passes over the dataset … Throw away some data?
Might create bias. How do we know we’re not throwing away the important clues?
Better: Summarize the less relevant data and try to learn from the summary.
Google N-gram corpus = a compressed version of the English web. N-gram counts from 1 trillion words of text
![Page 136: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/136.jpg)
136
though most monitor lizards fromIN DT NN NNS IN
Topics: Biology
Tagging isolated N-grams
NVV
IN NN VB NNS IN
Computers
Oops, ambiguous.For learning, would help to have the whole sentence.
![Page 137: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/137.jpg)
137
though most monitor lizards fromIN DT NN NNS IN
Tagging N-grams in context
NVV
Oops, ambiguous.For learning, would help to have the whole sentence.
Africa are carnivores …… some will eat vegetables
Topics: Computers Biology
![Page 138: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/138.jpg)
138
though most monitor lizards from
Tagging N-grams in context
IN NN VB NNS IN
Oops, ambiguous.For learning, would help to have the whole sentence.
a distance …… he watches them up close
Topics: BiologyComputers
![Page 139: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/139.jpg)
139
though most monitor lizards from
Africa
Asia
adistance
are
carnivores
carnivorousvegetables
somewill
eat
close
he
watches
themup
watch
IN DT NN NNS IN
IN NN VB NNS IN
Topics: Biology
NV
N
though most monitor lizards frommost monitor lizards frommost monitor lizards from Africamost monitor lizards from Asiamost monitor lizards from a
though most monitor lizardsvegetables though most monitor lizards
close though most monitor lizards
though most monitor lizards from
Extrapolating contexts …V
Computers
![Page 140: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/140.jpg)
140
though most monitor lizards from
Africa
Asia
are
adistance
carnivores
carnivorous
somewill
eatvegetableshe
watches
themup
close
watch
IN DT NN NNS IN
IN NN VB NNS IN
Topics: Computers Biology
NV
N
52 though most monitor lizards from2501428325
most monitor lizards frommost monitor lizards from Africamost monitor lizards from Asiamost monitor lizards from a
5213310132
though most monitor lizardsvegetables though most monitor lizards
close though most monitor lizards
though most monitor lizards from
Learning from N-gramsV
![Page 141: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/141.jpg)
Fast but Principled Reasoning to Analyze Data Principled:
New models suited to the data + new inference algorithms for those models = draw appropriate conclusions from data
Fast prediction: Inference algorithms + approximations trained to balance speed & acc. = 80% of the benefit at 20% of the cost
Reusable frameworks for modeling & prediction
![Page 142: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/142.jpg)
DynaA language for propagating and combining information Each idea takes a lot of labor to code up.
We spend way too much “research” time building the parts that we already knew how to build. Coding natural variants on existing models/algorithms Hacking together existing data sources and algorithms Extracting outputs Tuning data structures, file formats, computation order,
parallelization
![Page 143: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/143.jpg)
What’s in a knowledge base? Types
Observed facts
Derived facts
Inference rules (declarative) Inference strategies (procedural)
![Page 144: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/144.jpg)
Common architecture? There’s not a single best way to represent
uncertainty or combine knowledge.What do numeric “weights” represent in a reasoning
system? Probabilities (perhaps approximations or bounds) Intermediate quantities used to compute probabilities
(in dynamic programming or message-passing) Potentials Priorities Confidences Activation levels Event or sample counts Losses, risks, rewards
Feature values Feature weights & other
parameters Distances, similarities Margins Regularization terms Partial derivatives . . .
![Page 145: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/145.jpg)
Common architecture? There’s not a single best way to represent uncertainty
or combine knowledge. Different underlying probabilistic models Different approximate inference/decision algorithms
Depends on domain properties, special structure, speed needs … Heterogeneous data, features, rules, proposal distributions … Need ability to experiment, extend, and combine
But all of the methods share the same computational needs.
![Page 146: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/146.jpg)
Common architecture? There’s not a single best way … But all of the methods share the same needs.
Store data and permit it to be queried. Fuse data – compute derived data using rules. Propagate updates to data, parameters, or hypotheses. Encapsulate data sources – both input data & analytics. Sensitivity analysis (e.g., back-propagation for training). Visualization of facts, changes, and provenance.
![Page 147: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/147.jpg)
Common architecture?2011 paper on encoding AI problems in Dyna:
2-3 lines: Dijkstra’s algorithm 4 lines: Feed-forward neural net 11 lines: Bigram language model (Good-Turing backoff
smoothing) 6 lines: Arc-consistency constraint propagation
+6 lines: With backtracking search +6 lines: With branch-and-bound
6 lines: Loopy belief propagation 3 lines: Probabilistic context-free parsing
+7 lines: PCFG rule weights via feature templates (toy example) 4 lines: Value computation in a Markov Decision Process 5 lines: Weighted edit distance 3 lines: Markov chain Monte Carlo (toy example)
![Page 148: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/148.jpg)
Common architecture? There’s not a single best way … But all of the methods share the same needs.
Store data and permit it to be queried. Fuse data – compute derived data using rules. Propagate updates to data, parameters, or hypotheses. Encapsulate data sources – both input data & analytics. Sensitivity analysis (e.g., back-propagation for training). Visualization.
And benefit from the same optimizations. Decide what is worth the time to compute (next). Decide where to compute it (parallelism). Decide what is worth the space to store (data, memos, indices). Decide how to store it.
![Page 149: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/149.jpg)
Common architecture?
Dyna is not a probabilistic database, a graphical model inference package, FACTORIE, BLOG, Watson, a homebrew evidence combination system, ...
It provides the common infrastructure for these. That’s where “all” the implementation effort lies.
But does not commit to any specific data model, probabilistic semantics, or inference strategy.
![Page 150: Learning Approximate Inference Policies for Fast Prediction](https://reader036.fdocuments.in/reader036/viewer/2022081604/56815ffa550346895dcef952/html5/thumbnails/150.jpg)
Summary (again)
1. Principled Bayesian models of various interesting NLP domains. Discover underlying structure with little supervision Requires new learning and inference algorithms
2. Learn fast, accurate policies for structured prediction and large-scale relational reasoning.
3. Unified computational infrastructure for NLP and AI. A declarative programming language that supports modularity Backed by a searchable space of strategies & data structures
Machine learning + linguistic structure.
Fashion statistical models that capture good intuitions about various kinds of linguistic structure. Develop efficient algorithms to apply these models to data. Be generic.