Learning Approximate Inference Policies for Fast Prediction

11

Jason Eisner

ICML “Inferning” WorkshopJune 2012

Learning Approximate Inference Policies for Fast Prediction

Beware: Bayesians in Roadway

A Bayesian is the person who writes down the function you wish you could optimize

lexicon (word types)semantics

sentences

discourse context

resources

entailmentcorrelation

inflectioncognatestransliterationabbreviationneologismlanguage evolution

translationalignment

editingquotation

speech misspellings,typos formatting entanglement annotation

Ntokens

To recover variables, model and exploit their correlations

Motivating Tasks Structured prediction (e.g., for NLP problems)

Parsing ( trees) Machine translation ( word strings) Word variants ( letter strings, phylogenies, grids)

Unsupervised learning via Bayesian generative models Given a few verb conjugation tables and a lot of text

Find/organize/impute all verb conjugation tables of the language

Motivating Tasks Structured prediction (e.g., for NLP problems)

Parsing ( trees) Machine translation ( word strings) Word variants ( letter strings, phylogenies, grids)

Unsupervised learning via Bayesian generative models Given a few verb conjugation tables and a lot of text

Find/organize/impute all verb conjugation tables of the language Given some facts and a lot of text

Discover more facts through information extraction and reasoning

Current Methods

Dynamic programming Exact but slow

Approximate inference in graphical models Are approximations any good? May use dynamic programming as subroutine

(structured BP)

Sequential classification

Speed-Accuracy Tradeoffs Inference requires lots of computation

Is some computation going to waste? Sometimes the best prediction is overdetermined … Quick ad hoc methods sometimes work: how to respond?

Is some computation actively harmful? In approximate inference, passing a message can hurt

Frustrating to simplify model just to fix this Want to keep improving our models! But need good fast approximate inference Choose approximations automatically

Tuned to data distribution & loss function “Trainable hacks” – more robust

This talk is about “trainable hacks”

Prediction device

(suitable for domain)

training data

likelihood

feedback


Prediction device


training data

loss + runtime

feedback

Bayesian Decision Theory

Predictionrule

Loss

Datadistribution

What prediction rule? (approximate inference + beyond) What loss function? (can include runtime) How to optimize? (backprop, RL, …) What data distribution? (may have to impute)

Optimizedparameters of prediction rule


Prediction device


Probabilisticdomain model

Partialdata

Completetraining data

loss + runtime

feedback

Part 1:Your favorite approximate inference algorithm is a trainable hack

General CRFs: Unrestricted model structure.

Add edges to model the conditional distribution well. But exact inference is intractable. So use loopy sum-product or max-product BP.

14

Y1

Y2

Y4

Y3

X1 X2 X3

Inference: compute properties of the posterior distribution.

The cat sat on the ma

t .

DT .9NN .05

…

NN .8JJ .1

…

VBD .7VB .1

…

IN .9NN .01

…

DT .9NN .05

…

NN .4JJ .3

…

. .99, .001

…

15

General CRFs: Unrestricted model structure

Decoding: coming up with predictions from the results of inference.

The cat sat on the ma

t .

DT NN VBD IN DT NN .

16


One uses CRFs with several approximations: Approximate inference. Approximate decoding. Mis-specified model structure. MAP training (vs. Bayesian).

Why are we still maximizing data likelihood? Our system is more like a Bayes-inspired

neural network that makes predictions.

Could be present in

linear-chain CRFs as well.

17


Adjust ϴ to (locally) minimize training loss E.g., via back-propagation (+ annealing)

“Empirical Risk Minimization under Approximations (ERMA)”

p(y|x)x

(Appr.)Inferenc

e

(Appr.)Decodin

gŷ L(y*,ŷ)Black box decision

functionparameterized by ϴ

18

Train directly to minimize task loss(Stoyanov, Ropson, & Eisner 2011; Stoyanov & Eisner 2012)

Optimization Criteria

19


20

MLE


21

MLE


22

MLE

Experimental Results 3 NLP problems; also synthetic data We show that:

General CRFs work better when they match dependencies in the data.

Minimum risk training results in more accurate models.

ERMA software package available at www.clsp.jhu.edu/~ves/software

23

ERMA software packagehttp://www.clsp.jhu.edu/~ves/software

Includes syntax for describing general CRFs. Supports sum-product and max-product BP. Can optimize several commonly used loss functions:

MSE, Accuracy, F-score. The package is generic: Little effort to model new problems. About1-3 days to express each problem in our formalism.

24

Modeling Congressional Votes

The ConVote corpus [Thomas et al., 2006]

First , I want to commend the

gentleman from Wisconsin (Mr.

Sensenbrenner), the chairman of the

committee on the judiciary , not just

for the underlying bill…

25








Yea

26








Yea

Had it not been for the heroic actions of the passengers of United

flight 93 who forced the plane down over

Pennsylvania, congress's ability to serve …

Yea

Mr. Sensenbrenner

27








Yea

Had it not been for the heroic actions of the passengers of United

flight 93 who forced the plane down over

Pennsylvania, congress's ability to serve …

Yea

Mr. Sensenbrenner

28


An example from the ConVote corpus [Thomas et al., 2006]

• Predict representative votes based on debates.

29

Y/N




30




committee on the judiciary , not just for the underlying

bill…

Y/N

Text




31




committee on the judiciary , not just for the underlying

bill…

Y/N

Text

Y/N

Context Tex

t


32

Accuracy

Non-loopy baseline(2 SVMs + min-cut)

71.2


33

Accuracy


71.2Loopy CRF models(inference via loopy sum-prod BP)


34

Accuracy


71.2Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)

78.2


35

Accuracy



78.2Softmax-margin(loss-aware)

79.0


36

Accuracy




79.0ERMA (loss- and approximation-aware)

84.5*Boldfaced results are significantly better than all others (p < 0.05)

Information Extraction from Semi-Structured Text

What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of TechnologyTopic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737)

ABSTRACT: We will demonstrate the system "automata" that implements finite state machines……After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package

CMU Seminar Announcement Corpus [Freitag, 2000]

37

start timelocation

speaker

speaker

Information Extraction from Semi-Structured Text

What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of TechnologyTopic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737)

ABSTRACT: We will demonstrate the system "automata" that implements finite state machines……After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package

CMU Seminar Announcement Corpus [Freitag, 2000]

38

Skip-Chain CRF for Info Extraction Extract speaker, location, stime, and etime

from seminar announcement emails

Sutner

S

Who:

O

Prof.

S

Klaus

S

will

O

Prof.

S

Sutner

S… …

… …

CMU Seminar Announcement Corpus [Freitag, 2000]Skip-chain CRF [Sutton and McCallum, 2005; Finkel et al., 2005]

39

Semi-Structured Information Extraction

40

F1

Non-loopy baseline(linear-chain CRF)

86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood) 87.1


41

F1


86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood) 87.1Loopy CRF models(inference via loopy sum-prod BP) Maximum-likelihood training(with approximate inference)

89.5


42

F1




90.2


43

F1

Non-loopy baseline(Linear-chain CRF)




90.9

*Boldfaced results are significantly better than all others (p < 0.05).

Collective Multi-Label Classification

Reuters Corpus Version 2[Lewis et al, 2004]

The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…

Oil

Libya

Sports

44


Reuters Corpus Version 2[Lewis et al, 2004]


Oil

Libya

Sports

45



Oil

Libya

Sports

46


[Ghamrawi and McCallum, 2005;Finley and Joachims, 2008]


Oil

Libya

Sports

47

Multi-Label Classification

48

F1

Non-loopy baseline(logistic regression for each label)

81.6


49

F1



84.0


50

F1




83.8


51

F1





84.6*Boldfaced results are significantly better than all others (p < 0.05)

Summary

52

Congressional Vote Modeling

(Accuracy)

Semi-str. Inf. Extraction

(F1)

Multi-label Classification

(F1)

Non-loopy baseline 71.2 87.1 81.6Loopy CRF models

Maximum-likelihood training 78.2 89.5 84.0ERMA 84.5 90.9 84.6

Synthetic Data Generate a CRF at random

Random structure & parameters Use Gibbs sampling to generate data

Forget the parameters Optionally add noise to the structure

Learn the parameters from the sampled data Evaluate using one of four loss functions Total of 12 models of different size/connectivity

53

Synthetic Data: Results

Test Loss Train Objective Δ Loss compared to true model

wins/ties/losses (over 12 models)

MSEApprLogL baseline .71

MSE . 05 12/0/0

AccuracyApprLogL baseline . 75

Accuracy .01 11/0/1

F-ScoreApprLogL baseline 1.17

F-Score .08 10/2/0

ApprLogL ApprLogL baseline -.31

54

Introducing Structure Mismatch

55

Back-Propagation of Error for Empirical Risk Minimization

• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.

• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.

56

x L(y*,ŷ)

Black box decision function

parameterized by ϴ

(independently done by Domke 2010, 2011)




57

x L(y*,ŷ)




58

x L(y*,ŷ)

Neural network




59

x L(y*,ŷ)

Neural network

Y1

Y2

Y4Y3

X1 X2 X3




60

x L(y*,ŷ)

CRF System

Error Back-Propagation

61


62


63


64


65


66


67


68


69


70

VoteReidbill77

P(VoteReidbill77=Yea | x)

m(y1y2)=m(y3y1)*m(y4y1)

ϴ

Error Back-Propagation• Applying the differentiation chain rule

over and over.• Forward pass:

– Regular computation (inference + decoding) in the model (+ remember intermediate quantities).

• Backward pass:– Replay the forward pass in reverse

computing gradients.71

• Run inference and decoding:

Inference (loopy BP)

The Forward Pass

θ messages beliefs

Decoding

output

Loss

L

72

• Replay the computation backward calculating gradients:

Inference (loopy BP)

The Backward Pass

θ messages beliefs

Decoding

output

Loss

L

ð(L)=1ð(output)ð(f)= L/f

ð(messages) ð(beliefs)ð(θ)

73

Gradient-Based Optimization

• Use a local optimizer to find θ* that minimize training loss.

• In practice, we use a second-order method, Stochastic Meta Descent (Schradoulph 1999).– Some more automatic differentiation magic

needed to compute vector-Hessian products(Pearlmutter 1994).

• Both gradient and vector-Hessian computation have the same complexity as the forward pass (small constant factor).

74

Deterministic Annealing• Some loss functions are not

differentiable (e.g., accuracy)• Some inference methods are not

differentiable (e.g., max-product BP).• Replace Max with Softmax and

anneal.

75

Part 2:What other trainable inference devices can we devise?

Part 1:Your favorite approximate inference algorithm is a trainable hack

Preferably can tune for speed-accuracy tradeoff

(Horvitz 1989, “flexible computation”)

Prediction device


1. Lookup methods

Hash tables Memory-based learning Dual-path models

(look up if possible, else do deeper inference)

(in general, dynamic mixtures of policies: Halpern & Pass 2010)

2. Choose Fast Model Structure Static choice of fast model structure (Sebastiani & Ramoni 1998)

Learning a low-treewidth model (e.g., Bach & Jordan 2001, Narasimhan & Bilmes 2004)

Learning a sparse model (e.g., Lee et al. 2007) Learning an approximate arithmetic circuit (Lowd & Domingos 2010)

Dynamic choice of fast model structure Dynamic feature selection (Dulac-Arnold et al., 2011; Busa-Fekete et al.,

2012; He et al., 2012; Stoyanov & Eisner, 2012) Evidence-specific tree (Chechetka & Guestrin 2010) Data-dependent convex optimization problem (Domke 2008, 2012)

3. Pruning Unlikely Hypotheses Tune aggressiveness of pruning

Pipelines, cascades, beam-width selection Classifiers or competitive thresholds E.g., Taskar & Weiss 2010, Bodenstab et al. 2011

4. Pruning Work During Search Early stopping

Message-passing inference (Stoyanov et al. 2011)

ERMA: Increasing Speed by Early Stopping(synthetic data)

81

4. Pruning Work During Search Early stopping before convergence

Message-passing inference (Stoyanov et al. 2011)

Agenda-based dynamic prog. (Jiang et al. 2012) – approximate A*! Update some messages more often

In generalized BP, some messages are more complex Order of messages also affects convergence rate

Cf. residual BP Cf. flexible arithmetic circuit computation (Filardo & Eisner 2012)

Coarsen or drop messages selectively Value of computation Cf. expectation propagation (for likelihood only)

5. Sequential Decisions with Revision Common to use sequential decision

processes for structured prediction MaltParser, SEARN, etc.

1 2 3 4 5 6 7 8 9

Economic news had little effect on financial markets .

JJ NN VBD JJ NN IN JJ NNS .

REDUCELA(NMOD)SHIFTLA(SBJ)SHIFTSHIFTLA(NMOD)RA(OBJ)RA(NMOD)SHIFTLA(NMOD)RA(PMOD)REDUCEREDUCESHIFTRA(P)

NMOD SBJ NMOD

OBJ

NMOD NMOD

PMOD

Algorithm example (from Joakim Nivre)

ROOT

0

P

5. Sequential Decisions with Revision Common to use sequential decision processes for

structured prediction MaltParser, SEARN, etc.

Often treated as reinforcement learning Cumulative or delayed reward Try to avoid “contagious” mistakes

New opportunity: Enhanced agent that can backtrack and fix errors

The flip side of RL lookahead! (only in a forgiving environment) Sometimes can observe such agents (in psych lab)

Or widen its beam and explore in parallel

Open Questions

Effective algorithm that dynamically assesses value of computation.

Theorems of the following form: If true model comes from distribution P, then with high probability there exists a fast/accurate policy in the policy space. (better yet, find the policy!)

Effective policy learning methods.

On Policy Learning Methods … Basically large reinforcement learning problems

But rather strange ones! (Eisner & Daumé 2011) Policy ( priorities) trajectory reward Often, many equivalent trajectories will get the same answer

Search in policy parameter space Policy gradient (doesn’t work) Direct search (e.g., Nelder-Mead)

Search in priority space Need a surrogate objective, like A*

Search in trajectory space SEARN (too slow for some controllers) Loss-augmented inference (Chiang et al. 2009; McAllester et al. 2010) Response surface methodology (really searches in policy space) Integer linear programming

Part 3:Beyond ERMA to IRMA

Empirical Risk Minimization under Approximations

Part 1:Your favorite approximate inference algorithm is a trainable hackPart 2:What other trainable inference devices can we devise?

Imputed

Where does p(x, y) come from?

Predictionrule

Loss

Datadistribution


Generative vs. discriminative

training data vs. dev data (Raina et al. 2003) unsupervised vs. supervised data (McCallum et al. 2006) regularization vs. empirical loss (Lasserre et al. 2006) data distribution vs. decision rule (this work; cf. Lacoste-Julien 2011)

engineeringscienceOptimizedparameters of prediction rule

Data imputation (Little & Rubin 1987) May need to “complete” missing data What are we given? How do we need to complete it? How do we complete it?

engineeringscienceOptimizedparameters of prediction rule

1. Have plenty of inputs; impute outputs “Model compression / uptraining / structure compilation”

GMM -> VQ (Hedelin & Skoglund 2000) ensemble -> single classifier (Bucila et al. 2006) sparse coding -> regression or NN

(Kavukcuoglu et al., 2008; Jarrett et al., 2009; Gregor & LeCun, 2010) CRF or PCFG -> local classifiers (Liang, Daume & Klein 2008) latent-variable PCFG -> deterministic sequential parser

(Petrov et al. 2010)

sampling instead of 1-best [stochastic] local search -> regression (Boyan & Moore 2000) k-step planning in an MDP -> classification or k'-step planning

(e.g., rollout in Bertsekas 2005; Ross et al. 2011, DAgger) BN -> arithmetic circuit (Lowd & Domingos, 2010)

2. Have good outputs; impute inputs Distort inputs from input-output pairs

Abu-Mostafa 1995 SVMs can be regarded as doing this too!

Structured prediction: Impute possible missing inputs Impute many Chinese sentences that should translate

into each observed English sentence (Li et al., 2011)

3. Insufficient training data to impute well Assumed that we have a good slow model at

training time But what if we don’t? Could sample from posterior over model

parameters as well …

4. Statistical Relational Learning

May only have one big incomplete training example!

Sample jointly from (model parameters, completions of the data) Need a censorship model to mask data plausibly Need a distribution over queries as well – query is part of (x,y) pair

What model should we use here? Start with a “base MRF” to allow domain-specific inductive bias

some variables rarely observed; some values rarely observed But try to respect the marginals we can get good estimates of

Want IRMA ERMA as we get more and more training data Need a high-capacity model to get consistency

Learn MRF close to base MRF? Use a GP based on the base MRF?

Summary: Speed-Accuracy Tradeoffs Inference requires lots of computation

Is some computation going to waste? Sometimes the best prediction is overdetermined … Quick ad hoc methods sometimes work: how to respond?

Is some computation actively harmful? In approximate inference, passing a message can hurt

Frustrating to simplify model just to fix this Want to keep improving our models! But need good fast approximate inference Choose approximations automatically

Tuned to data distribution & loss function “Trainable hacks” – more robust

Summary: Bayesian Decision Theory

Predictionrule

Loss

Datadistribution

What prediction rule? (approximate inference + beyond) What loss function? (can include runtime) How to optimize? (backprop, RL, …) What data distribution? (may have to impute)


Hal Daumé(+ 2 UMD students)

René Vidal (+ students)

Matt Gormley

Current Collaborators

Katherine Wu Jay Feldman Tim Vieira Adam Teichert Michael Paul

Nick Andrews Henry Pao Wes Filardo Jason Smith Ariya Rastrow

Ves Stoyanov Ben Van Durme Mark Dredze Yanif Ahmad(+ student)

Frank Ferraro

Undergrads& junior grad

students

Mid to seniorgrad students

Faculty

NLP Tasks15-20 years of introducing new formalisms, models & algorithms across NLP Parsing

Dependency, constituency, categorial, … Deep syntax Grammar induction

Word-internal modeling Morphology Phonology Transliteration Named entities

Translation Syntax-based (synchronous, quasi-synchronous, training, decoding)

Miscellaneous Tagging, sentiment, text cat, topics, coreference, web scraping … Generic algorithms on automata, hypergraphs, graphical models

Current Guiding Themes

1. Principled Bayesian models of various interesting NLP domains. Discover underlying structure with little supervision Requires new learning and inference algorithms

2. Learn fast, accurate policies for structured prediction and large-scale relational reasoning.

3. Unified computational infrastructure for NLP and AI. A declarative programming language that supports modularity Backed by a searchable space of strategies & data structures

Machine learning + linguistic structure.

Fashion statistical models that capture good intuitions about various kinds of linguistic structure. Develop efficient algorithms to apply these models to data. Be generic.

Fast but Principled Reasoning to Analyze Data Principled:

New models suited to the data + new inference algorithms for those models = draw appropriate conclusions from data

Fast prediction: Inference algorithms + approximations trained to balance speed & acc. = 80% of the benefit at 20% of the cost

Reusable frameworks for modeling & prediction

Word-Internal ModelingVariation in a name within and across languages

E step: re-estimate distribution over all spanning trees Requires: Corpus model with sequential generation, copying, mutation

M step: re-estimate name mutation model along likely tree edges Required: Trainable parametric model of name mutation

Word-Internal ModelingVariation in a name within and across languages

Word-Internal ModelingSpelling of named entities The “gazetteer problem” in NER systems

Using gazetteer features helps performance on in-gazetteer names. But hurts performance on out-of-gazetteer names! Spelling features essentially do not learn from the in-gazetteer

names.

Solution: Generate your gazetteer Treat the gazetteer itself as training data for a generative model of

entity names. Includes spelling features. Non-parametric model generates good results.

Include this sub-model within a full NER model. Not obvious how, especially for a discriminative NER model. Can exploit additional gazetteer data, such as town population.

Problem & solution extend to other dictionary resources in NLP Acronyms, IM-speak, cognate translations, …

Word-Internal ModelingInference over multiple strings

2011 dissertation by Markus Dreyer Organize corpus tokens into morphological paradigms Infer missing forms

String and sequence modelingOptimal inference of strings

2011 dissertation by Markus Dreyer Organize corpus types into morphological paradigms Infer missing forms Cool model – but exact inference is intractable, even undecidable

Dual decomposition to the rescue? Will allow MAP inference in such models

Message passing algorithm If it converges, the answer is guaranteed correct

Wasn’t obvious how to infer strings by dual decomposition We have one technique and are working on others

So far, we’ve applied it to intersecting many automata E.g, exact consensus output of ASR or MT systems Usually converges reasonably quickly

String and sequence modelingOptimal inference of strings

O(100*n*g) per iteration

Grammar Induction Finding the “best” grammar is a horrible optimization problem

Even for overly simple definitions of “best”

Two new attacks: Mathematical programming techniques

Branch and bound + Dantzig-Wolfe decomposition over the sentences + Stochastic local search

Deep learning “Inside” and “outside” strings should depend on each other only

through a nonterminal (context-freeness) CCA should be able to find that nonterminal (spectral learning) But first need vector representations of inside and outside strings So use CCA to build up representations recursively (deep learning)

Improved Topic ModelsResults improve on the state of the art

What can we learn from distributional properties of words?

Some words group together into “topics.” Tend to cooccur in documents; or have similar syntactic arguments.

But are there further hidden variables governing this? Try to get closer to underlying meaning or discourse space.

Future: Embed words or phonemes in a structured feature space whose structure must be learned

Applied NLP TasksResults improve on the state of the art Add more global features to the model …

Need approximate inference, but it’s worth it Especially if we train for the approximate inference condition

Within-document coreference Build up properties of the underlying entities

Gender, number, animacy, semantic type, head word Sentiment polarity

Exploit cross-document references that signal(dis)agreement between two authors

Multi-label text categorization Exploit correlations between labels on same document

Information extraction Exploit correlations between labels on faraway words

112

Database generated websites

Post ID Author

520 Demon

521 Ushi

Author Title

Demon Moderator

Ushi Pink Space Monkey

Author Location

Demon Pennsylvania

Ushi Where else?

Database back-end Web-page code produced by querying DB

(...) (...)

113

Website generated databases*

Post ID Author

520 Demon

521 Ushi

Author Title

Demon Moderator

Ushi Pink Space Monkey

Author Location

Demon Pennsylvania

Ushi Where else?

Recovered database Given web pages

* Thanks, Bayes!We state a prior over annotated grammars

And a prior over database schemasAnd a prior over database contents

114

Relational database Webpages Why isn’t this easy?

Could write a custom script … … for every website in every language?? (and maintain it??)

Why are database-backed websites important?1. Vast amounts of useful information are published this way! (most?)2. In 2007, Dark Web project @ U. Arizona estimated 50,000

extremist/terrorist websites; fastest growth was in Web 2.0 sites Some were transient sites, or subcommunities on larger sites

3. Our techniques could extend to analyze other semistructured docs

Why are NLP methods relevant? Like NL, these webpages are meant to be read by humans But they’re a mixture of NL text, tables, semi-structured data, repetitive

formatting … Harvest NL text + direct facts (including background facts for NLP) Helpful that HTML is a tree: we know about those

http://ai.arizona.edu/research/terror

115

Shopping & auctions (with user comments)

116

News articles & blogs...

117

...with user comments

118

Crime listings

119

Social networking

120

Collaborative encyclopedias

121

Linguistic resources (monolingual, bilingual)

122

Classified ads

123

Catalogs

124

Public records (in some countries)

Real estate, car ownership, sex offenders, legal judgments, inmate data, death records, census data, map data, genealogy, elected officials, licensed professionals …http://www.publicrecordcenter.com

http://www.publicrecordcenter.com/

http://www.publicrecordcenter.com/

125

Public records (in some countries)

126

Directories of organizations (e.g., Yellow Pages)

Banks of the World >> South Africa >> Union Bank of Nigeria PLC

127

Directories of people

128

Different types of structured fields

Explicit fields

Fields with internal structure

Iterated field

129

Forums, bulletin boards, etc.

130

Lots of structured & unstructured content

Date of post

AuthorPost

Title (moderator, member, ...) Geographic location of poster

ERMAEmpirical Risk Minimization under Approximations Our pretty models are approximate

Our inference procedures are also approximate Our decoding procedures are also approximate Our training procedures are also approximate (non-Bayesian)

So let’s train to minimize loss in the presence of all these approximations Striking improvements on several real NLP tasks

(as well as a range of synthetic data)

Speed-Aware ERMAEmpirical Risk Minimization under Approximations So let’s train to minimize loss in the presence of all these

approximations Striking improvements on several real NLP tasks

(as well as a range of synthetic data)

Even better, let’s train to minimize loss + runtime Need special parameters to control degree of approximation

How long to run? Which messages to pass? Which features to use? Get substantial speedups at little cost to accuracy

Next extension: Probabilistic relational models Learn to do fast approximate probabilistic reasoning about slots and

fillers in a knowledge base Detect interesting facts, answer queries, improve info extraction Generate plausible supervised training data – minimize imputed risk

Learned Dynamic PrioritizationMore minimization of loss + runtime

Many inference procedurestake nondeterministic steps that refine current beliefs. Graphical models: Which message to update next? Parsing: Which constituent to extend next? Parsing: Which build action, or should we backtrack & revise?

Should we prune, or narrow or widen the beam? Coreference: Which clusters to merge next?

Learn a fast policy that decides what step to take next. “Compile” a slow inference procedure into a fast one that is

tailored to the specific data distribution and task loss. Hard training problem in order to make test fast.

We’re trying a bunch of different techniques.

135

Compressed LearningSublinear time

How do we do unsupervised learning on many terabytes of data??

Can’t afford to do many passes over the dataset … Throw away some data?

Might create bias. How do we know we’re not throwing away the important clues?

Better: Summarize the less relevant data and try to learn from the summary.

Google N-gram corpus = a compressed version of the English web. N-gram counts from 1 trillion words of text

136

though most monitor lizards fromIN DT NN NNS IN

Topics: Biology

Tagging isolated N-grams

NVV

IN NN VB NNS IN

Computers

Oops, ambiguous.For learning, would help to have the whole sentence.

137

though most monitor lizards fromIN DT NN NNS IN

Tagging N-grams in context

NVV


Africa are carnivores …… some will eat vegetables

Topics: Computers Biology

138

though most monitor lizards from

Tagging N-grams in context

IN NN VB NNS IN


a distance …… he watches them up close

Topics: BiologyComputers

139


Africa

Asia

adistance

are

carnivores

carnivorousvegetables

somewill

eat

close

he

watches

themup

watch

IN DT NN NNS IN

IN NN VB NNS IN

Topics: Biology

NV

N

though most monitor lizards frommost monitor lizards frommost monitor lizards from Africamost monitor lizards from Asiamost monitor lizards from a

though most monitor lizardsvegetables though most monitor lizards

close though most monitor lizards


Extrapolating contexts …V

Computers

140


Africa

Asia

are

adistance

carnivores

carnivorous

somewill

eatvegetableshe

watches

themup

close

watch

IN DT NN NNS IN

IN NN VB NNS IN

Topics: Computers Biology

NV

N

52 though most monitor lizards from2501428325

most monitor lizards frommost monitor lizards from Africamost monitor lizards from Asiamost monitor lizards from a

5213310132

though most monitor lizardsvegetables though most monitor lizards

close though most monitor lizards


Learning from N-gramsV

DynaA language for propagating and combining information Each idea takes a lot of labor to code up.

We spend way too much “research” time building the parts that we already knew how to build. Coding natural variants on existing models/algorithms Hacking together existing data sources and algorithms Extracting outputs Tuning data structures, file formats, computation order,

parallelization

What’s in a knowledge base? Types

Observed facts

Derived facts

Inference rules (declarative) Inference strategies (procedural)

Common architecture? There’s not a single best way to represent

uncertainty or combine knowledge.What do numeric “weights” represent in a reasoning

system? Probabilities (perhaps approximations or bounds) Intermediate quantities used to compute probabilities

(in dynamic programming or message-passing) Potentials Priorities Confidences Activation levels Event or sample counts Losses, risks, rewards

Feature values Feature weights & other

parameters Distances, similarities Margins Regularization terms Partial derivatives . . .

Common architecture? There’s not a single best way to represent uncertainty

or combine knowledge. Different underlying probabilistic models Different approximate inference/decision algorithms

Depends on domain properties, special structure, speed needs … Heterogeneous data, features, rules, proposal distributions … Need ability to experiment, extend, and combine

But all of the methods share the same computational needs.

Common architecture? There’s not a single best way … But all of the methods share the same needs.

Store data and permit it to be queried. Fuse data – compute derived data using rules. Propagate updates to data, parameters, or hypotheses. Encapsulate data sources – both input data & analytics. Sensitivity analysis (e.g., back-propagation for training). Visualization of facts, changes, and provenance.

Common architecture?2011 paper on encoding AI problems in Dyna:

2-3 lines: Dijkstra’s algorithm 4 lines: Feed-forward neural net 11 lines: Bigram language model (Good-Turing backoff

smoothing) 6 lines: Arc-consistency constraint propagation

+6 lines: With backtracking search +6 lines: With branch-and-bound

6 lines: Loopy belief propagation 3 lines: Probabilistic context-free parsing

+7 lines: PCFG rule weights via feature templates (toy example) 4 lines: Value computation in a Markov Decision Process 5 lines: Weighted edit distance 3 lines: Markov chain Monte Carlo (toy example)

Common architecture? There’s not a single best way … But all of the methods share the same needs.

Store data and permit it to be queried. Fuse data – compute derived data using rules. Propagate updates to data, parameters, or hypotheses. Encapsulate data sources – both input data & analytics. Sensitivity analysis (e.g., back-propagation for training). Visualization.

And benefit from the same optimizations. Decide what is worth the time to compute (next). Decide where to compute it (parallelism). Decide what is worth the space to store (data, memos, indices). Decide how to store it.

Common architecture?

Dyna is not a probabilistic database, a graphical model inference package, FACTORIE, BLOG, Watson, a homebrew evidence combination system, ...

It provides the common infrastructure for these. That’s where “all” the implementation effort lies.

But does not commit to any specific data model, probabilistic semantics, or inference strategy.

Summary (again)

1. Principled Bayesian models of various interesting NLP domains. Discover underlying structure with little supervision Requires new learning and inference algorithms

2. Learn fast, accurate policies for structured prediction and large-scale relational reasoning.

3. Unified computational infrastructure for NLP and AI. A declarative programming language that supports modularity Backed by a searchable space of strategies & data structures

Machine learning + linguistic structure.

Fashion statistical models that capture good intuitions about various kinds of linguistic structure. Develop efficient algorithms to apply these models to data. Be generic.

Learning Approximate Inference Policies for Fast Prediction

Documents

Transcript of Learning Approximate Inference Policies for Fast Prediction