Probabilistic Programming with Imperative Factor...
Transcript of Probabilistic Programming with Imperative Factor...
![Page 1: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/1.jpg)
Joint work with Karl Schultz, Sameer Singh, Michael Wick, Sebastian Reidel.Some slide material from Avi Pfeffer.
Andrew McCallum
Department of Computer ScienceUniversity of Massachusetts Amherst
Probabilistic Programmingwith Imperative Factor Graphs
![Page 2: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/2.jpg)
Uncertainty
• Uncertainty is ubiquitous- Partial information- Noisy sensors- Non-deterministic actions- Exogenous events
• Reasoning under uncertainty is a central challenge for building intelligent systems.
![Page 3: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/3.jpg)
Probability
• Probability provides a mathematically sound basis for dealing with uncertainty.
• Combined with utilities, provides a basis for decision-making under uncertainty.
![Page 4: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/4.jpg)
Probabilistic Modeling in the Last Few Years
• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite
• Developing the representation, inference and learning for a new model is a significant task.
![Page 5: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/5.jpg)
Conditional Random Fields
Finite state model
Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)
transitions observations
y1 y2 y3 y4 y5 y6 y7 y8state sequence
observation sequence
x1 x2 x3 x4 x5 x6 x7 x8
Graphical model
(Linear-chain) [Lafferty, McCallum, Pereira 2001]
1Z�x
|�x|�
t=1
φ(yt, yt−1)φ(xt, yt)= exp
� �
k
λkfk(xt, yt)
�
p(y|x) =
![Page 6: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/6.jpg)
Conditional Random Fields
Finite state model
Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)
state sequence
observation sequence
Graphical model
(Linear-chain) [Lafferty, McCallum, Pereira 2001]
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
y1 y2 y3 y4 y5 y6 y7 y8
x1 x2 x3 x4 x5 x6 x7 x8
![Page 7: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/7.jpg)
Skip-chain CRF
. . .
Senator Joe Green said today . Green chairs the ...
Joint NER across sentences
Capture long-distance dependencies
[Sutton, McCallum, 2005]
![Page 8: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/8.jpg)
Factorial CRF
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
Those surfers like San Jose
[Sutton, McCallum ’04]
Joint Part-of-speech, NP chunking, NER
Inference by Loopy Belief Propagation
![Page 9: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/9.jpg)
Pairwise Affinity CRFMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
[McCallum & Wellner 2003]
![Page 10: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/10.jpg)
Entity Resolution
“mention”
“mention” “mention”
“mention”
“mention”Mr. Hill
Amy Hall
Dana Hill
Dana she
![Page 11: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/11.jpg)
Entity Resolution
“entity”
“entity”
Mr. Hill
Amy Hall
Dana Hill
Dana she
![Page 12: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/12.jpg)
Entity Resolution
“entity”
“entity”
Mr. Hill
Amy Hall
Dana Hill
Dana she
![Page 13: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/13.jpg)
Entity Resolution
“entity”
“entity”
“entity”Mr. Hill
Amy Hall
Dana Hill
Dana she
![Page 14: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/14.jpg)
CRF for Co-referenceMr. Hill
Amy Hall
Dana Hill
Dana she
![Page 15: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/15.jpg)
CRF for Co-referenceMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
[McCallum & Wellner 2003]
p(�y|�x) =1
Z�xexp
�
i,j
�
l
λlfl(xi, xj , yij)
+ mechanism for preserving transitivity
Make pair-wise mergingdecisions jointly by:- calculating a joint prob.- including all edge weights- enforcing transitivity.
![Page 16: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/16.jpg)
Pairwise Affinity is not EnoughMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
![Page 17: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/17.jpg)
Pairwise Affinity is not Enoughshe
Amy Hall
she
she she
C
C
N
N
N
C
NC
N N
![Page 18: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/18.jpg)
Pairwise Comparisons Not EnoughExamples:
• ∀ mentions are pronouns?
• Entities have multiple attributes (name, email, institution, location);
need to measure “compatibility” among them.
• Having 2 “given names” is common, but not 4.– e.g. Howard M. Dean / Martin, Dean / Howard Martin
• Need to measure size of the clusters of mentions.
• ∃ a pair of lastname strings that differ > 5?
We need to ask ∃, ∀ questions about a set of mentions
We want first-order logic!
![Page 19: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/19.jpg)
Pairwise Affinity is not Enoughshe
Amy Hall
she
she she
C
C
N
N
N
C
NC
N N
![Page 20: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/20.jpg)
Partition Affinity CRFshe
Amy Hall
she
she she
Ask arbitrary questionsabout all entities in a partitionwith first-order logic...
![Page 21: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/21.jpg)
Partition Affinity CRFshe
Amy Hall
she
she she
![Page 22: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/22.jpg)
Partition Affinity CRFshe
Amy Hall
she
she she
![Page 23: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/23.jpg)
Partition Affinity CRFshe
Amy Hall
she
she she
![Page 24: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/24.jpg)
Partition Affinity CRFshe
Amy Hall
she
she she
![Page 25: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/25.jpg)
How can we perform inference and learning in models that cannot be “unrolled” ?
Can’t use belief propagation.Can’t use standard integer linear programming.
![Page 26: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/26.jpg)
Don’t represent all alternatives...
she
she
AmyHall
sheshe
![Page 27: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/27.jpg)
Don’t represent all alternatives... just one
she
she
AmyHall
sheshe
she
she
AmyHall
sheshe
StochasticJump
ProposalDistribution
at a time
Markov Chain Monte Carlo
![Page 28: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/28.jpg)
Metropolis-HastingsSampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
feasible region defined by deterministic constraintse.g. clustering, parse-tree projectivity.
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
Given factor graph with target variables y and observed x
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
proposal distribution
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
Can do MAP inference with decreasing temperature on ratio of p(y)’s
![Page 29: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/29.jpg)
M-H Natural Efficiencies1. Partition function cancels
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
SampleRank
MH Efficiency
factors cancel
=
�y �i∈y � ψ(x , y �i)
�y i∈y ψ(x , y i)
=
��y �i∈δy � ψ(x , y �i)
� ��y i∈y �/δy � ψ(x , y i)
�
��y i∈δy
ψ(x , y i)� ��
y i∈y/δyψ(x , y i)
�
=
�y �i∈δy � ψ(x , y �i)
�y i∈δy
ψ(x , y i)
δy is the “diff”, ie variables in y that have changed
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27
2. Unchanged factors cancel
How to learn parameters for ?
SampleRank
SampleRankmotivation
QUESTION: how do we learn θ for MH
Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding
Want: push updates (not inference) into inner-most loop of learning
Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 7 / 27
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
![Page 30: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/30.jpg)
Probabilistic Modeling in the Last Few Years
• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite
• Developing the representation, reasoning and learning for a new model is a significant task.
![Page 31: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/31.jpg)
Probabilistic Programming Languages
• Make it easy to represent rich, complex models, using the full power of programming languages- data structures
- control mechanisms
- abstraction
• Inference and learning come for free (or sort of)
• Give you the language to think of and create new models
![Page 32: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/32.jpg)
Small Sampling of Probabilistic Programming Languages
• Logic-based- Markov logic, BLOG, PRISM
• Functional- IBAL, Church
• Object Oriented- Figaro, Infer.NET
![Page 33: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/33.jpg)
BLOG
#Researcher ~ NumResearchersPrior();
Name(r) ~ NamePrior();
#Paper ~ NumPapersPrior();
FirstAuthor(p) ~ Uniform({Researcher r});
Title(p) ~ TitlePrior();
PubCited(c) ~ Uniform({Paper p});
Text(c) ~ NoisyCitationGrammar (Name(FirstAuthor(PubCited(c))), Title(PubCited(c)));
• Generative model of objects and relations.
• Handles unknown number of objects
• Inference by MCMC.
[Milch et al, 2005]
![Page 34: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/34.jpg)
Church
• Tell generative story-line in Scheme.
• Do MCMC inference over execution paths.(define (DP alpha proc)
(let ((sticks (mem (lambda x (beta 1.0 alpha)))) (atoms (mem (lambda x (proc)))))
(lambda () (atoms (pick-a-stick sticks 1))))) (define (pick-a-stick sticks J) (if (< (random) (sticks J)) J (pick-a-stick sticks (+ J 1))))
(define (DPmem alpha proc) (let ((dps (mem (lambda args
(DP alpha (lambda () (apply proc args)) )))))
(lambda argsin ((apply dps argsin))) ))
[Goodman, Mansinghka, Roy, Tenenbaum, 2009]
![Page 35: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/35.jpg)
Figaro• Generative model of objects and relations.
• Object oriented (also in Scala!)- “Models” are basic building block,
composed of other models, derived by inheritance.
- Models are objects with conditions, constraints and relations to other objects.
- Model = data + factors; they are intertwined.
[Pfeffer, 2009]
![Page 36: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/36.jpg)
Figaro [Pfeffer, 2009]
People smoke with probability 0.6:Smoke(x) 1.5Friends are 3 times as likely to have the same smoking habit than different:¬Friends(x,y) v ¬Smoke(x) v Smoke(y) 3¬Friends(x,y) v Smoke(x) v ¬ Smoke(y) 3
class Person { val smokes = Flip(0.6) }val alice, bob, clara = new Personalice.smokes.condition(true)val friends = List((alice, bob), (bob, clara))def constraint(pair: (Boolean, Boolean)) = if (pair._1 == pair._2) 3.0; else 1.0for { (p1,p2) ← friends } Pair(p1.smokes, p2.smokes).constrain(constraint)
![Page 37: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/37.jpg)
Markov LogicFirst-Order Logic as a Template to Define CRF Parameters
[Richardson & Domingos 2005][Paskin & Russell 2002][Taskar et al 2003]
ground Markov network
grounding Markov network requires space O(nr)
n = number constants r = highest clause arity
![Page 38: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/38.jpg)
My Approach
• I’m going to immediately dismiss the generative models.- Interesting, but not what performs best in NLP.
Want
• Discriminatively trained factor graphs.
• Best previous example of this: Markov Logic.
![Page 39: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/39.jpg)
Logic + Probability
• Significant interest in this combination- Poole, Muggleton, DeRaedt, Sato, Domingos,...
• We now hypothesize that in much of this previous workthe “logic” aspect is mostly a red herring.- Power: repeated relational structures and tied parameters
- Logic is one way to specify these structures, but not the only one, and perhaps not the best.
- In deterministic programming, Prolog replaced by imperative lang’s✦ programmers have to keep imperative solver in mind after all✦ much domain knowledge is procedural anyway
- Logical inference replaced by probabilistic inference.
![Page 40: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/40.jpg)
Declarative Model Specification• One of biggest advances in AI & ML
• Gone too far?Much domain knowledge is also procedural.
• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...
• Our approach- Preserve the declarative statistical semantics of factor graphs
- Provide imperative hooks to define structure, parameterization, inference, learning. Efficient. Easy-to-use.
- “Imperatively-Defined Factor Graphs” (IDFs)
![Page 41: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/41.jpg)
Our Design Goals• Represent factor graphs
- emphasis on discriminative undirected models
• Scalability- input data, output configuration, factors, tree-width
- observed data that cannot fit in memory
- super-exponential number of factors
• Efficient discriminative parameter estimation- sensitive to the expense of inference
• Leverage object-oriented benefits- Modularity, encapsulation, inheritance,...
• Integrate declarative & procedural knowledge- natural, easy-to-use
- upcoming slides: 3 examples of injecting imperativ-ism into factor graphs
![Page 42: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/42.jpg)
FACTORIE• Factor Graphs, Imperative, Extensible• Implemented as a library in Scala [Martin Odersky]
- object oriented & functional- type inference- lazy evaluation- everything an object (int, float,...)- nice syntax for creating “domain-specific languages”- runs in JVM (complete interoperation with Java)- “Haskell++ in a Java style”
• Library, not new “little language”- all familiar Java constructs & libraries available to you- integrate data pre-processing & eval. w/ model spec- Scala makes syntax not too bad.- But not as compact as a dedicated language (BLOG, MLN)
![Page 43: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/43.jpg)
Stages of FACTORIE programming1. Define templates for data (i.e. classes)
- Use data structures just like in deterministic programming.
- Only special requirement: provide “undo” capability for changes.
2. Define templates for factors- Distinct from above data representation;
makes it easy to modify model scores indep’ly.
- Use & transform data’s natural relations to define factors’ relations.
3. Optionally, define MCMC proposal functions that leverage domain knowledge.
![Page 44: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/44.jpg)
Scala• New variable
var myHometown : Stringvar myAltitude = 10523.2
• New constantval myName = “Andrew”
• New methoddef climb(increment:double) = myAltitude += increment
• New classclass Skier extends Person
• New trait (like Java interface with implementations)trait FirstAid { def applyBandage = ... }
• New class with traitclass BackcountrySkier extends Skier with FirstAid
• New static object [generic]object GlobalSkierTable extends ArrayList[Skier]
![Page 45: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/45.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg)
class Token(word:String) extends EnumVariable(word)
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
![Page 46: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/46.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq
class Token(word:String) extends EnumVariable(word) with VarSeq
label.prev label.next
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
![Page 47: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/47.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq
class Token(word:String) extends EnumVariable(word) with VarSeq
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
![Page 48: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/48.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label}
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Avoid representing relations by indices.Do it directly with members, pointers... arbitrary data structure.
![Page 49: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/49.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
![Page 50: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/50.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
![Page 51: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/51.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token]
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
![Page 52: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/52.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
![Page 53: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/53.jpg)
Imperativ-ism #1: Jump Function
• Proposal “jump function”– Make changes to world state
• Sometimes simple, sometimes not– Sample Gaussian with mean at old value– Sample cluster to split,
run stochastic greedy agglomerative clustering
• Gibbs sampling, one variable at a time– poor mixing
• Rich jump function– Natural place to embed domain knowledge about what variables
should change in concert.
![Page 54: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/54.jpg)
Imperativ-ism #1: Jump Function
• Proposal “jump function”– Make changes to world state
• Sometimes simple, sometimes not– Sample Gaussian with mean at old value– Sample cluster to split,
run stochastic greedy agglomerative clustering
• Gibbs sampling, one variable at a time– poor mixing
• Rich jump function– Natural place to embed domain knowledge about what variables
should change in concert.– Avoid some expensive deterministic factors with
property-preserving jump functions (e.g. coref transitivity, dependency parsing projectivity)
![Page 55: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/55.jpg)
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.
![Page 56: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/56.jpg)
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.
![Page 57: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/57.jpg)
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables
![Page 58: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/58.jpg)
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores
![Page 59: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/59.jpg)
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores
![Page 60: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/60.jpg)
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores• How to find factors from variables & vice versa?
– In BLOG, rich, highly-indexed data structure stores mapping variables ←→ factors
– But complex to maintain as structure changes
![Page 61: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/61.jpg)
Imperativ-ism #2: Model Structure• Maintain no map structure between factors and variables
• Finding factors is easy. Usually # templates < 50.• Primitive operation:
Given factor template and one changed variable, find other variables• In factor Template object, define imperative methods that do this.
– unroll1(v1) returns (v1,v2,v3)– unroll2(v2) returns (v1,v2,v3)– unroll3(v3) returns (v1,v2,v3)– I.e., use Turing-complete language to determine structure on the fly.– If you want to use a data structure instead, access it in the method.– If you want a higher-level language for specifying structure,
write it terms of this primitive.
• Other nice attribute– Easy to do value-conditioned structure. Case Factor Diagrams, etc.
![Page 62: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/62.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
![Page 63: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/63.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token]
object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
![Page 64: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/64.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token)}
object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
![Page 65: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/65.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
![Page 66: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/66.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next)}
![Page 67: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/67.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}
![Page 68: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/68.jpg)
Imperativ-ism #3: Neighbor-Sufficient Map• “Neighbor Variables” of a factor
– Variables touching the factor• “Sufficient Statistics” of a factor
– Vector, dot product with weights of log-linear factor → factor’s score
• Usually confounded. Separate them.• Skip-chain NER. Instead of 5x5 parameters, just 2.
(label1, label2) → label1 == label2
Labels
Words
Bill loves Paris Bill the painter ...
PER O LOC PER O O
![Page 69: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/69.jpg)
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}
![Page 70: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/70.jpg)
Example: Skip-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template1[Label,Label] with Statistics1[Bool]
![Page 71: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/71.jpg)
Example: Skip-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template1[Label,Label] with Statistics1[Bool]{ def unroll1(label:Label) = for (other <- label.seq) if (label.token == other.token)) yield Factor (label,other)}
![Page 72: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/72.jpg)
Example: Skip-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template2[Label,Label] with Statistics1[Bool]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor (label,other) def statistics(label1:Label,label2:Label) = Stat(label1 == label2)}
![Page 73: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/73.jpg)
Example: Dependency Parsingclass Word(str:String) extends EnumVariable(word)class Node(word:Word, parent:Node) extends PrimitiveVariable(parent)
object ChildParentTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, n.parent.word)}
object NearestVerbTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, closestVerb(n).word) def closestVerb(n:Node) = if (isVerb(n.word)) n else closestVerb(n.parent) def unroll1(n:Node) = n.selfAndDescendants}
VERB
![Page 74: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/74.jpg)
Extensibility
• Many variables types provided:- boolean, int, float, String, categorical,...
• Create new ones!- set-valued variable
- finite-state machine as a variable [JHU]
• Create new factor types- Poisson, Dirichlet,...
![Page 75: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/75.jpg)
Experimental Results• Joint Segmentation & Coreference of
research paper citations.- 1295 mentions, 134 entities, 36487 tokens
• Compare with MLNs (Alchemy)- Same observable features
• Factorie results:
- ~25% reduction in error (segmentation & coref)
- 3-20x faster
- coref results:
![Page 76: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/76.jpg)
FACTORIE Summary
• Factor graphs, • ...object-oriented
– data types and factor template types, with inheritance• ...scalable
– factors created on demand, only score diffs• ...with imperative hooks
– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics
• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)
• Combine declarative & procedural knowledge
![Page 77: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/77.jpg)
Conclusion: Some Reasons To Use Probabilistic Programming
• Simple- Save time & avoid debugging complex, hand-built ML code.- Say exactly what you want in the way you want to say it.
• Flexible- Encourage research exploration by making it easier to try
new modeling ideas- The language provides the right “hinge-points” to provide
the flexibility you want, without the underlying cruft.
• Glue that binds many reasoning paradigms together.• Allows probabilistic modeling to be integrated with all
the other traditional deterministic programming.
Some of this text from Avi Pfeffer
![Page 78: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/78.jpg)
Key Questions for Probabilistic Programming
• What are good design patterns for probabilistic programming?
• What are the skills required to be an effective probabilistic programmer?
• How can probabilistic programmers work well with domain experts and end users?
• What kind of tools can we develop to support probabilistic programming (debuggers, profilers etc.)?
• How can probabilistic programs be learned (especially structure)?
• How can we make inference more efficient (especially memory) to scale up to even large domains?
Some of this text from Avi Pfeffer
![Page 79: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/79.jpg)
FACTORIE Summary
• Factor graphs, • ...object-oriented
– data types and factor template types, with inheritance• ...scalable
– factors created on demand, only score diffs• ...with imperative hooks
– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics
• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)
• Combine declarative & procedural knowledge
![Page 80: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/80.jpg)
![Page 81: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/81.jpg)
Don’t represent all alternatives... just one
she
she
AmyHall
sheshe
she
she
AmyHall
sheshe
StochasticJump
ProposalDistribution
at a time
Markov Chain Monte Carlo
![Page 82: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/82.jpg)
M-H Natural Efficiencies1. Partition function cancels
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
SampleRank
MH Efficiency
factors cancel
=
�y �i∈y � ψ(x , y �i)
�y i∈y ψ(x , y i)
=
��y �i∈δy � ψ(x , y �i)
� ��y i∈y �/δy � ψ(x , y i)
�
��y i∈δy
ψ(x , y i)� ��
y i∈y/δyψ(x , y i)
�
=
�y �i∈δy � ψ(x , y �i)
�y i∈δy
ψ(x , y i)
δy is the “diff”, ie variables in y that have changed
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27
2. Unchanged factors cancel
How to learn parameters for ?
SampleRank
SampleRankmotivation
QUESTION: how do we learn θ for MH
Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding
Want: push updates (not inference) into inner-most loop of learning
Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 7 / 27
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
![Page 83: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/83.jpg)
• Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)...
• ...which in turn requires “expectations of marginals,” P(y1| x1, x2, x3,...)
• But, getting marginal distributions by sampling can be inefficient due to large sample space.
• Alternative: Perceptron. Approximate gradient from difference between true output and model’s predicted best output.
• But, even finding model’s predicted best output is expensive.
• We propose: “Sample Rank” [Culotta, Wick, Hall, McCallum, HLT 2007]Learn to rank intermediate solutions: P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...)
Parameter Estimation in Large State Spaces
![Page 84: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/84.jpg)
Ranking vs Classification Training• Instead of training
[Powell, Mr. Powell, he] → YES[Powell, Mr. Powell, she] → NO
• ...Rather...
[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]
• In general, higher-ranked example may contain errors
[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]
![Page 85: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/85.jpg)
1.
UPDATE
Ranking Intermediate SolutionsExample
2.
∆ Model = -23∆ Truth = -0.2
3.
∆ Model = 10∆ Truth = -0.1
4.
∆ Model = -10∆ Truth = -0.1
5.
∆ Model = -3∆ Truth = 0.3
• Like Perceptron:Proof of convergence under Marginal Separability.
• More constrained than Maximum Likelihood:Parameters must correctly rank incorrect solutions!
• Very fast to train.
UPDATE
![Page 86: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/86.jpg)
Comparison to Contrastive DivergenceContrastive Divergence, n=2 [Hinton 2002]
sufficient statistics for update
proposal
truth
Persistent Contrastive Divergence [Tieleman 2008]
Sample Rank
![Page 87: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/87.jpg)
SampleRank on Coreference• ACE 2004
• All nouns. 28,122 mentions, 14,047 entitiese.g. he, the President, Clinton, Mrs. Clinton, Washington
2005 Ng . 69.5%
2007 Culotta, Wick, Hall, McCallum . 79.3%
2008 Bengston, Roth . 80.8%
2009 Wick, McCallum MCMC+SampleRank . 81.5%
Contrastive Divergence . 75.1%
Persistent Contrastive Divergence . 74.9%
Perceptron . 76.3%
B3
![Page 88: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings](https://reader031.fdocuments.in/reader031/viewer/2022022013/5b342dc77f8b9aec518bd498/html5/thumbnails/88.jpg)
FACTORIE Summary
• Factor graphs, • ...object-oriented
– data types and factor template types, with inheritance• ...scalable
– factors created on demand, only score diffs• ...with imperative hooks
– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics
• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)
• Combine declarative & procedural knowledge