Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015....

39
Probabilistic Programming Practical Frank Wood, Brooks Paige {fwood,brooks}@robots.ox.ac.uk MLSS 2015

Transcript of Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015....

Page 1: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Probabilistic Programming Practical

Frank Wood, Brooks Paige {fwood,brooks}@robots.ox.ac.uk

MLSS 2015

Page 2: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Setup

Page 3: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

JavaInstallation

Mac and Windows:Download and run the installerfrom https://www.java.com/en/download/manual.jsp

Linux:# Debian/Ubuntusudo apt-get install

default-jre,!

# Fedorasudo yum install

java-1.7.0-openjdk,!

Java (> v. 1.5)

Page 4: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

LeiningenInstallation

# Download lien to ˜/binmkdir ˜/bincd ˜/binwget http://git.io/XyijMQ

# Make executablechmod a+x ˜/bin/lein

# Add ˜/bin to pathecho ’export PATH="$HOME/bin:$PATH"’ >> ˜/.bashrc

# Run leinlein

Further details: http://leiningen.org/

Leiningen (v. > 2.0)

Page 5: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Practical Materials• https://bitbucket.org/probprog/mlss2015/get/

master.zip

• cd mlss2015 • lein gorilla

• open the url

5

Page 6: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Schedule• 15:35 - 16:05 Intro/Hello World!

• 16:05 - 16:30 Gaussian (you code)

• 16:30 - 16:40 Discuss / intro to physics problem

• 16:40 - 16:55 Physics (you code)

• 16:55 - 17:00 Share / discuss solutions

• 17:00 - 17:20 Inference explanation

• 17:20 - 17:45 Poisson (you code)

• 17:45 - 17:50 Inference Q/A

• 17:50 - 18:05 Coordination (you code)

Page 7: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

What is probabilistic programming?

Page 8: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

An Emerging Field

ML: Algorithms &Applications

STATS: Inference &

Theory

PL: Compilers,Semantics,

Analysis

ProbabilisticProgramming

Page 9: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Conceptualization

Parameters

Program

Output

CS

Parameters

Program

Observations

Probabilistic Programming Statistics

y

p(y|x)p(x)

p(x|

Page 10: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Operative Definition“Probabilistic programs are usual functional or imperative programs with two added constructs:

(1) the ability to draw values at random from distributions, and

(2) the ability to condition values of variables in a program via observations.”

Gordon et al, 2014

Page 11: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

What are the goals of probabilistic

programming?

Page 12: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Simplify Machine Learning…Start

Identify And Formalize Problem, Gather Data

Design Model = Read Papers, Do Math

Existing Model Sufficient?

Choose Approximate Inference Algorithm

Derive Updates And Code Inference

Algorithm

Exists AndCan Use?

Performs Well Statistically?

End

Performs Well Computationally?

Search For Useable Implementation

Test

Scale

Deploy

Simple Model?

Implement Using High Level Modeling Tool

ToolSupports Required

Features?

N

N

N

N

N N

Y

Y Y

Y

Y

Y

LEGEND Color indicates the skills that are required to traverse the edge.

PhD-level machine learning or statistics

PhD-level machine learning or computer science

Non-specialist

Feasible?

YN

Page 13: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

To ThisStart

Identify And Formalize Problem, Gather Data

Design Model = Write Probabilistic

Program

Existing Model Sufficient?

Derive Updates And Code Inference

Algorithm

Exists AndCan Use?

Performs Well?

End

Performs Well Computationally?

Search For Useable Implementation

Debug, Test, Profile

Scale

Deploy

N Y

LEGEND Color indicates the skills that are required to traverse the edge.

Non-specialist

Choose Approximate Inference Algorithm

Simple Model?

Implement Using High Level Modeling Tool

ToolSupports Required

Features?

Feasible?

Page 14: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Automate Inference

Programming Language Representation / Abstraction Layer

Inference Engine(s)

Models / Stochastic Simulators

CARON ET AL.

This lack of consistency is shared by other models based on the Polya urn construction (Zhuet al., 2005; Ahmed and Xing, 2008; Blei and Frazier, 2011). Blei and Frazier (2011) provide adetailed discussion on this issue and describe cases where one should or should not bother about it.

It is possible to define a slightly modified version of our model that is consistent under marginal-isation, at the expense of an additional set of latent variables. This is described in Appendix C.

3.2 Stationary Models for Cluster Locations

To ensure we obtain a first-order stationary Pitman-Yor process mixture model, we also need tosatisfy (B). This can be easily achieved if for k 2 I(mt

t)

Uk,t ⇠⇢

p (·|Uk,t�1) if k 2 I(mtt�1)

H otherwise

where H is the invariant distribution of the Markov transition kernel p (·|·). In the time seriesliterature, many approaches are available to build such transition kernels based on copulas (Joe,1997) or Gibbs sampling techniques (Pitt and Walker, 2005).

Combining the stationary Pitman-Yor and cluster locations models, we can summarize the fullmodel by the following Bayesian network in Figure 1. It can also be summarized using a Chineserestaurant metaphor (see Figure 2).

Figure 1: A representation of the time-varying Pitman-Yor process mixture as a directed graphi-cal model, representing conditional independencies between variables. All assignmentvariables and observations at time t are denoted ct and zt, respectively.

3.3 Properties of the Models

Under the uniform deletion model, the number At =P

imti,t�1 of alive allocation variables at time

t can be written as

At =

t�1X

j=1

nX

k=1

Xj,k

8

c0 Hr

� �m

c1 ⇡m

Hy ✓m

r0

s0

r1

s1

y1

r2

s2

y2

rT

sT

yT

r3

s3

y31

Gaussian Mixture Model

¼

µc

yi

k

k

i

N

K

K

α

θc

yi

k

k o

i

N

K

K

α

θc

yi

k

k o

i

N

1

1

Figure : From left to right: graphical models for a finite Gaussian mixture model(GMM), a Bayesian GMM, and an infinite GMM

ci |~⇡ ⇠ Discrete(~⇡)

~

yi |ci = k ;⇥ ⇠ Gaussian(·|✓k).

~⇡|↵ ⇠ Dirichlet(·| ↵K

, . . . ,

K

)

⇥ ⇠ G0

Wood (University of Oxford) Unsupervised Machine Learning January, 2014 16 / 19

Latent Dirichlet Allocation

w

diz

di �k �

d = 1 . . . D

i = 1 . . . N

d.

✓d

k = 1 . . . K

Figure 1. Graphical model for LDA model

Lecture LDA

LDA is a hierarchical model used to model text documents. Each document is modeled as

a mixture of topics. Each topic is defined as a distribution over the words in the vocabulary.

Here, we will denote by K the number of topics in the model. We use D to indicate the

number of documents, M to denote the number of words in the vocabulary, and N

d. to

denote the number of words in document d. We will assume that the words have been

translated to the set of integers {1, . . . , M} through the use of a static dictionary. This is

for convenience only and the integer mapping will contain no semantic information. The

generative model for the D documents can be thought of as sequentially drawing a topic

mixture ✓d for each document independently from a DirK(↵

~

1) distribution, where DirK(

~

�)

is a Dirichlet distribution over the K-dimensional simplex with parameters [�1, �2, . . . , �K ].

Each of K topics {�k}Kk=1 are drawn independently from DirM (�

~

1). Then, for each of the

i = 1 . . . N

d. words in document d, an assignment variable z

di is drawn from Mult(✓

d).

Conditional on the assignment variable z

di , word i in document d, denoted as w

di , is drawn

independently from Mult(�zdi). The graphical model for the process can be seen in Figure 1.

The model is parameterized by the vector valued parameters {✓d}Dd=1, and {�k}K

k=1, the

parameters {Z

di }d=1,...,D,i=1,...,Nd

., and the scalar positive parameters ↵ and �. The model

is formally written as:

✓d ⇠ DirK(↵

~

1)

�k ⇠ DirM (�

~

1)

z

di ⇠ Mult(✓d)

w

di ⇠ Mult(�zd

i)

1

✓d ⇠ DirK (↵~1)

�k ⇠ DirM(�~1)

z

di ⇠ Discrete(✓d)

w

di ⇠ Discrete(�zdi

)

Wood (University of Oxford) Unsupervised Machine Learning January, 2014 15 / 19

Page 15: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Hello World!

15

Page 16: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

First ExerciseGaussian Unknown Mean

• Learning objectives 1. Clojure 2. Gorilla REPL 3. Anglican 4. Automatic inference over generative models

expressed as programs via query • Resources

• https://clojuredocs.org/ • https://bitbucket.org/probprog/anglican/ • http://www.robots.ox.ac.uk/~fwood/anglican/index.html

16

Page 17: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Simulation

Page 18: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Second ExerciseLearning objectives

1. Develop experience thinking about expressing problems as inference over program executions

2. Understand how to perform inference over a complex deterministic generative process, here a 2D-physics simulator

18

Page 19: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Second ExerciseUse inference to solve a mechanism design optimization task:

• get all balls safely in bin

19

Page 20: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Inference

Page 21: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Trace Probability• observe data points

• internal random choices

• simulate from

by running the program forward

• weight traces by observesy1 y2

x1 x2

x11 x12 x13 x21 x22

{ {etc

p(y1:N ,x1:N ) =NY

n=1

g(yn|x1:n)f(xn|x1:n�1)

y1 y2

x1 x2 x3

y3

f(xn|x1:n�1)

g(yn|x1:n)

xn

yn

Page 22: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Trace

x1,1 = 3

x1,2 = 0

x1,2 = 1

x1,2 = 2

(let [x-1-1 3 x-1-2 (sample (discrete (range x-1-1)))] (if (not= x-1-2 1) (let [x-2-1 (+ x-1-2 7)] (sample (poisson x-2-1)))))

x2,1 = 7

x2,1 = 9

x2,2 = 0

x2,2 = 1

. . .

Page 23: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Observe

x1,1 = 3

x1,2 = 0

x1,2 = 1

x1,2 = 2

(let [x-1-1 3 x-1-2 (sample (discrete (range x-1-1)))] (if (not= x-1-2 1) (let [x-2-1 (+ x-1-2 7)] (sample (poisson x-2-1))))

(observe (gaussian x-2-1 0.0001) 7)))

x2,1 = 7

x2,1 = 9

x2,2 = 0

x2,2 = 1

. . .

Page 24: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

“Single Site” MCMC = LMHPosterior distribution of execution traces is proportional to trace score with observed values plugged in

Metropolis-Hastings acceptance rule

Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation [Wingate, Stuhlmüller et al, 2011]

24

▪ Need▪ Proposal

▪ Have▪ Likelihoods (via observe statement restrictions)▪ Prior (sequence of ERP returns; scored in interpreter)

min

✓1,

p(y|x0)p(x0)q(x|x0)

p(y|x)p(x)q(x0|x)

[assume poi-1 (beta 7 4)]

[assume poi-2 (+ 1 (poisson 8))]

[assume generative-model-part-1 (lambda (a b)

(if (flip a)

(+ (poisson b) 7)

(normal a b)))]

[assume generative-model-part-2 (gamma poi-1 poi-2)]

[observe (normal (generative-model-part-1 poi-1 poi-2) 1) 18]

[observe (normal 0 generative-model-part-2) 6]

[predict (list poi-1 poi-2 )]

x1,1 ⇠ Beta(7, 4)

x1,2|x1,1 = 0.4 ⇠ Poisson(8)

x1,3|x1,2 = 6, x1,1 = 0.4 ⇠ Binomial(0.4)

x1,4|x1,3 = true, x1,2 = 6, x1,1 = 0.4 ⇠ Poisson(4)

y1 = 18|x1,4 = 7, x1,3 = true, x1,2 = 6, x1,1 = 0.4 ⇠ Normal(7, 1)

x2,1| . . . ⇠ Gamma(0.4, 6)

y2 = 6|x2,1 = 6.92, . . . ⇠ Normal(0, 6.92)

p(x|y) / p(y = observes,x)

2

Page 25: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Manuscript under review by AISTATS 2014

Note that there are more e�cient PMCMC algorithmsfor probabilistic programming inference. In particular,there is no reason to fork unless an observe has justbeen interpreted. Alg. 1 is presented in this form forexpositional purposes.

3.2 Random Database

We refer to the MH approach to sampling overthe space of all traces proposed in [13] as “randomdatabase” (RDB). A RDB sampler is a MH samplerwhere a single variable drawn in the course of a partic-ular interpretation of a probabilistic program is modi-fied via a standard MH proposal, and this modificationis accepted by comparing the value of the joint distri-bution of old and new program traces. For complete-ness we review RDB here, noting a subtle correction tothe acceptance ratio proposed in the original referencewhich is proper for a larger family of models.

The RDB sampler employs a data structure that holdsall random variables x associated with an executiontrace, along with the parameters and log probabilityof each draw. Note that interpretation of a programis deterministic conditioned on x. A new proposaltrace is initialized by picking a single variable xm,j

from the |x| random draws, and resampling its valueusing a reversible kernel (x0

m,j |xm,j). Starting fromthis initialization, the program is rerun to generate anew set of variables x0 that correspond to a new validexecution trace. In each instance where the randomprocedure type remains the same, we reuse the exist-ing value from the set x, rescoring its log probabilityconditioned on the preceding variables where neces-sary. When the random procedure type has changed,or a new random variable is encountered, its value issampled in the usual manner. Finally, we computethe probability p(y|x0) by rescoring each observe asneeded, and accept with probability

min

✓1,

p(y|x0)p(x0)q(x|x0)

p(y|x)p(x)q(x0|x)

◆. (3)

In order to calculate the ratio of the proposal proba-bilities q(x0|x) and q(x|x0), we need to account for thevariables that were resampled in the course of con-structing the proposal, as well as the fact that the setsx0 and x may have di↵erent cardinalities |x0| and |x|.We will use the (slightly inaccurate) notation x0\x torefer to the set of variables that were resampled, andlet x0 \ x represent the set of variables common toboth execution traces. The proposal probability is nowgiven by

q(x0|x) =(x0

m,j |xm,j)

|x|p(x0\x | x0 \ x)

p(x0m,j |x0 \ x)

. (4)

In our implementation, the initialization x

0m,j is sim-

ply resampled conditioned on the preceding variables,such that (x0

m,j |xm,j) = p(x0m,j |x0 \ x). The reverse

proposal density can now be expressed in a similarfashion in terms of x\x0 and x \ x0 = x0 \ x, allowingthe full acceptance probability to be written as

p(y|x0) p(x0) |x| p(x\x0 | x \ x0)

p(y|x) p(x) |x0| p(x0\x | x0 \ x). (5)

4 Testing

Programming probabilistic program interpreters is anon-trivial software development e↵ort, involving boththe correct implementation of an interpreter and thecorrect implementation of a general purpose sampler.The methodology we employ to ensure correctness ofboth involves three levels of testing; 1) unit tests, 2)measure tests, and 3) conditional measure tests.

4.1 Unit and Measure Tests

In the context of probabilistic programming, unit test-ing includes verifying that the interpreter correctlyinterprets a comprehensive set of small deterministicprograms. Measure testing involves interpreting shortrevealing programs consisting of assume and predict

statements (producing a sequence of ancestral, uncon-ditioned samples, i.e. no observe’s). Interpreter out-put is tested relative to ground truth, where groundtruth is computed via exhaustive enumeration, ana-lytic derivation, or some combination, and always ina di↵erent, well-tested independent computational sys-tem like Matlab. Various comparisons of the empiricaldistribution constructed from the accumulating streamof output predicts’s and ground truth are computed;Kulback-Leibler (KL) divergences for discrete samplespaces and Kolmogorov Smirnov (KS) test statisticsfor continuous sample spaces. While it is possibleto construct distribution equality hypothesis tests forsome combinations of test statistic and program wegenerally are content to accept interpreters for whichthere is clear evidence of convergence towards zero ofall test statistics for all measure tests. Anglican passedall unit and measure tests.

4.2 Conditional Measure Tests

Measure tests involving conditioning provide addi-tional information beyond that provided by measureand unit tests. Conditioning involves endowing pro-grams with observe statements which constrain orweight the set of possible execution traces. Interpret-ing observe statements engages the full inference ma-chinery. Conditional measure test performance is mea-sured in the same way as measure test performance.

LMH Proposal

Single stochastic procedure (SP) output

Number of SP’s in original trace

Probability of new SP return value (sample) given trace prefix

Probability of new part of proposed execution trace

[Wingate, Stuhlmüller et al, 2011]

Page 26: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Manuscript under review by AISTATS 2014

Note that there are more e�cient PMCMC algorithmsfor probabilistic programming inference. In particular,there is no reason to fork unless an observe has justbeen interpreted. Alg. 1 is presented in this form forexpositional purposes.

3.2 Random Database

We refer to the MH approach to sampling overthe space of all traces proposed in [13] as “randomdatabase” (RDB). A RDB sampler is a MH samplerwhere a single variable drawn in the course of a partic-ular interpretation of a probabilistic program is modi-fied via a standard MH proposal, and this modificationis accepted by comparing the value of the joint distri-bution of old and new program traces. For complete-ness we review RDB here, noting a subtle correction tothe acceptance ratio proposed in the original referencewhich is proper for a larger family of models.

The RDB sampler employs a data structure that holdsall random variables x associated with an executiontrace, along with the parameters and log probabilityof each draw. Note that interpretation of a programis deterministic conditioned on x. A new proposaltrace is initialized by picking a single variable xm,j

from the |x| random draws, and resampling its valueusing a reversible kernel (x0

m,j |xm,j). Starting fromthis initialization, the program is rerun to generate anew set of variables x0 that correspond to a new validexecution trace. In each instance where the randomprocedure type remains the same, we reuse the exist-ing value from the set x, rescoring its log probabilityconditioned on the preceding variables where neces-sary. When the random procedure type has changed,or a new random variable is encountered, its value issampled in the usual manner. Finally, we computethe probability p(y|x0) by rescoring each observe asneeded, and accept with probability

min

✓1,

p(y|x0)p(x0)q(x|x0)

p(y|x)p(x)q(x0|x)

◆. (3)

In order to calculate the ratio of the proposal proba-bilities q(x0|x) and q(x|x0), we need to account for thevariables that were resampled in the course of con-structing the proposal, as well as the fact that the setsx0 and x may have di↵erent cardinalities |x0| and |x|.We will use the (slightly inaccurate) notation x0\x torefer to the set of variables that were resampled, andlet x0 \ x represent the set of variables common toboth execution traces. The proposal probability is nowgiven by

q(x0|x) =(x0

m,j |xm,j)

|x|p(x0\x | x0 \ x)

p(x0m,j |x0 \ x)

. (4)

In our implementation, the initialization x

0m,j is sim-

ply resampled conditioned on the preceding variables,such that (x0

m,j |xm,j) = p(x0m,j |x0 \ x). The reverse

proposal density can now be expressed in a similarfashion in terms of x\x0 and x \ x0 = x0 \ x, allowingthe full acceptance probability to be written as

p(y|x0) p(x0) |x| p(x\x0 | x \ x0)

p(y|x) p(x) |x0| p(x0\x | x0 \ x). (5)

4 Testing

Programming probabilistic program interpreters is anon-trivial software development e↵ort, involving boththe correct implementation of an interpreter and thecorrect implementation of a general purpose sampler.The methodology we employ to ensure correctness ofboth involves three levels of testing; 1) unit tests, 2)measure tests, and 3) conditional measure tests.

4.1 Unit and Measure Tests

In the context of probabilistic programming, unit test-ing includes verifying that the interpreter correctlyinterprets a comprehensive set of small deterministicprograms. Measure testing involves interpreting shortrevealing programs consisting of assume and predict

statements (producing a sequence of ancestral, uncon-ditioned samples, i.e. no observe’s). Interpreter out-put is tested relative to ground truth, where groundtruth is computed via exhaustive enumeration, ana-lytic derivation, or some combination, and always ina di↵erent, well-tested independent computational sys-tem like Matlab. Various comparisons of the empiricaldistribution constructed from the accumulating streamof output predicts’s and ground truth are computed;Kulback-Leibler (KL) divergences for discrete samplespaces and Kolmogorov Smirnov (KS) test statisticsfor continuous sample spaces. While it is possibleto construct distribution equality hypothesis tests forsome combinations of test statistic and program wegenerally are content to accept interpreters for whichthere is clear evidence of convergence towards zero ofall test statistics for all measure tests. Anglican passedall unit and measure tests.

4.2 Conditional Measure Tests

Measure tests involving conditioning provide addi-tional information beyond that provided by measureand unit tests. Conditioning involves endowing pro-grams with observe statements which constrain orweight the set of possible execution traces. Interpret-ing observe statements engages the full inference ma-chinery. Conditional measure test performance is mea-sured in the same way as measure test performance.

LMH Implementation“Single site update” = sample from the prior = run program forward

Simplified MH acceptance ratioNumber of SP applications in original trace

Probability of regenerating current trace continuation given proposal trace beginning

Number of SP applications in new trace

Probability of generating proposal trace continuation given current trace beginning

26

Manuscript under review by AISTATS 2014

Note that there are more e�cient PMCMC algorithmsfor probabilistic programming inference. In particular,there is no reason to fork unless an observe has justbeen interpreted. Alg. 1 is presented in this form forexpositional purposes.

3.2 Random Database

We refer to the MH approach to sampling overthe space of all traces proposed in [13] as “randomdatabase” (RDB). A RDB sampler is a MH samplerwhere a single variable drawn in the course of a partic-ular interpretation of a probabilistic program is modi-fied via a standard MH proposal, and this modificationis accepted by comparing the value of the joint distri-bution of old and new program traces. For complete-ness we review RDB here, noting a subtle correction tothe acceptance ratio proposed in the original referencewhich is proper for a larger family of models.

The RDB sampler employs a data structure that holdsall random variables x associated with an executiontrace, along with the parameters and log probabilityof each draw. Note that interpretation of a programis deterministic conditioned on x. A new proposaltrace is initialized by picking a single variable xm,j

from the |x| random draws, and resampling its valueusing a reversible kernel (x0

m,j |xm,j). Starting fromthis initialization, the program is rerun to generate anew set of variables x0 that correspond to a new validexecution trace. In each instance where the randomprocedure type remains the same, we reuse the exist-ing value from the set x, rescoring its log probabilityconditioned on the preceding variables where neces-sary. When the random procedure type has changed,or a new random variable is encountered, its value issampled in the usual manner. Finally, we computethe probability p(y|x0) by rescoring each observe asneeded, and accept with probability

min

✓1,

p(y|x0)p(x0)q(x|x0)

p(y|x)p(x)q(x0|x)

◆. (3)

In order to calculate the ratio of the proposal proba-bilities q(x0|x) and q(x|x0), we need to account for thevariables that were resampled in the course of con-structing the proposal, as well as the fact that the setsx0 and x may have di↵erent cardinalities |x0| and |x|.We will use the (slightly inaccurate) notation x0\x torefer to the set of variables that were resampled, andlet x0 \ x represent the set of variables common toboth execution traces. The proposal probability is nowgiven by

q(x0|x) =(x0

m,j |xm,j)

|x|p(x0\x | x0 \ x)

p(x0m,j |x0 \ x)

. (4)

In our implementation, the initialization x

0m,j is sim-

ply resampled conditioned on the preceding variables,such that (x0

m,j |xm,j) = p(x0m,j |x0 \ x). The reverse

proposal density can now be expressed in a similarfashion in terms of x\x0 and x \ x0 = x0 \ x, allowingthe full acceptance probability to be written as

p(y|x0) p(x0) |x| p(x\x0 | x \ x0)

p(y|x) p(x) |x0| p(x0\x | x0 \ x). (5)

4 Testing

Programming probabilistic program interpreters is anon-trivial software development e↵ort, involving boththe correct implementation of an interpreter and thecorrect implementation of a general purpose sampler.The methodology we employ to ensure correctness ofboth involves three levels of testing; 1) unit tests, 2)measure tests, and 3) conditional measure tests.

4.1 Unit and Measure Tests

In the context of probabilistic programming, unit test-ing includes verifying that the interpreter correctlyinterprets a comprehensive set of small deterministicprograms. Measure testing involves interpreting shortrevealing programs consisting of assume and predict

statements (producing a sequence of ancestral, uncon-ditioned samples, i.e. no observe’s). Interpreter out-put is tested relative to ground truth, where groundtruth is computed via exhaustive enumeration, ana-lytic derivation, or some combination, and always ina di↵erent, well-tested independent computational sys-tem like Matlab. Various comparisons of the empiricaldistribution constructed from the accumulating streamof output predicts’s and ground truth are computed;Kulback-Leibler (KL) divergences for discrete samplespaces and Kolmogorov Smirnov (KS) test statisticsfor continuous sample spaces. While it is possibleto construct distribution equality hypothesis tests forsome combinations of test statistic and program wegenerally are content to accept interpreters for whichthere is clear evidence of convergence towards zero ofall test statistics for all measure tests. Anglican passedall unit and measure tests.

4.2 Conditional Measure Tests

Measure tests involving conditioning provide addi-tional information beyond that provided by measureand unit tests. Conditioning involves endowing pro-grams with observe statements which constrain orweight the set of possible execution traces. Interpret-ing observe statements engages the full inference ma-chinery. Conditional measure test performance is mea-sured in the same way as measure test performance.

Page 27: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Sequential  Monte  Carlo  targets  

With  a  weighted  set  of  particles  

Noting  the  identity  

We  can  use  importance  sampling  to  generate  samples  from    

Given  a  sample-­‐based  approximation  to    

   

Introduction : Sequential Monte Carlo

q(x1:n|y1:n) = f(xn|x1:n�1)p(x1:n�1|y1:n�1)

x

`n ⇠ f(xn|x

a`n�1

1:n�1)

w

`n = g(y1:n,x

`1:n)

w

`n =

w

`nPL

`=1 w`n

p(x1:n|y1:n)q(x1:n|y1:n)

=

g(yn|x1:n)f(xn|x1:n�1)p(x1:n�1|y1:n�1)

f(xn|x1:n�1)p(x1:n�1|y1:n�1)= g(yn|x1:n)

p(x1:N |y1:N ) / p(y1:N ,x1:N ) ⌘NY

n=1

g(yn|x1:n)f(xn|x1:n�1)

5

A Compilation Target for Probabilistic Programming Languages

In this manner, any model with a generative process thatcan be described in arbitrary C code can be representedin this sequential form in the space of program executiontraces.

Each observe statement takes as its argumentln g(yn|x1:n). Each quantity of interest in a predict

statement corresponds to some deterministic function h(·)of all random choices x1:N made during the execution ofthe program. Given a set of S posterior samples {x(s)

1:N},we can approximate the posterior distribution of thepredict value as

h(x1:N ) ⇡ 1

S

SX

s=1

h(x(s)1:N ). (10)

3.2. Sequential Monte Carlo

Forward simulation-based algorithms are a natural fit forprobabilistic programs: run the program and report execu-tions that match the data. Sequential Monte Carlo (SMC,sequential importance resampling) forms the basic buildingblock of other, more complex particle-based methods, andcan itself be used as a simple approach to probabilistic pro-gramming inference. SMC approximates a target densityp(x1:N |y1:N ) as a weighted set of L realized trajectoriesx

`1:N such that

p(x1:N |y1:N ) ⇡LX

`=1

w`N�

x

`1:N

(x1:N ). (11)

For most probabilistic programs of interest, it will be in-tractable to sample from p(x1:N |y1:N ) directly. Instead,noting that (for n > 1) we have the recursive identity

p(x1:n|y1:n) (12)= p(x1:n�1|y1:n�1)g(yn|x1:n)f(xn|x1:n�1),

we sample from p(x1:N |y1:N ) by iteratively sampling fromeach p(x1:n|y1:n), in turn, from 1 through N . At eachn, we construct an importance sampling distribution byproposing from some distribution q(xn|x1:n�1, y1:n); inprobabilistic programs we find it convenient to proposedirectly from the executions of the program, i.e. each se-quence of random variates xn is jointly sampled from theprogram execution state dynamics

x

`n ⇠ f(xn|x

a`n�1

1:n�1) (13)

where a`n�1 is an “ancestor index,” the particle index1, . . . , L of the parent (at time n�1) of x`

n. The unnormal-ized particle importance weights at each observation yn aresimply the observe data likelihood

w`n = g(y1:n,x

`1:n) (14)

Algorithm 1 Parallel SMC program executionAssume: N observations, L particles

launch L copies of the program (parallel)for n = 1 . . . N do

wait until all L reach observe yn (barrier)update unnormalized weights w1:L

n (serial)if ESS < ⌧ then

sample number of offspring O1:Ln (serial)

set weight w1:Ln = 1 (serial)

for ` = 1 . . . L dofork or exit (parallel)

end forelse

set all number of offspring O`n = 1 (serial)

end ifcontinue program execution (parallel)

end forwait until L program traces terminate (barrier)predict from L samples from p(x1:L

1:N |y1:N ) (serial)

which can be normalized as

w`n =

w`nPL

`=1 w`n

. (15)

After each step n, we now have a weighted set of execu-tion traces which approximate p(x1:n|y1:n). As the pro-gram continues, traces which do not correspond well withthe data will have weights which become negligibly small,leading in the worst case to all weight concentrated on asingle execution trace. To counteract this deficiency, weresample from our current set of L execution traces aftereach observation yn, according to their weights w`

n. Thisis achieved by sampling a count O`

n for the number of “off-spring” of a given execution trace ` to be included at timen + 1. Any sampling scheme must ensure E[O`

n] = w`n.

Sampling offspring counts O`n is equivalent to sampling

ancestor indices a`n. Program execution traces with no off-spring are killed; program execution traces with more thanone offspring are forked multiple times. After resampling,all weights w`

n = 1.

We only resample if the effective sample size

ESS ⇡ 1P`(w

`n)

2(16)

is less than some threshold value ⌧ ; we choose ⌧ = L/2.

In probabilistic C, each observe statement forms a bar-rier: parallel execution cannot continue until all particleshave arrived at the observe and have reported their cur-rent unnormalized weight. As execution traces arrive at theobserve barrier, they take the number of particles whichhave already reached the current observe as a (temporary)

Algorithm 1 Particle Gibbs Prob. Prog. Inference

L number of particles

S number of sweeps

{w(`)N ,x

(`)1:N} Run SMC

for s < S do

{·,x⇤1:N} r(1, {1/L,x(`)

1:N}){·,x(`)

0 = ;} initialize L� 1 interpreters

for d 2 ordered lines of program do

for ` < L� 1 do

¯

x

(`)1:(n�1) fork(x

(`)1:(n�1))

end for

if directive(d) == assume then

for ` < L� 1 do

¯

x

(`)1:n interpret(d,

¯

x

(`)1:(n�1))

end for

{x(`)1:n} {¯x(`)

1:n} [ x

⇤1:n

else if directive(d) == predict then

for ` < L� 1 do

interpret(d,

¯

x

(`)1:(n�1))

end for

interpret(d,x

⇤1:(n�1))

else if directive(d) == observe then

for ` < L� 1 do

{w(`)n ,

¯

x

(`)1:n} interpret(d,

¯

x

(`)1:(n�1))

end for

T r(L� 1, {w(`)n ,

¯

x

(`)1:n} [ {w⇤

1:n,x⇤1:n})

{w(`)n ,x

(`)1:n} T [ {w⇤

n,x⇤1:n}

end if

end for

end for

p(x1:N |y1:N ) ⇡LX

`=1

w

`N�

x

`1:N

(x1:N )

p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n�1)p(x1:n�1|y1:n�1)

p(x1:n|y1:n) =LX

`=1

w

`n�1g(yn|x1:n)f(xn|x`

1:n�1)p(x`1:n�1|y1:n�1)

4

Algorithm 1 Particle Gibbs Prob. Prog. Inference

L number of particles

S number of sweeps

{w(`)N ,x

(`)1:N} Run SMC

for s < S do

{·,x⇤1:N} r(1, {1/L,x(`)

1:N}){·,x(`)

0 = ;} initialize L� 1 interpreters

for d 2 ordered lines of program do

for ` < L� 1 do

¯

x

(`)1:(n�1) fork(x

(`)1:(n�1))

end for

if directive(d) == assume then

for ` < L� 1 do

¯

x

(`)1:n interpret(d,

¯

x

(`)1:(n�1))

end for

{x(`)1:n} {¯x(`)

1:n} [ x

⇤1:n

else if directive(d) == predict then

for ` < L� 1 do

interpret(d,

¯

x

(`)1:(n�1))

end for

interpret(d,x

⇤1:(n�1))

else if directive(d) == observe then

for ` < L� 1 do

{w(`)n ,

¯

x

(`)1:n} interpret(d,

¯

x

(`)1:(n�1))

end for

T r(L� 1, {w(`)n ,

¯

x

(`)1:n} [ {w⇤

1:n,x⇤1:n})

{w(`)n ,x

(`)1:n} T [ {w⇤

n,x⇤1:n}

end if

end for

end for

p(x1:N |y1:N ) ⇡LX

`=1

w

`N�

x

`1:N

(x1:N )

p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n�1)p(x1:n�1|y1:n�1)

p(x1:n|y1:n) =LX

`=1

w

`n�1g(yn|x1:n)f(xn|x`

1:n�1)p(x`1:n�1|y1:n�1)

4

Page 28: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

n = 1 n = 2Iteratively,

- simulate - weight - resample

SMC

Observe

Parti

cle

Page 29: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Run program forward until next observeWeight of particle

Is observation likelihood

Proposal

Ep(x1:n|y1:n)[h(x1:n)] ⇡1

L

LX

`=1

h(x

`1:n)w

`n x

`1:n ⇠ q(x1:n|y1:n), w

`n =

p(x

`1:n|y1:n)

q(x

`1:n|y1:n)

=

p(x1:n�1|y1:n�1) ⇡LX

`=1

w

`n�1�x`

1:n�1(x1:n�1)

p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n�1)p(x1:n�1|y1:n�1)

q(x1:n|y1:n) = f(xn|x1:n�1)p(x1:n�1|y1:n�1)

p(x1:n|y1:n) ⇡LX

`=1

g(yn|x`1:n)�x`

1:n(x1:n), x

`1:n = x

`nx

a`n�1

1:n�1 ⇠ f

p(x1:n|y1:n) ⇡LX

`=1

g(yn|x`1:n)w

`n�1�x`

1:n(x1:n), x

`1:n = x

`nx

a`n�1

1:n�1 ⇠ f

p(y1:N |x1:N , ✓) = p(x0|✓)NY

n=1

p(yn|xn, ✓)p(xn|xn�1, ✓)p(✓)

min

✓1,

p(y1:N |✓0)p(✓0)q(✓|✓0)p(y1:N |✓)p(✓)q(✓0|✓)

min

1,

ˆ

Z

0p(✓

0)q(✓|✓0)

ˆ

Zp(✓)q(✓

0|✓)

!

ˆ

Z ⌘ p(y1:N |✓) ⇡NY

n=1

"1

N

LX

`=1

w

`n

#

p(x1)

f(xn|x1:n�1)

g(yn|x1:n)

6

SMC for Probabilistic Programming

Fischer, Kiselyov, and Shan “Purely functional lazy non-deterministic programming” ACM Sigplan 2009 W., van de Meent, and Mansinghka “A New Approach to Probabilistic Programming Inference” AISTATS 2014 Paige and W. “A Compilation Target for Probabilistic Programming Languages” ICML 2014

Sequence of environments

Parallel executions

Page 30: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

• Initialization (sample)

• Forward simulation (sample)

• Observation likelihood computation • pointwise evaluation up to normalization

SMC Methods Only Require

Ep(x1:n|y1:n)[h(x1:n)] ⇡1

L

LX

`=1

h(x

`1:n)w

`n x

`1:n ⇠ q(x1:n|y1:n), w

`n =

p(x

`1:n|y1:n)

q(x

`1:n|y1:n)

=

p(x1:n�1|y1:n�1) ⇡LX

`=1

w

`n�1�x`

1:n�1(x1:n�1)

p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n�1)p(x1:n�1|y1:n�1)

q(x1:n|y1:n) = f(xn|x1:n�1)p(x1:n�1|y1:n�1)

p(x1:n|y1:n) ⇡LX

`=1

g(yn|x`1:n)w

`n�1�x`

1:n(x1:n), x

`1:n = x

`nx

a`n�1

1:n�1 ⇠ f

p(y1:N |x1:N , ✓) = p(x0|✓)NY

n=1

p(yn|xn, ✓)p(xn|xn�1, ✓)p(✓)

min

✓1,

p(y1:N |✓0)p(✓0)q(✓|✓0)p(y1:N |✓)p(✓)q(✓0|✓)

min

✓1,

Z

0p(✓

0)q(✓|✓0)

Zp(✓)q(✓

0|✓)

ˆ

Z ⌘ p(y1:N |✓) ⇡NY

n=1

"1

N

LX

`=1

w

`n

#

p(x1)

f(xn|x1:n�1)

g(yn|x1:n)

6

Ep(x1:n|y1:n)[h(x1:n)] ⇡1

L

LX

`=1

h(x

`1:n)w

`n x

`1:n ⇠ q(x1:n|y1:n), w

`n =

p(x

`1:n|y1:n)

q(x

`1:n|y1:n)

=

p(x1:n�1|y1:n�1) ⇡LX

`=1

w

`n�1�x`

1:n�1(x1:n�1)

p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n�1)p(x1:n�1|y1:n�1)

q(x1:n|y1:n) = f(xn|x1:n�1)p(x1:n�1|y1:n�1)

p(x1:n|y1:n) ⇡LX

`=1

g(yn|x`1:n)w

`n�1�x`

1:n(x1:n), x

`1:n = x

`nx

a`n�1

1:n�1 ⇠ f

p(y1:N |x1:N , ✓) = p(x0|✓)NY

n=1

p(yn|xn, ✓)p(xn|xn�1, ✓)p(✓)

min

✓1,

p(y1:N |✓0)p(✓0)q(✓|✓0)p(y1:N |✓)p(✓)q(✓0|✓)

min

✓1,

Z

0p(✓

0)q(✓|✓0)

Zp(✓)q(✓

0|✓)

ˆ

Z ⌘ p(y1:N |✓) ⇡NY

n=1

"1

N

LX

`=1

w

`n

#

p(x1)

f(xn|x1:n�1)

g(yn|x1:n)

6

Ep(x1:n|y1:n)[h(x1:n)] ⇡1

L

LX

`=1

h(x

`1:n)w

`n x

`1:n ⇠ q(x1:n|y1:n), w

`n =

p(x

`1:n|y1:n)

q(x

`1:n|y1:n)

=

p(x1:n�1|y1:n�1) ⇡LX

`=1

w

`n�1�x`

1:n�1(x1:n�1)

p(x1:n|y1:n) = g(yn|x1:n)f(xn|x1:n�1)p(x1:n�1|y1:n�1)

q(x1:n|y1:n) = f(xn|x1:n�1)p(x1:n�1|y1:n�1)

p(x1:n|y1:n) ⇡LX

`=1

g(yn|x`1:n)w

`n�1�x`

1:n(x1:n), x

`1:n = x

`nx

a`n�1

1:n�1 ⇠ f

p(y1:N |x1:N , ✓) = p(x0|✓)NY

n=1

p(yn|xn, ✓)p(xn|xn�1, ✓)p(✓)

min

✓1,

p(y1:N |✓0)p(✓0)q(✓|✓0)p(y1:N |✓)p(✓)q(✓0|✓)

min

✓1,

Z

0p(✓

0)q(✓|✓0)

Zp(✓)q(✓

0|✓)

ˆ

Z ⌘ p(y1:N |✓) ⇡NY

n=1

"1

N

LX

`=1

w

`n

#

p(x1)

f(xn|x1:n�1)

g(yn|x1:n)

6

Page 31: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

SMC for Probabilistic Programming

Paige and W. “A Compilation Target for Probabilistic Programming Languages.” ICML, 2014Paige and W. “A Compilation Target for Probabilistic Programming Languages” ICML 2014

A Compilation Target for Probabilistic Programming Languages

In this manner, any model with a generative process thatcan be described in arbitrary C code can be representedin this sequential form in the space of program executiontraces.

Each observe statement takes as its argumentln g(yn|x1:n). Each quantity of interest in a predict

statement corresponds to some deterministic function h(·)of all random choices x1:N made during the execution ofthe program. Given a set of S posterior samples {x(s)

1:N},we can approximate the posterior distribution of thepredict value as

h(x1:N ) ⇡ 1

S

SX

s=1

h(x(s)1:N ). (10)

3.2. Sequential Monte Carlo

Forward simulation-based algorithms are a natural fit forprobabilistic programs: run the program and report execu-tions that match the data. Sequential Monte Carlo (SMC,sequential importance resampling) forms the basic buildingblock of other, more complex particle-based methods, andcan itself be used as a simple approach to probabilistic pro-gramming inference. SMC approximates a target densityp(x1:N |y1:N ) as a weighted set of L realized trajectoriesx

`1:N such that

p(x1:N |y1:N ) ⇡LX

`=1

w`N�

x

`1:N

(x1:N ). (11)

For most probabilistic programs of interest, it will be in-tractable to sample from p(x1:N |y1:N ) directly. Instead,noting that (for n > 1) we have the recursive identity

p(x1:n|y1:n) (12)= p(x1:n�1|y1:n�1)g(yn|x1:n)f(xn|x1:n�1),

we sample from p(x1:N |y1:N ) by iteratively sampling fromeach p(x1:n|y1:n), in turn, from 1 through N . At eachn, we construct an importance sampling distribution byproposing from some distribution q(xn|x1:n�1, y1:n); inprobabilistic programs we find it convenient to proposedirectly from the executions of the program, i.e. each se-quence of random variates xn is jointly sampled from theprogram execution state dynamics

x

`n ⇠ f(xn|x

a`n�1

1:n�1) (13)

where a`n�1 is an “ancestor index,” the particle index1, . . . , L of the parent (at time n�1) of x`

n. The unnormal-ized particle importance weights at each observation yn aresimply the observe data likelihood

w`n = g(y1:n,x

`1:n) (14)

Algorithm 1 Parallel SMC program executionAssume: N observations, L particles

launch L copies of the program (parallel)for n = 1 . . . N do

wait until all L reach observe yn (barrier)update unnormalized weights w1:L

n (serial)if ESS < ⌧ then

sample number of offspring O1:Ln (serial)

set weight w1:Ln = 1 (serial)

for ` = 1 . . . L dofork or exit (parallel)

end forelse

set all number of offspring O`n = 1 (serial)

end ifcontinue program execution (parallel)

end forwait until L program traces terminate (barrier)predict from L samples from p(x1:L

1:N |y1:N ) (serial)

which can be normalized as

w`n =

w`nPL

`=1 w`n

. (15)

After each step n, we now have a weighted set of execu-tion traces which approximate p(x1:n|y1:n). As the pro-gram continues, traces which do not correspond well withthe data will have weights which become negligibly small,leading in the worst case to all weight concentrated on asingle execution trace. To counteract this deficiency, weresample from our current set of L execution traces aftereach observation yn, according to their weights w`

n. Thisis achieved by sampling a count O`

n for the number of “off-spring” of a given execution trace ` to be included at timen + 1. Any sampling scheme must ensure E[O`

n] = w`n.

Sampling offspring counts O`n is equivalent to sampling

ancestor indices a`n. Program execution traces with no off-spring are killed; program execution traces with more thanone offspring are forked multiple times. After resampling,all weights w`

n = 1.

We only resample if the effective sample size

ESS ⇡ 1P`(w

`n)

2(16)

is less than some threshold value ⌧ ; we choose ⌧ = L/2.

In probabilistic C, each observe statement forms a bar-rier: parallel execution cannot continue until all particleshave arrived at the observe and have reported their cur-rent unnormalized weight. As execution traces arrive at theobserve barrier, they take the number of particles whichhave already reached the current observe as a (temporary)

Page 32: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Intuitively

- run- wait - fork

SMC for Probabilistic ProgrammingTh

read

s

observe delimiter

continuations

Page 33: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

SMC Inner Loop

n n n

n n n

n n n

• Sequential Monte Carlo is now a building block for other inference techniques

• Particle MCMC - PIMH : “particle

independent Metropolis-Hastings”

- iCSMC : “iterated conditional SMC”

-­‐    

s=1

s=2

s=3

[Andrieu, Doucet, Holenstein 2010]

[W., van de Meent, Mansinghka 2014]

Page 34: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Third ExercisePoisson

• Learning objectives 1. Understand trace 2. Understand on which variables inference algorithms

operate 3. Develop intuitions about how LMH and SMC

inference variants work in probabilistic programming systems

34

Page 35: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Fourth ExerciseCoordination

• Learning objectives 1. Learn how to write non-trivial models including

mutually recursive queries • Resources

• http://forestdb.org/ • http://www.robots.ox.ac.uk/~fwood/anglican/examples/

index.html

35

Page 36: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Fourth ExerciseCoordination

• Learning objectives 1. Learn how to write non-trivial models including

mutually recursive queries • Resources

• http://forestdb.org/ • http://www.robots.ox.ac.uk/~fwood/anglican/examples/

index.html

36

Page 37: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Discrete RV’s Only

2000

1990

2010

SystemsPL

HANSAI

IBAL

Figaro

ML STATS

WinBUGS

BUGS

JAGS

STANLibBi

Venture Anglican

Church

Probabilistic-C

infer.NET

webChurch

Blog

Factorie

AI

Prism

Prolog

KMP

Bounded Recursion

Problog

Simula

Probabilistic-ML,Haskell,Scheme,…

Page 38: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Opportunities• Parallelism

“Asynchronous Anytime Sequential Monte Carlo” [Paige, W., Doucet, Teh NIPS 2014]

• Backwards passing “Particle Gibbs with Ancestor Sampling for Probabilistic Programs” [van de Meent, Yang, Mansinghka, W. AISTATS 2015]

• Search “Maximum a Posteriori Estimation by Search in Probabilistic Models” [Tolpin, W., SOCS, 2015]

• Adaptation “Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs” [Tolpin, van de Meent, Paige, W ; ECML, 2015]

• Novel proposals “Neural Adaptive Inference for Probabilistic Programming” [Paige, W.; in submission]

Page 39: Probabilistic Programming Practical - University of Oxfordfwood/talks/2015/mlss-2015... · 2015. 10. 15. · generative model for theDdocuments can be thought of as sequentially drawing

Bubble Up

Inference

Probabilistic Programming Language

Models

Applications

Probabilistic Programming System