Presentationatthe MachineLearningJournalClub...

51
Probabilistic Programming Presentation at the Machine Learning Journal Club Lawrence Murray, Jan Kudlicka Department of Information Technology Uppsala University March 29, 2017

Transcript of Presentationatthe MachineLearningJournalClub...

Page 1: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

Probabilistic ProgrammingPresentation at the Machine Learning Journal Club

Lawrence Murray, Jan Kudlicka

Department of Information TechnologyUppsala University

March 29, 2017

Page 2: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS(a) Directed

A

B C

D

E

(b) Undirected

AB

CD

E

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 2

Page 3: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS

State-Space Model (SSM)

X0

Y0 X1

Y1 X2

Y2 X3

Y3 X4

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 3

Page 4: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS

Ising Model

A

B

D

C

E

F

G

H

I

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 4

Page 5: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS

Bayesian Logistic Regression Model

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 5

Page 6: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS

Latent Dirichlet Allocation (LDA) Model

α

β

z w N

M

θ

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 6

Page 7: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS

Gaussian Mixture Model

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 7

Page 8: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS

▶ Not all models can be represented as graphical models, and thegraphical language does not necessarily capture all attributes ofa model.

▶ Inference methods are often tailored for specific models, e.g. theKalman filter for a linear-Gaussian SSM, collapsed Gibbssamplers for LDA, Polya–Gamma samplers for Bayesian logisticregression.

▶ Implementations are often bespoke: of a specific inferencemethod for a specific model.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 8

Page 9: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS

▶ Not all models can be represented as graphical models, and thegraphical language does not necessarily capture all attributes ofa model.

▶ Inference methods are often tailored for specific models, e.g. theKalman filter for a linear-Gaussian SSM, collapsed Gibbssamplers for LDA, Polya–Gamma samplers for Bayesian logisticregression.

▶ Implementations are often bespoke: of a specific inferencemethod for a specific model.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 8

Page 10: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS GRAPHS

▶ Not all models can be represented as graphical models, and thegraphical language does not necessarily capture all attributes ofa model.

▶ Inference methods are often tailored for specific models, e.g. theKalman filter for a linear-Gaussian SSM, collapsed Gibbssamplers for LDA, Polya–Gamma samplers for Bayesian logisticregression.

▶ Implementations are often bespoke: of a specific inferencemethod for a specific model.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 8

Page 11: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS PROGRAMS?

▶ Write a program that simulates from the joint distribution. Letthis define the model.

▶ The program is stochastic, so that each time it runs, it mayproduce different output.

▶ Consider constraining the output of the program, or constrainingits execution. This is inference.

▶ Programs are more expressive than graphs, because a programcan do stochastic branching. This makes inference difficult.

▶ Ideally the implementation of models is decoupled from theimplementation of inference methods.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 9

Page 12: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS PROGRAMS?

▶ Write a program that simulates from the joint distribution. Letthis define the model.

▶ The program is stochastic, so that each time it runs, it mayproduce different output.

▶ Consider constraining the output of the program, or constrainingits execution. This is inference.

▶ Programs are more expressive than graphs, because a programcan do stochastic branching. This makes inference difficult.

▶ Ideally the implementation of models is decoupled from theimplementation of inference methods.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 9

Page 13: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS PROGRAMS?

▶ Write a program that simulates from the joint distribution. Letthis define the model.

▶ The program is stochastic, so that each time it runs, it mayproduce different output.

▶ Consider constraining the output of the program, or constrainingits execution.

This is inference.

▶ Programs are more expressive than graphs, because a programcan do stochastic branching. This makes inference difficult.

▶ Ideally the implementation of models is decoupled from theimplementation of inference methods.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 9

Page 14: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS PROGRAMS?

▶ Write a program that simulates from the joint distribution. Letthis define the model.

▶ The program is stochastic, so that each time it runs, it mayproduce different output.

▶ Consider constraining the output of the program, or constrainingits execution. This is inference.

▶ Programs are more expressive than graphs, because a programcan do stochastic branching. This makes inference difficult.

▶ Ideally the implementation of models is decoupled from theimplementation of inference methods.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 9

Page 15: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS PROGRAMS?

▶ Write a program that simulates from the joint distribution. Letthis define the model.

▶ The program is stochastic, so that each time it runs, it mayproduce different output.

▶ Consider constraining the output of the program, or constrainingits execution. This is inference.

▶ Programs are more expressive than graphs, because a programcan do stochastic branching.

This makes inference difficult.

▶ Ideally the implementation of models is decoupled from theimplementation of inference methods.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 9

Page 16: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS PROGRAMS?

▶ Write a program that simulates from the joint distribution. Letthis define the model.

▶ The program is stochastic, so that each time it runs, it mayproduce different output.

▶ Consider constraining the output of the program, or constrainingits execution. This is inference.

▶ Programs are more expressive than graphs, because a programcan do stochastic branching. This makes inference difficult.

▶ Ideally the implementation of models is decoupled from theimplementation of inference methods.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 9

Page 17: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MODELS AS PROGRAMS?

▶ Write a program that simulates from the joint distribution. Letthis define the model.

▶ The program is stochastic, so that each time it runs, it mayproduce different output.

▶ Consider constraining the output of the program, or constrainingits execution. This is inference.

▶ Programs are more expressive than graphs, because a programcan do stochastic branching. This makes inference difficult.

▶ Ideally the implementation of models is decoupled from theimplementation of inference methods.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 9

Page 18: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

PROBABILISTIC PROGRAMMING

▶ Probabilistic programming is a programming paradigm, in thesame way that object-oriented, functional and logicprogramming are programming paradigms.

▶ Probabilistic programming languages (PPLs) have ergonomicsupport for random variables, probability distributions andinference.

▶ The hard bit is getting a correct result.

▶ The really hard bit is getting the best result.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 10

Page 19: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

PROBABILISTIC PROGRAMMING

▶ Probabilistic programming is a programming paradigm, in thesame way that object-oriented, functional and logicprogramming are programming paradigms.

▶ Probabilistic programming languages (PPLs) have ergonomicsupport for random variables, probability distributions andinference.

▶ The hard bit is getting a correct result.

▶ The really hard bit is getting the best result.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 10

Page 20: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

PROBABILISTIC PROGRAMMING

▶ Probabilistic programming is a programming paradigm, in thesame way that object-oriented, functional and logicprogramming are programming paradigms.

▶ Probabilistic programming languages (PPLs) have ergonomicsupport for random variables, probability distributions andinference.

▶ The hard bit is getting a correct result.

▶ The really hard bit is getting the best result.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 10

Page 21: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

PROBABILISTIC PROGRAMMING

▶ Probabilistic programming is a programming paradigm, in thesame way that object-oriented, functional and logicprogramming are programming paradigms.

▶ Probabilistic programming languages (PPLs) have ergonomicsupport for random variables, probability distributions andinference.

▶ The hard bit is getting a correct result.

▶ The really hard bit is getting the best result.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 10

Page 22: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

EXAMPLE: TWO DICE

die1 ~ duniform(1, 6)die2 ~ duniform(1, 6)sum = die1 + die2observe sum <= 4infer die1

Figure generated at webppl.org.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 11

Page 23: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

EXAMPLE: LINEAR GAUSSIAN STATE SPACE (LGSS) MODEL

xt+1 = 0.7xt + wyt = 0.5xt + vx0 ∼ N (0, 0.1)w ∼ N (0, 0.1)v ∼ N (0, 0.1)

0 20 40 60 80 100−2

−1

0

1

2

xy

y = read_from_file('measurements.txt', separator='\n')x[0] ~ normal(0, 0.1)for t in range(100)

observe y[t] ~ normal(0.5*x[t], 0.1)x[t+1] ~ normal(0.7*x[t], 0.1)

endinfer E(x[100])

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 12

Page 24: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

PROBABILISTIC PROGRAMS

Probabilistic constructs in PPL:

▶ Assume – declaring and defining a random variable byspecifying its probability distribution.

▶ Observe – conditioning based on a observation.▶ Infer – calculating / estimating

▶ distribution of a random variable given by an expression, or▶ its expected value, or▶ its mode(s).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 13

Page 25: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

COMPARISON OF PPL WITH STD. PROGRAMMING AND ML

“Standard”programming

MachineLearning

Probabilisticprogramming

Parameters θ Parameters

Program p(X|θ) Program

Output X Observations

Based on a figure by Frank Wood.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 14

Page 26: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

ADVANTAGES

▶ Clear separation between model and inference.▶ Programs might be easier and/or quicker to “write down” thanmathematical models.

▶ Less need for experts to find out how to do the inference and toimplement it.

▶ Huge scope of applications (comparing to e. g. ProbabilisticGraphical Models).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 15

Page 27: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

INFERENCE

The inference is, in general, a difficult task.

▶ Exact inference▶ Closed-form posterior distribution cases (e. g. Kalman filtering)▶ Enumeration – discrete models of limited dimension

▶ Approximate inference▶ Monte Carlo inference▶ Variational inference

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 16

Page 28: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

USING MONTE CARLO FOR ESTIMATING EXPECTED VALUE

Monte Carlo can be used to estimate the expected value of a functionof random variable:

I = E[h(x)] =∫h(x)p(x)dx.

Sample L points {xℓ}Lℓ=1 from p(x).

E[h(x)] ≈ IL =1L

L∑ℓ=1

h(xℓ).

The law of large numbers: limL→∞ IL = I with probability 1.The central limit theorem:

√L(IL − I) → N (0, σ2) in distribution,

where σ2 = var h(x).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 17

Page 29: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

IMPORTANCE SAMPLING

What if we cannot sample from p(x)?Assume that

▶ we can evaluatep(x) = Zp(x)

for all x, where Z is a (possibly unknown) constant, and▶ there is another distribution q(x) from which we can sample andq(x) = 0⇒ p(x) = 0.

We can use samples from the proposal distribution q(x) to calculatethe expected value w. r. t. p(x).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 18

Page 30: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

IMPORTANCE SAMPLING, CONT’D

E[h(x)] =∫h(x)p(x)dx = 1

Z

∫h(x) p(x)q(x)︸︷︷︸

w(x)

q(x)dx.

Since p(x) is a probability distribution:

Z =∫p(x)dx =

∫ p(x)q(x)︸︷︷︸w(x)

q(x)dx.

Both integrals can be estimated using Monte Carlo.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 19

Page 31: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

GRAPHICAL MODEL OF THE EXECUTION

Nomenclature:

▶ N – number of observations,▶ yn – value of the n-th observation,▶ xn – the memory state at the n-th observation,▶ gn(yn|xn) – PDF of seeing the n-th observation yn given thememory state xn,

▶ fn(xn|xn−1) – PDF of the memory state xn given the memory statexn−1 at the previous observation.

We will also use the following notation:

x1:N = {x1, x2, . . . , xN}y1:N = {y1, y2, . . . , yN}

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 20

Page 32: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

GRAPHICAL MODEL OF THE EXECUTION

Nomenclature:

▶ N – number of observations,▶ yn – value of the n-th observation,▶ xn – the memory state at the n-th observation,▶ gn(yn|xn) – PDF of seeing the n-th observation yn given thememory state xn,

▶ fn(xn|xn−1) – PDF of the memory state xn given the memory statexn−1 at the previous observation.

We will also use the following notation:

x1:N = {x1, x2, . . . , xN}y1:N = {y1, y2, . . . , yN}

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 20

Page 33: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

GRAPHICAL MODEL OF THE EXECUTION

∅ x1 x2 · · · xn · · ·

y1 y2 yn

f1(x1|∅) f2(x2|x1) f3(x3|x2) fn(xn|xn−1) fn+1(xn+1|xn)

g1(y1|x1) g2(y2|x2) gn(yn|xn)

p(x1:N, y1:N) =N∏n=1

fn(xn|xn−1)gn(yn|xn),

p(x1:N|y1:N) =p(x1:N, y1:N)p(y1:N)

∝ p(x1:N, y1:N),

where x0 = ∅.Our interest is the posterior probability p(x1:N|y1:N).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 21

Page 34: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

IMPORTANCE SAMPLING REVISITED

The target distribution multiplied by an (unknown) constant:

p(x1:N|y1:N) = p(x1:N, y1:N) =N∏n=1

fn(xn|xn−1)gn(yn|xn).

Let’s use the following proposal distribution:

q(x1:N) =N∏n=1

fn(xn|xn−1).

The importance weight:

w =pq =

∏Nn=1 fn(xn|xn−1)gn(yn|xn)∏N

n=1 fn(xn|xn+1)=

N∏n=1

gn(yn|xn).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 22

Page 35: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

SAMPLING FROM THE PROPOSAL DISTRIBUTION

How to sample from the proposal distribution?

q(x1:N) =N∏n=1

fn(xn|xn−1)

Execute the program as if it was a standard program and

▶ at an assume – sample a value from the given probabilitydistribution

▶ at an observe – update the weight

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 23

Page 36: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

SAMPLING FROM THE PROPOSAL DISTRIBUTION

How to sample from the proposal distribution?

q(x1:N) =N∏n=1

fn(xn|xn−1)

Execute the program as if it was a standard program and

▶ at an assume – sample a value from the given probabilitydistribution

▶ at an observe – update the weight

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 23

Page 37: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

IMPORTANCE SAMPLING REVISITED, CONT’D

Algorithm:

1. Sample L points {xℓ1:N}Lℓ=1 from the proposal distribution q(x).

2. Calculate the importance weights

wℓ =N∏n=1

gn(yn|xℓn)

for ℓ = 1, . . . , L.3. Estimate the expected value:

E[h(x1:N)] ≈1∑L

ℓ=1 wℓ

L∑ℓ=1

wℓh(xℓ1:N).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 24

Page 38: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

IMPORTANCE SAMPLING REVISITED, CONT’D

Algorithm:

1. Sample L points {xℓ1:N}Lℓ=1 from the proposal distribution q(x).2. Calculate the importance weights

wℓ =N∏n=1

gn(yn|xℓn)

for ℓ = 1, . . . , L.

3. Estimate the expected value:

E[h(x1:N)] ≈1∑L

ℓ=1 wℓ

L∑ℓ=1

wℓh(xℓ1:N).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 24

Page 39: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

IMPORTANCE SAMPLING REVISITED, CONT’D

Algorithm:

1. Sample L points {xℓ1:N}Lℓ=1 from the proposal distribution q(x).2. Calculate the importance weights

wℓ =N∏n=1

gn(yn|xℓn)

for ℓ = 1, . . . , L.3. Estimate the expected value:

E[h(x1:N)] ≈1∑L

ℓ=1 wℓ

L∑ℓ=1

wℓh(xℓ1:N).

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 24

Page 40: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

SEQUENTIAL IMPORTANCE SAMPLING (SIS)

Algorithm:for ℓ = 1, . . . , Lwℓ = 1start the programfor n = 1, . . . ,Ncontinue running the program until observe ynwℓ = wℓ ∗ gn(yn|xℓn)

endcontinue running the program until the endhℓ = value of the inference expression

endw = w/

∑ℓ wℓ

E[h] =∑

ℓ wℓ ∗ hℓ

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 25

Page 41: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

WEIGHT DEGENERACY

Showstopper:Weight degeneracy – in real applications, almost all weights wℓ arezero and the value of interest must be calculated using only a fewsamples.

0

0.1

0.2

0.3

Weigths for the example from slide 12, L = 1000

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 26

Page 42: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

BOOTSTRAP PARTICLE FILTER

The most basic particle filter / Sequential Monte Carlo (SMC)algorithm.

Run L copies of the program (called particles) in parallel. At eachobserve we will resample the particles:

1. Wait until all particles have reached the observe.2. Calculate wℓ = gn(yn|xℓn) for each particle.3. Sample the offspring counts {oℓ}Lℓ=1 from the multinomialdistribution with the number of trials L and the eventprobabilities {wℓ/

∑wℓ}Lℓ=1.

▶ If oℓ = 0, kill the particle process.▶ If oℓ = 1, continue the process.▶ If oℓ > 1, fork the process oℓ − 1 times and continue.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 27

Page 43: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

BOOTSTRAP PARTICLE FILTER

The most basic particle filter / Sequential Monte Carlo (SMC)algorithm.Run L copies of the program (called particles) in parallel. At eachobserve we will resample the particles:

1. Wait until all particles have reached the observe.

2. Calculate wℓ = gn(yn|xℓn) for each particle.3. Sample the offspring counts {oℓ}Lℓ=1 from the multinomialdistribution with the number of trials L and the eventprobabilities {wℓ/

∑wℓ}Lℓ=1.

▶ If oℓ = 0, kill the particle process.▶ If oℓ = 1, continue the process.▶ If oℓ > 1, fork the process oℓ − 1 times and continue.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 27

Page 44: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

BOOTSTRAP PARTICLE FILTER

The most basic particle filter / Sequential Monte Carlo (SMC)algorithm.Run L copies of the program (called particles) in parallel. At eachobserve we will resample the particles:

1. Wait until all particles have reached the observe.2. Calculate wℓ = gn(yn|xℓn) for each particle.

3. Sample the offspring counts {oℓ}Lℓ=1 from the multinomialdistribution with the number of trials L and the eventprobabilities {wℓ/

∑wℓ}Lℓ=1.

▶ If oℓ = 0, kill the particle process.▶ If oℓ = 1, continue the process.▶ If oℓ > 1, fork the process oℓ − 1 times and continue.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 27

Page 45: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

BOOTSTRAP PARTICLE FILTER

The most basic particle filter / Sequential Monte Carlo (SMC)algorithm.Run L copies of the program (called particles) in parallel. At eachobserve we will resample the particles:

1. Wait until all particles have reached the observe.2. Calculate wℓ = gn(yn|xℓn) for each particle.3. Sample the offspring counts {oℓ}Lℓ=1 from the multinomialdistribution with the number of trials L and the eventprobabilities {wℓ/

∑wℓ}Lℓ=1.

▶ If oℓ = 0, kill the particle process.▶ If oℓ = 1, continue the process.▶ If oℓ > 1, fork the process oℓ − 1 times and continue.

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 27

Page 46: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

BOOTSTRAP PARTICLE FILTER, CONT’D

Algorithm:Start L copies of the programfor n = 1, . . . ,Ncontinue running all copies until observe ynwait until all copies calculate wℓ = gn(yn|xℓn)if n < Nsample {oℓ}Ll=1 as described abovefor ℓ = 1, . . . , Lif oℓ = 0kill the process

else if oℓ > 1fork the process oℓ − 1 times

endend

endendcontinue running all copies until the endwait until all copies calculate hℓ = value of the inference expressionE[h] =

∑ℓ wℓ ∗ hℓ/

∑ℓ wℓ

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 28

Page 47: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

BOOTSTRAP PARTICLE FILTER, CONT’D

observey 1

observey 2

Particles/

Processes

Run until observey 1

Cont.untilobservey 2

Resampling(exit or fork)

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 29

Page 48: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

OTHER ALGORITHMS

▶ Metropolis-Hastings algorithm▶ Hamiltonian Monte Carlo▶ Gibbs sampling

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 30

Page 49: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

EXISTING PROGRAMMING LANGUAGES

Birch

Other probabilistic programming languages:Anglican, Church, Stan, Infer.NET, WebPPL, Venture, Turing.jl, Edward

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 31

Page 50: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

MORE INFORMATION

probabilistic-programming.org

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 32

Page 51: Presentationatthe MachineLearningJournalClub ...user.it.uu.se/~janku655/files/20170329_murray_kudlicka_ml_journal_club_ppl.pdf · MODELSASGRAPHS Notallmodelscanberepresentedasgraphicalmodels,andthe

Questions?

Lawrence Murray, Jan Kudlicka: Probabilistic Programming 33