Bayesian Inference in Intractable Likelihood Models...Intractable Likelihood The Bernoulli Factory...

Intractable LikelihoodThe Bernoulli Factory problem

Barkers and moreThe Markov switching diffusion model & exact Bayesian inference

Pseudo-marginal MCMC

Bayesian Inference in Intractable Likelihood Models

Krzysztof Łatuszynski(University of Warwick, UK)

(The Alan Turing Institute, London)

WISŁA 2018

Krzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

Intractable LikelihoodMonta Carlo based inferenceIntractable Likelihood

The Bernoulli Factory problemThe Bernoulli FactoryMotivationBernoulli Factory - what is known?Reverse time martingale approach to sampling

Barkers and moreBarkers AlgorithmThe two coin algorithmThe s-poly-BarkersThe Dice Enterprise!

The Markov switching diffusion model & exact Bayesian inferenceThe model and inferenceDesigning an exact MCMC algorithm

Example: the SINE modelPseudo-marginal MCMC

Pseudo-marginal MCMCKrzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

Monta Carlo based inferenceIntractable Likelihood

Monta Carlo based inference

I A common goal in parametric Bayesian inference is to estimate posteriorexpectations.

I Given data y, the likelihood lθ(y) and the prior p(θ), the posterior is

π(θ) =p(θ)lθ(y)∫p(θ)lθ(y)dθ

I We are interested in computing the expectations of the form

π(φ) =

∫φ(θ)π(θ)dθ

I The integral cannot be computed analyticallyI Monte Carlo methods involve approximation of π(φ) with random variables.

π(φ) =

∫φ(θ)π(θ)dθ

π(φ) =

∫φ(θ)π(θ)dθ

π(φ) =

∫φ(θ)π(θ)dθ

π(φ) =

∫φ(θ)π(θ)dθ

Metropolis-Hastings

I The Metropolis-Hastings algorithm generates a Markov chain that is reversiblewrt π. Its transition operator P(·, ·) satisfies

π(θ)P(θ, θ′) = π(θ′)P(θ′, θ)

I The algorithm: Given θn

sample θ′ ∼ Q(θn, ·)I with probability α(θn, θ

′) set θn+1 := θ′ otherwise θn+1 := θn.I where

α(θn, θ′) = min

1,π(θ′)q(θ′, θn)

π(θn)q(θn, θ′)

p(θ′)lθ′(y)q(θ′, θn)

p(θn)lθn(y)q(θn, θ′)

I Tractable model: lθ(y) can be computed pointwiseI Intractable model: lθ(y) cannot be computed pointwise

Metropolis-Hastings

π(θ)P(θ, θ′) = π(θ′)P(θ′, θ)

1,π(θ′)q(θ′, θn)

Metropolis-Hastings

π(θ)P(θ, θ′) = π(θ′)P(θ′, θ)

1,π(θ′)q(θ′, θn)

Metropolis-Hastings

π(θ)P(θ, θ′) = π(θ′)P(θ′, θ)

1,π(θ′)q(θ′, θn)

Metropolis-Hastings

π(θ)P(θ, θ′) = π(θ′)P(θ′, θ)

1,π(θ′)q(θ′, θn)

Metropolis-Hastings

π(θ)P(θ, θ′) = π(θ′)P(θ′, θ)

1,π(θ′)q(θ′, θn)

Intractable Likelihood

1,π(θ′)q(θ′, θn)

I Tractable model: lθ(y) can be computed pointwiseI Intractable model: lθ(y) cannot be computed pointwiseI Intractable models are found in all application areas

I physics, biology, chemistry, epidemiology, etc.I finance, marketing, manufacturing, etc.

1,π(θ′)q(θ′, θn)

I There are several types of intractability:I latent variable models:

lθ(y) =∫

lθ(y, x)dx and lθ(y, x) can be computed pointwise;I Big data: lθ(y) =

∏i lθ(yi);

I ABC: one can only z ∼ lθ(·)I Consider a diffusion

Yt = µθ(Yt)dt + σθ(Yt)dBt

And discretely observed data yt1 , . . . , ytn

lθ(yt1 , . . . , ytn) =∏

pθ(yti , yti+1)

where pθ(yti , yti+1) is the transition density of the diffusion.

lθ(y) =∫

∏i lθ(yi);

lθ(yt1 , . . . , ytn) =∏

pθ(yti , yti+1)

lθ(y) =∫

∏i lθ(yi);

lθ(yt1 , . . . , ytn) =∏

pθ(yti , yti+1)

lθ(y) =∫

∏i lθ(yi);

lθ(yt1 , . . . , ytn) =∏

pθ(yti , yti+1)

lθ(y) =∫

∏i lθ(yi);

lθ(yt1 , . . . , ytn) =∏

pθ(yti , yti+1)

lθ(y) =∫

∏i lθ(yi);

lθ(yt1 , . . . , ytn) =∏

pθ(yti , yti+1)

The Bernoulli FactoryMotivationBernoulli Factory - what is known?Reverse time martingale approach to sampling

The Benoulli Factory problem

I let p ∈ (0, 1) be unknownI given a black box that generates a sequence

X1,X2, . . .

of p−coinsI is it possible to generate an

f (p)− coin

for a known f ?I for example

f (p) = min1, 2p.

X1,X2, . . .

f (p)− coin

f (p) = min1, 2p.

X1,X2, . . .

f (p)− coin

f (p) = min1, 2p.

X1,X2, . . .

f (p)− coin

f (p) = min1, 2p.

Some history

I von Neumann posed and solved (see e.g. Peres 1992):

f (p) = 1/2

I Algorithm1. set n = 1;2. use the black box to sample Xn,Xn+1

3. if (Xn,Xn+1) = (0, 1) output 1 and STOP4. if (Xn,Xn+1) = (1, 0) output 0 and STOP5. set n := n + 2 and GOTO 2.

I Asmussen posed an open poblem for:

f (p) = 2p

I but it turned out difficult

Some history

f (p) = 1/2

f (p) = 2p

Some history

f (p) = 1/2

f (p) = 2p

Some history

f (p) = 1/2

f (p) = 2p

Motivation I - MCMC for jump diffusions

I MCMC for jump diffusions with stochastic jump rate(F. Goncalves, G.O. Roberts, KL)

I Consider the model t ∈ [0,T]

γt ∼ Ornstein-Uhlenbeck(θ1)

λt = exp(γt)

Jt ∼ JumpProcess(λt, d∆)

dVt = µ(Vt−, θ2)dt + σ(Vt−, θ2)dBt+dJt

I Gibbs sampling from the full posterior will alternate between((Jt,Vt) | ·

); (λt | ·) ; (θ1 | ·) ; (θ2 | ·)

I let’s have a look at updating(λt | ·)

λt = exp(γt)

); (λt | ·) ; (θ1 | ·) ; (θ2 | ·)

λt = exp(γt)

); (λt | ·) ; (θ1 | ·) ; (θ2 | ·)

λt = exp(γt)

); (λt | ·) ; (θ1 | ·) ; (θ2 | ·)

Motivation I - MCMC for jump diffusionsI for updating (λt | ·) compute

p(γt|·) = p(γt|Jt) ∝ p(γt)p(Jt|γt) = p(γt) exp−∫ T

0eγt dt +

NJ∑j=1

= p(γt)Kγ exp

−∫ T

0eγt dt

= p(γ)KγI(γ)

I If proposal = p(γt), then the Metropolis acceptance rate is of the form

α(γ(i), γ(i+1)) = min1 , K(γ(i),γ(i+1))I(γ(i), γ(i+1)), where

I K(γ(i),γ(i+1)) is a known constantI We have a mechanism to generate events of probability I(γ(i), γ(i+1))I so we have the Bernoulli factory problem with

f (p) = min1,Kp.

0eγt dt +

NJ∑j=1

= p(γt)Kγ exp

−∫ T

0eγt dt

= p(γ)KγI(γ)

f (p) = min1,Kp.

0eγt dt +

NJ∑j=1

= p(γt)Kγ exp

−∫ T

0eγt dt

= p(γ)KγI(γ)

f (p) = min1,Kp.

0eγt dt +

NJ∑j=1

= p(γt)Kγ exp

−∫ T

0eγt dt

= p(γ)KγI(γ)

f (p) = min1,Kp.

0eγt dt +

NJ∑j=1

= p(γt)Kγ exp

−∫ T

0eγt dt

= p(γ)KγI(γ)

f (p) = min1,Kp.

Motivation II - perfect sampling for Markov chainsI Consider Xnn≥0 an ergodic Markov chain with transition kernel P and

limiting distribution π.I Under mild assumptions P can be decomposed

P(x, ·) = s(x)ν(·) + (1− s(x))Q(x, ·)I and every time we sample from P we flip a coin with probability s(x) to

decide between sampling from ν(·) and Q(x, ·)I Let τ be the first time the coin points at ν(·)I then π(·) admits the decomposition

π(·) =

∞∑n=0

pnRn(ν, ·) where pn :=Pr(τ ≥ n)

E(τ); R(x, ·) =

P(x, ·)− s(x)ν(·)1− s(x)

I it looks like perfect sampling from π is possible using rejection sampling.(S.Assmussen, P.W.Glynn, H.Thorisson 1992; J.P.Hobert, C.P.Robert 2004;J.Blanchet, X-L.Meng 2005; J.Blanchet, A.C.Thomas 2007)

π(·) =

∞∑n=0

E(τ); R(x, ·) =

P(x, ·)− s(x)ν(·)1− s(x)

π(·) =

∞∑n=0

E(τ); R(x, ·) =

P(x, ·)− s(x)ν(·)1− s(x)

π(·) =

∞∑n=0

E(τ); R(x, ·) =

P(x, ·)− s(x)ν(·)1− s(x)

π(·) =

∞∑n=0

E(τ); R(x, ·) =

P(x, ·)− s(x)ν(·)1− s(x)

π(·) =

∞∑n=0

E(τ); R(x, ·) =

P(x, ·)− s(x)ν(·)1− s(x)

Motivation III - perfect sampling for Markov chainsI π(·) admits the decomposition

π(·) =

∞∑n=0

E(τ).

I find a probability distribution d(n) s.t. Pr(τ > n) ≤ Md(n).(e.g. using drift conditions for geometrically ergodic chains)

I Now we can writePr(τ > n)

E(τ)=

Pr(τ > n)

E(τ)d(n)d(n)

I Goal: reject the d(n) proposal with probability proportional to Pr(τ>n)E(τ)d(n)

I so we can usePr(τ > n)

Md(n)=: KPr(τ > n) < 1

I where K is known and we can sample from Pr(τ > n).I The above was successfully implemented by J.Flegal, R.Herbei 2012.Krzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

π(·) =

∞∑n=0

E(τ).

E(τ)=

Pr(τ > n)

E(τ)d(n)d(n)

Md(n)=: KPr(τ > n) < 1

π(·) =

∞∑n=0

E(τ).

E(τ)=

Pr(τ > n)

E(τ)d(n)d(n)

Md(n)=: KPr(τ > n) < 1

π(·) =

∞∑n=0

E(τ).

E(τ)=

Pr(τ > n)

E(τ)d(n)d(n)

Md(n)=: KPr(τ > n) < 1

π(·) =

∞∑n=0

E(τ).

E(τ)=

Pr(τ > n)

E(τ)d(n)d(n)

Md(n)=: KPr(τ > n) < 1

π(·) =

∞∑n=0

E(τ).

E(τ)=

Pr(τ > n)

E(τ)d(n)d(n)

Md(n)=: KPr(τ > n) < 1

π(·) =

∞∑n=0

E(τ).

E(τ)=

Pr(τ > n)

E(τ)d(n)d(n)

Md(n)=: KPr(τ > n) < 1

Keane and O’Brien - existence result

I Keane and O’Brien (1994):

Let p ∈ P ⊆ (0, 1)→ [0, 1]

then it is possible to simulate an f (p)−coin ⇐⇒I f is constant, orI f is continuous and for some n ∈ N and all p ∈ P satisfies

f (p), 1− f (p)≥ min

p, 1− p

I however their proof is not constructiveI note that the result rules out min1, 2p, but not min1− ε, 2p.

Let p ∈ P ⊆ (0, 1)→ [0, 1]

f (p), 1− f (p)≥ min

p, 1− p

Let p ∈ P ⊆ (0, 1)→ [0, 1]

f (p), 1− f (p)≥ min

p, 1− p

Nacu-Peres Theorem - Bernstein polynomial approachI There exists an algorithm which simulates f ⇐⇒ there exist polynomials

gn(x, y) =

n∑k=0

)a(n, k)xkyn−k, hn(x, y) =

n∑k=0

)b(n, k)xkyn−k

I 0 ≤ a(n, k) ≤ b(n, k) ≤ 1I(n

)a(n, k) and

)b(n, k) are integers

I limn→∞ gn(p, 1− p) = f (p) = limn→∞ hn(p, 1− p)I for all m < n

a(n, k) ≥k∑

(n−mk−i

) a(m, i), b(n, k) ≤k∑

(n−mk−i

) b(m, i). (1)

I Nacu & Peres provide coefficients for f (p) = min1− ε, 2p explicitly.I Given an algorithm for f (p) = min1− ε, 2p Nacu & Peres develop a

calculus that collapses every real analytic g to nesting the algorithm for fand simulating g.

gn(x, y) =

n∑k=0

)b(n, k)xkyn−k

I 0 ≤ a(n, k) ≤ b(n, k) ≤ 1I(n

)a(n, k) and

a(n, k) ≥k∑

(n−mk−i

) a(m, i), b(n, k) ≤k∑

(n−mk−i

) b(m, i). (1)

gn(x, y) =

n∑k=0

)b(n, k)xkyn−k

I 0 ≤ a(n, k) ≤ b(n, k) ≤ 1I(n

)a(n, k) and

a(n, k) ≥k∑

(n−mk−i

) a(m, i), b(n, k) ≤k∑

(n−mk−i

) b(m, i). (1)

gn(x, y) =

n∑k=0

)b(n, k)xkyn−k

I 0 ≤ a(n, k) ≤ b(n, k) ≤ 1I(n

)a(n, k) and

a(n, k) ≥k∑

(n−mk−i

) a(m, i), b(n, k) ≤k∑

(n−mk−i

) b(m, i). (1)

gn(x, y) =

n∑k=0

)b(n, k)xkyn−k

I 0 ≤ a(n, k) ≤ b(n, k) ≤ 1I(n

)a(n, k) and

a(n, k) ≥k∑

(n−mk−i

) a(m, i), b(n, k) ≤k∑

(n−mk−i

) b(m, i). (1)

gn(x, y) =

n∑k=0

)b(n, k)xkyn−k

I 0 ≤ a(n, k) ≤ b(n, k) ≤ 1I(n

)a(n, k) and

a(n, k) ≥k∑

(n−mk−i

) a(m, i), b(n, k) ≤k∑

(n−mk−i

) b(m, i). (1)

gn(x, y) =

n∑k=0

)b(n, k)xkyn−k

I 0 ≤ a(n, k) ≤ b(n, k) ≤ 1I(n

)a(n, k) and

a(n, k) ≥k∑

(n−mk−i

) a(m, i), b(n, k) ≤k∑

(n−mk−i

) b(m, i). (1)

Summary of theoretical results

I Nacu and Peres show that the random running time of their algorithm hasexponentially decaying tails for every real analytic function f .

I There are further interesting theoretical results relating the smoothness of f toexistence of Bernoulli Factory algorithms with certain running time. (see OHoltz, F Nazarov, Y Peres, 2011)

I Other results (E Mossel, Y Peres, C Hillar - 2005) relate to constructing aBernoulli Factory for f rational over Q be a finite automaton.

Bernstein polynomial approach - to nice to be true?

I at time n the N-P algorithm computes sets An and Bn

An and Bn are subsets of all 01 strings of length nI the cardinalities of An and Bn are precisely

)a(n, k) and

)b(n, k)

I the upper polynomial approximation is converging slowly to fI length of 01 strings is 215 = 32768 and above, e.g. 225 = 16777216I one has to deal efficiently with the set of 2225

strings, of length 225 each.I

I we shall develop a reverse time martingale approach to the problemI we will construct reverse time super- and submartingales that perform a

random walk on the Nacu-Peres polynomial coefficients a(n, k), b(n, k)and result in a black box that has algorithmic cost linear in the number oforiginal p−coins

)a(n, k) and

)b(n, k)

)a(n, k) and

)b(n, k)

)a(n, k) and

)b(n, k)

)a(n, k) and

)b(n, k)

)a(n, k) and

)b(n, k)

)a(n, k) and

)b(n, k)

Reverse time martingale approach to sampling

I Reverse time martingale approach to sampling events of unknown probability(KL, I. Kosmidis, O. Papaspiliopoulos, G.O. Roberts, RSA 2011)

I We shall progress gradually from a simple to a general algorithm for samplingevents of unknown probabilities constructively

I s is the unknown ”target” probability (”s = f (p)”)I It is determined uniquely but can not be computed and increasing

knowledge/precision about s is expensive algorithmically.

Algorithm 0 - randomization

I Lemma: Sampling events of probability s ∈ [0, 1] is equivalent to constructingan unbiased estimator of s taking values in [0, 1] with probability 1.

I Proof: Let S, s.t. ES = s and P(S ∈ [0, 1]) = 1 be the estimator. Then drawG0 ∼ U(0, 1), obtain S and define a coin Cs := IG0 ≤ S.

P(Cs = 1) = E I(G0 ≤ S) = E(E(I(G0 ≤ s) | S = s

))= ES = s.

The converse is straightforward since an s−coin is an unbiased estimator ofs with values in [0, 1].

I Algorithm 01. simulate G0 ∼ U(0, 1);2. obtain S;3. if G0 ≤ S set Cs := 1, otherwise set Cs := 0;4. output Cs.

P(Cs = 1) = E I(G0 ≤ S) = E(E(I(G0 ≤ s) | S = s

))= ES = s.

P(Cs = 1) = E I(G0 ≤ S) = E(E(I(G0 ≤ s) | S = s

))= ES = s.

Algorithm 1 - monotone deterministic boundsI let l1, l2, ... and u1, u2, ... be sequences of lower and upper monotone

bounds for s converging to s, i.e.

li s and ui s.I Algorithm 1

1. simulate G0 ∼ U(0, 1); set n = 1;2. compute ln and un;3. if G0 ≤ ln set Cs := 1;4. if G0 > un set Cs := 0;5. if ln < G0 ≤ un set n := n + 1 and GOTO 2;6. output Cs.

I Remark: P(N > n) = un − ln.I

I If Clnn≥1 and Cunn≥1 are sequences of coins s.t. P(Cln = 1) = ln andP(Cun = 1) = un respectively,

I Then Algorithm 1 corresponds to a coupling of Clnn≥1 and Cunn≥1 s.t.Cln = Cun for all n ≥ N , where N is the random number of iterations needed.Krzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

Algorithm 2 - monotone stochastic bounds

Ln ≤ Un

Ln ∈ [0, 1] and Un ∈ [0, 1]

Ln−1 ≤ Ln and Un−1 ≥ Un

E Ln = ln s and E Un = un s.

F0 = ∅,Ω, Fn = σLn,Un, Fk,n = σFk,Fk+1, ...Fn for k ≤ n.

I Algorithm 21. simulate G0 ∼ U(0, 1); set n = 1;2. obtain Ln and Un; conditionally on F1,n−1

3. if G0 ≤ Ln set Cs := 1;4. if G0 > Un set Cs := 0;5. if Ln < G0 ≤ Un set n := n + 1 and GOTO 2;6. output Cs.

I Thm In the above algorithm ECs = sKrzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

Ln ≤ Un

Ln ∈ [0, 1] and Un ∈ [0, 1]

Ln ≤ Un

Ln ∈ [0, 1] and Un ∈ [0, 1]

Algorithm 3 - reverse time martingales

Ln ≤ Un (2)Ln ∈ [0, 1] and Un ∈ [0, 1] (3)Ln−1 ≤ Ln and Un−1 ≥ Un (4)

E Ln = ln s and E Un = un s. (5)F0 = ∅,Ω, Fn = σLn,Un, Fk,n = σFk,Fk+1, ...Fn for k ≤ n.

The final step is to weaken condition (4) and let Ln be a reverse timesupermartingale and Un a reverse time submartingale with respect to Fn,∞.Precisely, assume that for every n = 1, 2, ... we have

E (Ln−1 | Fn,∞) = E (Ln−1 | Fn) ≤ Ln a.s. and (6)E (Un−1 | Fn,∞) = E (Un−1 | Fn) ≥ Un a.s. (7)

I Algorithm 31. simulate G0 ∼ U(0, 1); set n = 1; set L0 ≡ L0 ≡ 0 and U0 ≡ U0 ≡ 12. obtain Ln and Un given F0,n−1,3. compute L∗

n = E (Ln−1 | Fn) and U∗n = E (Un−1 | Fn).

4. compute

Ln = Ln−1 +Ln − L∗

U∗n − L∗

(Un−1 − Ln−1

)Un = Un−1 −

U∗n − Un

U∗n − L∗

(Un−1 − Ln−1

)5. if G0 ≤ Ln set Cs := 1;6. if G0 > Un set Cs := 0;7. if Ln < G0 ≤ Un set n := n + 1 and GOTO 2;8. output Cs.

I Ln and Un satisfy assumptions of Algorithm 2.

I Algorithm 31. simulate G0 ∼ U(0, 1); set n = 1; set L0 ≡ L0 ≡ 0 and U0 ≡ U0 ≡ 12. obtain Ln and Un given F0,n−1,3. compute L∗

n = E (Ln−1 | Fn) and U∗n = E (Un−1 | Fn).

4. compute

Ln = Ln−1 +Ln − L∗

U∗n − L∗

(Un−1 − Ln−1

)Un = Un−1 −

U∗n − Un

U∗n − L∗

(Un−1 − Ln−1

)5. if G0 ≤ Ln set Cs := 1;6. if G0 > Un set Cs := 0;7. if Ln < G0 ≤ Un set n := n + 1 and GOTO 2;8. output Cs.

I Ln and Un satisfy assumptions of Algorithm 2.

Application to the Bernoulli Factory problem

I Let X1,X2, . . . iid tosses of a p−coin.I Define Ln,Unn≥1 as follows:I if

n∑i=1

Xi = k,

letLn = a(n, k) and Un = b(n, k).

I Verify assumptions of Algorithm 3.I Here Ln,Unn≥1 are random walks on the coefficients of Nacu-Peres

polynomials with dynamics driven by the original p−coins.

n∑i=1

Xi = k,

n∑i=1

Xi = k,

n∑i=1

Xi = k,

n∑i=1

Xi = k,

I The reverse time martingale approach is the first constructive and practicalimplementation of a general Bernoulli Factory

I In particular the Nacu-Peres polynomials can be utilised forf (p) = min1− ε,Kp yielding a practical algorithm for the Metropolisaccept-reject step in the discussed scenarios (and many others, see e.g. workby R. Herbei and M. Berliner)

I J. Flegal and R. Herbei use the reverse martingale approach to implement theMarkov Chain perfect sampling algorithm discussed above.

Application to Metropolis-Hastings

I Recall that in MH (say with proposal Q)we needed a Bernoulli Factory for f (p) = min1,Kp

I f (p) = min1,Kp - impossible f (p) = min1− ε,Kp - possibleI Consider a lazy version

εI + (1− ε)P

I It turns out it is an accept reject algorithm with proposal Q and accept reject

min1− ε, (1− ε)Kp

εI + (1− ε)P

Barkers AlgorithmThe two coin algorithmThe s-poly-BarkersThe Dice Enterprise!

Barkers Algorithm

I Recall the Metropolis algorithm: sample from π we propose from q(x, y) andaccept with probability

αM(x, y) = min1, π(y)q(y, x)

π(x)q(x, y) =: 1 ∧ R(x, y),

I in order to satisfy detailed balance π(x)P(x, y) = π(y)P(y, x).I Other choices of the acceptance function can yield detailed balance too!I Any acceptance rate of the form g(R(x, y)) will do, if

g(R) = Rg(1/R)

I The Barkers acceptance rate is

αB(x, y) =π(y)q(y, x)

π(y)q(y, x) + π(x)q(x, y)so for Barkers g(R) =

R1 + R

Barkers Algorithm

π(x)q(x, y) =: 1 ∧ R(x, y),

g(R) = Rg(1/R)

R1 + R

Barkers Algorithm

π(x)q(x, y) =: 1 ∧ R(x, y),

g(R) = Rg(1/R)

R1 + R

Barkers Algorithm

π(x)q(x, y) =: 1 ∧ R(x, y),

g(R) = Rg(1/R)

R1 + R

Barkers Algorithm

π(x)q(x, y) =: 1 ∧ R(x, y),

g(R) = Rg(1/R)

R1 + R

Barkers Algorithm - efficiency

I The Metropolis acceptance function is optimal with respect to Peskunordering (explain)

I Suppose we estimate πf :=∫

f (x)π(dx) by πf := 1n

∑ni=1 f (Xi)

I Then, under mild assumptions the Markov chain CLT holds:√

n(πf − πf ) → N(0, σas(f ,P)).

I By Peskun ordering

σas(f ,PBarker) ≥ σas(f ,PMetropolis)

I However:σ2

as(f ,PBarker) ≤ 2σ2as(f ,PMetropolis) + σ2

as(f , π)

(KL, GO Roberts, 2013)I So Barkers is not that much worse than Metropolis!

∑ni=1 f (Xi)

n(πf − πf ) → N(0, σas(f ,P)).

I However:σ2

as(f , π)

∑ni=1 f (Xi)

n(πf − πf ) → N(0, σas(f ,P)).

I However:σ2

as(f , π)

∑ni=1 f (Xi)

n(πf − πf ) → N(0, σas(f ,P)).

I However:σ2

as(f , π)

∑ni=1 f (Xi)

n(πf − πf ) → N(0, σas(f ,P)).

I However:σ2

as(f , π)

∑ni=1 f (Xi)

n(πf − πf ) → N(0, σas(f ,P)).

I However:σ2

as(f , π)

recall: The Benoulli Factory for Metropolis-Hastings

I in the intractable likelihood setting the Metropolis-Hastings acceptance ratetakes the form

f (p1, p2) = 1 ∧ c1p1

and can be usually rewritten as

f (p3) = 1 ∧ c3p3 and then f (p3) = (1− ε) ∧ (1− ε)c3p3

I but this is still a difficult Bernoulli Factory problem - not suitable for manyapplications.

recall: The Benoulli Factory for Metropolis-Hastings

I in the intractable likelihood setting the Metropolis-Hastings acceptance ratetakes the form

f (p1, p2) = 1 ∧ c1p1

and can be usually rewritten as

f (p3) = 1 ∧ c3p3 and then f (p3) = (1− ε) ∧ (1− ε)c3p3

I but this is still a difficult Bernoulli Factory problem - not suitable for manyapplications.

Barkers and the Bernoulli Factory

I In the scenarios where w need Bernoulli Factory to execute the Metropolisacceptance rate, we typically can also write the Barkers acceptance rate inthe form of

αB(x, y) =Kq

Mp + Kq,

I where K and M are known constants and p and q re probabilities that we cansample.

I Obtaining an event of probability αB(x, y) may be more efficient by thefollowing algorithm:

αB(x, y) =Kq

Mp + Kq,

αB(x, y) =Kq

Mp + Kq,

The two coin algorithmI Assume there is a black box generating p−coin and another black box

generating q−coins.I Assume p and q are unknown and, for known K,M. , we are interested to

obtain an event of probability

KqMp + Kq

K+M qM

K+M p + KK+M q

I Two coin algorithm(1) draw C ∼ K

K+M−coin,(2) if C = 1 draw X ∼ q−coin,

if X = 1, output 1 and STOPif X = 0, GOTO (1).

(3) if C = 0 draw X ∼ p−coin,if X = 1, output 0 and STOPif X = 0, GOTO (1).

I The number of iterations N needed by the algorithm for a single output hasgeometric distribution with parameter M

K+M p + KK+M q.

KqMp + Kq

K+M qM

K+M p + KK+M q

K+M p + KK+M q.

KqMp + Kq

K+M qM

K+M p + KK+M q

K+M p + KK+M q.

KqMp + Kq

K+M qM

K+M p + KK+M q

K+M p + KK+M q.

The s-poly-Barkers acceptance rate

I Recall Barkers:

αB = g(R) =R

1 + Rwhere R(x, y) =

π(y)q(y, x)

π(x)q(x, y)

I consider [D Vats, GO Roberts, KL, 2018]

αspB = g(R) =

∑si=0 Ri − 1∑s

i=0 Ri .

I αspB(x, y)→ αMH(x, y) as s→∞I Asymptotic variances satisfy

σMH(f ) ≤ σspB(f ) ≤ s + 1s

σMH(f ) +1sσπ(f )

I We can extend the two coin algorithm to s-poly-BarkersI We can do even betterKrzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

I Recall Barkers:

αB = g(R) =R

π(y)q(y, x)

π(x)q(x, y)

αspB = g(R) =

∑si=0 Ri − 1∑s

i=0 Ri .

I Recall Barkers:

αB = g(R) =R

π(y)q(y, x)

π(x)q(x, y)

αspB = g(R) =

∑si=0 Ri − 1∑s

i=0 Ri .

I Recall Barkers:

αB = g(R) =R

π(y)q(y, x)

π(x)q(x, y)

αspB = g(R) =

∑si=0 Ri − 1∑s

i=0 Ri .

I Recall Barkers:

αB = g(R) =R

π(y)q(y, x)

π(x)q(x, y)

αspB = g(R) =

∑si=0 Ri − 1∑s

i=0 Ri .

I Recall Barkers:

αB = g(R) =R

π(y)q(y, x)

π(x)q(x, y)

αspB = g(R) =

∑si=0 Ri − 1∑s

i=0 Ri .

The Dice Enterprise [KL, G Molina, A Wendland, 2018]

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

I f : ∆m → ∆v - rational function. We have the mapping p→ f (p).

I The strategy:I Design a Markov chain that admits f (p) as stationary distribution;I Using samples from p is enough to sample the dynamics of the Markov chain;I Apply a Markov chain perfect sampling algorithm to sample from the

stationary distribution exactly.I e.g. Couling From the Past (CFTP) - Propp and Wilson 1995.

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

2 Examples

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

I Design a Markov chain that admits f (p) as stationary distribution;I Using samples from p is enough to sample the dynamics of the Markov chain;I

p302 + p50

2 + p403,

p302 + p50

2 + p403,

p302 + p50

2 + p403

p302 + (p2 − p3)50

,(p2 − p3)50

p302 + (p2 − p3)50

2 Examples

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

p302 + p50

2 + p403,

p302 + p50

2 + p403,

p302 + p50

2 + p403

p302 + (p2 − p3)50

,(p2 − p3)50

p302 + (p2 − p3)50

2 Examples

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

p302 + p50

2 + p403,

p302 + p50

2 + p403,

p302 + p50

2 + p403

p302 + (p2 − p3)50

,(p2 − p3)50

p302 + (p2 − p3)50

2 Examples

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

p302 + p50

2 + p403,

p302 + p50

2 + p403,

p302 + p50

2 + p403

p302 + (p2 − p3)50

,(p2 − p3)50

p302 + (p2 − p3)50

2 Examples

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

p302 + p50

2 + p403,

p302 + p50

2 + p403,

p302 + p50

2 + p403

p302 + (p2 − p3)50

,(p2 − p3)50

p302 + (p2 − p3)50

2 Examples

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

p302 + p50

2 + p403,

p302 + p50

2 + p403,

p302 + p50

2 + p403

p302 + (p2 − p3)50

,(p2 − p3)50

p302 + (p2 − p3)50

2 Examples

I ∆m =

p = (p1, . . . , pm) ∈ (0, 1)m :∑m

i=1 pi = 1

p302 + p50

2 + p403,

p302 + p50

2 + p403,

p302 + p50

2 + p403

p302 + (p2 − p3)50

,(p2 − p3)50

p302 + (p2 − p3)50

Disaggregation

I We will be doing a lot of this:

f : 13

π: 15

I Given a rational function f : ∆m → ∆v, in we will construct a new discreteprobability distribution

π : ∆m → ∆k,

k > v, called a ladder, and such that a sample from f (p) can be transformedinto a sample from π(p) and vice-versa.

Disaggregation

f : 13

π: 15

π : ∆m → ∆k,

Disaggregation

f : 13

π: 15

π : ∆m → ∆k,

Multivariate ladder over RI Let p = (p1, . . . , pm) and π(p) = (π1(p), . . . , πk(p)) be a probability distribution

on 1, . . . , k for every p ∈ ∆m. We say that π(p) is a ladder over R if every πi

is of the form

πi(p) = Ri

∏mj=1 pni,j

C(p)(8)

whereI C(p) is a polynomial with real coefficients that does not admit any root in ∆m;I ∀i, j, Ri is a strictly positive real constant and ni,j ∈ N≥0;I Denote ni = (ni,1, ni,2, . . . , ni,m). Then, there exists an integer d such that ∀i,‖ni‖1 = d, where the 1-norm of a vector a = (a1, . . . , an) is ‖a‖1 =

∑nj=1 |aj|. We

will refer to ni as the degree of πi(p) and to d as the degree of π(p).Moreover, we say that π(p) is a connected ladder if

I For each i, j ∈ 1, . . . , k states i and j are connected, meaning that there exists asequence (n(1) = ni, n(2), . . . , n(t−1), n(t) = nj) such that

∥∥∥n(h) − n(h−1)∥∥∥

1≤ 2, for

all h ∈ 2, . . . , t.Finally, we say that π(p) is a fine ladder if

I If ni = nj, then i = j.Krzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

Fine and connected ladder π : ∆3 → ∆5

R2p1p2C(p) R3

p1p3C(p)

0 p2p3C(p) R5

Fine, but not connected ladder π : ∆3 → ∆4

R2p1p2C(p) 0 p1p3

0 p2p3C(p) R4

Connected, but not fine ladder π : ∆3 → ∆6

R2p1p2C(p) R3

p1p3C(p) R4

p1p3C(p)

0 p2p3C(p) R6

Main Theorem

I Let f : ∆m → ∆v be a probability distribution such that every fi(p) is a rationalfunction with real coefficients. Then, one can explicitly construct a fine andconnected ladder π : ∆m → ∆k such that sampling from π is equivalent tosampling from f .

I utilizes the following theorem by Polya:I Let g : ∆m → R be a homogeneous and positive polynomial in the variables

p1, . . . , pm, i.e. all the monomials of the polynomial have the same degree.Then for all sufficiently large n, all the coefficients of(p1 + . . .+ pm)ng(p1, . . . , pm) are positive.

I plus some trickery

Main Theorem

The Markov chain

1p2C(p) R3

R4p1p2

R5p1p2p3C(p) R6

2C(p) R8

C(p) R9p2p2

3C(p) R10

The Markov chain

p1 p 1

p 2 p3

And the second term that accounts for Ri’s in such a way that the chain is optimalin Peskun ordering in the class of chains with dynamics operating with the sameneighborhood structure and using p dynamics.

From coins to dice: monotone CFTP

R1(1−p)4

C(p) R2p(1−p)3

C(p) R3p2(1−p)2

C(p) R4p3(1−p)

C(p)R5

P1,1P1,2

P2,2P2,3

P3,3P3,4

P4,4P4,5

The model and inferenceDesigning an exact MCMC algorithm

Example: the SINE model

The Markov switching diffusion model

I V = Vt, t ∈ [0,T] follows dynamics described by the stochastic differentialequation

dVt = b(Vt,Yt, θ)dt + σ(Vt, θ)γ(Yt, θ)dBt, (9)

whereI Y = Yt, t ∈ [0,T] is a continuous time jump process onY = 1, . . . ,m, m ∈ N ∪ ∞,

I Bt is the Brownian motionI θ ∈ Θ is an unknown parameterI moreover L 3 Λ = λi,j is the intensity matrix for the dynamics of Y

I denote by Ω, F , P the probability spaceI and assume Bt and Yt are independent under P.I we observe V = Vt at discrete time instances

exact Bayesian inference (L, Palczewski, Roberts)

dVt = b(Vt,Yt, θ)dt + σ(Vt, θ)γ(Yt, θ)dBt, t ∈ [0,T],

I let VD be the observed discrete data form V and VM the missing parts ofthe trajectory, i.e. V = (VD,VM)

I Bayesian setting: we assume prior distributions on unknown parametersI Θ 3 θ ∼ πθ and L 3 Λ ∼ πΛ

I The goal is to explore, via MCMC, the posterior distribution

π(VM,Y,Λ, θ |VD) ∝ πθ(θ)πΛ(Λ)π(Y |Λ)π(VM,VD |Y, θ) (10)

I note that the state space of the target distribution is infinite dimensional,I in particular VM is a continuous time diffusion path.I nevertheless, the limiting distribution of our MCMC algorithm is the exact full

posterior (10)

properties of our exact MCMC algorithm

I the limiting distribution of our MCMC algorithm is the exact full posterior

π(VM,Y,Λ, θ |VD) ∝ πθ(θ)πΛ(Λ)π(Y |Λ)π(VM,VD |Y, θ)

I we also avoid any discrete time approximation of the diffusion VI we employ the Exact Algorithm methodology of Beskos et al 06, Beskos &

Roberts 04, Beskos et al 05, Beskos et al 08I we work with a random, finite dimensional representation of VM and store it

in computer memory while the simulation progressI we can evaluate averages of any finite dimensional functional of (VM,Y,Λ, θ)

with Monte Carlo error only. In particular, the exact posterior distribution ofany individual variables VM,Y,Λ or θ can be explored by marginalising the fullposterior.

Designing an exact MCMC algorithmI Recall the SDE for V : dVt = b(Vt,Yt, θ)dt + σ(Vt, θ)γ(Yt, θ)dBtI We aim at Gibbs sampling from the full posterior

π(VM,Y,Λ, θ |VD) ∝ πθ(θ)πΛ(Λ)π(Y |Λ)π(VM,VD | Y, θ)I Problem: for different (Y, θ) the measures π(VM,VD | Y, θ) are mutually

singular (quadratic variation issue). A naive Gibbs sampler won’t mix at all.I Finding a dominating measure of a product form for π(VM,Y,Λ, θ |VD) is an

essential stepI We find a sequence of transformations of the diffusion path V and,

respectively, of the diffusion equation for VI Let ΩT = C[0,T] and Ω∗ = C[0, 1] .I Given fixed Y, θ, v0, vT we define a 1-1 transformation

HY,θ,v0,vT : ΩT → Ω∗, (11)

such that the law of HY,θ,v0,vT (V) is absolutely continuous with respect to thelaw of a Brownian bridge on Ω∗

HY,θ,v0,vT : ΩT → Ω∗, (11)

Designing an exact MCMC algorithm... continued

I Recall the SDE for V : dVt = b(Vt,Yt, θ)dt + σ(Vt, θ)γ(Yt, θ)dBt

I We aim at Gibbs sampling from the full posterior

π(VM,Y,Λ, θ |VD) ∝ πθ(θ)πΛ(Λ)π(Y |Λ)π(VM,VD | Y, θ)on ΩT × YT × L×Θ

I The Gibbs sampler we design targets a measure π∗v0,vT(ω∗, y,Λ, θ) on

Ω∗ × YT × L×Θ

I Let the simulation output be

(ω∗(n), y(n),Λ(n), θ(n)) n = 0, 1, . . .

I Then(H−1

y(n),θ(n),v0,vT(ω∗(n)), y(n),Λ(n), θ(n)) n = 0, 1, . . .

targets π(VM,Y,Λ, θ |VD) on ΩT × YT × L×Θ

Ω∗ × YT × L×Θ

(ω∗(n), y(n),Λ(n), θ(n)) n = 0, 1, . . .

I Then(H−1

Ω∗ × YT × L×Θ

(ω∗(n), y(n),Λ(n), θ(n)) n = 0, 1, . . .

I Then(H−1

Ω∗ × YT × L×Θ

(ω∗(n), y(n),Λ(n), θ(n)) n = 0, 1, . . .

I Then(H−1

Ω∗ × YT × L×Θ

(ω∗(n), y(n),Λ(n), θ(n)) n = 0, 1, . . .

I Then(H−1

Designing an exact MCMC algorithm... some details

I We now identify HY,θ,v0,vT

I start with V : dVt = b(Vt,Yt, θ)dt + σ(Vt, θ)γ(Yt, θ)dBt

I and use the 1-1 Lampertie transformation

η(v, θ) =

∫ v 1σ(u, θ)

du and define Xt := η(Vt, θ). (12)

dXt = α(Xt,Yt, θ)dt + γ(Yt, θ)dBt,

I For Xt assume the setting of the EA3 Algorithm of Beskos et al. 08.

η(v, θ) =

∫ v 1σ(u, θ)

η(v, θ) =

∫ v 1σ(u, θ)

η(v, θ) =

∫ v 1σ(u, θ)

Designing an exact MCMC algorithm... some detailsI We now work with dXt = α(Xt,Yt, θ)dt + γ(Yt, θ)dBt,I define a speed adjusted Brownian motion by

dByt = γ(yt, θ)dBt, and denote hy,θ(t) =

0γ2(ys, θ)ds.

I Let BByt be a speed adjusted Brownian bridge with the endpoints x0 and xT

I To obtain a Brownian bridge on [0, hy,θ(T)] with endpoints x0, xT , put

BB1t = BBy

h−1y,θ(t)

, t ∈ [0, hy,θ(T)]. (13)

I Next, define its centered version starting and ending at 0,

BB1,ct = BB1

t − (1− t/hy,θ(T))x0 − t/hy,θ(T)xT , t ∈ [0, hy,θ(T)], (14)

I The process BBt is a standard centred Brownian bridge on [0, 1]

BBt =1√

hy,θ(T)BB1,c

t hy,θ(T), t ∈ [0, 1]. (15)

0γ2(ys, θ)ds.

BB1t = BBy

h−1y,θ(t)

, t ∈ [0, hy,θ(T)]. (13)

BB1,ct = BB1

BBt =1√

hy,θ(T)BB1,c

t hy,θ(T), t ∈ [0, 1]. (15)

0γ2(ys, θ)ds.

BB1t = BBy

h−1y,θ(t)

, t ∈ [0, hy,θ(T)]. (13)

BB1,ct = BB1

BBt =1√

hy,θ(T)BB1,c

t hy,θ(T), t ∈ [0, 1]. (15)

0γ2(ys, θ)ds.

BB1t = BBy

h−1y,θ(t)

, t ∈ [0, hy,θ(T)]. (13)

BB1,ct = BB1

BBt =1√

hy,θ(T)BB1,c

t hy,θ(T), t ∈ [0, 1]. (15)

0γ2(ys, θ)ds.

BB1t = BBy

h−1y,θ(t)

, t ∈ [0, hy,θ(T)]. (13)

BB1,ct = BB1

BBt =1√

hy,θ(T)BB1,c

t hy,θ(T), t ∈ [0, 1]. (15)

0γ2(ys, θ)ds.

BB1t = BBy

h−1y,θ(t)

, t ∈ [0, hy,θ(T)]. (13)

BB1,ct = BB1

BBt =1√

hy,θ(T)BB1,c

t hy,θ(T), t ∈ [0, 1]. (15)

I By Hy,θ,x0,xT denote an operator that maps ΩT into Ω∗ by applyingtransformations (13), (14), (15), i.e.,

BBt =(Hy,θ,x0,xT (BBy)

I defineHy,θ,v0,vT (·) := Hy,θ,x0,xT

(η(·, θ)

I Let Qy and Py be measures induced by X and By , respectively, on ΩT .For conditional measures write Q(y;x0,xT ) , P(y;x0,xT ) respectively.

I Let P(y;x0,xT )H be the push-forward measure of P(y;x0,xT ) via mapping Hy,θ,x0,xT

I Then P(y;x0,xT )H = P∗, the Wiener measure of the standard Brownian bridge.

I In order to identify π∗v0,vT(ω∗, y,Λ, θ) on Ω∗ × YT × L×Θ ,

I we shall find the Radon-Nikodym derivative of Q(y;x0,xT ) with respect toP(y;x0,xT ) and consequently of Q(y;x0,xT )

H with respect to P∗

(η(·, θ)

Designing an exact MCMC algorithm... some detailsI From Girsanov and applying a trick from EA papers,

dQ(y;x0,xT )

dP(y;x0,xT )(ω) ∝ exp

τ(T)+1∑

(A(ωtk , ytk−1 , θ)− A(ωtk−1 , ytk−1 , θ)

(α′x(ωs, ys, θ) +

α2(ωs, ys, θ)

γ2(ys, θ)

=: G(ω, y, θ; x0, xT),

I whereI 0 = t0 < t1 < · · · < tτ(T) < tτ(T)+1 = T are moments of jumps of yt and

yt = ytk for t ∈ [tk, tk+1).I A(x, y, θ) = 1

γ2(y,θ)

∫ xα(u, y, θ)du,

I As a consequence we can set

π∗v0,vT(ω∗, y,Λ, θ) := πθ(θ)πΛ(Λ)π(y |Λ)G

(H−1

y,θ,x0,xT(ω∗), y, θ; x0, xT

dQ(y;x0,xT )

τ(T)+1∑

α2(ωs, ys, θ)

γ2(ys, θ)

=: G(ω, y, θ; x0, xT),

γ2(y,θ)

(H−1

dQ(y;x0,xT )

τ(T)+1∑

α2(ωs, ys, θ)

γ2(ys, θ)

=: G(ω, y, θ; x0, xT),

γ2(y,θ)

(H−1

Conditional distributions for the Gibbs samplerI The conditional distributions are as follows.

ω∗ ∝ G(H−1

y ∝ π(y |Λ)G(H−1

Λ ∝ πΛ(Λ)π(y|Λ),

θ ∝ πθ(θ)G(H−1

I For Λ we can use a conjugate prior λij ∼ Exp(βij) and compute the fullconditional Gamma(formula1, formula2).

I For ω∗ we use a rejection sampling with reweighed Brownian Bridgeproposals using ideas of Exact Algorithms of Beskos et al. 08.

I a reweighed Brownian bridge proposal BB is accepted as ω∗ withprobability obtained from G

(H−1

y,θ,x0,xT(BB), y, θ; x0, xT

)I The decision on accepting a Brownian bridge proposal BB as ω∗ is made

after evaluating BB at finite number of randomly chosen points.I For y and θ we use Barker’s within Gibbs.

(Metropolis within Gibbs also possible based on Sermaidis et al 2011)Krzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

ω∗ ∝ G(H−1

y ∝ π(y |Λ)G(H−1

(H−1

ω∗ ∝ G(H−1

y ∝ π(y |Λ)G(H−1

(H−1

ω∗ ∝ G(H−1

y ∝ π(y |Λ)G(H−1

(H−1

ω∗ ∝ G(H−1

y ∝ π(y |Λ)G(H−1

(H−1

ω∗ ∝ G(H−1

y ∝ π(y |Λ)G(H−1

(H−1

Barker’s within Gibbs step o for y (and θ )I Recall the Barker’s acceptance probability for a move from y to y′ , for a

stationary distribution π , is

a(y, y′) =π(y′)q(y′, y)

π(y′)q(y′, y) + π(y)q(y, y′)

I In our context a Barker’s step is applied within the Gibbs sampler step for yand the conditional target distribution is proportional to

π(y |Λ)G(H−1

I if q(y, y′) = π(y′|Λ) , the acceptance a(y, y′) simplifies to

a(y, y′) =G(H−1

y′,θ,x0,xT(ω∗), y′, θ; x0, xT

)G(H−1

y′,θ,x0,xT(ω∗), y′, θ; x0, xT

(H−1

) .I And the two coin algorithm can be readily applied!Krzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

a(y, y′) =π(y′)q(y′, y)

π(y′)q(y′, y) + π(y)q(y, y′)

π(y |Λ)G(H−1

a(y, y′) =G(H−1

y′,θ,x0,xT(ω∗), y′, θ; x0, xT

)G(H−1

y′,θ,x0,xT(ω∗), y′, θ; x0, xT

(H−1

a(y, y′) =π(y′)q(y′, y)

π(y′)q(y′, y) + π(y)q(y, y′)

π(y |Λ)G(H−1

a(y, y′) =G(H−1

y′,θ,x0,xT(ω∗), y′, θ; x0, xT

)G(H−1

y′,θ,x0,xT(ω∗), y′, θ; x0, xT

(H−1

a(y, y′) =π(y′)q(y′, y)

π(y′)q(y′, y) + π(y)q(y, y′)

π(y |Λ)G(H−1

a(y, y′) =G(H−1

y′,θ,x0,xT(ω∗), y′, θ; x0, xT

)G(H−1

y′,θ,x0,xT(ω∗), y′, θ; x0, xT

(H−1

Example: the SINE modelYt − a 2-state Markov process, Y = 1, 2dVt = sin

(Vt − µ(Yt)

)dt + γ(Yt)dBt

Parameter: θ =(µ(1), µ(2), γ(1), γ(2)

)Priors:

I µ(1), µ(2) ∼ U(0, 2π), independentI γ2(1), γ2(2) ∼ InvGamma(1, 1), independent

Data:I 1000 samples of Vt at 0, 1, 2, . . . , 999I µ = [3, 1], γ = [1, 2]I Yt = 1 for t ∈ [0, 250] ∪ (750, 1000], and Yt = 2 for t ∈ (250, 750]

Stats of MCMC:I Acceptance probabilities: BB 0.65, Y 0.50, θ 0.36I Average number of imputed points (per interval): 1.49Krzysztof Łatuszynski(University of Warwick, UK) (The Alan Turing Institute, London)Intractable Likelihood

Density

0.5 1.0 1.5 2.0 2.5 3.0

mu_1 (3.0)

mu_2 (1.0)

1.0 1.5 2.0

gamma_1 (1.0)

gamma_2 (2.0)

Posterior distribution for Y

0 200 400 600 800 1000

babili

200 220 240 260 280 300

babili

Autocorrelation

0 20 40 60 80

gamma_2

The pseudo-marginal approach

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

I If the likelihood component of π(·) is not in closed form, αMH(x, y) can not beevaluated.

I However, it might be possible to design an unbiased estimator of π(x).I The pseudo-marginal approach exploits this. It designs an extended state

space algorithm that targets π as its marginal.I Pseudo-marginal suffers from loss of efficiency through MCMC

convergence slow down typical for extended state space algorithms. Thismay be drastic (depending on the properties of the unbiased estimator).

I This is in contrast with the Bernoulli Factory based methods that retain theexact MCMC convergence speed, but instead the execution time of singleiteration suffers.

I Pseudo-marginal methods are more general but more difficult to diagnose.

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

I Assume that we have access to an unbiased estimator of π(x)

π(x) = Wxπ(x) Wx ∼ Qx(·) > 0 E(Wx) = 1

I Then the method can be seen as targeting the distribution

π(x,w) = π(x)Qx(w)w on X × R+

I And using the proposal

q(x,w, y, u) = q(x, y)Qy(u)

I Convergence is obtained by integrating out W

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

π(x) = Wxπ(x) Wx ∼ Qx(·) > 0 E(Wx) = 1

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

π(x) = Wxπ(x) Wx ∼ Qx(·) > 0 E(Wx) = 1

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

π(x) = Wxπ(x) Wx ∼ Qx(·) > 0 E(Wx) = 1

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

π(x) = Wxπ(x) Wx ∼ Qx(·) > 0 E(Wx) = 1

αMH = 1 ∧ π(y)q(y, x)

π(x)q(x, y).

π(x) = Wxπ(x) Wx ∼ Qx(·) > 0 E(Wx) = 1

Bayesian Inference in Intractable Likelihood Models...Intractable Likelihood The Bernoulli Factory...

Documents

Transcript of Bayesian Inference in Intractable Likelihood Models...Intractable Likelihood The Bernoulli Factory...

Lecture 07 QLA { Quasi-likelihood analysis Lecture 08 Bayesian analysis · 2019. 6. 26. · QLA { Quasi-likelihood analysis + Lecture 08 Bayesian analysis Nakahiro Yoshida Graduate

Likelihood and Bayesian Inference · Likelihood and Bayesian Inference Joe Felsenstein Department of Genome Sciences and Department of Biology Likelihood and Bayesian Inference –

A Comparison of a Bayesian and Maximum Likelihood ...

236607 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization.

Bayesian Learning, Regression-based learning. Overview Bayesian Learning Full MAP learning Maximum Likelihood Learning Learning Bayesian Networks.

principles of statistical inference: likelihood and the bayesian paradigm

Approximate Bayesian Inference with the Weighted Likelihood … · Approximate Bayesian Inference with the Weighted Likelihood Bootstrap Author(s): Michael A. Newton and Adrian E.

A Bayesian approach to Nested Clade Analysisucakima/thesis.pdf · This leads to intractable likelihoods and normalisation constants. Here we use Approximate Bayesian Computation to

Bayesian inference for doubly-intractable distributions · Bayesian inference for doubly-intractable distributions Anne-Marie Lyne anne-marie.lyne@curie.fr April, 2016

Full Bayesian inference (Learning)...Learning paradigms Learning as inference Bayesian learning, full Bayesian inference, Bayesian model averaging Model identification, maximum likelihood

Bayesian Optimization for Likelihood-Free Inference of ...2014, as a poster “Bayesian optimization for efﬁcient likelihood-free inference”, and at the NIPS workshop “ABC in

Introduction to Bayesian Mapping Methodspeople.musc.edu/~abl6/data/Introduction_to_BMM_CDC_2003.pdfA Bayesian Model • A Bayesian model consists of a likelihood and prior distributions

236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization.

Bayesian Experimental Design for Models with …...Model discrimination for intractable likelihoods MotivatingExample StaticDesign DesignforIntractableModels Examples Ending () ...

Overview Full Bayesian Learning MAP learning Maximun Likelihood Learning Learning Bayesian Networks Fully observable (complete data) With hidden.

Bayesian perspective-plane (BPP) with maximum likelihood ...

Computing Bayesian posterior with empirical likelihood in population genetics

Bayesian optimisationfor likelihood-free cosmological ... · Florent Leclercq BOLFI: Bayesian Optimisationfor Likelihood-Free Inference 1. It rejects most samples when is small 2.

Lecture 1 Bayesian inference and maximum likelihood

Likelihood and Bayesian Inferenceevolution.gs.washington.edu/gs560/2011/lecture7.pdf · Likelihood and Bayesian Inference Joe Felsenstein Department of Genome Sciences and Department