Particle Gibbs algorithms - Uppsala University · 2016. 2. 23. · of the Roaly Statistical...

Particle Gibbs algorithmsmethodology and analysis

Fredrik Lindsten

Linköping University &

The University of Cambridge

November 18, 2015

Outline

1. Background � Particle Gibbs

2. Uniform ergodicityF. Lindsten, R. Douc, and E. Moulines, Uniform ergodicity of the Particle Gibbs sampler.Scandinavian Journal of Statistics, 42(3): 775-797, 2015.

3. Blocking strategies and stabilityS. S. Singh, F. Lindsten, and E. Moulines, Blocking Strategies and Stability of Particle GibbsSamplers. arXiv:1509.08362, 2015.

4. Particle Gibbs with Ancestor SamplingF. Lindsten, M. I. Jordan and T. B. Schön, Particle Gibbs with Ancestor sampling, Journal ofMachine Learning Research, 15: 2145-2184, 2014.

Fredrik Lindsten (LiU & Cambridge) Particle Gibbs algorithms November 18, 2015 2 / 27

Inference in state-space models

Consider a nonlinear discrete-time state-space model,

Xt | Xt−1 ∼ mθ(Xt−1, ·),Yt | Xt ∼ gθ(Xt , ·),

and X1 ∼ µ.

We observe Y1:T = (y1, . . . , yT ) and wish to estimate θ and/or X1:T .


Gibbs sampler for SSMs

Let

φT ,θ(dx1:T ) = p(x1:T | θ, y1:T )dx1:T ,

denote the joint smoothing distribution.

MCMC: Gibbs sampling for state-space models. Iterate,

Draw θ[k] ∼ p(θ | X1:T [k − 1], y1:T );

OK!

Draw X1:T [k] ∼ φT ,θ[k](·).

Hard!

One-at-a-time: Xt [k] ∼ p(xt | θ[k],Xt−1[k],Xt+1[k − 1], yt)

Particle Gibbs: Approximate φT ,θ(dx1:T ) using a particle �lter.


Gibbs sampler for SSMs

Let

φT ,θ(dx1:T ) = p(x1:T | θ, y1:T )dx1:T ,

denote the joint smoothing distribution.

MCMC: Gibbs sampling for state-space models. Iterate,

Draw θ[k] ∼ p(θ | X1:T [k − 1], y1:T ); OK!

Draw X1:T [k] ∼ φT ,θ[k](·). Hard!

One-at-a-time: Xt [k] ∼ p(xt | θ[k],Xt−1[k],Xt+1[k − 1], yt)

Particle Gibbs: Approximate φT ,θ(dx1:T ) using a particle �lter.


The particle �lter

The particle �lter approximates φt,θ(dx1:t), t = 1, . . . , T by

φ̂Nt,θ(dx1:t) :=N∑i=1

ωit∑` ω

`t

δX i1:t(dx1:t).

Resampling: {X i1:t−1, ω

it−1}Ni=1 → {X̃ i

1:t−1, 1/N}Ni=1.

Propagation: X it ∼ qt,θ(X̃

it−1, ·) and X i

1:t = (X̃ i1:t−1,X

it ).

Weighting: ωit = Wt,θ(X̃

it−1,X

it ).

⇒ {X i1:t , ω

it}Ni=1


Weighting Resampling Propagation Weighting Resampling

The particle �lter

The particle �lter approximates φt,θ(dx1:t), t = 1, . . . , T by

φ̂Nt,θ(dx1:t) :=N∑i=1

ωit∑` ω

`t

δX i1:t(dx1:t).

Resampling: P(Ait = j | FN

t−1) = ωjt−1/

∑` ω

`t−1.

Propagation: X it ∼ qt,θ(X

Ait

t−1, ·) and X i1:t = (X

Ait

1:t−1,Xit ).

Weighting: ωit = Wt,θ(X

Ait

t−1,Xit ).

⇒ {X i1:t , ω

it}Ni=1


Weighting Resampling Propagation Weighting Resampling

MCMC using particle �lters

In MCMC we need a Markov kernel with invariant distribution φT .(From now on we drop θ from the notation.)

Conditional particle �lter (CPF)Let x ′

1:T = (x ′1, . . . , x ′

T) be a �xed reference trajectory.

At each time t, sample only N − 1 particles in the standard way.

Set the Nth particle deterministically: XNt = x

′t and AN

t = N.

C. Andrieu, A. Doucet and R. Holenstein, Particle Markov chain Monte Carlo methods. Journalof the Royal Statistical Society: Series B, 72:269-342, 2010.


The PG Markov kernel (I/II)

Consider the procedure:

1. Run CPF(N, x ′1:T ) targeting φT (dx1:T ),

2. Sample X?1:T with P(X?

1:T = X i1:T | FN

T ) ∝ ωiT .


The PG Markov kernel (I/II)

Consider the procedure:

1. Run CPF(N, x ′1:T ) targeting φT (dx1:T ),

2. Sample X?1:T with P(X?

1:T = X i1:T | FN

T ) ∝ ωiT .

5 10 15 20 25 30 35 40 45 50−3

−2

−1

0

1

2

3

Time

Sta

te


The PG Markov kernel (II/II)

This procedure:

Maps x ′1:T stochastically into X?

1:T .

Implicitly de�nes a Markov kernel PN on (XT ,XT ) (the PG kernel),

PN(x′1:T ,A) = E[1A(X ?

1:T )]

PN is φT -invariant for any number of particles N ≥ 1.

What about ergodicity?


Outline






Minorisation

Assume ‖Wt‖∞ <∞ and de�ne {Bt,T}Tt=1 by

Bt,T = sup0≤`≤T−t

‖Wt‖∞ supxt p(yt+1:t+` | xt)p(yt:t+` | y1:t−1)

Theorem

The PG kernel is minorised by φT :

PN(x′1:T ,A) ≥ (1− εT ,N)φT (A)

where εT ,N := 1−T∏t=1

N − 1

2Bt,T + N − 2≤ 1

N − 1

T∑t=1

(2Bt,T − 1) +O(N−2).


Proof idea

Take A ∈ XT . We can write,

PN(x′1:T ,A) =

N∑k=1

E

[ωkT1A(X

k1:T )∑N

i=1 ωiT

]≥ (N − 1)E

[ω1T1A(X

11:T )∑N

i=1 ωiT

]

≥ (N − 1)E

[E

[ω1T1A(X

11:T )

2‖WT‖∞ +∑N−1

i=2 ωiT

| FNT−1

]]

By convexity of x 7→ 1/x and Jensen's inequality:

PN(x′1:T ,A) ≥ (N − 1)E

[E[ω1T1A(X

11:T ) | FN

T−1]

2‖WT‖∞ + (N − 2)E[ω2T | FN

T−1]]

Compute the inner conditional expectations (w.r.t. X 1T and X 2

T ,respectively). Repeat for t = T − 1, t = T − 2, etc.


Mixing conditions

Under strong mixing conditions:

1− εT ,N ≥(1− 1

c(N−1)+1

)Tfor c ∈ (0, 1] (depending on mixing).

Stable as T →∞ if N ∼ γT .

Under (weaker) moment conditions:

(1− εT ,N)−1 bounded in probability as T →∞, provided N ∼ T 1/γ

for γ ∈ (0, 1) (depending on mixing).

Generalised to the case with a misspeci�ed model (unknown θ).

Veri�able conditions (also for non-compact state spaces).


Gibbs sampling for state space models

Alternative Gibbs sampling strategies:

Particle Gibbs: X ?1:T ∼ PN(x1:T , ·).

Samples X1:T in one �block�.

Requires N ∝ T as T →∞ for stability (strong mixing)

⇒ O(T 2) computational cost!

One-at-a-time: X ?t ∼ p(xt | x−t , yt), t = 1, . . . , T .

Slow mixing/convergence speed!

Stable as T →∞?


Outline






Blocking strategy

J1 J3 J5

J2 J4

1 · · · · · · T

Intermediate strategy � blocked Particle Gibbs:

PJN(xJ+ , dx

?J ) PG kernel for p(xJ | x∂J , yJ).

Trade o�:

(1) Mixing of ideal blocked Gibbs sampler ↗ as |J| ↗ (how fast? stable?)

(2) �Mixing of PJN � =

(1− 1

c(N−1)+1

)|J|, i.e., ↘ as |J| ↗

∂J = {t ∈ Jc : t + 1 ∈ J or t − 1 ∈ J} (�boundary points for block J�)J+ = J ∪ ∂J


Stability of blocked Gibbs sampler

Theorem

Let J = {J1, . . . , Jm} be a cover of {1, . . . , T} and let P = PJ1 · · ·PJm

be the Gibbs kernel for one complete sweep. Let all blocks have common

size L and common overlap p. Then

|µPk(f )− φT (f )| ≤ 2λk−1T∑i=1

osci (f ),

where λ = αp+1 + αL−p and α ∈ [0, 1) is a constant depending on the

mixing coe�cients of the model (assuming strong mixing).

To control the rate λ we need to increase both L and p! With . 50%overlapping blocks we get λ < 1 if L > log 4

logα−1 − 1.

For left-to-right and parallel blocking the rate is ∼ λ2.


Stability of blocked Particle Gibbs sampler

The blocked Particle Gibbs sampler PN can be seen as a perturbation ofthe ideal sampler.

Theorem

|µPkN(f )− φT (f )| ≤ 2λk−1N

∑Ti=1 osci (f )

λN = λ+ const.× εL,N , εL,N ≤ 1−(1− 1

c(N − 1) + 1

)L

.

λ→ 0 with increasing block size L and overlap p.

εL,N ↘ as N ↗; εL,N ↗ as L↗.

‖µPkN(f )− φT (f )‖TV ≤ 2Tλk−1N .


Outline






The PGAS Markov kernel (I/II)

Standard Particle Gibbs:

At time t: set XNt = x ′t and AN

t = N.

Particle Gibbs with �Ancestor Sampling� (PGAS):

At time t: set XNt = x ′t and sample

P(ANt = j | FN

t−1) =ωjt−1m(X j

t−1, x′t)∑

l ωlt−1m(X l

t−1, x′t).

Similar to the backward sampling method proposed by Whiteley,

Whiteley, N., Discussion on Particle Markov chain Monte Carlo methods. Journal of the RoyalStatistical Society: Series B, 72:306-207, 2010.

but accomplishes the same result using a forward-only implementation.

More appropriate for models with non-Markovian dynamics.


The PGAS Markov kernel (II/II)


The PGAS Markov kernel (II/II)

5 10 15 20 25 30 35 40 45 50−3

−2

−1

0

1

2

3

Time

Sta

te


PGAS vs. PG

5 10 15 20 25 30 35 40 45 50−3

−2

−1

0

1

2

3

Time

Sta

te

5 10 15 20 25 30 35 40 45 50−3

−2

−1

0

1

2

3

Time

Sta

te


PGAS PG

PGAS vs. PG

PGAS

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time (t)

Updatefreq

uen

ceyof

xt

N = 5N = 20N = 100N = 1000

PG

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time (t)

Updatefreq

uen

ceyof

xt

N = 5N = 20N = 100N = 1000

Plots of the update rate of Xt versus t, i.e. the proportion of iterations where Xt

changes value. (Simulated data from a simple stochastic volatility model.)


PGAS and blocking

Coloured regions illustrate the intervals between coalescence points.

PGAS ⇔ stochastic and adaptive blocking?

If yes, then ∃N0 such that PGAS is stable as T →∞ for N ≥ N0?


Summary

Particle Gibbs � mimics sampling from φT ,θ(dx1:T ) in a Gibbs sampler.

Uniformly ergodic under weak conditions.

• Strong mixing conditions: stable if N = γT .• (Weaker) Moment conditions: stable if N = T 1/γ .

Blocking ⇒ stable implementation for constant N.

• Find block size L and overlap p to obtain a stable ideal sampler.• Select N large enough to obtain a stable Particle Gibbs sampler.• Opens up for parallelisation!• Requires evaluation of m(xt−1, xt)!

Ancestor sampling ⇒ much improved empirical performance

• Can AS be viewed as adaptive and stochastic blocking?• Stable as T →∞ for �xed N?• Requires evaluation of m(xt−1, xt)!


Wasserstein estimates

Def: For f : XT 7→ R, the oscillation in the i-th coordinate is

osci (f ) = supx ,z∈XT

x−i=z−i

|f (x)− f (z)|

Def: W is a Wasserstein matrix for Markov kernel P if

osci (Pf ) ≤T∑j=1

Wijoscj(f ).


Wasserstein matrix for blocked Gibbs sampler

Under strong mixing

W J =

1

...

1

α 0 · · · 0 α|J|

α2 0 · · · 0 α|J|−1

.

.

.

.

.

....

.

.

.

.

.

.

α|J| 0 · · · 0 α1

...

1

,

is a Wasserstein Matrix for the ideal Gibbs kernel updating block J,

PJ(x1:T , dx?1:T ) :

{X ?J ∼ p(xJ | x∂J , yJ)dxJ ,

X ?Jc = xJc

where α ∈ [0, 1) is a constant depending on the mixing coe�cients.


Stability of blocked Gibbs sampler

Theorem

Let J = {J1, . . . , Jm} be a cover of {1, . . . , T} and let P = PJ1 · · ·PJm

be the Gibbs kernel for one complete sweep. Let ∂ =⋃

J∈J ∂J. Then, if

supi∈J∩∂

T∑j=1

W Jij ≤ λ < 1 ∀J ∈ J , (?)

it follows that |µPk(f )− φT (f )| ≤ 2λk−1T∑i=1

osci (f ).

With . 50% overlapping equally sized blocks, (?) is satis�ed if theblock size satis�es |J| > log 4

logα−1 − 1.

For left-to-right and parallel blocking the rate is ∼ λ2.


Particle Gibbs algorithms - Uppsala University · 2016. 2. 23. · of the Roaly Statistical...

Documents

Transcript of Particle Gibbs algorithms - Uppsala University · 2016. 2. 23. · of the Roaly Statistical...