Maximum Likelihood Parameter Estimation in State...

Maximum Likelihood Parameter Estimation inState-Space Models

Arnaud DoucetDepartment of Statistics, Oxford University

University College London

4th October 2012

A. Doucet (UCL Masterclass Oct. 2012) 4th October 2012 1 / 32

State-Space Models

Let {Xt}t≥1 be a latent/hidden X -valued Markov process with

X1 ∼ µ (·) and Xt | (Xt−1 = x) ∼ f ( ·| x) .

Let {Yt}t≥1 be an Y-valued Markov observation process such that

Yt | (Xt = x) ∼ g ( ·| x) .

Particle filters estimate {p (x1:t | y1:t )}t≥1 on-line but only estimatesof {p (xt | y1:t )}t≥1 and {p (y1:t )}t≥1 are reliable.Particle smoothing methods allow us to obtain reliable estimates of{p (xt | y1:T )}Tt=1 .


State-Space Models with Unknown Parameters

In most scenarios of interest, the state-space model contains anunknown static parameter θ ∈ Θ so that

X1 ∼ µθ (·) and Xt | (Xt−1 = x) ∼ fθ ( ·| xt−1) .

The observations {Yt}t≥1 are conditionally independent given{Xt}t≥1 and θ

Yt | (Xt = xt ) ∼ gθ ( ·| x) .Aim: We would like to infer θ either on-line or off-line.


Examples

Stochastic Volatility model

Xt = φXt−1 + σVt , Vti.i.d.∼ N (0, 1)

Yt = β exp (Xt/2)Wt , Wti.i.d.∼ N (0, 1)

where θ =(φ, σ2, β

).

Biochemical Network model

Pr(X 1t+dt=x

1t+1,X

2t+dt=x

2t

∣∣ x1t , x2t ) = α x1t dt + o (dt) ,Pr(X 1t+dt=x

1t−1,X 2t+dt=x2t+1

∣∣ x1t , x2t ) = β x1t x2t dt + o (dt) ,

Pr(X 1t+dt=x

1t ,X

2t+dt=x

2t−1

∣∣ x1t , x2t ) = γ x2t dt + o (dt) ,

withYk = X

1k∆T +Wk with Wk

i.i.d.∼ N(0, σ2

)where θ = (α, β,γ) .


Parameter Inference in State-Space Models

Online Bayesian parameter inference.

Offl ine Maximum Likelihood parameter inference.

Online Maximum Likelihood parameter inference.


Bayesian Parameter Inference in State-Space Models

Set a prior p (θ) on θ so inference relies now on

p ( θ, x1:t | y1:t ) =p (θ, x1:t , y1:t )

p (y1:t )

wherep (θ, x1:t , y1:t ) = p (θ) pθ (x1:t , y1:t )

with

pθ (x1:t , y1:t ) = µθ (x1)t

∏k=2

fθ (xk | xk−1)t

∏k=1

gθ (yk | xk )

We havep ( θ, x1:t | y1:t ) = p ( θ| y1:t ) pθ (x1:t | y1:t )

Standard and more sophisticated particle methods to sample from{p ( θ, x1:t | y1:t )}t≥1 are ALL unreliabe.


Online Bayesian Parameter Inference

At time t = 1(θ(i )1 ,X

(i )1

)∼ p (θ) µθ (x1) then

p ( θ, x1| y1) =N

∑i=1W (i )1 δ(

θ(i )1 ,X

(i )1

) (θ, x1) , W (i )1 ∝ g

θ(i )1

(y1|X

(i )1

).

(θ(i )1 ,X

(i )1

)∼ p ( θ, x1| y1)

and p̂ ( θ, x1| y1) = 1N ∑N

i=1 δθ(i ),X (i )1

(θ, x1) .

At time t ≥ 2

Set θ(i )t = θ

(i )t−1, X

(i )t ∼ fθ(i )t

(xt |X (i )t−1

)and X

(i )1:t =

(X (i )1:t−1,X

(i )t

)p ( θ, x1:t | y1) =

N

∑i=1W (i )t δ(

θ(i )t ,X

(i )1:t

) (θ, x1:t ) , W(i )t ∝ g

θ(i )t

(yt |X

(i )t

).

(θ(i )t ,X

(i )1:t

)∼ p ( θ, x1:t | y1:t ) and

p̂ ( θ, x1:t | y1:t ) =1N ∑N

i=1 δθ(i )t ,X

(i )1:t(θ, x1:t ) .



Provide consistent estimates but remarkably ineffi cient (Chopin,

2002). Particles{

θ(i )1

}in Θ space only sampled at time 1:

degeneracy problem!Consider the extended state Zt = (Xt , θt ) then

ν (z1) = p (θ1) µθ1(x1) ,

f (zt | zt−1) = δθt−1 (θt ) fθt (xt | xt−1) , g (yt | zt ) = gθt (yt | xt ) ;i.e. θt = θ1 for any t with θ1 from the prior. Exponential stabilityassumption on {p (zt | y1:t )}t≥1 cannot be satisfied.Use MCMC steps on θ so as to jitter

{θ(i )t

}; e.g. Andrieu, De Freitas

& D. (1999); Fearnhead (2002); Gilks & Berzuini (2001); Carvalho etal. (2010).When p ( θ| y1:t , x1:t ) = p ( θ| st (x1:t , y1:t )) where st (x1:t , y1:t ) is afixed-dimensional vector, “elegant”but still implicitly relies onp̂ (x1:t | y1:t ) so degeneracy will creep in.



At time t − 1, we have

p̂ ( θ, x1:t−1| y1:t−1) =1N

N

∑i=1

δ(θ(i )t−1,X

(i )1:t−1

) (θ, x1:t−1) ,

Set θ(i )t = θ

(i )t−1, sample X

(i )t ∼ fθ(i )t

(·|X (i )t−1

), set

X(i )1:t =

(X (i )1:t−1,X

(i )t

)and

p ( θ, x1:t | y1:t ) = ∑Ni=1W

(i )t δ(

θ(i )t ,X

(i )1:t

) (θ, x1:t ) ,

W (i )t ∝ g

θ(i )t

(yt |X

(i )t

).

Resample(

θ′(i )t ,X (i )1:t

)∼ p ( θ, x1:t | y1:t ) then sample

θ(i )t ∼ p

(θ| y1:t ,X

(i )1:t

)to obtain

p̂ ( θ, x1:t | y1:t ) =1N ∑N

i=1 δ(θ(i )t ,X

(i )1:t

) (θ, x1:t ).


A Toy Example

Linear Gaussian state-space model

Xt = θXt−1 + σVVt , Vti.i.d.∼ N (0, 1)

Yt = Xt + σWWt , Wti.i.d.∼ N (0, 1) .

We set p (θ) ∝ 1(−1,1) (θ) so

p ( θ| y1:t , x1:t ) ∝ N(θ;mt , σ2t

)1(−1,1) (θ)

whereσ2t = S

−12,t , mt = S

−12,t S1,t

with

S1,t =t

∑k=2

xk−1xk , S2,t =t

∑k=2

x2k−1


Illustration of the Degeneracy Problem

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SMC estimate of E [ θ| y1:t ], as t increases the degeneracy creeps in.


Another Toy Example

Linear Gaussian state-space model

Xt = ρXt−1 + Vt , Vti.i.d.∼ N (0, 1)

Yt = Xt + σWt , Wti.i.d.∼ N (0, 1) .

We set ρ ∼ U(−1,1) and σ2 ∼ IG (1, 1).We use particle filter with perfect adaptation and Gibbs moves withN = 10000; particle learning (Andrieu, D. & De Freitas, 1999;Carvalho et al., 2010)

50 runs of the particle method vs ground truth obtained usingKalman filter on states and grid on parameters.


Another Illustration of Degeneracy for Particle Learning

0.8 0.9 1 1.1 1.200.010.020.030.04

pdf,

n=1

000

0.8 0.9 1 1.1 1.200.010.020.030.04

pdf,

n=2

000

0.8 0.9 1 1.1 1.200.010.020.030.04

pdf,

n=3

000

0.8 0.9 1 1.1 1.200.010.020.030.04

pdf,

n=4

000

0.8 0.9 1 1.1 1.200.010.020.030.04

σy2

pdf,

n=5

000

0.4 0.5 0.6 0.7 0.8 0.90

0.05

0.1

0.4 0.5 0.6 0.7 0.8 0.90

0.02

0.04

0.06

0.4 0.5 0.6 0.7 0.8 0.90

0.05

0.1

0.4 0.5 0.6 0.7 0.8 0.90

0.05

0.1

0.4 0.5 0.6 0.7 0.8 0.90

0.05

0.1

ρ

Figure: Estimates of p (ρ| y1:t ) and p(

σ2∣∣ y1:t

)over 50 runs (red) vs ground

truth (blue) for t = 103, 2.103, ..., 5.103 for N = 104 (Kantas et al., 2012)


Stepping Back

For fixed θ, V [p̂θ (y1:t ) /pθ (y1:t )] is in O (t/N).In a Bayesian context, p ( θ| y1:t ) ∝ pθ (y1:t ) p (θ) so we implicitlyneed to compute pθ (y1:t ) at each particle location θ(i ).

It appears impossible to obtain uniformly in time stable estimates of{p ( θ| y1:t )}t≥1 for a fixed N.However for a given time horizon T , we can use PF to sampleeffi cently from p ( θ| y1:T ); see Lecture 3.


Likelihood Function Estimation

Let y1:T being given, the log-(marginal) likelihood is given by

`(θ) = log pθ (y1:T ) .

For any θ ∈ Θ, one can estimate `(θ) using particle methods,variance O (T/N) .

Direct maximization of `(θ) diffi cult as estimate ̂̀(θ) is not a smoothfunction of θ even for fixed random seed.

For dim (Xt ) = 1, we can obtain smooth estimate of log-likelihoodfunction by using a smoothed resampling step (e.g. Pitt, 2011); i.e.piecewise linear approximation of Pr (Xt < x | y1:t ) .

For dim (Xt ) > 1, we can obtain estimates of `(θ) highly positivelycorrelated for neigbouring values in Θ (e.g. Lee, 2008).


Gradient Ascent

To maximise `(θ) w.r.t θ, use at iteration k + 1

θk+1 = θk + γk ∇`(θ)|θ=θk

where ∇`(θ)|θ=θkis the so-called score vector.

∇`(θ)|θ=θkcan be estimated using finite differences but more

effi ciently using Fisher’s identity

∇`(θ) =∫∇ log pθ (x1:T , y1:T ) pθ (x1:T | y1:T ) dx1:T

where

∇ log pθ (x1:T , y1:T ) = ∇ log µθ (x1)+∑T

t=2∇ log fθ (xt | xt−1) +∑Tt=1∇ log gθ (yt | xt ) .


Particle Calculation of the Score Vector

We have

∇`(θ) =∫{∇ log µθ (x1) +∇ log gθ (y1| x1)} pθ (x1| y1:T ) dx1

+T

∑t=2

∫{∇ log fθ (xt | xt−1) +∇ log gθ (yt | xt )} pθ (xt−1, xt | y1:T ) dxt−1dxt

To approximate ∇`(θ), we just need particle approximations of{pθ (xt−1, xt | y1:T )}Tt=2 .All the particle smoothing methods detailed before can be applied.

Similar “smoothed additive functionals”have to be computed whenimplementing the Expectation-Maximization.


Comparison Direct Method vs FB

We want to estimateϕT = ∑T

t=1

∫ϕ (xt−1, xt , yt ) p (xt−1, xt | y1:T ) dxt−1dxt .

Method Direct FB# particles N Ncost O (TN) O

(TN2

),O (TN)

Var. O(T 2/N

)O (T/N)

Bias O (T/N) O (T/N)MSE=Bias2+Var O

(T 2/N

)O(T 2/N2

)“Fast” implementations FB of computational complexity O (NT )outperform direct approach as MSE is O

(T 2/N2

)whereas it is

O(T 2/N

)for direct SMC.

“Naive” implementations FB and TF have MSE of same order asdirect method for fixed computational complexity but MSE is biasdominated for FB/TF whereas it is variance dominated for DirectSMC.


Experimental Results

Consider a linear Gaussian model

Xt = φXt−1 + σvVt , Vti.i.d.∼ N (0, 1)

Yt = cXt + σwWt , Wti.i.d.∼ N (0, 1) .

We simulate 10,000 observations and compute particle estimates of∫ϕT (x1:T ) p (x1:T | y1:T ) dx1:T

for 4 different additive functionalsϕt (x1:t ) = ϕt−1 (x1:t−1) + ϕ (xt−1, xt , yt ) includingϕ1 (xt−1, xt , yt ) = xt−1xt , ϕ2 (xt−1, xt , yt ) = x2t . [Ground truth canbe computed using Kalman smoother.]

We use 100 replications on the same dataset to estimate the empiricalvariance.


Boxplots of Direct vs FB Estimates

2500 5000 7500 10000

500

0

500

Sco

re

Algorithm 1

2500 5000 7500 10000

500

0

500

Algorithm 2

2500 5000 7500 10000200

0

200

Sco

re

Time steps

Algorithm 1

2500 5000 7500 10000200

0

200

Time steps

Algorithm 2

score estimates for parameter σv

score estimates for parameter φ

Direct (left) vs FB (right)


Empirical Variance for Direct vs FB Estimates

0 2 5 0 0 5 0 0 0 7 5 0 0 1 0 0 0 00

2

4

6

x 1 04

V a r ia n c e o f s c o r e e s tim a te w .r .t.σv

Va

ria

nc

e

0 2 5 0 0 5 0 0 0 7 5 0 0 1 0 0 0 00

2 0 0 0

4 0 0 0

6 0 0 0

8 0 0 0

1 0 0 0 0

V a r ia n c e o f s c o r e e s tim a te w .r .t.φ

0 2 5 0 0 5 0 0 0 7 5 0 0 1 0 0 0 00

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

V a r ia n c e o f s c o r e e s tim a te w .r .t.σw

Ti m e s te p s

Va

ria

nc

e

0 2 5 0 0 5 0 0 0 7 5 0 0 1 0 0 0 00

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

5 0 0 0

V a r ia n c e o f s c o r e e s tim a te w .r .t. c

Ti m e s te p s

0 2 5 0 0 5 0 0 0 7 5 0 0 1 0 0 0 00

2 0

4 0

6 0

8 0

1 0 0

1 2 0

V a r ia n c e o f s c o r e e s tim a te w .r .t.σv

Var

ianc

e

0 2 5 0 0 5 0 0 0 7 5 0 0 1 0 0 0 00

2 0

4 0

6 0

V a r ia n c e o f s c o r e e s tim a te w .r .t.φ

0 2 5 0 0 5 0 0 0 7 5 0 0 1 0 0 0 00

5

1 0

1 5

2 0

V a r ia n c e o f s c o r e e s tim a te w .r .t.σw

Ti m e s te p s

Var

ianc

e

0 2 5 0 0 5 0 0 0 7 5 0 0 1 0 0 0 00

1 0

2 0

3 0

4 0

V a r ia n c e o f s c o r e e s tim a te w .r .t. c

Ti m e s te p s

Direct (left) vs FB (right); the vertical scale is different


Online ML Parameter Inference

Recursive maximum likelihood (Titterington, 1984; LeGland & Mevel,1997) proceeds as follows

θt+1 = θt + γt ∇ log pθ1:t (yt | y1:t−1)

where pθ1:t (yt | y1:t−1) is computed using θk at time k and∑t γt = ∞, ∑t γ2t < ∞. Under regularity conditions, this convergestowards a local maximum of the (average) log-likelihood.Note that

∇ log pθ1:t (yt | y1:t−1) = ∇ log pθ1:t (y1:t )−∇ log pθ1:t−1 (y1:t−1)

is given by the difference of two pseudo-score vectors where

∇ log pθ1:t (y1:t ) :=∫ (

∑tk=2 ∇ log fθ (xk | xk−1)|θk

+ ∇ log gθ (yk | xk )|θk)pθ1:t (x1:t | y1:t ) dx1:t .


Online ML Particle Parameter Inference

Particle approximation follows

θt+1 = θt + γt ∇̂ log pθ1:t (yt | y1:t−1)

where

∇̂ log pθ1:t (yt | y1:t−1) = ∇̂ log pθ1:t (y1:t )− ∇̂ log pθ1:t−1 (y1:t−1)

is given by the difference of particle estimates of pseudo-score vectors(Poyadjis, D. & Singh, 2011).

Asymptotic variance of ∇̂ log pθ1:t (yt | y1:t−1) is uniformly boundedin O (1/N) for FB estimate whereas it is O (t/N) for direct particlemethod (Del Moral, D. & Singh, 2011). Bias is O (1/N) in bothcases.Major Problem: If we use FB, this is not an online algorithmanymore as it requires a backward pass of order O (t) to approximate∇ log pθ1:t (y1:t ) ...


Variance of the Gradient Estimate for Direct vs FB

1 5000 10000 15000 200000

20

40

60

80

100

120

140

Figure: Empirical variance of the gradient estimate for standard versus FBapproximations (SV model)


Online Particle ML Inference using Direct Approach

0 500 1000 1500 20000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

x10 3

Figure: N = 10, 000 particles, online parameter estimates for SV model.


Online Particle ML Inference using FB

0 500 1000 1500 20000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

x10 3

Figure: N = 50 particles, online parameter estimates for SV model.


Forward only Smoothing

Dynamic programming allows us to compute in a single forward passthe FB estimates of

ϕθt =

∫t

ϕt (x1:t ) pθ (x1:t | y1:t ) dx1:t

where

ϕt (x1:t ) =t

∑k=1

ϕ (xk−1, xk , yk )

Forward Backward (FB) decomposition states

pθ (x1:T | y1:T ) = pθ (xT | y1:T )T−1∏t=1

pθ (xt | y1:t , xt+1)

where pθ (xt | y1:t , xt+1) =fθ( xt+1 |xt )pθ( xt |y1:t )

pθ( xt+1 |y1:t ).

Conditioned upon y1:T , {Xt}Tt=1 is a backward Markov chain of initialdistribution p (xT | y1:T ) and inhomogeneous Markov transitions{pθ (xt | y1:t , xt+1)}T−1t=1 independent of T .



We have

ϕθt =

∫ϕt (x1:t ) pθ (x1:t−1| y1:t−1, xt ) dx1:t−1

=∫

∫ϕt (x1:t ) pθ (x1:t−1| y1:t−1, xt ) dx1:t−1︸︷︷︸

V θt (xt )

pθ (xt | y1:t ) dxt

Forward smoothing recursion

V θt (xt ) =

∫ [V θt−1 (xt−1) + ϕ (xt−1:t , yt )

]pθ (xt−1| y1:t−1, xt ) dxt−1

Appears implicitly in Elliott, Aggoun & Moore (1996), Ford (1998)and rediscovered a few times... Presentation follows here (Del Moral,D. & Singh, 2009).



Forward smoothing recursion

V θt (xt ) =

∫ [V θt−1 (xt−1) + ϕ (xt−1:t , yt )

]pθ (xt−1| y1:t−1, xt ) dxt−1

Proof is trivial

V θt (xt ) =

∫ϕt (x1:t ) pθ (x1:t−1| y1:t−1, xt ) dx1:t−1

=∫ [

ϕt−1 (x1:t−1) + ϕ (xt−1:t , yt )]pθ (x1:t−2| y1:t−2, xt−1)×pθ (xt−1| y1:t−1, xt ) dx1:t−1

=∫{∫

ϕt−1 (x1:t−1) pθ (x1:t−2| y1:t−2, xt−1) dx1:t−2︸︷︷︸V θt−1(xt−1)

+ϕ (xt−1:t , yt )} pθ (xt−1| y1:t−1, xt ) dxt−1

Exact implementation possible for finite state-space and linearGaussian models.


Particle Forward only Smoothing

At time t − 1, we have p̂θ (xt−1| y1:t−1) =1N ∑N

i=1 δX (i )t−1

(xt−1) and{V̂ θt−1

(X (i )t−1

)}1≤i≤N

.

At time t, compute p̂θ (xt | y1:t ) = ∑Ni=1W

(i )t δ

X (i )t(xt ) and set

V̂ θt

(X (i )t

)=∫ {

V̂ θt−1 (xt−1) + ϕ (xt−1, xt , yt )

}p̂θ

(xt−1| y1:t−1,X

(i )t

)dxt−1

=∑Nj=1 fθ

(X (i )t |X

(j)t−1

)[V̂ θt−1

(X (j)t−1

)+ϕ(X (j)t−1,X

(i )t ,yt

)]∑Nj=1 fθ

(X (i )t |X

(j)t−1

) ,

ϕ̂θt =

1N ∑N

i=1 V̂θt

(X (i )t

).

This estimate is exactly the same as the Particle FB estimate,computational complexity O

(N2).


Online Particle ML Inference

At time t − 1, we have p̂θ1:t−1 (xt−1| y1:t−1),{V̂ θ1:t−1t−1

(X (i )t−1

)},

∇̂ log pθ1:t−1 (y1:t−1) =∫V̂ θ1:t−1t−1 (xt−1) p̂θ1:t−1 (xt−1| y1:t−1) dxt−1

and get θt .

At time t, use your favourite PF to compute p̂θ1:t (xt | y1:t ) and

V̂ θ1:tt

(X (i )t

)=∫ {

V̂ θ1:t−1t−1 (xt−1) + ϕ (xt−1, xt , yt )

}×p̂θ1:t

(xt−1| y1:t−1,X

(i )t

)dxt−1,

ϕ (xt−1:t , yt ) = ∇ log fθ (xt | xt−1)|θt + ∇ log gθ (yt | xt )|θtand

∇̂ log pθ1:t (y1:t ) =∫V̂ θ1:tt (xt ) p̂θ1:t (xt | y1:t ) dxt

Parameter update

θt+1 = θt + γt

(∇̂ log pθ1:t (y1:t )− ∇̂ log pθ1:t−1 (y1:t−1)

)A. Doucet (UCL Masterclass Oct. 2012) 4th October 2012 31 / 32

Summary

Online Bayesian parameter inference using particle methods is yet anunsolved problem.

Particle smoothing techniques can be used to perform off-line andon-line ML parameter estimation.

Observed information matrix can also be evaluated online in a stablemanner.

For online inference, computational complexity is O(N2)at each

time step and requires evaluating fθ (xt | xt−1) .


Maximum Likelihood Parameter Estimation in State...

Documents

Transcript of Maximum Likelihood Parameter Estimation in State...