1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue...

12
1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space models are powerful tools to describe dynamical structures in complex time series. In a streaming setting where data are processed one sample at a time, simultaneous inference of the state and its nonlinear dynamics has posed significant challenges in practice. We develop a novel online learning framework, leveraging variational inference and sequential Monte Carlo, which enables flexible and accurate Bayesian joint filtering. Our method provides an approximation of the filtering posterior which can be made arbitrarily close to the true filtering distribution for a wide class of dynamics models and observation models. Specifically, the proposed framework can efficiently approximate a posterior over the dynamics using sparse Gaussian processes, allowing for an interpretable model of the latent dynamics. Constant time complexity per sample makes our approach amenable to online learning scenarios and suitable for real-time applications. Index Terms—Nonlinear state-space modeling, online filtering, Bayesian machine learning 1 I NTRODUCTION Nonlinear state-space models are generative models for complex time series with underlying nonlinear dynamical structure [1], [2], [3]. Specifically, they represent nonlinear dynamics in the latent state-space, x t , that capture the spatiotemporal structure of noisy observations, y t : x t = f θ (x t-1 ,u t )+ t (state dynamics model) (1a) y t P (y t | g ψ (x t )) (observation model) (1b) where f θ and g ψ are continuous vector functions, P denotes a probability distribution, and t is intended to capture unob- served perturbations of the state x t . Such state-space models have many applications (e.g., object tracking) where the flow of the latent states is governed by known physical laws and constraints, or where learning an interpretable model of the laws is of great interest, especially in neuroscience [4], [5], [6], [7], [8], [9]. If the parametric form of the model and the parameters are known a priori, then the latent states x t can be inferred online through the filtering distribution, p(x t | y 1:t ), or offline through the smoothing distribution, p(x 1:t | y 1:t ) [10], [11]. Otherwise the challenge is in learning the parameters of the state-space model, {θ,ψ}, which is known in the literature as the system identification problem. In a streaming setting where data is processed one sample at a time, joint inference of the state and its nonlinear dynam- ics has posed significant challenges in practice. In this study, we are interested in online algorithms that can recursively solve the dual estimation problem of learning both the latent trajectory, x 1:t , in the state-space and the parameters of the model, {θ,ψ}, from streaming observations [12]. Popular solutions such, as the extended Kalman filter (EKF) or the unscented Kalman filter (UKF) [13], build an online dual estimator using nonlinear Kalman filtering by augmenting the state-space with its parameters [13], [14]. * equal contribution All authors are with Stony Brook University, Stony Brook, NY, 11794. Y. Zhao and I. M. Park are with the Department of Neurobiology and Behavior. J. Nassar and M. Bugallo are with the Department of Electrical and Computer Engineering. I. Jordan is with the Department of Applied Mathematics and Statistics. While powerful, they usually provide coarse approximations to the filtering distribution and involve many hyperpa- rameters to be tuned which hinder their practical perfor- mance. Moreover, they do not take advantage of modern stochastic gradient optimization techniques commonly used throughout machine learning and are not easily applicable to arbitrary observation likelihoods. Recursive stochastic variational inference has been pro- posed for streaming data assuming either independent [15] or temporally-dependent samples [6], [16]. However the proposed variational distributions are not guaranteed to be good approximations to the true posterior. As opposed to variational inference, sequential Monte Carlo (SMC) leverages importance sampling to build an approximation to the target distribution in a data streaming setting [17], [18]. However, its success heavily depends on the choice of proposal distribution and the (locally) optimal proposal dis- tribution usually is only available in the simplest cases [17]. While work has been done on learning good proposals for SMC [19], [20], [21], [22] most are designed only for offline scenarios targeting the smoothing distributions instead of the filtering distributions. In [19], the proposal is learned online but the class of dynamics for which this is applicable to is extremely limited. In this paper, we propose a novel sequential Monte Carlo method for inferring a state-space model for the streaming time series scenario that adapts the proposal distribution on-the-fly by optimizing a surrogate lower bound to the log normalizer of the filtering distribution. Moreover, we choose the sparse Gaussian process (GP) [23] for modeling the unknown dynamics that allows for O(1) recursive Bayesian inference. Specifically our contributions are: 1) We prove that our approximation to the filtering distribution converges to the true filtering distribu- tion. 2) Our objective function allows for unbiased gradi- ents which lead to improved performance. 3) To the best of our knowledge, we are the first to use particles to represent the posterior of inducing arXiv:1906.01549v3 [stat.ML] 29 Feb 2020

Transcript of 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue...

Page 1: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

1

Streaming Variational Monte CarloYuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park

Abstract—Nonlinear state-space models are powerful tools to describe dynamical structures in complex time series. In a streamingsetting where data are processed one sample at a time, simultaneous inference of the state and its nonlinear dynamics has posedsignificant challenges in practice. We develop a novel online learning framework, leveraging variational inference and sequential MonteCarlo, which enables flexible and accurate Bayesian joint filtering. Our method provides an approximation of the filtering posterior whichcan be made arbitrarily close to the true filtering distribution for a wide class of dynamics models and observation models. Specifically, theproposed framework can efficiently approximate a posterior over the dynamics using sparse Gaussian processes, allowing for aninterpretable model of the latent dynamics. Constant time complexity per sample makes our approach amenable to online learningscenarios and suitable for real-time applications.

Index Terms—Nonlinear state-space modeling, online filtering, Bayesian machine learning

F

1 INTRODUCTION

Nonlinear state-space models are generative models forcomplex time series with underlying nonlinear dynamicalstructure [1], [2], [3]. Specifically, they represent nonlineardynamics in the latent state-space, xt, that capture thespatiotemporal structure of noisy observations, yt:

xt = fθ(xt−1, ut) + εt (state dynamics model) (1a)yt ∼ P (yt | gψ(xt)) (observation model) (1b)

where fθ and gψ are continuous vector functions, P denotesa probability distribution, and εt is intended to capture unob-served perturbations of the state xt. Such state-space modelshave many applications (e.g., object tracking) where the flowof the latent states is governed by known physical laws andconstraints, or where learning an interpretable model of thelaws is of great interest, especially in neuroscience [4], [5],[6], [7], [8], [9]. If the parametric form of the model andthe parameters are known a priori, then the latent statesxt can be inferred online through the filtering distribution,p(xt | y1:t), or offline through the smoothing distribution,p(x1:t | y1:t) [10], [11]. Otherwise the challenge is in learningthe parameters of the state-space model, {θ, ψ}, which isknown in the literature as the system identification problem.

In a streaming setting where data is processed one sampleat a time, joint inference of the state and its nonlinear dynam-ics has posed significant challenges in practice. In this study,we are interested in online algorithms that can recursivelysolve the dual estimation problem of learning both the latenttrajectory, x1:t, in the state-space and the parameters of themodel, {θ, ψ}, from streaming observations [12].

Popular solutions such, as the extended Kalman filter(EKF) or the unscented Kalman filter (UKF) [13], build anonline dual estimator using nonlinear Kalman filtering byaugmenting the state-space with its parameters [13], [14].

* equal contribution

• All authors are with Stony Brook University, Stony Brook, NY, 11794.Y. Zhao and I. M. Park are with the Department of Neurobiology andBehavior. J. Nassar and M. Bugallo are with the Department of Electricaland Computer Engineering. I. Jordan is with the Department of AppliedMathematics and Statistics.

While powerful, they usually provide coarse approximationsto the filtering distribution and involve many hyperpa-rameters to be tuned which hinder their practical perfor-mance. Moreover, they do not take advantage of modernstochastic gradient optimization techniques commonly usedthroughout machine learning and are not easily applicableto arbitrary observation likelihoods.

Recursive stochastic variational inference has been pro-posed for streaming data assuming either independent [15]or temporally-dependent samples [6], [16]. However theproposed variational distributions are not guaranteed tobe good approximations to the true posterior. As opposedto variational inference, sequential Monte Carlo (SMC)leverages importance sampling to build an approximationto the target distribution in a data streaming setting [17],[18]. However, its success heavily depends on the choice ofproposal distribution and the (locally) optimal proposal dis-tribution usually is only available in the simplest cases [17].While work has been done on learning good proposals forSMC [19], [20], [21], [22] most are designed only for offlinescenarios targeting the smoothing distributions instead of thefiltering distributions. In [19], the proposal is learned onlinebut the class of dynamics for which this is applicable to isextremely limited.

In this paper, we propose a novel sequential Monte Carlomethod for inferring a state-space model for the streamingtime series scenario that adapts the proposal distributionon-the-fly by optimizing a surrogate lower bound to the lognormalizer of the filtering distribution. Moreover, we choosethe sparse Gaussian process (GP) [23] for modeling theunknown dynamics that allows for O(1) recursive Bayesianinference. Specifically our contributions are:

1) We prove that our approximation to the filteringdistribution converges to the true filtering distribu-tion.

2) Our objective function allows for unbiased gradi-ents which lead to improved performance.

3) To the best of our knowledge, we are the first touse particles to represent the posterior of inducing

arX

iv:1

906.

0154

9v3

[st

at.M

L]

29

Feb

2020

Page 2: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

2

variables of the sparse Gaussian processes, whichallows for accurate Bayesian inference on the in-ducing variables rather than the typical variationalapproximation and closed-form weight updates.

4) Unlike many efficient filtering methods that usuallyassume Gaussian or continuous observations, ourmethod allows arbitrary observational distribu-tions.

2 STREAMING VARIATIONAL MONTE CARLO

Given the state-space model defined in (1), the goal is toobtain the latent state xt ∈ Rdx given a new observation,yt ∈ X, where X is a measurable space (typically X = Rdy orX = Ndy ). Under the Bayesian framework, this correspondsto computing the filtering posterior distribution at time t

p(xt|y1:t) =p(yt|xt)

p(yt|y1:t−1)p(xt|y1:t−1) (2)

which recursively uses the previous filtering posterior distri-bution, p(xt|y1:t−1) =

∫p(xt|xt−1)p(xt−1|y1:t−1)dxt−1.

However the above posterior is generally intractableexcept for limited cases [12] and thus we turn to approximatemethods. Two popular approaches for approximating (2)are sequential Monte Carlo (SMC) [17] and variationalinference (VI) [24], [25], [26]. In this work, we propose tocombine sequential Monte Carlo and variational inference,which allows us to utilize modern stochastic optimizationwhile leveraging the flexibility and theoretical guaranteesof SMC. We refer to our approach as streaming variationalMonte Carlo (SVMC). For clarity, we review SMC and VI inthe follow sections.

2.1 Sequential Monte CarloSMC is a sampling based approach to approximate Bayesianinference that is designed to recursively approximate asequence of distributions p(x0:t|y1:t) for t = 1, · · · , usingsamples from a proposal distribution, r(x0:t|y1:t;λ0:t) whereλ0:t are the parameters of the proposal [17]. Due to theMarkovian nature of the state-space model in (1), thesmoothing distribution, p(x0:t|y1:t), can be expressed as

p(x0:t|y1:t) ∝ p(x0)t∏

j=1

p(xt|xt−1)p(yt|xt). (3)

We enforce the same factorization for the proposal,r(x0:t|y1:t;λ0:t) = r0(x0;λ0)

∏tj=1 rj(xj |xj−1, yj ;λj).

A naive approach to approximating (3) is to use standardimportance sampling (IS) [27]. N samples are sampled fromthe proposal distribution, x1

0:t, · · · ,xN0:t ∼ r(x0:t;λ0:t), andare given weights according to

wi0:t =p(xi0)

∏tj=1 p(x

ij |xij−1)p(yj |xij)

r0(xi0;λ0)∏tj=1 rj(x

ij |xij−1, yj ;λj)

. (4)

The importance weights can also be computed recursively

wi0:t =t∏

s=0

wis, (5)

where

wis =p(ys|xis)p(xis|xis−1)

rs(xis|xis−1, ys;λs). (6)

The samples and their corresponding weights,{(xi0:t, w

i0:t)}Ni=1, are used to build an approximation

to the target distribution

p(x0:t|y1:t) ≈ p(x0:t|y1:t) =N∑i=1

wi0:t∑` w

`0:t

δxi0:t

(7)

where δx is the Dirac-delta function centered at x. Whilestraightforward, naive IS suffers from the weight degeneracyissue; as the length of the time series, T , increases all but oneof the importance weights will go to 0 [17].

To alleviate this issue, SMC leverages sampling-importance-resampling (SIS). Suppose at time t− 1, we havethe following approximation to the smoothing distribution

p(x0:t−1|y1:t−1) =1

N

N∑i=1

wit−1∑` w

`t−1

δxi0:t−1

, (8)

where wit−1 is computed according to (6). Given a newobservation, yt, SMC starts by resampling ancestor vari-ables, ait ∈ {1, · · · , N} with probability proportional to theimportance weights wjt−1. N samples are then drawn from

the proposal, xit ∼ rt(xt|xaitt−1, yt;λt) and their importance

weights are computed, wit, according to (6). The introductionof resampling allows for a (greedy) solution to the weightdegeneracy problem. Particles with high weights are deemedgood candidates and are propagated forward while the oneswith low weights are discarded.

The updated approximation to p(x0:t|y1:t) is now

p(x0:t|y1:t) =1

N

N∑i=1

wit∑` w

`t

δxi0:t, (9)

where xi0:t = (xit,xait0:t−1). Marginalizing out x0:t−1 in (9)

gives an approximation to the filtering distribution:

p(xt|y1:t) =

∫p(x0:t|y1:t)dx0:t−1

≈∫ N∑

i=1

wit∑` w

`t

δxi0:t

=N∑i=1

wit∑` w

`t

δxit.

(10)

As a byproduct, the weights produced in an SMC runyield an unbiased estimate of the marginal likelihood ofthe smoothing distribution [18]:

E[p(y1:t)] = E

[t∏

s=1

1

N

N∑i=1

wis

]= p(y1:t), (11)

and a biased but consistent estimate of the marginal likelihoodof the filtering distribution [18], [28]

E[p(yt|y1:t−1)] = E

[1

N

N∑i=1

wit

]. (12)

For completeness, we reproduce the consistency proof of (12)in section A of the appendix. The recursive nature of SMCmakes it constant complexity per time step and constantmemory because only the samples and weights generatedat time t are needed, {wit, xit}Ni=1, making them a perfectcandidate to be used in an online setting [29]. These attractive

Page 3: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

3

properties have allowed SMC to enjoy much success in fieldssuch as robotics [30], control engineering [31] and target track-ing [32]. The success of an SMC sampler crucially dependson the design of the proposal distribution, rt(xt|xt−1, yt;λt).A common choice for the proposal distribution is the tran-sition distribution, rt(xt|xt−1, yt;λt) = p(xt|xt−1), whichis known as the bootstrap particle filter (BPF) [33]. Whilesimple, it is well known that BPF needs a large number ofparticles to perform well and suffers in high-dimensions [34].

Designing a proposal is even more difficult in an onlinesetting because a proposal distribution that was optimizedfor the system at time t may not be the best proposal Ksteps ahead. For example, if the dynamics were to changeabruptly, a phenomenon known as concept drift [35], theprevious proposal may fail for the current time step. Thus,we propose to adapt the proposal distribution online usingvariational inference. This allows us to utilize modernstochastic optimization to adapt the proposal on-the-fly whilestill leveraging the theoretical guarantees of SMC.

2.2 Variational Inference

In contrast to SMC, VI takes an optimization approach toapproximate Bayesian inference. In VI, we approximate thetarget posterior, p(xt|y1:t), by a class of simpler distributions,q(xt;ϑt), where ϑt are the parameters of the distribution. Wethen minimize a divergence (which is usually the Kullback-Leibler divergence (KLD)) between the posterior and theapproximate distribution in the hopes of making q(xt;ϑt)closer to p(xt|y1:t). If the divergence used is KLD, then min-imizing the KLD between these distributions is equivalent tomaximizing the so-called evidence lower bound (ELBO) [26],[24]:

L(ϑt) = Eq[log p(xt,y1:t)− log q(xt;ϑt)], (13)= Eq[logEp(xt−1|y1:t−1)[p(xt, xt−1,y1:t)]− log q(xt;ϑt)].

For filtering inference, the intractability introduced bymarginalizing over p(xt−1 | y1:t−1) in (13) makes theproblem much harder to optimize, rendering variationalinference impractical in a streaming setting where incomingdata are temporally dependent.

Algorithm 1 Streaming Variational Monte Carlo (Step t)

Require: {xit−1, wit−1}Ni=1,Θt−1, yt

1: for k = 1, . . . , NSGD do2: for i = 1, . . . , L do3: ait ∼ Pr(ait = j) ∝ wjt−1 . Resample

4: xit ∼ r(xt | xaitt−1, yt; Θt−1) . Propose

5: wit ←p(xi

t|xait

t−1;Θt−1)p(yt|xait

t−1;Θt−1)

r(xit|x

ait

t−1,yt;Θt−1). Weigh

6: end for7: Lt ←

∑i logwit

8: Θt ← Θt−1 + α∇ΘLt . SGD9: end for

10: Resample, propose and reweigh N particles11: return {xit, wit}Ni=1, Θt

2.3 A Tight Lower Bound

Due to the intractability of the filtering distribution, thestandard ELBO is difficult to optimize forcing us to define adifferent objective function. As stated above, we know thatthe sum of importance weights is an unbiased estimator ofp(y1:t). Jensen’s inequality applied to (11) [22], [36] gives,

log p(y1:t) = logE[p(y1:t)] ≥ E[log p(y1:t)]. (14)

Expanding (14), we obtain

log p(yt | y1:t−1) + log p(y1:t−1)

≥ E[log p(yt | y1:t−1)] + E[log p(y1:t−1)],(15)

log p(yt | y1:t−1) ≥ E[log p(yt | y1:t−1)]−Rt(N) (16)

where Rt(N) = log p(y1:t−1) − E[log p(y1:t−1)] ≥ 0 is thevariational gap. Leveraging this we propose to optimize

Lt(Θt) = E[log p(yt | y1:t−1)]−Rt(N),

= E

[N∑i=1

logwit

]−Rt(N).

(17)

We call Lt the filtering ELBO; it is a lower bound to thelog normalization constant (log partition function) of thefiltering distribution where Rt(N) accounts for the bias ofthe estimator (12).

There exists an implicit posterior distribution that arisesfrom performing SMC given by [37],

q(xt | y1:t) = p(xt,y1:t)E[

1

p(y1:t)

](18)

= p(xt, yt | y1:t−1)E[p(yt | y1:t−1)−1 p(y1:t−1)

p(y1:t−1)

]As the number of samples goes to infinity (17) can be madearbitrarily tight; as a result, the implicit approximation tothe filtering distribution (18) will become arbitrarily closeto the true posterior, p(xt | y1:t) almost everywhere whichallows for a trade-off between accuracy and computationalcomplexity. We note that this result is not applicable in mostcases of VI due to the simplicity of variational familiesused. As an example, we showcase the accuracy of theimplicit distribution in Fig. 1. We summarize this resultin the following theorem (proof in section B of the appendix).

Theorem 2.1 (Filtering ELBO). The filtering ELBO (17), Lt, isa lower bound to the logarithm of the normalization constant ofthe filtering distribution, p(xt | y1:t). As the number of samples,N , goes to infinity, Lt will become arbitrarily close to log p(yt |y1:t−1).

Theorem 2.1 leads to the following corollary [38] (proofin section C of the appendix).

Corollary 2.1.1. Theorem 2.1 implies that the implicit filteringdistribution, q(xt | y1:t), converges to the true posterior p(xt |y1:t) as N →∞.

2.4 Stochastic Optimization

As in variational inference, we fit the parameters of the pro-posal, dynamics and observation model, Θt = {λt, θt, ψt},

Page 4: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

4

prior likelihood posterior proposal SVMC variational

Figure 1. Accurate SVMC posterior. SVMC is able to capture the posterior fairly well (200 samples) even with a unimodal proposal. The unimodalvariational approximation fails to capture the structure of the true posterior.

by maximizing the (filtering) ELBO (Alg. 1). While the expec-tations in (17) are not in closed form, we can obtain unbiasedestimates of Lt and its gradients with Monte Carlo. Notethat, when obtaining gradients with respect to Θt, we onlyneed to take gradients of E[log p(yt|y1:t−1)]. We also assumethat the proposal distribution, xt ∼ r(xt|xt−1, yt;λt) isreparameterizable, i.e. we can sample from r(xt|xt−1, yt;λt)by setting xt = h(xt−1, yt, εt;λt) for some function h whereεt ∼ s(εt) and s is a distribution independent of λt. Thus wecan express the gradient of (17) using the reparameterizationtrick [39] as

∇ΘtLt = ∇ΘtEs(ε1:L)[log p(yt|y1:t−1)],

= Es(ε1:L)[∇Θtlog p(yt|y1:t−1)],

= Es(ε1:L)

[∇Θt

L∑i=1

logwit

].

(19)

In Algorithm 1, we doNSGD times stochastic gradient descent(SGD) update for each step.

While using more samples, N, will reduce the variationalgap between the filtering ELBO, Lt, and log p(yt|y1:t−1),using more samples for estimating (19), L, may be detri-mental for optimizing the parameters, as it has been shownto decrease the signal-to-noise ratio (SNR) of the gradientestimator for importance-sampling-based ELBOs [40]. Theintuition is as follows: as the number of samples used tocompute the gradient increases, the bound gets tighter andtighter which in turn causes the magnitude of the gradient tobecome smaller. The rate at which the magnitude decreasesis much faster than the variance of the estimator, thus drivingthe SNR to 0. In practice, we found that using a small numberof samples to estimate (19), L < 5, is enough to obtain goodperformance.

2.5 Learning Dynamics with Sparse Gaussian Pro-cessesState-space models allow for various time series modelsto represent the evolution of state and ultimately predictthe future [41]. While in some scenarios, there exists priorknowledge on the functional form of the latent dynamicsfθ(x), this is usually never the case in practice; thus, fθ(x),must be learned online as well. While one can assume aparametric form for fθ(x), i.e. a recurrent neural network,and learn θ through SGD this does not allow uncertaintyover the dynamics to be expressed which is key for manyreal-time, safety-critical tasks. An attractive alternative over

parametric models are Gaussian processes (GPs) [42]. Gaus-sian processes do not assume a functional form for the latentdynamics; rather general assumptions, such as continuityor smoothness, are imposed. Gaussian processes also allowfor a principled notion of uncertainty, which is key whenpredicting the future.

A Gaussian process is a collection of random variables,any finite number of which have a joint Gaussian distribu-tion. It is completely specified by its mean and covariancefunctions. A GP allows one to specify a prior distributionover functions

f(x) ∼ GP(m(x), k(x, x′)) (20)

where m(·) is the mean function and k(·, ·) is the covariancefunction; in this study, we assume that m(x) = x. With theGP prior imposed on the dynamics, one can do Bayesianinference with data.

With the current formulation, a GP can be incorporated byaugmenting the state-space to (xt, ft), where ft ≡ f(xt−1) .The importance weights are now computed according to

wt =p(yt|xt)p(xt|ft)p(ft|f1:t−1,x0:t−1)

r(xt, ft|ft−1, xt−1, yt;λt). (21)

Examining (21), it is evident that naively using a GP isimpractical for online learning because its space and timecomplexity are proportional to the number of data points,which grows with time t, i.e., O(t2) and O(t3) respectively.In other words, the space and time costs increase as moreand more observations are processed.

To overcome this limitation, we employ the sparse GPmethod [23], [43]. We introduce M inducing points, z =(z1, . . . , zM ), where zi = f(ui) and ui are pseudo-inputs andimpose that the prior distribution over the inducing pointsis p(z) = N (0, k(u,u′)). In the experiments, we spread theinducing points uniformly over a finite volume in the latentspace. Under the sparse GP framework, we assume that z isa sufficient statistic for ft, i.e.

p(ft|x0:t−1, f1:t−1, z) = p(ft|xt−1, z)

= N(ft|mt +KtzK

−1zz z,Ktt −KtzK

−1zz Kzt

),

(22)

where mt = m(xt−1). Note that the inclusion of the inducingpoints in the model reduces the computational complexity to

Page 5: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

5

be constant with respect to time. Marginalizing out ft in (22)

p(xt|xt−1, z)

=

∫p(xt|ft)p(ft|xt−1, z)dft

= N(xt|mt +KtzK

−1zz z,Ktt −KtzK

−1zz Kzt +Q

).

(23)

Equipped with equation (23), we can express the smoothingdistribution as

p(x0:t, z|y1:t) ∝ p(x0)p(z)∏

p(yt|xt)p(xt|xt−1, z), (24)

and the importance weights can be computed according to

wt =p(yt|xt)p(xt|xt−1, z)p(z|x0:t−1)

r(xt, z|xt−1, yt;λt). (25)

Due to the conjugacy of the model, p(z|x0:t−1) can berecursively updated efficiently. Let p(zt|x0:t−1) = N (zt |µt−1,Γt−1). Given xt and by Bayes rule

p(z|x0:t) ∝ p(xt|xt−1, z)p(z|x0:t−1). (26)

we obtain the recursive updating rule:

Γt =(

Γ−1t−1 +A>t C

−1t At

)−1

µt = Γt[Γ−1t−1µt−1 +A>t C

−1t (xt −mt)

] (27)

where At = KtzK−1zz and Ct = Ktt −KtzK

−1zz Kzt +Q.

To facilitate learning in non-stationary environments,we impose a diffusion process over the inducing variables.Letting p(zt−1|x0:t−1) = N (µt−1,Γt−1), we impose thefollowing relationship between zt−1 and zt

zt = zt−1 + ηt, (28)

where ηt ∼ N (0, σ2zI). We can rewrite (25)

wt =p(yt|xt)p(xt|xt−1, zt)p(zt|x0:t−1)

r(xt, zt|xt−1, zt−1, yt;λt), (29)

where

p(zt|x0:t−1) =

∫p(zt|zt−1)p(zt−1|x0:t−1)dzt−1

= N (µt−1,Γt−1 + σ2zI).

(30)

To lower the computation we marginalize out the inducingpoints from the model, simplifying (29)

wt =p(yt|xt)p(xt|x0:t−1)

r(xt|xt−1, yt;λt), (31)

where

p(xt|x0:t−1) =

∫p(xt|xt−1, zt)p(zt|x0:t−1)dzt

= N (vt,Σt)(32)

where vt = mt +Atµt−1 and Σt = Ct +At(Γt−1 + σ2xI)A>t .

For each stream of particles, we store µit and Γit. Due to therecursive updates (27), maintaining and updating µit and Γitis of constant time complexity, making it amenable for onlinelearning. The use of particles also allows for easy samplingof the predictive distribution (details are in section E of theappendix). We call this approach SVMC-GP; the algorithm issummarized in Algorithm 2.

Algorithm 2 SVMC-GP (Step t)

Require: {xit−1, µit−1,Γ

it−1, w

it−1}Ni=1,Θt−1, yt

1: for k = 1, . . . , NSGD do2: for i = 1, . . . , L do3: ait ∼ Pr(ait = j) ∝ wjt−1 . Resample

4: xit ∼ r(xt | yt, xaitt−1;µ

aitt−1,Γ

aitt−1,Θt−1) . Propose

5: wit ←p(xi

t|xait

t−1)p(yt|xait

t−1;Θt−1)

r(xit|x

ait

t−1,yt;µait

t−1,Γait

t−1,Θt−1). Reweigh

6: end for7: Lt ←

∑i logwit

8: Θt ← Θt−1 + α∇ΘLt . SGD9: end for

10: Resample, propose and reweigh N particles11: Compute µit and Γit12: return {xit, µit,Γit, wit}Ni=1, Θt

2.6 Design of Proposals

As stated previously, the accuracy of SVMC depends cruciallyon the functional form of the proposal. The (locally) optimalproposal is

p(xt|xt−1, yt) ∝ p(yt|xt)p(xt|xt−1) (33)

which is a function of yt and ft [44]. In general (33) isintractable; to emulate (33) we parameterize the proposalas

r(xt|xt−1, yt) = N (µλt(ft, yt), σ

2λt

(ft, yt)I) (34)

where µλt and σλt are neural networks with parameter λt.

3 RELATED WORKS

Much work has been done on learning good proposals forSMC. The method proposed in [21] iteratively adapts itsproposal for an auxiliary particle filter. In [19], the proposalis learned online using expectation-maximization but theclass of dynamics for which the approach is applicable for isextremely limited. The method proposed in [20] learns theproposal by minimizing the KLD between the smoothingdistribution and the proposal, DKL[p(x0:t|y1:t)‖r(x0:t;λ0:t)];while this approach can be used to learn the proposalonline, biased importance-weighted estimates of the gra-dient are used which can suffer from high variance ifthe initial proposal is bad. Conversely, [22] maximizesE[log p(y1:t)], which can be shown to minimize the KLDbetween the proposal and the implicit smoothing distri-bution, DKL[q(x0:t|y1:t)‖p(x0:t|y1:t)]; biased gradients wereused to lower the variance. In contrast, SVMC allows forunbiased and low variance gradients that target the filteringdistribution as opposed to the smoothing distribution. In [45],the proposal is parameterized by a Riemannian HamiltonianMonte Carlo and the algorithm updates the mass matrix bymaximizing E[log p(y1:t)]. At each time step (and for everystochastic gradient), the Hamiltonian must be computedforward in time using numerical integration, making themethod impractical for an online setting.

Many works have focused on learning latent dynamicsonline using (sparse) GPs [6], [46], [47], [48], [49]. Howeverthose proposed methods in the literature either make anapproximation to the posterior of the GP and/or only work

Page 6: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

6

for certain emission models i.e. linear and Gaussian. The useof particles allows SVMC-GP to work for arbitrary emissionmodels without making such simplified assumptions.

4 EXPERIMENTS

To showcase the power of SVMC, we employ it on a numberof simulated and real experiments.

4.1 Synthetic Data

4.1.1 Linear Dynamical System

As a baseline, we use the proposed method on a lineardynamical system (LDS)

xt = Axt−1 + εt, εt ∼ N (0, Q),

yt = Cxt + ξt, ξt ∼ N (0, R).(35)

LDS is the de facto dynamical system for many fields ofscience and engineering due to its simplicity and efficientexact filtering (i.e., Kalman filtering). The use of an LDSalso endows a closed form expression of the log marginallikelihood for the filtering and smoothing distribution. Thus,as a baseline we compare the estimates of the log marginallikelihood, log p(y1:T ), produced by SVMC, variational se-quential Monte Carlo (VSMC) [22] (run in offline mode) andBPF [33] in an experiment similar to the one used in [22].We generated data from (35) with T = 50, dx = dy = 10,(A)ij = α|i−j|+1, with α = 0.42 and Q = R = I where thestate and emission parameters are fixed; the true negative logmarginal likelihood is 1168.2. For both VSMC and SVMC,100 particles were used for both computing gradients andcomputing the lower bound. The proposal for VSMC andSVMC was

r(xt|xt−1;λt) = N (µt + diag(βt)Axt−1,diag(σ2t )) (36)

where λt = {µt, βt, σ2t }. 25,000 gradient steps were used

for VSMC and 25,000/50 = 5,000 gradient steps were doneat each time step for SVCMC to equate the total numberof gradient steps between both methods. To equate thecomputational complexity between SVMC and BPF, we ranthe BPF with 125,000 particles. We fixed the data generatedfrom (35) and ran each method for 100 realizations. Theaverage negative ELBO and its standard error of each themethods are shown in Table 1.

From Table (1), we see that the use of unbiased gradientsin SVMC allows for faster convergence than the biasedgradients used in VSMC [22]. Interestingly, SVMC performsworse than a BPF with 125,000 particles. We note that thismay be due to the choice of the proposal function; the optimalproposal function for particle filtering is a function of theobservation, yt, and the dynamics, Axt−1, while (36) is onlya function of the dynamics. To support this claim, we ranSVMC with 10,000 particles with the proposal defined by (34).It is evident that the use of this improved proposal allowsSVMC to outperform BPF with a much lower computationalbudget.

4.1.2 Chaotic Recurrent Neural NetworkTo show the performance our algorithm in filtering datagenerated from a complex, nonlinear and high-dimensionaldynamical system, we generate data from a continuous-time"vanilla" recurrent neural network (vRNN)

τ x(t) = −x(t) + γWx tanh(x(t)) + σ(x)dW (t). (37)

where W (t) is Brownian motion. Using Euler integration,(37) can be described as a discrete time dynamical system

xt+1 = xt+∆(−xt+γWx tanh(xt))/τ + εt, εt ∼ N (0, Q)(38)

where ∆ is the Euler step. The emission model is

yt = Cxt +D + ξt, ξt,1, · · · , ξt,dyi.i.d∼ ST (0, νy, σy) (39)

where each dimension of the emission noise, vt, is indepen-dently and identically distributed (i.i.d) from a Student’st distribution, ST (0, νy, σy), where νy is the degrees offreedom and σy is the scale.

We set dy = dx = 10 and the elements of Wx are i.i.d.drawn from N (0, 1/dx). We set γ = 2.5 which produceschaotic dynamics at a relatively slow timescale comparedto τ [50]. The rest of the parameters values are: τ = 0.025,δ = 0.001, Q = 0.01I , νy = 2 and σy = 0.1, which arekept fixed. We generated a single time series of length ofT = 500 and fixed it across multiple realizations. SVMC wasran using 200 particles with proposal distribution (34), wherethe neural network was a multi-layer perceptron (MLP) with1 hidden layer with 100 hidden units; 15 gradient steps wereperformed at each time step. For a comparison, a BPF with10,000 particles and an unscented Kalman filter (UKF) wasrun. Each method was ran over 100 realizations. SVMC canachieve better performance with a smaller computationalbudget than BPF with an order of magnitude less samples.

4.1.3 Synthetic NASCAR® DynamicsWe test learning dynamics with sparse GP on a syntheticdata of which the underlying dynamics follow a recurrentswitching linear dynamical systems [51]. The simulatedtrajectory resembles the NASCAR® track (Fig. 2A). Wetrain the model with 2000 observations simulated fromyt = Cxt + ξt where C is a 50-by-2 matrix. The proposal isdefined as N (µt,Σt) of which µt and Σt are linear maps ofthe concatenation of observation yt and previous state xit−1.We use 50 particles and squared exponential (SE) kernel forGP. To investigate the learning of dynamics, we control forother factors, i.e. we fix the observation model and otherhyper-parameters such as noise variances at their true values.(See the details in section D of the appendix.)

Figure 2A shows the true (blue) and inferred (red) latentstates. The inference quickly catches up with the true stateand stays on the track. As the state oscillates on the track, thesparse GP learns a reasonable limit cycle (Fig. 2F) withoutseeing much data outside the track. The velocity fields inFigure 2D–F show the gradual improvement in the onlinelearning. The 500-step prediction also shows that the GPcaptures the dynamics (Fig. 2B). We compare SVMC withKalman filter in terms of mean squared error (MSE) (Fig. 2C).The transition matrix of the LDS of the Kalman filter (KF)is learned by expectation-maximization which is an offline

Page 7: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

7

Table 1Experiment 1 (LDS) with 100 replication runs (true negative log-likelihood is 1168.12). The average negative ELBO and runtime are shown with the

standard error.

SVMC (100) BPF (125,000) VSMC (100) SVMC-MLP−ELBO 1189.5 ± 0.6 1183.0 ± 0.5 1198.0 ± 0.7 1178.7 ± 0.4time (s) 312.0 ± 0.1 374.0 ± 2.2 7076.7 ± 28.3 47.8 ± 0.2

Table 2Experiment 2 (Chaotic RNN) with 100 replication runs. The average

RMSE (lower is better), negative ELBO (lower is better) and runtime perstep are shown with standard error.

SVMC (200) BPF (10,000) UKFRMSE 3.3 ± 0.04 4.0 ± 0.05 3.9 ± 0.12

−ELBO (nats) 20.7 ± 0.02 23.9 ± 0.04 N/Atime (s) 20.0 ± 0.08 82.1 ± 0.57 0.8 ± 0.004

method, hence not truly online. Yet, SVMC performs betterthan KF after 1000 steps.

4.1.4 Winner-Take-All Spiking Neural NetworkPerceptual decision-making paradigm is a well-establishedcognitive task where typically a low-dimensional decisionvariable needs to be integrated over time, and subjects areclose to optimal in their performance. To understand how thebrain implements such neural computation, many competingtheories have been proposed [52], [53], [54], [55], [56]. We testour method on a simulated biophysically realistic corticalnetwork model for a visual discrimination experiment [55]. Inthe model, there are two excitatory sub-populations that arewired with slow recurrent excitation and feedback inhibitionto produce attractor dynamics with two stable fixed points.Each fixed point represents the final perceptual decision,and the network dynamics amplify the difference betweenconflicting inputs and eventually generates a binary choice.

The simulated data was organized into decision-makingtrials. We modified the network by injecting a 60 Hz Poissoninput into the inhibitory sub-population at the end of eachtrial to "reset” the network for the purpose of uninterruptedtrials to fit the streaming case because the original networkwas designed for only one-time experiment. In each trialthe input to the network consisted of two periods, one 2-secstimulus signal with different strength of visual evidencecontrolled by "coherence”, and one 2-sec 60 Hz reset signal,each follows a 1-sec spontaneous period (no input). Wesubsampled 480 selective neurons out of 1600 excitatoryneurons from the simulation to be observed by our algorithm.

Fig. 3 shows that our approach did well at learning thetarget network. The inferred latent trajectories of severaltrials (Fig. 3B). In each trial the network started at theinitial state, eventually reached either choice (indicated bythe color) after the stimulus onset, and finally went backaround the initial state after receiving reset signal. The restthree panels (Fig. 3B) show the phase portraits of inferreddynamical system revealing the bifurcation and how thenetwork dynamics were governed at different phases of theexperiment. At the spontaneous phase (when the networkreceive no external input), the latent state is attracted by themiddle sink. After the stimulus is onset, the middle sinkdisappears and the latent state falls into either side driven

by noise to form a choice. When the reset is onset, the latentstate is pushed back to the only sink that is close to themiddle sink of the spontaneous phase, and then the networkis ready for a new trial. We generated latent trajectory andspike train by replicating the experiments on the fitted model(Fig. 3C). The result shows that the model can replicate thedynamics of the target network.

The mean-field reduction of the network (Fig. 6 [56])also confirms that our inference accurately captured the keyfeatures of the network. Note that our inference was donewithout knowing the true underlying dynamics which meansour method can recover the dynamics from data as a bottom-up approach.

4.2 Real Data: Learning an Analog Circuit

It has been verified that the proposed methodology is capableof learning the underlying dynamics from noisy streamingobservations in the above synthetic experiments. To test it inreal world, we applied our method to the voltage readoutfrom an analog circuit (Fig. 4). We designed and built thiscircuit to realize a system of ordinary differential equationsas follows

x = (5z − 5)[x− tanh(αx− βy)]

y = (5z − 5)[y − tanh(βx+ αy)]

z = −0.5(z − tanh(1.5z))

(40)

where · indicates the time derivative and α = β = 1.5 cos(π5 ).This circuit performed oscillation with a period of approxi-mately 2 Hz. The sampling rate was 200 Hz.

We fit a 2D SSM with the specification such that such thatxt = f(xt−1) + εt; yt = Cxt + ξt where xt ∈ R2, yt ∈ R3,εt ∼ N (0, 10−3) and ξt ∼ N (0, 10−3). We trained modelwith 3000 time steps (Fig. 5A). The inferred dynamics showsthat the limit cycle can implement the oscillation (Fig. 5C).The prediction of future observations (500 steps) resemble anoscillating trajectory (Fig. 5B)

5 DISCUSSION

In this study we develop a novel online learning framework,leveraging variational inference and sequential Monte Carlo,which enables flexible and accurate Bayesian joint filtering.Our derivation shows that our filtering posterior can be madearbitrarily close to the true one for a wide class of dynamicsmodels and observation models. Specifically, the proposedframework can efficiently infer a posterior over the dynamicsusing sparse Gaussian processes by augmenting the statewith the inducing variables that follow a diffusion process.Taking benefit from Bayes’ rule, our recursive proposal onthe inducing variables does not require optimization withgradients. Constant time complexity per sample makes ourapproach amenable to online learning scenarios and suitable

Page 8: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

8

�ltering prediction Predictive RMSE

prediction horizon

high < speed > lowtrueinferred

startstop

0 100 200 300 400 5000

2

4

6

8

10

12

Pre

dict

ive

RM

SE

Kalman EMSVMC 100 stepsSVMC 1000 stepsSVMC 2000 steps

1600 2000 2400−10

0

10

1600 2000 2400

−5

0

5

A B C

D E F

t=100 t=1000 t=2000

time step

Figure 2. NASCAR® Dynamics. (A) True and inferred latent trajectory. (B) Filtering and prediction. We show the last 500 steps of filtered states andthe following 500 steps of predicted states. (C) Predictive error. We compare the 500-step predictive MSE (average over 100 realizations) of SVMC(sparse GP dynamics) with Kalman filter. The transition matrix of the Kalman filter was learned by EM. (D)–(F) Velocity field learned by different timesteps.

for real-time applications. In contrast to previous works,we demonstrate our approach is able to accurately filter thelatent states for linear / nonlinear dynamics, recover complexposteriors, faithfully infer dynamics, and provide long-termpredictions. In future, we want to focus on reducing thecomputation time per sample that could allow for real-time application on faster systems. On the side of GP,we would like to investigate the priors and kernels thatcharacterize the properties of dynamical systems as well asthe hyperparameters.

APPENDIX APROOF THAT p(yt | y1:t−1) IS A CONSISTENT ESTI-MATOR FOR p(yt | y1:t−1)

Proof. To prove that p(yt | y1:t−1) is a consistent estimator,we will rely on the delta method [28]. From [57], we knowthat the central limit theorem (CLT) holds for p(y1:t) andp(y1:t−1)

√N(p(y1:t−1)− p(y1:t−1))

d→ N (0, σ2t−1), (41)

√N(p(y1:t)− p(y1:t))

d→ N (0, σ2t ) (42)

where we assume that σ2t−1, σ

2t are finite. We can express

p(yt | y1:t−1) as a function of p(y1:t) and p(y1:t−1),

p(yt | y1:t−1) = g(p(y1:t), p(y1:t−1)) =p(y1:t)

p(y1:t−1). (43)

Since p(y1:t)p(y1:t−1) = p(yt | y1:t−1) and g is a continuous

function, an application of the Delta method gives√N(p(yt | y1:t−1)− p(yt | y1:t−1))

d→ N (0,∇g>Σ∇g),(44)

where Σ1,1 = σ2t , Σ2,2 = σ2

t−1 and Σ1,2 = Σ2,1 = σt,t−1

where by the Cauchy-Schwartz inequality, σt,t−1 is alsofinite [28]. Thus, as N → ∞, p(yt|y1:t−1) will convergein probability to p(yt|y1:t−1), proving the consistency of theestimator.

APPENDIX BPROOF OF THEOREM 2.1

Proof. It is well known that the importance weights producedin a run of SMC are an unbiased estimator of p(y1:t) [18]

E[p(y1:t)] = p(y1:t) (45)

where p(y1:t) =∏tj=1

1N

∑Ni=1 w

ij . We can apply Jensen’s

inequality to obtain

log p(y1:t) ≥ E[log p(y1:t)]. (46)

Expanding both sides of (46)

log p(yt | y1:t−1) + log p(y1:t−1)

≥ E[log p(yt | y1:t−1)] + E[log p(y1:t−1)].(47)

Page 9: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

9

Figure 3. Winner-Take-All Spiking Neural Network. (A) 5 trials of training data. The neuronal activity was drawn over a 25 sec time window. Eachrow represents one neuron. Each dot represents that neuron fired within that time bin. (B) Inference. The top-left panel shows the inferred latenttrajectories of several trials. In each trial the network started at the initial state, eventually reached either choice (indicated by the color) after thestimulus onset, and finally went back around the initial state after receiving reset signal. The rest three panels show the phase portraits of inferreddynamical system revealing the bifurcation and how the network dynamics were governed at different phases of the experiment. At the spontaneousphase (when the network receive no external input), the latent state is attracted by the middle sink. After the stimulus is onset, the middle sinkdisappears and the latent state falls into either side driven by noise to form a choice. When the reset is onset, the latent state is pushed back to theonly sink that is close to the middle sink of the spontaneous phase, and then the network is ready for a new trial. (C) Simulation from the fitted model.We generated latent trajectory and spike train by replicating the experiments on the fitted model. The result shows that the model can replicate thedynamics of the target network.

Figure 4. Analog Nonlinear Oscillator Circuit.

−0.6−0.4−0.2 0.0 0.2 0.4−0.4

0.00.4

1.000

1.050

1.100

1.150

−0.8−0.4

0.00.4

−0.40.0

0.40.8

0.90

1.00

1.10

1.20

−1.0 −0.5 0. 0.5 1.0−1.0

−0.5

0

0.5

1.0

high < speed > low

A B C

Figure 5. Analog oscillator. (A) Observations for training (3000 steps).(B) Predicted observations (500 steps). (C) Velocity field of inferreddynamics.

Subtracting log p(y1:t−1) from both sides gives

log p(yt | y1:t−1)

≥ E[log p(yt | y1:t−1)] + E[log p(y1:t−1)]− log p(y1:t−1).(48)

Page 10: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

10

Letting Rt(N) = log p(y1:t−1) − E[log p(y1:t−1)], where Nis the number of samples, we get

log p(yt | y1:t−1) ≥ E[log p(yt | y1:t−1)]−Rt(N), (49)

where by Jensen’s inequality (46), Rt(N) ≥ 0 for all valuesof N . By the continuous mapping theorem [28],

limN→∞

E[log p(y1:t−1)] = log p(y1:t−1). (50)

As a consequence, limN→∞ E[Rt(N)] = 0. By the same logic,and leveraging that p(yt | y1:t−1) is a consistent estimatorfor p(yt | y1:t−1), we get that

limN→∞

E[log p(yt | y1:t−1)] = log p(yt | y1:t−1). (51)

Thus Lt will get arbitrarily close to log p(yt | y1:t−1) asN →∞.

APPENDIX CPROOF OF COROLLARY 2.1.1Proof. The implicit smoothing distribution that arises fromperforming SMC [22] is defined as

q(x1:t) = p(x1:t,y1:t)E[

1

p(y1:t)

]. (52)

We can obtain the implicit filtering distribution by marginal-izing out dx1:t−1 from (52)

q(xt | y1:t)

=

∫p(x1:t,y1:t)E

[1

p(y1:t)

]dx1:t−1,

= p(xt,y1:t)E[

1

p(y1:t)

],

= p(xt, yt | y1:t−1)E[p(yt | y1:t−1)−1 p(y1:t−1)

p(y1:t−1)

].

(53)

In [22], [36], it was shown that

log p(y1:t) ≥ Eq(x1:t)[log p(x1:t,y1:t)− log q(x1:t)]

≥ E[log p(y1:t)].(54)

Rearranging terms in (54), we get

log p(yt | y1:t−1) ≥ Lt ≥ Lt. (55)

where

Lt = Eq(x1:t)[log p(xt, yt | y1:t−1,x1:t−1)− log q(xt | x1:t−1)]

+ DKL[q(x1:t−1)‖p(x1:t−1,y1:t−1)]− log p(y1:t−1).(56)

By Theorem 2.1, we know that limN→∞

Lt = log p(yt | y1:t−1),and thus

limN→∞

Lt = log p(yt | y1:t−1). (57)

Leveraging Theorem 1 from [22] we have

limN→∞

DKL[q(x1:t−1)‖p(x1:t−1,y1:t−1)] = log p(y1:t−1) (58)

which implies that

limN→∞

q(x1:t−1) = p(x1:t−1) a.e. (59)

thus plugging this into (57)

log p(yt | y1:t−1)

=

∫−q(x1:t) log

p(xt, yt | y1:t−1,x1:t−1)

q(xt | x1:t−1)dx1:t

=

∫−q(x1:t) log

p(x1:t | y1:t)p(yt | y1:t−1)

q(xt | x1:t−1)p(x1:t−1 | y1:t−1)dx1:t

= log p(yt | y1:t−1)

+

∫−q(x1:t) log

p(xt | x1:t−1,y1:t)

q(xt | x1:t−1)dx1:t

(60)which is true iff q(xt | x1:t−1) = p(xt | x1:t−1,y1:t) a.e..Thus by Lebesgue’s dominated convergence theorem [28]

limN→∞

∫q(xt | x1:t−1)dx1:t−1

=

∫limN→∞

q(xt | x1:t−1)dx1:t−1

= p(xt | y1:t).

(61)

APPENDIX DSYNTHETIC NASCAR® DYNAMICS

An rSLDS [51] with 4 discrete states was used to generatethe synthetic NASCAR® track. The linear dynamics for eachhidden state were

A1 =

[cos(θ1) − sin(θ1)sin(θ1) cos(θ1)

],

A2 =

[cos(θ2) − sin(θ2)sin(θ2) cos(θ2)

],

and A3 = A4 = I . The affine terms were B1 = −(A1 − I)c1,(c1 = [2, 0]>), B2 = −(A2 − I)c2, (c2 = [−2, 0]>), B3 =[0.1, 0]> and B4 = [−0.35, 0]>. The hyperplanes, R, andbiases, r, were defined as

R =

100 0−100 0

0 100

, r =

−200−200

0

.A state noise of Q = 0.001I was used.

APPENDIX EPREDICTION USING SVMC-GPLet wit =

wit∑

` w`t

be the self-normalized importance weights.At time t, given a test point x∗ we can approximately samplefrom the predictive distribution

p(f∗|x∗,y1:t)

=

∫p(f∗|x∗, zt)p(zt|y1:t)dzt

=

∫p(f∗|x∗, zt)p(zt|x0:t)p(x0:t|y1:t)dztdx0:t

=

∫p(f∗|x∗,x0:t)p(x0:t|y1:t)dx0:t

≈N∑i=1

wit p(f∗|x∗,xi0:t)

=N∑i=1

witN (vi∗,Σi∗)

(62)

Page 11: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

11

where

vi∗ = m(x∗) +A∗µit, (63)

Σi∗ = C∗ +A∗ΓitA>∗ (64)

where A∗ = K∗zK−1zz and C∗ = K∗∗ − K∗zK−1

zz Kz∗ + Q.The approximate predictive distribution is a mixture ofSGPs, allowing for a much more richer approximation tothe predictive distribution. Equipped with (62), we canapproximate the mean of the predictive distribution, µf∗ ,as

µf∗

=

∫f∗p(f∗|x∗,y1:t)df∗

≈∫f∗

N∑i=1

wit p(f∗|x∗,xi0:t)df∗

=N∑i=1

wit

∫f∗p(f∗|x∗,xi0:t)df∗

=N∑i=1

wit Ei[f∗]

=N∑i=1

witvi∗ = µf∗

(65)

where Ei[·] = Ep(f∗|x∗,xi0:t)

[·].Similarly, we can also approximate the covariance of of

the predictive distribution, Σf∗

Σf∗

=

∫(f∗ − µf∗)(f∗ − µf∗)>p(f∗|x∗,y1:t)df∗

≈N∑i=1

wit

∫(f∗ − µf∗)(f∗ − µf∗)>p(f∗|x∗,xi0:t)df∗

=N∑i=1

wit Ei[(f∗ − µf∗)(f∗ − µf∗)>]

=N∑i=1

wit(Ei[f∗f>∗ ]− Ei[f∗]µ>f∗ − µf∗Ei[f∗]

> + µf∗µ>f∗)

=N∑i=1

wit(Σi∗ + vi∗v

i>∗ − vi∗µ>f∗ − µf∗v

i>∗ + µf∗µ

>f∗)

≈N∑i=1

wit(Σi∗ + vi∗v

i>∗ − vi∗µ>f∗ − µf∗v

i>∗ + µf∗ µ

>f∗).

(66)

APPENDIX FWINNER-TAKE-ALL SPIKING NEURAL NETWORK

In Figure 6 the mean-field reduction of the spiking networkmodel is shown [56].

REFERENCES

[1] S. Haykin and J. Principe, “Making sense of a complex world[chaotic events modeling],” IEEE Signal Processing Magazine, vol. 15,no. 3, pp. 66–81, May 1998.

[2] J. Ko and D. Fox, “GP-BayesFilters: Bayesian filtering usinggaussian process prediction and observation models,” AutonomousRobots, vol. 27, no. 1, pp. 75–90, 5 2009.

sink

saddle

Spontaneous Stimulus onset

Figure 6. Mean field reduction of the Winner-Take-All spiking neuralnetwork.

[3] C. L. C. Mattos, Z. Dai, A. Damianou, J. Forth, G. A. Barreto,and N. D. Lawrence, “Recurrent gaussian processes,” InternationalConference on Learning Representations (ICLR), 2016.

[4] S. Roweis and Z. Ghahramani, Learning nonlinear dynamical systemsusing the expectation-maximization algorithm. John Wiley & Sons,Inc, 2001, pp. 175–220.

[5] D. Sussillo and O. Barak, “Opening the black box: Low-dimensionaldynamics in high-dimensional recurrent neural networks,” NeuralComputation, vol. 25, no. 3, pp. 626–649, Mar. 2013.

[6] R. Frigola, Y. Chen, and C. E. Rasmussen, “Variational gaussianprocess state-space models,” in Proceedings of the 27th InternationalConference on Neural Information Processing Systems - Volume 2,Montreal, Canada, 2014, pp. 3680–3688.

[7] B. C. Daniels and I. Nemenman, “Automated adaptive inferenceof phenomenological dynamical models,” Nature Communications,vol. 6, pp. 8133+, Aug. 2015.

[8] Y. Zhao and I. M. Park, “Interpretable nonlinear dynamic modelingof neural trajectories,” in Advances in Neural Information ProcessingSystems (NIPS), 2016.

[9] J. Nassar, S. Linderman, M. Bugallo, and I. M. Park, “Tree-structuredrecurrent switching linear dynamical systems for multi-scalemodeling,” in International Conference on Learning Representations,2019.

[10] Y. Ho and R. Lee, “A Bayesian approach to problems in stochasticestimation and control,” IEEE Transactions on Automatic Control,vol. 9, no. 4, pp. 333–339, Oct. 1964.

[11] S. Särkkä, Bayesian filtering and smoothing. Cambridge UniversityPress, 2013.

[12] S. S. Haykin, Kalman filtering and neural networks. Wiley, 2001.[13] E. A. Wan and R. Van Der Merwe, “The unscented kalman filter

for nonlinear estimation,” in Proceedings of the IEEE 2000 AdaptiveSystems for Signal Processing, Communications, and Control Symposium(Cat. No. 00EX373). Ieee, 2000, pp. 153–158.

[14] E. A. Wan and A. T. Nelson, Dual extended Kalman filter methods.John Wiley & Sons, Inc, 2001, pp. 123–173.

[15] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan,“Streaming variational Bayes,” in Advances in Neural InformationProcessing Systems 26, C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates,Inc., 2013, pp. 1727–1735.

[16] Y. Zhao and I. M. Park, “Variational joint filtering,”arXiv:1707.09049, 2017.

[17] A. Doucet and A. M. Johansen, “A tutorial on particle filteringand smoothing: Fifteen years later,” Handbook of nonlinear filtering,vol. 12, no. 656-704, p. 3, 2009.

[18] A. Doucet, N. de Freitas, and N. Gordon, Sequential Monte CarloMethods in Practice. Springer Science & Business Media, Mar. 2013.

[19] J. Cornebise, E. Moulines, and J. Olsson, “Adaptive sequentialmonte carlo by means of mixture of experts,” Statistics andComputing, vol. 24, no. 3, pp. 317–337, 2014.

[20] S. S. Gu, Z. Ghahramani, and R. E. Turner, “Neural adaptivesequential monte carlo,” in Advances in Neural Information ProcessingSystems, 2015, pp. 2629–2637.

[21] P. Guarniero, A. M. Johansen, and A. Lee, “The iterated auxiliaryparticle filter,” Journal of the American Statistical Association, vol. 112,no. 520, pp. 1636–1647, 2017.

Page 12: 1 Streaming Variational Monte Carlo · 1 Streaming Variational Monte Carlo Yuan Zhao*, Josue Nassar*, Ian Jordan, Mónica Bugallo, and Il Memming Park Abstract—Nonlinear state-space

12

[22] C. Naesseth, S. Linderman, R. Ranganath, and D. Blei, “Varia-tional sequential monte carlo,” in Proceedings of the Twenty-FirstInternational Conference on Artificial Intelligence and Statistics, ser.Proceedings of Machine Learning Research, A. Storkey and F. Perez-Cruz, Eds., vol. 84. Playa Blanca, Lanzarote, Canary Islands: PMLR,09–11 Apr 2018, pp. 968–977.

[23] M. Titsias, “Variational learning of inducing variables in sparsegaussian processes,” in Artificial Intelligence and Statistics, Apr. 2009,pp. 567–574.

[24] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational infer-ence: A review for statisticians,” Journal of the American StatisticalAssociation, vol. 112, no. 518, pp. 859–877, 2017.

[25] C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt, “Advancesin variational inference,” IEEE transactions on pattern analysis andmachine intelligence, 2018.

[26] M. J. Wainwright, M. I. Jordan et al., “Graphical models, exponentialfamilies, and variational inference,” Foundations and Trends® inMachine Learning, vol. 1, no. 1–2, pp. 1–305, 2008.

[27] A. B. Owen, Monte Carlo theory, methods and examples, 2013.[28] A. W. Van der Vaart, Asymptotic statistics. Cambridge university

press, 2000, vol. 3.[29] T. Adali and S. Haykin, Adaptive signal processing: next generation

solutions. John Wiley & Sons, 2010, vol. 55.[30] S. Thrun, “Particle filters in robotics,” in Proceedings of the Eighteenth

conference on Uncertainty in artificial intelligence. Morgan KaufmannPublishers Inc., 2002, pp. 511–518.

[31] A. Greenfield and A. Brockwell, “Adaptive control of nonlinearstochastic systems by particle filtering,” in 2003 4th InternationalConference on Control and Automation Proceedings. IEEE, 2003, pp.887–890.

[32] F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssell, J. Jansson,R. Karlsson, and P.-J. Nordlund, “Particle filters for positioning,navigation, and tracking,” IEEE Transactions on signal processing,vol. 50, no. 2, pp. 425–437, 2002.

[33] N. J. Gordon, D. J. Salmond, and A. F. M. Smith, “Novel approachto nonlinear/non-gaussian bayesian state estimation,” in IEEproceedings F (radar and signal processing), vol. 140, no. 2. IET,1993, pp. 107–113.

[34] P. Bickel, B. Li, T. Bengtsson et al., “Sharp failure rates for thebootstrap particle filter in high dimensions,” in Pushing the limitsof contemporary statistics: Contributions in honor of Jayanta K. Ghosh.Institute of Mathematical Statistics, 2008, pp. 318–329.

[35] I. Žliobaite, M. Pechenizkiy, and J. Gama, “An overview of conceptdrift applications,” in Big data analysis: new algorithms for a newsociety. Springer, 2016, pp. 91–114.

[36] T. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood, “Auto-encodingsequential monte carlo,” in International Conference on LearningRepresentations, 2018.

[37] C. Cremer, Q. Morris, and D. Duvenaud, “ReinterpretingImportance-Weighted Autoencoders,” arXiv e-prints, Apr 2017.

[38] P. Del Moral, “Non-linear filtering: interacting particle resolution,”Markov processes and related fields, vol. 2, no. 4, pp. 555–581, 1996.

[39] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,”arXiv:1312.6114 [cs, stat], May 2014, arXiv: 1312.6114.

[40] T. Rainforth, A. R. Kosiorek, T. A. Le, C. J. Maddison, M. Igl,F. Wood, and Y. W. Teh, “Tighter variational bounds are notnecessarily better,” arXiv preprint arXiv:1802.04537, 2018.

[41] R. H. Shumway and D. S. Stoffer, Time Series Analysis and ItsApplications: With R Examples (Springer Texts in Statistics). Springer,2010.

[42] C. K. I. W. Carl Edward Rasmussen, Gaussian Processes for MachineLearning. MIT Press Ltd, 2006.

[43] E. Snelson and Z. Ghahramani, “Sparse Gaussian Processes usingPseudo-inputs,” in Advances in Neural Information Processing Systems18, Y. Weiss, B. Schölkopf, and J. C. Platt, Eds. MIT Press, 2006,pp. 1257–1264.

[44] A. Doucet, S. Godsill, and C. Andrieu, “On sequential monte carlosampling methods for bayesian filtering,” Statistics and computing,vol. 10, no. 3, pp. 197–208, 2000.

[45] D. Xu, “Learning nonlinear state space models with hamiltoniansequential monte carlo sampler,” 2019.

[46] J. Kocijan, Modelling and control of dynamic systems using Gaussianprocess models. Springer, 2016.

[47] H. Bijl, T. B. Schön, J.-W. van Wingerden, and M. Verhaegen,“System identification through online sparse gaussian processregression with input noise,” IFAC Journal of Systems and Control,vol. 2, pp. 1–11, 2017.

[48] M. F. Huber, “Recursive gaussian process: On-line regression andlearning,” Pattern Recognition Letters, vol. 45, pp. 85–91, 2014.

[49] M. P. Deisenroth, R. D. Turner, M. F. Huber, U. D. Hanebeck, andC. E. Rasmussen, “Robust filtering and smoothing with gaussianprocesses,” IEEE Transactions on Automatic Control, vol. 57, no. 7, pp.1865–1871, 2011.

[50] D. Sussillo and L. Abbott, “Generating coherent patterns of activityfrom chaotic neural networks,” Neuron, vol. 63, no. 4, pp. 544 – 557,2009.

[51] S. Linderman, M. Johnson, A. Miller, R. Adams, D. Blei, andL. Paninski, “Bayesian learning and inference in recurrent switchinglinear dynamical systems,” in Artificial Intelligence and Statistics,2017, pp. 914–922.

[52] O. Barak, D. Sussillo, R. Romo, M. Tsodyks, and L. F. Abbott, “Fromfixed points to chaos: three models of delayed discrimination.”Progress in neurobiology, vol. 103, pp. 214–222, Apr. 2013.

[53] V. Mante, D. Sussillo, K. V. Shenoy, and W. T. Newsome, “Context-dependent computation by recurrent dynamics in prefrontal cortex,”Nature, vol. 503, no. 7474, pp. 78–84, Nov. 2013.

[54] S. Ganguli, J. W. Bisley, J. D. Roitman, M. N. Shadlen, M. E. Gold-berg, and K. D. Miller, “One-dimensional dynamics of attentionand decision making in LIP,” Neuron, vol. 58, no. 1, pp. 15–25, Apr.2008.

[55] X.-J. Wang, “Probabilistic decision making by slow reverberationin cortical circuits,” Neuron, vol. 36, no. 5, pp. 955–968, Dec. 2002.

[56] K.-F. Wong and X.-J. Wang, “A recurrent network mechanism oftime integration in perceptual decisions,” The Journal of Neuroscience,vol. 26, no. 4, pp. 1314–1328, Jan. 2006.

[57] N. Chopin et al., “Central limit theorem for sequential monte carlomethods and its application to bayesian inference,” The Annals ofStatistics, vol. 32, no. 6, pp. 2385–2411, 2004.