Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

9
Clustering Time Series with Nonlinear Dynamics: A Bayesian Non-Parametric and Particle-Based Approach Alexander Lin Yingzhuo Zhang Jeremy Heng Stephen A. Allsop Kay M. Tye § Pierre E. Jacob Demba Ba Harvard University ESSEC Business School MIT, Picower Institute § Salk Institute Abstract We propose a general statistical framework for clustering multiple time series that ex- hibit nonlinear dynamics into an a-priori- unknown number of sub-groups. Our mo- tivation comes from neuroscience, where an important problem is to identify, within a large assembly of neurons, subsets that re- spond similarly to a stimulus or contingency. Upon modeling the multiple time series as the output of a Dirichlet process mixture of nonlinear state-space models, we derive a Metropolis-within-Gibbs algorithm for full Bayesian inference that alternates between sampling cluster assignments and sampling parameter values that form the basis of the clustering. The Metropolis step employs re- cent innovations in particle-based methods. We apply the framework to clustering time series acquired from the prefrontal cortex of mice in an experiment designed to character- ize the neural underpinnings of fear. 1 INTRODUCTION In a data set comprising hundreds to thousands of neu- ronal time series (Brown et al., 2004), the ability to automatically identify sub-groups of time series that respond similarly to an exogenous stimulus or contin- gency can provide insights into how neural computa- tion is implemented at the level of groups of neurons. Existing methods for clustering multiple time series can be classified broadly into feature-based approaches and model-based ones. The former extract a set of features from each time series, followed by clus- Proceedings of the 22 nd International Conference on Ar- tificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by the author(s). tering in feature space using standard algorithms, e.g. Humphries (2011). While simple to implement, feature-based approaches cannot be used to perform statistical inference on the parameters of a physical model by which the time series are generated. Previous model-based approaches for clustering multi- ple time series typically employ Bayesian mixtures of time series models. Examples have included GARCH models (Bauwens and Rombouts, 2007), INAR mod- els (Roick et al., 2019), and TRCRP models (Saad and Mansinghka, 2018). State-space models are a well-known, flexible class of models for time series data (Durbin and Koopman, 2012). Many existing model-based approaches for clus- tering time series use a mixture of linear Gaussian state-space models. Inoue et al. (2006) and Chiappa and Barber (2007) both consider the case of finite mix- tures and use Gibbs sampling and variational-Bayes respectively for posterior inference. Nieto-Barajas and Contreras-Crist´ an (2014) and Middleton (2014) use a Dirichlet process mixture to infer the number of clus- ters and Gibbs sampling for full posterior inference. In all cases, the linear-Gaussian assumption is crucial; it enables exact evaluation of the likelihood using a Kalman filter and the ability to sample exactly from the state sequences underlying each of the time series. For nonlinear and non-Gaussian state-space models, this likelihood cannot be evaluated in closed form and exact sampling is not possible. We introduce a framework for clustering multiple time series that exhibit nonlinear dynamics into an a-priori- unknown number of clusters, each modeled as a non- linear state-space model. We derive a Metropolis- within-Gibbs algorithm for inference in a Dirichlet process mixture of state-space models with linear- Gaussian states and binomial observations, a popular model in the analysis of neural spiking activity (Smith and Brown, 2003). The Metropolis step uses particle marginal Metropolis Hastings (Andrieu et al., 2010), which requires likelihood estimates with small vari- ance. We use controlled sequential Monte Carlo (Heng

Transcript of Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

Page 1: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

Clustering Time Series with Nonlinear Dynamics: A BayesianNon-Parametric and Particle-Based Approach

Alexander Lin⇤ Yingzhuo Zhang⇤ Jeremy Heng⇤† Stephen A. Allsop‡

Kay M. Tye§ Pierre E. Jacob⇤ Demba Ba⇤⇤Harvard University †ESSEC Business School ‡MIT, Picower Institute §Salk Institute

Abstract

We propose a general statistical frameworkfor clustering multiple time series that ex-hibit nonlinear dynamics into an a-priori-unknown number of sub-groups. Our mo-tivation comes from neuroscience, where animportant problem is to identify, within alarge assembly of neurons, subsets that re-spond similarly to a stimulus or contingency.Upon modeling the multiple time series asthe output of a Dirichlet process mixtureof nonlinear state-space models, we derivea Metropolis-within-Gibbs algorithm for fullBayesian inference that alternates betweensampling cluster assignments and samplingparameter values that form the basis of theclustering. The Metropolis step employs re-cent innovations in particle-based methods.We apply the framework to clustering timeseries acquired from the prefrontal cortex ofmice in an experiment designed to character-ize the neural underpinnings of fear.

1 INTRODUCTION

In a data set comprising hundreds to thousands of neu-ronal time series (Brown et al., 2004), the ability toautomatically identify sub-groups of time series thatrespond similarly to an exogenous stimulus or contin-gency can provide insights into how neural computa-tion is implemented at the level of groups of neurons.

Existing methods for clustering multiple time seriescan be classified broadly into feature-based approachesand model-based ones. The former extract a setof features from each time series, followed by clus-

Proceedings of the 22nd International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s).

tering in feature space using standard algorithms,e.g. Humphries (2011). While simple to implement,feature-based approaches cannot be used to performstatistical inference on the parameters of a physicalmodel by which the time series are generated.

Previous model-based approaches for clustering multi-ple time series typically employ Bayesian mixtures oftime series models. Examples have included GARCHmodels (Bauwens and Rombouts, 2007), INAR mod-els (Roick et al., 2019), and TRCRP models (Saad andMansinghka, 2018).

State-space models are a well-known, flexible class ofmodels for time series data (Durbin and Koopman,2012). Many existing model-based approaches for clus-tering time series use a mixture of linear Gaussianstate-space models. Inoue et al. (2006) and Chiappaand Barber (2007) both consider the case of finite mix-tures and use Gibbs sampling and variational-Bayesrespectively for posterior inference. Nieto-Barajas andContreras-Cristan (2014) and Middleton (2014) use aDirichlet process mixture to infer the number of clus-ters and Gibbs sampling for full posterior inference.In all cases, the linear-Gaussian assumption is crucial;it enables exact evaluation of the likelihood using aKalman filter and the ability to sample exactly fromthe state sequences underlying each of the time series.For nonlinear and non-Gaussian state-space models,this likelihood cannot be evaluated in closed form andexact sampling is not possible.

We introduce a framework for clustering multiple timeseries that exhibit nonlinear dynamics into an a-priori-unknown number of clusters, each modeled as a non-linear state-space model. We derive a Metropolis-within-Gibbs algorithm for inference in a Dirichletprocess mixture of state-space models with linear-Gaussian states and binomial observations, a popularmodel in the analysis of neural spiking activity (Smithand Brown, 2003). The Metropolis step uses particlemarginal Metropolis Hastings (Andrieu et al., 2010),which requires likelihood estimates with small vari-ance. We use controlled sequential Monte Carlo (Heng

Page 2: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

Clustering Time Series with Nonlinear Dynamics

et al., 2017) to produce such estimates. We apply theframework to the clustering of 33 neural spiking timeseries acquired from the prefrontal cortex of mice in anexperiment designed to characterize the neural under-pinnings of fear. The framework produces a clusteringof the neurons into groups that represent various de-grees of neuronal signal modulation.

2 NONLINEAR TIME SERIESCLUSTERING MODEL

We begin by introducing the Dirichlet Process nonlin-ear State-Space Mixture (DPnSSM) model for cluster-ing multiple time series exhibiting nonlinear dynamics.

2.1 DPnSSM

Let Y = {y(1), . . . ,y(N)} be a set of observed time

series in which each series y

(n) = y(n)1 , . . . , y(n)T is avector of length T . Following the framework of state-space models, we model y(n) as the output of a latent,autoregressive process x

(n). For all n = 1, . . . , N ,

x(n)1 | ✓

(n)⇠ h(x(n)

1 ; ✓(n)

), (1)

x(n)t | x(n)

t�1, ✓(n)

⇠ f(x(n)t�1, x

(n)t ; ✓

(n)), 1 < t,

y(n)t | x(n)t , ✓

(n)⇠ g(x(n)

t , y(n)t ; ✓(n)

), 1 t,

where ✓(n)

denotes a set of parameters for series n, f issome state transition density, g is some state-dependentlikelihood, and h is some initial prior .

The hidden parameters ⇥ = {✓(1)

, . . . , ✓(N)

} form the

basis of clustering Y ; that is, if ✓(n)

= ✓

(n0), then se-

ries n and n0 belong to the same cluster. However, inmany applications, the number of clusters itself maybe unknown and, therefore, we choose to model theparameters as coming from a distribution Q that issampled from a Dirichlet process (DP) with base dis-tribution G and inverse-variance parameter ↵. Fer-guson (1973) showed that Q is almost surely discreteand that the number of distinct values within N ! 1draws from Q is random. Thus, the DP serves as aprior for discrete distributions over elements in thesupport of G. The overall objective is to infer thejoint distribution of ⇥ | Y ,↵, G.

The Chinese Restaurant Process (CRP) representationof the DP integrates out the intermediary distributionQ used to generate ⇥ (Neal, 2000). The CRP allowsus to nicely separate the process of assigning a cluster(i.e. table) to each y

(n) from the process of choosinga hidden parameter (i.e. table value) for each cluster.This is similar to the finite mixture model, but we donot need to choose K, the number of clusters, a priori.

Under the CRP, we index the parameters by the clus-ter index k instead of the observation index n. Letz(n) 2 {1, . . . ,K} denote the cluster identity of seriesn and let ✓(k) denote the hidden parameters for clusterk. We formally define the model as follows,

z(1), . . . , z(N),K | ↵ ⇠ CRP(↵, N), (2)

(k) | G ⇠ G, 1 k K,

and for all n = 1, . . . , N ,

x(n)1 | z(n) = k,✓(k) ⇠ h(x(n)

1 ;✓(k)),

x(n)t | x(n)

t�1, z(n) = k,✓(k) ⇠ f(x(n)

t�1, x(n)t ;✓(k)), 1 < t,

y(n)t | x(n)t , z(n) = k,✓(k) ⇠ g(x(n)

t , y(n)t ;✓(k)), 1 t.

Let Z = {z(1), . . . , z(N)}, ⇥ = {✓(1), . . . ,✓(K)}. Fig-ure 1 shows the graphical model for the DPnSSM. Ex-tensions of the model and inference algorithm to han-dling multi-dimensional time series and finite mixturescan be found in Appendix A.

Figure 1: The graphical model representation of theDPnSSM. Observations are shown in grey, and statesand parameters are shown in white. For simplicity, weomit the dependencies that are assumed in the DP-nSSM between Y and (Z, ⇥).

2.2 Point Process State-Space Model

While the DPnSSM is defined for generic nonlinearstate-space models, in this paper we focus on a state-space model commonly used for neural spike rasters.

Consider an experiment with R successive trials, dur-ing which we record the activity of N neuronal spikingunits. For each trial, let (0, T ] be the continuous obser-vation interval following the delivery of an exogenousstimulus at time ⌧ = 0. For each trial r = 1, . . . , R

and neuron n = 1, . . . , N , let S(n)r be the total num-

ber of spikes from neuron n during trial r, and the

sequence 0 < ⌧ (n)r,1 < . . . < ⌧ (n)r,S(n)

r

correspond to the

Page 3: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

A. Lin, Y. Zhang, J. Heng, S. A. Allsop, K. M. Tye, P. E. Jacob, D. Ba

times at which events from the neuronal unit occur.We assume that {⌧ (n)r,s }S

(n)r

s=1 is the realization in (0, T ]of a stochastic point-process with counting process

N (n)r (⌧) =

R ⌧0 dN (n)

r (u), where dN (n)r (⌧) is the indi-

cator function in (0, T ] of {⌧ (n)r,s }S(n)r

s=1 . A point-processis fully characterized by its conditional intensity func-tion (CIF) (Vere-Jones, 2003). Assuming all trials arei.i.d. realizations of the same point-process, the CIF

�(n)(⌧ | H⌧ ) of dN (n)r (⌧) for r = 1, . . . , R is

�(n)(⌧ | H(n)⌧ ) (3)

= lim�!0

p⇣N (n)

r (⌧ + �) � N (n)r (⌧) = 1 | H(n)

�,

where H(n)⌧ is the event history of the point process up

to time ⌧ for neuron n. Suppose we sample dN (n)r (⌧)

at a resolution of � to yield a binary event sequence.

We denote by {�N (n)t,r }T,R

t=1,r=1 the discrete-time pro-cess obtained by counting the number of events inT = bT /�c disjoint bins of width � = M · �, where

M 2 N. Given x(n)0 and (n)

0 , a popular approachis to encode a discrete-time representation of the CIF

{�(n)t }Tt=1 within an autoregressive process that un-derlies a binomial state-space model with observations

{�N (n)t,r }T,R

t=1,r=1 (Smith and Brown, 2003):

x(n)1 | µ(n) ⇠ N (x(n)

0 + µ(n), (n)0 ), (4)

x(n)t | x(n)

t�1, (n) ⇠ N (x(n)

t�1, (n)), 1 < t,

p(n)t = �(n)t · � = �(x(n)t ) =

expx(n)t

1 + expx(n)t

, 1 t,

y(n)t =RX

r=1

�N (n)t,r ⇠ Bin

⇣R · M,p(n)t

⌘, 1 t.

where ✓(n)

= [µ(n), log (n)]> are the parameters of in-terest. We can cluster the neuronal units by these pa-rameters by assuming that they arise from the Dirich-let process mixture of Equation 2, in which ✓

(k) =

[µ(k), log (k)]> = ✓

(n)for all n such that z(n) = k.

The parameter µ(k) describes the extent to which theexogenous stimulus modulates the response of the neu-ron – a positive value of µ(k) indicates excitation, anegative value indicates inhibition, and a value closeto zero indicates no response. The state transitiondensity f imposes a stochastic smoothness constrainton the CIF of neuron n, where (k) controls the de-gree of smoothness. A small value of (k) suggests thatthe neurons exhibit a sustained change in response tothe stimulus, whereas a large value of (k) indicatesthat the change is unsustained. With respect to theDPnSSM, the goal is to cluster the neurons accord-ing to the extent of the initial response µ(k) and howsustained the response is (k) .

3 INFERENCE ALGORITHM

For conducting posterior inference on the DPnSSM, weintroduce a Metropolis-within-Gibbs sampling proce-dure inspired by Algorithm 8 from Neal (2000). Wederive the following process for alternately sampling(1) the cluster assignments Z | ⇥,Y ,↵, G and (2) thecluster parameters ⇥ | Z,Y ,↵, G. A summary of theinference algorithm is given in Algorithm 1. Outputsare samples (Zhii,⇥hii) for iterations i = 1, 2, . . . , I.

For any set S = {s(1), . . . , s(J)}, we denote set S with-out the j-th element as S(�j) = S \ {s(j)}.

3.1 Sampling Cluster Assignments

For a given time series n 2 {1, . . . , N}, we sample itscluster assignment from the distribution:

p(z(n) | Z(�n),⇥,Y ,↵, G) (5)

/ p(z(n) | Z(�n),⇥,↵, G) · p(Y | Z,⇥,↵, G)

/ p(z(n) | Z(�n),↵) · p(y(n) | z(n),⇥, G).

The first term in Equation 5 can be represented by thecategorical distribution,

p(z(n) = k) =

8>><

>>:

N (k)

N � 1 + ↵, k = 1, . . . ,K 0, (6)

↵/m

N � 1 + ↵, k = K 0 + 1, . . . ,K 0 + m,

where K 0 is the number of unique k in Z(�n), N (k) isthe number of cluster assignments equal to k in Z(�n),and m � 1 is some integer algorithmic parameter. Forbrevity, we drop the conditioning on Z(�n) and ↵.

The second term in Equation 5 is equivalent to the pa-rameter likelihood p(y(n) | ✓(k)), where ✓

(k) is knownif k 2 {1, . . . ,K 0}; otherwise, ✓(k) must first be sam-pled from G if k 2 {K 0 + 1, . . .K 0 + m}. Since y

(n) isthe output of a nonlinear state-space model, we mustuse particle methods to approximate this parameterlikelihood. We employ a recently proposed methodknown as controlled sequential Monte Carlo (cSMC)to produce low-variance estimates of this likelihoodfor a fixed computational cost (Heng et al., 2017). Weoutline the basic premise behind cSMC in Section 3.3.

3.2 Sampling Cluster Parameters

For a given cluster k 2 {1, . . . ,K}, we wish to samplefrom the distribution:

p(✓(k) | ⇥(�k), Z,Y ,↵, G) (7)

/ p(✓(k) | ⇥(�k), Z,↵, G) · p(Y | ⇥, Z,↵, G)

/ p(✓(k) | G) ·Y

n | z(n)=k

p(y(n) | ✓(k)).

Page 4: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

Clustering Time Series with Nonlinear Dynamics

The first term of Equation 7 is the probability densityfunction of the base distribution, and the second termis a product of parameter likelihoods. Because thelikelihood conditioned on class membership involvesintegration of the state sequence x

(n), and the prior Gis on the parameters of the state sequence, marginal-ization destroys any conjugacy that might have existedbetween the state sequence prior and parameter priors.

To sample from the conditional posterior of parame-ters given cluster assignments, Middleton (2014) re-introduces the state sequence as part of his samplingalgorithm for the linear Gaussian state-space case. Weuse an approach that obviates the need to re-introducethe state sequence and generalizes to scenarios wherethe prior on parameter and the state sequence maynot have any conjugacy relationships. In particular,our sampler uses a Metropolis-Hastings step with pro-posal distribution r(✓0 | ✓) to sample from the classconditional distribution of parameters given cluster as-signments. This e↵ectively becomes one iteration ofthe well-known particle marginal Metropolis-Hastings(PMMH) algorithm (Andrieu et al., 2010). To evalu-ate the second term of Equation 7 for PMMH, we onceagain choose to use cSMC (Section 3.3).

3.3 Controlled Sequential Monte Carlo

Controlled SMC is based on the idea that we can mod-ify a state-space model in such a way that standardbootstrap particle filters (Doucet et al., 2001) givelower variance estimates while the likelihood of inter-est is kept unchanged. More precisely, the algorithmintroduces a collection of positive and bounded func-tions � = {�1, . . . , �T }, termed a policy, that alter thetransition densities of the model in the following way:

h�(x1;✓) / h(x1;✓) · �1(x1), (8)

f�t (xt�1, xt;✓) / f(xt�1, xt;✓) · �t(xt), 1 < t.

To ensure that the likelihood associated with the modi-fied model is the same as the original one, we introducea modified version of the state-dependent likelihood g,denoted by g�1 , . . . , g

�T . On the modified model defined

by h� , {f�t }Tt=2, {g

�t }Tt=1, we can run a bootstrap par-

ticle filter and compute the likelihood estimator:

p�(y | ✓) =TY

t=1

1

S

SX

s=1

g�t (xst , yt;✓)

!, (9)

where S is the number of particles and xst is the s-

th particle at time t. The policy � can be chosen soas to minimize the variance of the above likelihoodestimator; the optimal policy minimizing that varianceis denoted by �⇤.

When h, f are Gaussian and g is log-concave with re-spect to xt (such as in the point-process state-space

model of Equation 4), we can justify the approxima-tion of �⇤ with a series of Gaussian functions. Thisallows us to solve for h� , {f�

t }Tt=2, {g�t }Tt=1 using an

approximate backward recursion method that simplyreduces to a sequence of constrained linear regressions.We provide a more rigorous treatment of the exact de-tails in Appendix B.

Starting from an initial policy �h0i, we can thus run afirst bootstrap particle filter and obtain an approxima-tion �h1i of �⇤. One can then iterate L times to obtainrefined policies, and consequently, lower variance esti-mators of the likelihood. Our empirical testing demon-strates that cSMC can significantly outperform thestandard BPF in both precision and e�ciency, whilekeeping L very small. This justifies its use in the DP-nSSM inference algorithm.

Algorithm 1 InferDPnSSM(Y ,↵, G,m, r, I, Zh0i,⇥h0i)

1: for i = 1, . . . , I do2: Let Z = Zhi�1i and ⇥ = ⇥hi�1i.

// Sample cluster assignments.

3: for n = 1, . . . , N do4: Let K 0 be the number of distinct k in Z(�n).5: for k = 1, . . . ,K 0 + m do6: Run cSMC to compute p(y(n) | ✓(k)).7: end for8: Sample z(n) | Z(�n),⇥,Y ,↵, G.9: end for

10: Let K be the number of distinct k in Z.// Sample cluster parameters.

11: for k = 1, . . . ,K do12: Sample proposal ✓0 ⇠ r(✓0 | ✓(k)).13: for n 2 {1, . . . , N} | z(n) = k do14: Run cSMC to compute p

�y

(n) | ✓0

�.

15: end for16: Let a = p(✓0

| ⇥(�k),Z,Y ,↵,G)·r(✓(k)| ✓0)

p(✓(k)| ⇥(�k),Z,Y ,↵,G)·r(✓0

| ✓(k)).

17: Let ✓

(k) = ✓

0 with probability min(a, 1).18: end for19: Let Zhii = Z and ⇥hii = ⇥.20: end for21: return (Zh1i,⇥h1i), . . . , (ZhIi,⇥hIi)

4 RESULTS

We investigate the ability of the DPnSSM to clustertime series from simulated and real neuronal rasters.1

4.1 Selecting Clusters

The output of Algorithm 1 is a set of Gibbs samples(Zh1i,⇥h1i), . . . , (ZhIi,⇥hIi). Each sample (Zhii,⇥hii)

1Python code for all experiments can be found athttps://github.com/ds2p/state-space-mixture.

Page 5: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

A. Lin, Y. Zhang, J. Heng, S. A. Allsop, K. M. Tye, P. E. Jacob, D. Ba

may very well use a di↵erent number of clusters. Thenatural question that remains is how to select a singlefinal clustering (Z⇤,⇥⇤) of our data from this output.There is a great deal of literature on answering thissubjective question. We follow the work of Dahl (2006)and Nieto-Barajas and Contreras-Cristan (2014).

Each Gibbs sample describes a clustering of the timeseries; we therefore frame the objective as selecting themost representative sample from our output. To start,we take each Gibbs sample i and construct an N ⇥Nco-occurrence matrix ⌦hii in which,

⌦hii(n,n0) =

(1, z(n) = z(n

0) | z(n), z(n0) 2 Zhii,

0, z(n) 6= z(n0) | z(n), z(n0) 2 Zhii.

(10)

This is simply a matrix in which the (n, n0) entry is1 if series n and series n0 are in the same cluster forthe i-th Gibbs sample and 0 otherwise. We then define⌦ = (I�B)�1

PIi=B+1 ⌦

hii as the mean co-occurrencematrix, where B � 1 is the number of pre-burn-in sam-ples. This matrix summarizes information from the en-tire trace of Gibbs samples. The sample i⇤ that we ul-timately select is the one that minimizes the Frobeniusdistance to this matrix, i.e. i⇤ = arg minik⌦hii �⌦kF .We use the corresponding assignments and parametersas the final clustering (Z⇤,⇥⇤) = (Zhi⇤i,⇥hi⇤i). Theappeal of this procedure is that it makes use of globalinformation from all the Gibbs samples, yet ultimatelyselects a single clustering produced by the model. Ifthere are multiple Gibbs samples i1, . . . , iJ such that⌦hi⇤i = ⌦hi1i = . . . = ⌦hiJ i, then we redefine ⇥⇤ as asimple average, as explained in Appendix C.

4.2 Simulated Neural Spiking Data

We conduct a simulated experiment to test the abilityof the DPnSSM to yield desired results in a setting inwhich the ground truth clustering is known.

4.2.1 Data Generation

We simulate N = 25 neuronal rasters that each recorddata for R = 45 trials over the time interval (�500, T ]milliseconds (ms) before/after an exogenous stimulusis applied at 0 ms, where T = 1500. For each trial, theresolution of the binary event sequence is � = 1 ms.We create bins of size � = M · �, where M = 5, and

observe neuron n firing �N (n)t,r M times during the

t-th discrete time interval (t� � �, t�] for trial r.

We use the following process for generating the simu-lated data: For each neuron n, the initial rate is inde-pendently drawn as �(n) ⇠ Uniform(10, 15) Hz. Eachneuron’s type is determined by the evolution of its

discrete-time CIF �(n)t over time. We split the discrete-time intervals into three parts – t1 = {�99, . . . , 0},

t2 = {1, . . . , 50}, and t3 = {51, . . . , 300}. We generatefive neurons from each of the following five types:

1. Excited, sustained neurons with rate �(n)t = �(n)

for t 2 t1; rate �(n)t = �(n) · exp(1) for t 2 t2, t3.

2. Inhibited, sustained neurons with rate �(n)t = �(n)

for t 2 t1; rate �(n)t = �(n) · exp(�1) for t 2 t2, t3.

3. Non-responsive neurons with rate �(n)t = �(n) fort 2 t1, t2, t3.

4. Excited, unsustained neurons with rate �(n)t =

�(n) for t 2 t1; rate �(n)t = �(n) · exp(1) for t 2 t2;

rate �(n)t = �(n) for t 2 t3.

5. Inhibited, unsustained neurons with rate �(n)t =

�(n) for t 2 t1; rate �(n)t = �(n) · exp(�1) for

t 2 t2; rate �(n)t = �(n) for t 2 t3.

Following the point-process state-space model of Equa-tion 4 – which assumes i.i.d. trials – we simulate,

y(n)t =R=45X

r=1

�N (n)t,r ⇠ Bin(R · M = 225, p(n)t ), (11)

where p(n)t = �(n)t · � for t = �99, . . . , 300. These arethe observations Y = {y(1), . . . ,y(N)} that are fed tothe DPnSSM. The model is then tasked with figuringout the original ground-truth clustering.

4.2.2 Modeling

In modeling these simulated data as coming from theDPnSSM, we employ the generative process specifiedby Equation 4; that is,

x(n)1 | µ(k) ⇠ N (x(n)

0 + µ(k), 0), (12)

x(n)t | x(n)

t�1, (k) ⇠ N (x(n)

t�1, (k)), 1 < t T,

y(n)t | x(n)t ⇠ Bin(225,�(x(n)

t )), 1 t T,

where cluster parameters are ✓

(k) = [µ(k), log (k)]>.

The series are fed into Algorithm 1 with hyperparam-eter values ↵ = 1, G[µ, ] = [N (0, 2),Unif(�15, 0)],and m = 5. For every series n, we compute the initial

state x(n)0 = ��1(1/500 ·

P0t=�99 y

(n)t ) from the obser-

vations before the stimulus in that series. In addition,we let 0 = 10�10, a very small value that forces anychange in the latent state at t = 1 to be explainedby the cluster parameter µ(k). The initial clustering(Zh0i,⇥h0i) = (1,✓0), where 1 is a vector of N onesdenoting that every series begins in the same clusterand ✓0 is sampled from G. For the proposal r(✓0 | ✓),we use a N (✓(k), 0.25 · I) distribution, where I is the2 ⇥ 2 identity matrix. We run the sampling procedurefor I = 10,000 iterations and apply a burn-in of B =1,000 samples. To compute likelihood estimates, weuse L = 3 cSMC iterations and S = 64 particles.

Page 6: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

Clustering Time Series with Nonlinear Dynamics

A heatmap of the resultant mean co-occurrence matrix⌦ (Equation 10) and the selected clustering ⌦hi⇤i canbe found in Figure 2. The rows and columns of thismatrix have been reordered to aid visualization of clus-ters along the diagonal of ⌦. From this experiment, wecan see that the DPnSSM inference algorithm is ableto successfully recover the five ground-truth clusters.Appendix D present some results on the robustnessof the model to misspecification of the stimulus onset.

Table 1 summarizes the final cluster parameters ⇥⇤.As one may expect, a highly positive µ⇤

(k) correspondsto neurons that are excited by the stimulus, while ahighly negative µ⇤

(k) corresponds to neurons that areinhibited. The one cluster with µ⇤

(k) ⇡ 0 correspondsto non-responsive neurons. With µ⇤

(k), the algorithmis able to approximately recover the true amount bywhich the stimulus increases or decreases the log of thefiring rate/probability, which is stated in Section 4.2.1– i.e. +1 for k 2 {1, 4}, �1 for k 2 {2, 5} and 0 for k =3. This provides an interpretation of the numerical

value of µ(k). Indeed, if expx(n)0 << 1, expx(n)

1 << 1,as is often the case when modeling neurons, then theexpected increase in the log of the firing probabilitydue to the stimulus is:

E"log

�(x(n)1 )

�(x(n)0 )

#⇡ E

"log

expx(n)1

expx(n)0

#= µ(k). (13)

In addition, the values for log ⇤

(k) in Table 1 revealthat the algorithm uses this dimension to correctly sep-arate unsustained clusters from sustained ones. Fork 2 {1, 2, 3}, the algorithm infers smaller values oflog ⇤

(k) because the change in the firing rate is lessvariable after the stimulus has taken place, whereasthe opposite is true for k 2 {4, 5}. In summary, theDPnSSM is able to recover some of the key proper-ties of the data in an unsupervised fashion, therebydemonstrating its utility on this toy example.

Table 1: Parameters for simulation, where ✓

⇤(k) =[µ⇤

(k), log ⇤

(k)]> for each ✓

⇤(k) 2 ⇥⇤.

k µ⇤

(k) log ⇤

(k) E↵ect1 +0.96 �10.88 Excited, Sustained2 �0.92 �12.32 Inhibited, Sustained3 +0.03 �10.04 Non-responsive4 +1.04 �5.55 Excited, Unsustained5 �0.89 �5.89 Inhibited, Unsustained

4.3 Real Neural Spiking Data

In addition to simulations, we produce clusteringson real-world neural spiking data collected in a fear-conditioning experiment designed to elucidate the na-ture of neural circuits that facilitate the associative

1 2 3 4 5 6 7 8 9 10 15 14 13 12 11 16 17 18 19 20 21 22 23 24 25

25242322212019181716111213141510987654321

0.2

0.4

0.6

0.8

1.0

Figure 2: Heatmap of mean co-occurrence matrix ⌦for simulation results. Elements of the selected co-occurrence matrix ⌦hi⇤i that are equal to 1 are en-closed in green squares. Each square corresponds to adistinct cluster.

learning of fear. The detailed experimental paradigmis described in Allsop et al. (2018). In short, anobserver mouse observes a demonstrator mouse re-ceive conditioned cue-shock pairings through a perfo-rated transparent divider. The experiment consists ofR = 45 trials. During the first 15 trials of the experi-ment, both the observer and the demonstrator hear anauditory cue at time ⌧ = 0 ms. From trial 16 onwards,the auditory cue is followed by the delivery of a shockto the demonstrator at time ⌧ = 10,000 ms, i.e. 10seconds after the cue’s administration.

The data are recorded from various neurons in the pre-frontal cortex of the observer mouse. We apply ouranalysis to N = 33 neurons from this experiment thatform a network hypothesized to be involved in the ob-servational learning of fear. Our time interval of focusis (�500, T = 1500] ms before/after the administra-tion of the cue. The raster data comes in the form of{�N (n)

t,r }T,Rt=1,r=1, binned at a resolution of � = 5 ms

with T = 300, where each �N (n)t,r M = 5.

We apply DPnSSM to identify various groups of re-sponses in reaction to the auditory cue over time andover trials. A group of neurons that respond signif-icantly after trial 16 can be interpreted as one thatallows the observer to understand when the demon-strator is in distress.

4.3.1 Clustering Cue Data over Time

To cluster neurons by their cue responses over time,we collapse the raster for all neurons over the R = 45

trials. Thus, for neuron n, define y(n)t =PR

r=1 �N (n)t,r .

We apply the exact same model as the one used for thesimulations (Equation 12). We also use all of the samehyperparameter values, as detailed in Section 4.2.2.

Page 7: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

A. Lin, Y. Zhang, J. Heng, S. A. Allsop, K. M. Tye, P. E. Jacob, D. Ba

A heatmap of ⌦ along with demarcations of ⌦hi⇤i forthis experiment can be found in Figure 3. Overall,five clusters are selected by the algorithm. Table 2summarizes the chosen cluster parameters. Figure 4shows two of the five clusters identified by the algo-rithm, namely those corresponding to k = 1 (Figure4a) and k = 5 (Figure 4b). Figures for all other clus-ters can be found in Appendix E. Each of the fig-ures was created by overlaying the rasters from neu-rons in the corresponding cluster. The fact that theoverlaid rasters resemble the raster from a single unit(as opposed to random noise), with plausible parame-ter values in Table 2, indicates that the algorithm hasidentified a sensible clustering of the neurons.

The algorithm is able to successfully di↵erentiate vari-ous types of responses to the cue as well as the variabil-ity of the responses. One advantage of not restrictingthe algorithm to a set number of classes a priori is thatit can decide what number of classes best character-izes these data. In this case, the inference algorithmidentifies five di↵erent clusters. We defer a scientificinterpretation of this phenomenon to a later study.

1 31 29 27 25 24 23 22 21 20 14 33 7 2 4 5 12 3 28 6 17 19 18 32 16 15 8 10 9 13 26 30 11

113026139

108

1516321819176

283

125427

33142021222324252729311

0.2

0.4

0.6

0.8

1.0

Figure 3: Heatmap of mean co-occurrence matrix forcue data over time and selected clusters (green).

Table 2: Cluster parameters for cue data over time.

k µ⇤

(k) log ⇤

(k) # of Neurons1 +0.54 �6.27 172 +0.03 �6.80 13 �0.07 �7.25 84 �0.70 �6.01 45 +1.21 �4.52 3

4.3.2 Clustering Cue Data over Trials

We also apply DPnSSM to determine if neurons canbe classified according to varying degrees of neuronalsignal modulation when shock is delivered to anotheranimal, as opposed to when there is no shock delivered.The shock is administered starting from the 16th trial

�500 0 500 1000 1500Time (ms)

(a)

1

16

30

45

Trial

s

�500 0 500 1000 1500Time (ms)

(b)

1

16

30

45

Trial

s

Figure 4: Overlaid raster plots of neuronal clusterswith (a) moderately excited (µ⇤

(1) = 0.54) and some-what sustained (log ⇤

(1) = �6.27) responses to thecue; and (b) more excited (µ⇤

(5) = 1.21) and less sus-tained responses (log ⇤

(5) = �4.52) to the cue. Ablack dot at (⌧, r) indicates a spike from one of theneurons in the corresponding cluster at time ⌧ duringtrial r. The vertical green line indicates cue onset.

onwards. Thus, to understand the varying levels ofshock e↵ect, we collapse the raster across the T =300 time points (instead of the R = 45 trials, as was

done in Section 4.3.1). In this setting, each y(n)r =PTt=1 �N (n)

t,r 2 {0, 1, . . . , 2000} represents the numberof firings during the r-th trial. For each neuron n,

let the initial state be x(n)0 = ��1(1/15 ·

P15r=1 y

(n)r ).

Then, we use the following state-space model:

x(n)16 ⇠ N (x(n)

0 + µ(k), 0), (14)

x(n)r | x(n)

r�1 ⇠ N (x(n)r�1,

(k)), 16 < r R,

y(n)r | x(n)r ⇠ Bin(2000,�(x(n)

r )), 16 r R,

where once again the cluster parameters are ✓

(k) =[µ(k), log (k)]>. All other hyperparameter values arethe same as those listed in Section 4.2.2.

The corresponding heatmap, representative rasterplots, and clustering results can be found in Figure5, Figure 6, and Table 3, respectively. We speculatethat the results suggest the existence of what we termempathy clusters, namely groups of neurons that allowan observer to understand when the demonstrator isin distress. We will explore the implications of thesefindings to the neuroscience of observational learningof fear in future work.

Page 8: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

Clustering Time Series with Nonlinear Dynamics

1 30 28 25 9 33 5 2 3 21 24 4 26 27 23 12 11 31 8 6 22 29 7 17 18 32 16 15 14 10 19 13 20

201319101415163218177

292268

3111122327264

2421325

339

2528301

0.2

0.4

0.6

0.8

1.0

Figure 5: Heatmap of mean co-occurrence matrix forcue data over trials and selected clusters (green).

Table 3: Cluster parameters for cue data over trials.

k µ⇤

(k) log ⇤

(k) # of Neurons1 +0.19 �5.29 82 �0.14 �4.40 113 �0.08 �2.38 24 +0.87 �2.95 105 �0.42 �0.41 2

4.4 Controlled SMC Versus BPF

Finally, we present some results on the advantages ofusing controlled sequential Monte Carlo over the boot-strap particle filter. Computing the parameter likeli-hood is a key task in Algorithm 1. In each iteration, weperform O(N · K) particle filter computations duringthe sampling of the cluster assignments and anotherO(N) particle filter computations during the samplingof the cluster parameters. Thus, for both the e�ciencyand precision of the algorithm, it is necessary to finda fast way to compute low-variance estimates.

Figure 7 demonstrates the benefits of using cSMC overBPF for likelihood evaluation for a fixed computa-tional cost. Details on this experiment can be foundin Appendix F. In some cases, cSMC estimates havevariances that are several orders of magnitude lowerthan those produced by BPF. This is especially truefor low values of the variability parameter log , whichis crucial for this application since these are often theones that maximize the parameter likelihood.

5 CONCLUSION

We proposed a general framework to cluster time se-ries with nonlinear dynamics modeled by nonlinearstate-space models. To the best of the authors’ knowl-edge, this is the first Bayesian framework for clusteringtime series that exhibit nonlinear dynamics. The back-bone of the framework is the cSMC algorithm for low-

�500 0 500 1000 1500Time (ms)

(a)

1

16

30

45

Trial

s

�500 0 500 1000 1500Time (ms)

(b)

1

16

30

45

Trial

s

Figure 6: (a) Overlaid raster plots of neuronal clusterswith (a) slightly inhibited, variable responses (k = 2)and (b) very excited, variable responses (k = 4). Thered line marks the first trial with shock administration.

�2�1 0 1 2µ

20

�2�4lo

g 10v(

µ,�

)

log � = 0

�2�1 0 1 2µ

20

�2�4lo

g 10v(

µ,�

)

log � = �2

�2�1 0 1 2µ

20

�2�4lo

g 10v(

µ,�

)

log � = �4

�2�1 0 1 2µ

20

�2�4lo

g 10v(

µ,�

)

log � = �6

�2�1 0 1 2µ

20

�2�4lo

g 10v(

µ,�

)

log � = �8

BPF

cSMC

Figure 7: Using BPF versus cSMC for parameter log-likelihood computation. For each method, we plot anestimate of the variance v(µ, ) ⇡ Var[log p(y | ✓)],where ✓ = [µ, log ]>, over di↵erent values of (µ, ).

variance evaluation of parameter likelihoods in nonlin-ear state-space models. We applied the framework toneural data in an experiment designed to elucidate theneural underpinnings of fear. We were able to identifypotential clusters of neurons that allow an observer tounderstand when a demonstrator is in distress.

In future work, we plan to perform detailed analyses ofthe data from these experiments (Allsop et al., 2018),and the implications of these analyses on the neuro-science of the observational learning of fear in mice.We will also explore applications of our model to datain other application domains such as sports and sleepresearch (St Hilaire et al., 2017), to name a few.

Page 9: Clustering Time Series with Nonlinear Dynamics: A Bayesian ...

A. Lin, Y. Zhang, J. Heng, S. A. Allsop, K. M. Tye, P. E. Jacob, D. Ba

Acknowledgements

Demba Ba thanks Amazon Web Services (AWS), fortheir generous support and access to computationalresources, and the Harvard Data Science Initiative fortheir support. Pierre E. Jacob thanks the HarvardData Science Initiative and the National Science Foun-dation (Grant DMS-1712872). Kay M. Tye thanks theMcKnight Foundation, NIH (Grant R01-MH102441-01), and NCCIH (Pioneer Award DP1-AT009925).

References

Allsop, S. A., Wichmann, R., Mills, F., Burgos-Robles,A., Chang, C.-J., Felix-Ortiz, A. C., Vienne, A.,Beyeler, A., Izadmehr, E. M., Glober, G., et al.(2018). Corticoamygdala transfer of socially de-rived information gates observational learning. Cell,173(6):1329–1342.

Andrieu, C., Doucet, A., and Holenstein, R. (2010).Particle Markov chain Monte Carlo methods. Jour-nal of the Royal Statistical Society: Series B (Sta-

tistical Methodology), 72(3):269–342.

Bauwens, L. and Rombouts, J. (2007). Bayesian clus-tering of many GARCH models. Econometric Re-

views, 26(2-4):365–386.

Brown, E. N., Kass, R. E., and Mitra, P. P. (2004).Multiple neural spike train data analysis: state-of-the-art and future challenges. Nature Neuroscience,7(5):456.

Chiappa, S. and Barber, D. (2007). Output groupingusing Dirichlet mixtures of linear Gaussian state-space models. In Image and Signal Processing and

Analysis, 2007. ISPA 2007. 5th International Sym-

posium on, pages 446–451. IEEE.

Dahl, D. B. (2006). Model-based clustering for ex-pression data via a Dirichlet process mixture model.Bayesian Inference for Gene Expression and Pro-

teomics, 201:218.

Doucet, A., De Freitas, N., and Gordon, N. (2001). Anintroduction to sequential Monte Carlo methods. InSequential Monte Carlo Methods in Practice, pages3–14. Springer.

Durbin, J. and Koopman, S. J. (2012). Time Series

Analysis by State Space Methods, volume 38. OxfordUniversity Press.

Ferguson, T. S. (1973). A Bayesian analysis of somenonparametric problems. The Annals of Statistics,1(2):209–230.

Guarniero, P., Johansen, A. M., and Lee, A. (2017).The iterated auxiliary particle filter. Journal of the

American Statistical Association, 112(520):1636–1647.

Heng, J., Bishop, A. N., Deligiannidis, G., and Doucet,A. (2017). Controlled sequential Monte Carlo. arXivpreprint arXiv:1708.08396.

Humphries, M. D. (2011). Spike-train communities:finding groups of similar spike trains. Journal of

Neuroscience, 31(6):2321–2336.

Inoue, L. Y., Neira, M., Nelson, C., Gleave, M., andEtzioni, R. (2006). Cluster-based network modelfor time-course gene expression data. Biostatistics,8(3):507–525.

Middleton, L. (2014). Clustering time series: a Dirich-let process mixture of linear-Gaussian state-spacemodels. Master’s thesis, Oxford University, UnitedKingdom.

Neal, R. M. (2000). Markov chain sampling methodsfor Dirichlet process mixture models. Journal of

Computational and Graphical Statistics, 9(2):249–265.

Nieto-Barajas, L. E. and Contreras-Cristan, A. (2014).A Bayesian nonparametric approach for time seriesclustering. Bayesian Analysis, 9(1):147–170.

Roick, T., Karlis, D., and McNicholas, P. D. (2019).Clustering discrete valued time series. arXiv

preprint arXiv:1901.09249.

Saad, F. and Mansinghka, V. (2018). Temporally-reweighted chinese restaurant process mixtures forclustering, imputing, and forecasting multivariatetime series. In International Conference on Arti-

ficial Intelligence and Statistics, pages 755–764.

Smith, A. C. and Brown, E. N. (2003). Estimating astate-space model from point process observations.Neural computation, 15(5):965–991.

St Hilaire, M. A., Ruger, M., Fratelli, F., Hull, J. T.,Phillips, A. J., and Lockley, S. W. (2017). Modelingneurocognitive decline and recovery during repeatedcycles of extended sleep and chronic sleep deficiency.Sleep, 40(1).

Vere-Jones, D. (2003). An Introduction to the Theory

of Point Processes: Volume I: Elementary Theory

and Methods. Springer.