A Bayesian Approach to Change Points Detection in Time Series

7/27/2019 A Bayesian Approach to Change Points Detection in Time Series

1/14

1

A Bayesian approach to change points

detection in time series

Ali Mohammad-Djafari and Olivier FeronLaboratoire des Signaux et Systemes,Unite mixte de recherche 8506 (CNRS-Supelec-UPS)Supelec, Plateau de Moulon, 3 rue Joliot Curie, 91192 Gif-sur-Yvette, France.

Abstract. Change points detection in time series is an important area of re-search in statistics, has a long history and has many applications. However, veryoften change point analysis is only focused on the changes in the mean valueof some quantity in a process. In this work we consider time series with dis-crete point changes which may contain a finite number of changes of probabilitydensity functions (pdf). We focus on the case where the data in all segmentsare modeled by Gaussian probability density functions with different means,variances and correlation lengths. We put a prior law on the change pointoccurances (Poisson process) as well as on these different parameters (conju-gate priors) and give the expression of the posterior probability distributionsof these change points. The computations are done by using an appropriateMarkov Chain Monte Carlo (MCMC) technique.

The problem as we stated can also be considered as an unsupervised clas-sification and/or segmentation of the time series. This analogy gives us thepossibility to propose alternative modeling and computation of change points.

Keywords. Change point analysis, Bayesian classification and segmentation,Time series analysis.

1 Introduction

Figure 1 shows typical change point problems we consider in this work. Notethat, very often people consider problems in which there is only one changepoint [1]. Here we propose to consider more general problems with any numberof change points. However, very often the change point analysis problems needonline or real time detection algorithms [2, 3, 4, 5], while here, we focus onlyon off line methods where we assume that we have gathered all the data andwe want to analyze it to detect change points who have been occurred during

the observation time. Also, even if we consider here change point estimationof 1-D time series, we can extend the proposed method to multivariate data,for example the images where the change point problems become equivalentto segmentation. One more point to position this work is that, very often themodels used in change point problems assume to know perfectly the model of thesignal in each segment, i.e.,a linear or nonlinear regression model [5, 6, 7, 8, 9],while here, we use a probabilistic model for the signals in each segment which


2/14

2

gives probably more generality and applicability when we do not know perfectlythose models.

change points

t0 t1 t2 tn t0+T

Same variances, different means

Same means, different variances

Same means and variances, different correlation lengths

different pdfs (uniform, Gaussian, Gamma

Fig. 1: Change point problems description: In the first row, only mean values of the

different segments are different. In the second row, only variances are changed. In

the third row only the correlation strengths are changed. In the fifth row, the whole

nature shape of their probability distribution have been changed. The last row show

the change p oints tn.

More specifically, we model the time series by a hierarchical Gauss-Markovmodeling with hidden variables which are themselves modeled by a Markovmodel. Though, in each segment which corresponds to a particular value ofthe hidden variable, the time series is assumed to be modeled by a stationaryGauss-Markov model. However, we choosed a simple parametric model definedonly with three parameters of mean , variance 2 = 1/ and a parameter measuring the local correlation strength of the neighboring samples.

The choice of the hidden variable is also important. We have studied threedifferent modeling: i) change point time instants tn, ii) classification labels zn oriii) a Bernoulli variable qn which is always equal to zero except when a changepoint occurs.

The rest of the paper is organized as follows: In the next section we intro-duce the notations and fix the objectives of the paper. In section 3 we considerthe model with explicit change point times as the hidden variables and pro-pose particular modeling for them and an MCMC algorithm to compute theira posteriori probabilities. In sections 4 and 5 we consider the two other afore-mentioned models. Finally, we show some simulation results and present ourconclusions and perspectives.


3/14

3

2 Notations, modeling and classical methods

We note by x = [x(t0), , x(t0 + T)] the vector containing the data observedfrom time t0 to t0 + T. We note by t = [t1, , tN]

the unknown change pointsand note x = [x0,x1, ,xN] where xn = [x(tn), x(tn+1), , x(tn+1)], n =0, , N represent the data samples in each segment. In the following we willhave tN+1 = T.

We model the data xn = [x(tn), x(tn + 1), , x(tn+1)], n = 0, , N ineach segment by a Gauss-Markov chain:

p(x(tn)) = N(n, 2n)p(x(tn + l)|x(tn + l 1)) = N(n x(tn + l 1) + (1 n)n, 2n(1

2n)),

with l = 1, , ln 1, ln = tn+1 tn + 1 = dim [xn](1)

Then we have

p(xn)=p(x(tn))ln

l=1 p(x(tn + l)|x(tn + l 1))

exp

122n(x(tn) n)2

exp

1

2(2n(12n))

lnl=1[x(tn + l) nx(tn + l 1) (1 n)n]

2

=N(n1, n) with n = 2n Toeplitz([1, n, 2n, ,

lnn ])

(2)Noting by t = [t1, , tN] the vector of the change points and assuming that

the samples from any two segments are independent, we can write:

p(x|t,, N)=

Nn=0N(n1, n)

=Nn=0 |n|1/2(2)(ln/2) exp12Nn=0(xn n1)1n (xn n1)(3)

where we noted = {n, n, n, n = 0, , N}.Note that

lnp(x|t,, N) =Nn=0(ln/2) ln(2) +

12

Nn=0 ln |n|

12

Nn=0(xn n1)

1n (xn n1)(4)

and when the data are i.i.d., (n = nI) this becomes

lnp(x|t,, N) = (T /2)ln(2) +Nn=0

(ln/2)ln 2n

Nn=0

(xn n1)2

22n(5)

Then, the inference problems we will be faced are the following:

1. Learning: Infer on given a training set x and t;

2. Supervised estimation: Infer on t given x and :

3. Unsupervised estimation: Infer on t, or jointly on t and given x.


4/14

4

The classical maximum likelihood estimation (MLE) approach for these prob-lems becomes:

Estimate given x and t by = arg max {p(x|t, )} Estimate t given x and byt = arg maxt {p(x|t, )} Estimate t and given x by (t,) = arg max

(t,) {p(x|t, )}

Estimate t given x byt = arg maxt {p(x|t)} with p(x|t) = p(x|t, ) d Estimate given x by = arg max {p(x|)} with p(x|) = p(x|t,) dt

However, we must be careful to check the boundedness of the likelihood func-tion before using any optimization algorithm. The optimization with respect to when t is known can be done easily, but the optimization with respect to t isvery hard and computationally costly.

3 Bayesian estimation of the change point timeinstants

In Bayesian approach, one assigns prior probability laws on both t and anduse the posterior probability law p(t, |x) as a tool for doing any inference.Choosing a prior pdf for t is also usual in classical approach. A simple model isthe following:

tn = tn1 + n with n P(), (6)

where n are assumed iid end is the a priori mean value of time intervals(tn tn1). if N is the number of change point we can take =

TN+1

. Withthis modeling we have :

p(t|) =N+1

n=1 P(tn tn1|) =N+1

n=1 e

(tntn1)

(tntn1)!

lnp(t|) = (N + 1) + ln()N+1

n=1 (tn tn1) N+1

n=1 ln((tn tn1)!)(7)

With this prior selection, we have

p(x, t|, N) = p(x|t, , N)p(t|, N) (8)

and

p(t|x, , N) p(x|t, , N)p(t|, N) (9)

In Bayesian approach, one goes one step further with assigning prior proba-

bility laws to the hyperparameters , i.e., p() and then one writes the joint aposteriori :

p(t,|x, , N ) p(x|t,, N)p(t|, N)p(|N) (10)

where here we noted =

n, 2n, n, n = 1, , N

.


5/14

5

A classical choice for p() is the conjugate priors which, in general, results in Gaussian pdfs p(n) = N(0, 20) for position parameters n,

Inverse Gamma (IG) pdfs p(2n) = IG(0, 0) for variances 2n and Inverse Wishart (IW) pdfs p(n) = IW(0, 0) for the covariance matricesn.

When the likelihood p(x|t,) and the priors p(t|) and p() are choosed andthe expression of the posterior probability law p(t,|x) obtained, one can doany inference on on the unknown parameters of the problem t and separatelyor jointly. Two main approaches for the estimation are: the methods which are based on the the computation of the modes (Maximuma posteriori MAP) of the different posterior probability laws, and the methods which are based on the the computation of the means of thedifferent posterior probability laws.

3.1 MAP optimization based methods

The methods which are based on the the computation of the modes of thedifferent posterior probability laws result in the following optimization problems:

1. Learning: Infer on given a training set x and t;

= arg max

{p(|x, t)} = arg max

{p(x|t, )p()} (11)

2. Supervised estimation: Infer on t given x and :

t = argmax

t{p(t|x, )} = arg max

t{p(x|t, )p(t|)} (12)

3. Unsupervised estimation: Infer on t, or jointly on t and given x.

(t,) = arg max(t,)

{p(t,|x)} = arg max(t,)

{p(x|t,)p(t|)p()} (13)

We can also first focus on the estimation of by integrating out t fromp(t,|x) to obtain

p(|x) =

p(t, |x) dt =

p(x|t,)p(t|) dtp() = p(x|)p() (14)

and then estimate by

= arg max)

{p(|x)} = arg max(t,)

{p(x|)p()} (15)

and then use it as in the supervised estimation case.

t = arg maxt)

p(t|x,) = arg max

t

p(x|t,)p(t|)


6/14

6

3.2 Posterior mean and MCMC methods

The methods which are based on the the computation of the posterior meansresult in integration computation. Indeed, rarely these integrations can be doneanalytically, and often, they are done by the MCMC methods.

Here, we propose the following Gibbs sampling MCMC algorithm:

Iterate until convergence. sample t using p(t|x, , N). sample n :

n using p(n|x, t, N)2n using p(

2n|x, t, N)

n using p(n|x, t, N)

In the following, we give some details on the expressions of these posteriorlaws and the sampling algorithms which we implemented.

First, note that, thanks to the conjugacy, we have:

p(n|x, t) = N(n,2n) withn =

2n

020

+ 11n xn

2n = 11n 1 + 120 1

p(2n|x, t) = IG(n,n) with n = 0 + ln2n = 0 + 12(xn n1)R1n (xn n1),where Rn = Toeplitz([1, n, 2n, ,

lnn ]). Thus, these posterior laws are classi-

cal ones and generating samples from them is quite simple and easy.

p(n|x, t) is not a classical law, but we can write its expression which is givenby:

p(n|x, t, N) = Nn=0 p(n|xn, t, N)

1

2n(12n)

ln2

exp

(

xnn1)R1

n (xnn1)22n(1

2n)

1

2n(12n)

ln2

exp

Pln

l=1(x(tn+l)nx(tn+l1)(1n)n)2

22n(12n)

(16)

As we can see, this is not a classical probability density and we do not havea simple way to generate samples for this density. The solution we propose isto use, in this step, a Hastings-Metropolis algorithm for sampling this density.As an instrumental density we propose to use a Gaussian approximation ofthe posterior density, i.e., we estimate the mean mn and the variance

2n of

p(n|x, t, N) and we use a Gaussian law N(mn , 2n) to obtain a sample. This

sample is accepted or rejected following p(n|x, t, N). In practice we computemn and

2n calculating by approximation of their definition :

mn

10

n p(n|x, t, N)

2n

10

2n p(n|x, t, N) m2n


7/14

7

To generate samples from p(t|x,, N) can be obtained by a method based onrecursion on the change points. An approximation of this method is possible

to obtain an algorithm whose computational cost is linear in the number ofobservations [10].

The main idea behind this algorithm is to compute the conditional probabilitylaws p(tj |tj1,x) which permits to generate recursively samples for tj . To givemain relations to achieve this, first we note by xt:s = [x(t), x(t + 1), . . . , x(s)]and define the the following probabilities:

R(t, s) = p(xt:s|t, s in the same segment)

Q(t) = P(xt:s| change point at t 1), Q(1) = P(x)

Let also note g(tj tj1) the a priori density of the interval between two

change points, and G(t) the associated distribution function. Then one canshow that the posterior distribution of tj given tj1 is

p(tj |tj1,x) =R(tj1, tj)Q(tj + 1)g(tj tj1)

Q(tj1)(17)

and the posterior distribution of no further change point is given by

p(tj = T|tj1,x) =p(tj1, T)(1 G(T tj1 1))

Q(tj1)(18)

Thus, we have all the necessary expressions for generating samples (t, )(1), (t,)(2), ,from the joint posterior law p(t,|x). Note however that, we need to generate a

great number of those samples to achieve the convergence of the Markov chain.When this convergence is achieved, we can use those final samples to computeany statistics such as the mean, the median or take as the final output the mostfrequently generated samples. The main advantage however is to use these sam-ples to generate their histograms which are good representative of their marginalposterior probabilities.

4 Other formulations

Other formulation can also exist. We introduce two sets of indicator variablesz = [z(t0), , z(t0 + T)]

and q = [q(t0), , q(t0 + T)]

where

q(t) = 1 if z(t) = z(t 1)

0 elsewhere= 1 if t = tn, n = 0, , N

0 elsewhere. (19)

Thus, q can be modeled by a Bernoulli process

P(Q = q) = P

j qj (1 )P

j(1qj) = P

j qj (1 )NP

j qj


8/14

8

and z can be modeled by a Markov chain, i.e.,{z(t), t = 1, , T} forms aMarkov chain:

P(z(t) = k) = pk, k = 1, , K,P(z(t) = k|z(t 1) = l) = pkl, with

k pkl = 1.

(20)

In the multivariate case, or more precisely in bivariate case (image processing),q may represent the contours and z the labels for the regions in the image.Then, we may also give a Markov model for them. For example, if we note byr S the position of a pixel, S the set of pixels positions and by V(r) the set ofpixels in the neighborhood of the pixel position r, we may use an Ising modelfor q

P(Q = q) exp

rS

sV(r)

(z(r) z(s))

(21)

or a Potts model for z:

P(z) exp

rS

sV(r)

(z(r) z(s))

. (22)Other more complex modelings are also possible.

With these auxiliary variables, we can write

p(x|z, ) =Nn=1

P(zj = n)N(n1, n) =Nn=1

pkN(n1, n) (23)

if we choose K = N.

Here, = {N, {n, n, pn, n = 1, , N} , (pkl, k , l = 1, , N)} and the modelis a mixture of Gaussians.

We can again assign appropriate prior law on and give the expression ofp(z,|x) and do any inference on z, .

Finally, we can also use q as the auxiliary variable and write

p(x|q,) = (2)N/2

Nn=1

(1

n)

exp

1

22n

Nn=1

(x(tn) n)2

+(2)(TN)/2

Nn=1

(1

n)(ln1)

exp

1

22n

Tj=1

(1 qj) (xj xj1)2

= (2)T/2 Nn=1

(

1

n )(ln) exp 122n

Tj=1

(1 qj) (xj xj1)2 + qj (xj n)(24)

and again assign appropriate prior law on and give the expression of p(q,|x)and do any inference on q, . We are still working on using these auxiliaryhidden variables particularly for applications in data fusion in image processingand we will report on these works very soon.


9/14

9

5 Simulation results

To test the feasibility and to measure the performances of the proposed algo-rithms, we generated a few simple cases corresponding to only changes of oneof the three parameters n, 2n and n.In each case we present the data, the histogram of the a posteriori samples oft during the first and the last iterations of the MCMC algorithm. For eachcase we also give the value of the parameters used to simulate the data, theestimated values when the change points are known and the estimated valuesby the proposed method.

5.1 Change of the means

As we can see in Fig. 2, we obtain precise results on the position of the changepoints. In the case of change of means, the algorithm is very fast to converge tothe good solution. In fact it needs only few iterations (about 5). The main causeof this results is the importance of the means in the likelihood p(x|t, , N).We can also see in Table 1 that the estimations of the means are very precise,particularly when the size of the segment is long.

Change points

t0 t1 t2 t3 t4 t0+T

Different means

50th iteration

First iteration

Fig. 2: Change in the means. up to down : simulated data, histogram in the50th iteration, histogram in the first iteration, real position of the change points.

m m|x, t 2|x, t m|x 2|x

1.5 1.4966 0.0015 1.4969 0.00131.7 1.7084 0.0017 1.7013 0.00381.5 1.4912 0.0020 1.5015 0.00451.7 1.6940 0.0014 1.6929 0.00161.9 1.9012 0.0015 1.8915 0.0039


10/14

10

5.2 Change in the variances

We can see in Fig. 3 that we have again good results on the position of thechange points. However, for little difference of variances, the algorithm give anuncertainty on the exact position of the change point. This can be justified bythe fact that the simulated data give itself this uncertainty.In Table 2 we can see again good estimations on the variances on each segments.

Change points

t0 t1 t2 t3 t4 t0+T

Different variances

50th iteration

First iteration

Fig. 3: Change in the variances. up to down : simulated data, histogram in

the 50th iteration, histogram in the first iteration, real position of the changepoints.

2 2|x, t 2|x

0.01 0.0083 0.00811 0.9918 0.9598

0.001 0.0007 0.00260.1 0.0945 0.0940

0.01 0.0079 0.0107

5.3 Change in the correlation coefficient

The results showed in Fig. 4 are worse than in the two first cases. The position

of the change points are less precise, and we can see that another change pointappears. This affects the estimation of the correlation coefficient in the thirdsegment because the algorithm alternates between two positions of change point.This problem can be justified by the fact that a value of the correlation coefficientnear 1 implies locally a change of the mean, which can be considered by thealgorithm as a change point. Also this problem appears when the size of thesegments are far from the a priori size .


11/14

11

Change points

t0 t1 t2 t3 t4 t0+T

Different correlation coefficient

50th iteration

First iteration

Fig. 4: Change in the correlation coefficient. up to down : simulated data,histogram in the 50th iteration, histogram in the first iteration, real position ofthe change points.

a a|x

0 0.09880.9 0.78750.1 0.3737

0.8 0.80710.2 0.1710

5.4 Influence of the prior law

In this section we study the influence of the a priori on , i.e.,the size of thesegments. In the following we fix the number of change points as before andwe change the a priori size of the segments by 0 =

2

and 1 = 2. We applythen our algorithm on the change of the correlation coefficient.


12/14

12

Change points

t0 t1 t2 t3 t4 t0+T


50th iteration

First iteration

Fig. 5: Different correlation coefficient with 0 =12

TN+1

. up to down : simu-lated data, histogram in the 50th iteration, histogram in the first iteration, realposition of the change points.

Change points

t0 t1 t2 t3 t4 t0+T


50th iteration

First iteration

Fig. 6: Different correlation coefficient with 1 = 2T

N+1. up to down : simu-

lated data, histogram in the 50th iteration, histogram in the first iteration, realposition of the change points.In figure Fig. 5, we can see that the algorithm has detected other change points,forming segments whose size is near 0. This result shows the importance


13/14

13

of the a priori when the data are not enough significant. We can also seethis conclusion in Fig. 6 where only three change points are detected, forming

segments whose size is again near 1. We can also remark that fixing a priori asize comes down to fix the number of change points. Our algorithm give thengood results for instance if we have a good a priori on the number of changepoints.

6 Conclusions

In this paper, first we presented a Bayesian approach for estimating changepoints in time series. The main advantage o this approach is to give, at eachtime instant t, the probability that a change point has been occurred at thatpoint. Then, based on posterior probabilities, we can not only give the timeinstants with highest probabilities, but also give indications on the amount of

those changes via the values of the estimated parameters. In the second part, wefocused on a piecewise Gaussian model and studied the changes in the means,in the variances and in the correlation coefficients of the different segments.As a conclusion, we could show that the detection of change points due tochanges in the mean is easier than those due to changes in variances or changesin correlation coefficient. In this work, first we assumed to know the numberN of change points, but the proposed Bayesian approach can be extended toestimate this number too. We are investigating the estimation of the numberof change points in the same framework. We also studied the role of the apriori parameter on the results. Finally, we showed that other modeling usingother hidden variables than change point time instants are also possible andare under investigation. We are also investigating the extension of this work to

image processing (2D signals) where the change points are contours.

References

[1] M. Basseville, Detecting changes in signals and systems a survey, Automatica,vol. 24, no. 3, pp. 309326, 1988.

[2] M. Wax, Detection and localization of multiple sources via the stochastic sig-nals model, IEEE Transactions on Signal Processing, vol. 39, pp. 24502456,November 1991.

[3] J. J. Kormylo and J. M. Mendel, Maximum-likelihood detection and estima-tion of Bernoulli-Gaussian processes, IEEE Transactions on Information Theory,vol. 28, pp. 482488, 1982.

[4] C. Y. Chi, J. Goustias, and J. M. Mendel, A fast maximum-likelihood estimationand detection algorithm for Bernoulli-Gaussian processes, in Proceedings of theInternational Conference on Acoustic, Speech and Signal Processing, (Tampa,fl), pp. 12971300, April 1985.

[5] J. K. Goutsias and J. M. Mendel, Optimal simultaneous detection andestimation of filtered discrete semi-Markov chains, IEEE Transactionson Information Theory, vol. 34, pp. 551568, 1988.


14/14

14

[6] J. J. Oliver, R. A. Baxter, and C. S. Wallace, Unsupervised Learningusing MML, in Machine Learning: Proceedings of the Thirteenth International

Conference (ICML 96), pp. 364372, Morgan Kaufmann Publishers, 1996.[7] J. P. Hughes, P. Guttorp, and S. P. Charles, A non-homogeneous

hidden Markov model for precipitation occurrence, Applied Statistics,vol. 48, no. 1, pp. 1530, 1999.

[8] L. J. Fitzgibbon, L. , and D. L. Dowe, Minimum message lengthgrouping of ordered data, in Algorithmic Learning Theory, 11th Interna-tional Conference, ALT 2000, Sydney, Australia, December 2000, Proceedings,vol. 1968, pp. 5670, Springer, Berlin, 2000.

[9] L. Fitzgibbon, D. L. Dowe, and L. Allison, Change-point estima-tion using new minimum message length approximations, in Proceedingsof the Seventh Pacific Rim International Conference on Artificial Intelligence(PRICAI-2002) (M. Ishizuka and A. Sattar, eds.), vol. 2417 of LNAI,(Berlin), pp. 244254, Japanese Society for Artificial Intelligence(JSAI), Springer-Verlag, August 2002.

[10] P. Fearnhead, Exact and efficient bayesian inference for multiplechangepoint problems, tech. rep., Department of math. and stat.,Lancaster university.

A Bayesian Approach to Change Points Detection in Time Series

Documents

Transcript of A Bayesian Approach to Change Points Detection in Time Series