Bayesian Change Point Detection for Functional Datasghosal/papers/functional cp.pdfChange point,...

Bayesian Change Point Detection for Functional Data

Xiuqi Li and Subhashis Ghosal

Department of Statistics, North Carolina State University

Abstract

We propose a Bayesian method to detect change points in a sequence of func-tional observations that are signal functions observed with noises. Since func-tions have unlimited features, it is natural to think that the sequence of signalfunctions driving the underlying functional observations change through anevolution process, that is, different features change over time but possiblyat different times. A change-point is then viewed as the cumulative effect ofchanges in many features, so that the dissimilarities in the signal functionsbefore and after the change-points are at the maximum level. In our setting,features are characterized by the wavelet coefficients in their expansion. Weconsider a Bayesian approach by putting priors independently on the waveletcoefficients of the underlying functions, allowing a change in their values overtime. Then we compute the posterior distribution of change point for eachsequence of wavelet coefficients, and obtain a measure of overall similaritybetween two signal functions in this sequence, which leads to the notion ofan overall change-point by minimizing the similarity across the change pointrelative to the similarities within each segment. We study the performanceof the proposed method through a simulation study and apply it to a dataseton climate change.

Keywords:Change point, Functional data, Wavelet transform, Posterior consistency

1. Introduction

Change point detection is an important aspect of data analysis. In recentyears, detecting change points in functional data has received considerableattention. Berkes et al. [8] developed a method by projecting the differenceof mean functions on the principal components for the data. Zhang et al.

Preprint submitted to Journal of Statistical Planning and Inference October 22, 2020

[16] developed a self-normalization based test to identify the potential changepoints. Aston and Kirch [4] introduced an estimator of a change point in amodel for an epidemic, where a change in the state of the epidemic occurs,and then the observations return to baseline at a later time. Sharipov et al.[14] developed a test for structural changes in functional data taking valuesin a Hilbert space and determined critical values by bootstrap. Aue et al. [6]proposed a method to uncover structural breaks in functional data withoutusing dimension reduction techniques.

In this paper, we propose a Bayesian method to detect change points infunctional data. We consider the setup where functional observations appearin a sequence, and can be thought of as unknown signal functions observedwith a noise process independent for each observation in the sequence. Eachfunction is viewed as a collection of its features, typically described by itscoefficients in a wavelet basis expansion. Our approach fundamentally differsfrom others, in that we do not assume that the true signals suddenly changescompletely at a change point and are identical in all before the change pointand also identical in all after the change point. Instead, we let the signalsevolve steadily with some, but not all, features of the signal function possiblychanging at a given time, mimicking the process of evolution of species innature. These changes cumulate and at a stage, their overall effect becomessubstantial. Representing functions in a wavelet series, the features in a func-tion can be quantified through their wavelet coefficients when expanding on awavelet basis. While any orthogonal basis can be used to expand a function,wavelets are particularly effective in picking up local features. Analyzingfunctional data by treating wavelet coefficients was adopted by Hyndmanand Shang [10], Antoniadis et al. [3] and Suarez and Ghosal [15]. Hyndmanand Shang [10] represented wavelet coefficients of functional observationsas independent variables in forecasting functional time series by functionalprincipal component regression and weighted functional partial least squaresregression. Their approach allows to assign higher weights to more recentdata, and provide a modeling scheme that is easily adapted to allow for con-straints and other information. Representation of functions through waveletcoefficients is a very common and useful approach for putting a prior on anunknown function (Abrahamovich et al. [1], [2], Lian [13], Suarez and Ghosal[15]).

We consider two settings of functional observations — either observedover a regular grid, or observed continuously over time. We extract the fea-tures of functional data by the discrete wavelet transform (DWT) in the

2

former case, and using inner products with wavelet functions in the lattercase. The signal-plus-noise model for functional data then reduces to theproblem of detecting a change in the value of the mean vector in a multivari-ate (or an infinite-dimensional) normal model. Since there are no functionaldependencies between the coefficients, we can treat each sequence of fea-ture separately independently, and hence put independent priors on thesecoefficients. The prior for each coefficient allows for a change in the valueof a wavelet coefficient before and after some time point, but this changepoint need not be the same or related across different wavelet coefficients.Therefore, the problem of the change point for functional data is reduced tothe problem of multiple (or infinitely many) one-dimensional change points.Each of the latter can be solved very efficiently by a Bayesian method basedon conjugate priors, allowing explicit posterior computation without need-ing to sample from the posterior distribution. Finally, the change points forindividual wavelet coefficients need to be synthesized to a single value thatrepresents the most prominent change point for the entire signal. To this end,we introduce a similarity matrix whose (i, i′)th entry measures the fractionof wavelet coefficients of fi and fi′ are tied together when the coefficients aresampled from their posterior distributions. The posterior expected value ofthis similarity can be obtained in terms of the posterior distributions of thechange points for each wavelet coefficients, which are analytically computedin our setting. We define a synthetic change point for the overall change asthe value that minimizes the ratio of the mean similarities between groupsand within groups, where one group is formed by time points before andthe other by time points after the given value. This process can also berepeated more than once to detect multiple change points. The number ofchange points to be found may also be adaptively obtained from the datausing some stopping criterion involving these similarity measures.

The paper is organized as follows. In the next section, we formally intro-duce the model and describe the prior. Posterior computational formulas arepresented in Section 3. Under a small noise regime, a posterior consistencyresult on estimation and identification of the change point is presented inSection 4. A simulation study demonstrating the accuracy of the proposedmethod, and an application to a dataset on climate change over a 250 yearsperiod are respectively presented in Sections 5 and 6. Proofs of the resultsare given in Section 7.

3

2. Setup, model description and prior distribution

We consider two different settings of functional data — observed over aregular discrete grid, or observed continuously over time. In both cases, thedomain of the signal functions is taken to be the unit interval [0, 1], withoutloss of generality. The functional observations are viewed as signal func-tions corrupted by independent error processes. We follow the formulationof Suarez and Ghosal [15], who used wavelet coefficients of the underlyingsignal functions to describe their structure. Using these, they quantifiedthe extent of similarities among different functional observations and usedthe information to cluster these observations in a few groups. Suarez andGhosal [15] imposed a prior on wavelet coefficients which explicitly encour-ages forming ties in the values of these wavelet coefficients of different signalfunctions. We modify their approach to change point detection for functionaldata. This problem can be regarded as a special case of clustering with theconstraint that for each characteristic, there are at most two clusters andthese are linearly ordered. Hence, the technique of Suarez and Ghosal [15]can be modified by imposing the additional restriction in the prior that tiesoccur over consecutive values only.

2.1. Model description

We model that the functional observations Yi(t) arise from true signalsfi(t), corrupted by some noise process εi(t), t ∈ [0, 1], i = 1, . . . , n, wheren denotes the sample size: Yi(t) = fi(t) + εi(t). As in Antoniadis et al.[3] and Suarez and Ghosal [15], we shall apply wavelet transforms on thesefunctions to extract the relevant features. Let {φ0, ψjk : k = 0, . . . , 2j−1; j =

0, 1, . . .} stand for a wavelet basis, and let (a(i)0 , b

(i)jk : k = 0, . . . , 2j − 1, j =

0, 1, . . .), (α(i)0 , β

(i)jk : k = 0, . . . , 2j−1, j = 0, 1, . . .), (e

(i)0 , e

(i)jk : k = 0, . . . , 2j−

1, j = 0, 1, . . .), stand respectively for the wavelet coefficients of Yi, fi and εirespectively, i = 1, . . . , n.

In the discrete model, we observe functional data Yi(Tj), i = 1, . . . , n,T1, . . . , Tm ∈ [0, 1], given by Yi(Tl) = fi(Tj) + εij, where εij is assumed tofollow a normal distribution with mean 0 and variance σ2, and is assumedto be independent across i and j. Define Yi = (Yi(T1), . . . , Yi(Tm))>, fi =(fi(T1), . . . , fi(Tm))>, and εi = (εi(T1), . . . , εi(Tm))>. If further m = 2J forsome J and T1, . . . , TJ are equidistant, then the wavelet coefficients up tothe level J in the expansion of the functions may be computed much moreefficiently by the discrete wavelet transform (DWT), mapping a vector in Rm

4

to another vector in Rm by multiplying by a suitable orthogonal matrix W ;see Antoniadis et al. [3]). Then the transformed observations are given by

(a(i)0 , b

(i)jk : k = 0, . . . , 2j − 1, j = 0, . . . , J − 1)> = W (Yi(T1), . . . , Yi(Tm))>.

(1)The orthogonality of the transformation W preserves the independence andthe normality of the errors, leading to the model

a(i)0 = α

(i)0 + e

(i)0 , b

(i)jk = β

(i)jk + e

(i)jk , e

(i)jk ∼ N(0, σ2), (2)

i = 1, . . . , n, k = 0, . . . , 2j − 1, j = 0, 1, . . . , J . If m is not a power of 2,a practically useful fix is to obtain a function in the continuous domain bya standard smoothing technique, and then evaluate the function at a set ofequidistant points of cardinality a suitable power of 2.

When the functional data are (essentially) observed continuously in time,we consider a Gaussian white noise model dYi(t) = fi(t)dt+ σdBi(t), where

Bi(·) are independent Brownian motions on [0, 1]. Let a(i)0 =

∫ 1

0φ0(t)dYi(t),

α(i)0 =

∫ 1

0φ0(t)fi(t)dt, b

(i)jk =

∫ 1

0ψjk(t)dYi(t), β

(i)jk =

∫ 1

0ψjk(t)fi(t)dt, e

(i)0 =

σ∫ 1

0φ0(t)dBi(t), e

(i)jk = σ

∫ 1

0ψjk(t)dBi(t), i = 1, . . . , n, k = 0, . . . , 2j − 1, j =

0, 1, . . .. It follows that e(i)0 and e

(i)jk are independent N(0, σ2). Hence applying

the wavelet transform, we can again express the model by the ‘independentnormal sequence model’ (2), with the only difference that j = 0, 1, . . . is notrestricted by J . The independence across different (j, k) pairs allows us totreat both the discrete and the continuous models within a single framework.Different components in the model stand for different features of the functionscaptured by the wavelet coefficients.

To detect the change point of this sequence of functional data, we firstfind the change in each component, that is, we detect the change for eachfeature βjk, and we decide the overall change point from them. The procedureof combining the information across different wavelet coefficients is formallydescribed in Section 3.

2.2. Prior Distributions

For a given (j, k), the change point prior on βjk is regulated by its ‘changepoint parameter’ τjk by the relations

β(1)jk = · · · = β

(τjk−1)jk 6= β

(τjk)

jk = · · · = β(n)jk , (3)

5

τjk = 1, . . . , n. Note that if τjk = 1, there is actually no change point asall values are equal, which is permitted by our model. In principle, multiplechange points may be possible in a component, but for simplicity of formulas,we only allow at most one change point for a component. Typically, this willsuffice for applications since multiple change points can still be obtainedfor the overall functional observations as a result of the combined effect ofcomponent changes because of the difference of τjk values across different

pairs (j, k). The prior on the leading wavelet coefficient α(i)0 will also follow

the scheme of (3) using a change point variable τ0.Once the change point variable is given a prior on {1, . . . , n}, we can

put independent priors on the common value β<jk of the ‘pre-change block’

(β(i)jk : i = 1, . . . , τjk − 1) and that β>jk of the ‘post-change block’ (β

(i)jk :

i = τjk, . . . , n). It is tempting to consider independent normal priors on thecommon values, as this will lead to posterior conjugacy given τjk. However,being a wavelet coefficient, it is natural for βjk to assume the special value0 commonly, as 0 stands for the absence of a feature in a function. Thusit is imperative to allow the value 0 in the prior for a wavelet coefficient byattaching an additional point mass at 0, that is, replacing the normal priorby a spike-and-slab prior that is a convex combination of a point mass δ0 at0 (a degenerate normal distribution) and a normal density: independently

β<jkind∼ πjN(0, c2jσ

2) + (1− πj)δ0, β>jkind∼ πjN(0, c2jσ

2) + (1− πj)δ0, (4)

where πj and cj are hyperparameters. The choices of these hyperparameterscan be motivated by the degree of smoothness in the underlying functions.Suppose that c2j = ν12

−γ1j and πj = min(1, ν22−γ2j), and ν1, ν2, γ1, γ2 > 0.

If the mother wavelet function ψ has regularity r, then Abramovich et al.[1] observed that f ∈ Bsp,q almost surely, where Bsp,q denotes Besov space ofindex (p, q) and smoothness s with max(0, 1/p−1/2) < s < r, 1 ≤ p, q ≤ ∞,whenever s + 1

2− γ2

p− γ1

2< 0, or s + 1

2− γ2

p− γ1

2= 0 and 0 ≤ γ2 < 1,

1 ≤ p <∞, q =∞.It may be noted that with positive probability (which is also increasing

to one as j gets larger), the common value of β(i)jk before the change point τjk

and the common value after the change point are both 0, which negates anactual change even with a value τjk > 1. In such a situation, the posteriordistribution of τjk is totally prior-driven, as the likelihood does not containany information about τjk. This type of indeterminacy of value of the changepoint does not bother us, as we view τjk only as a technical device to put

6

priors on two blocks of common values of the coefficients using a hierarchicalscheme. The posterior distribution of τjk is not of primary interest, although

it is used to compute the probability of ties in β(i)jk and β

(i′)jk for a given pair

of indices i 6= i′ (see the next section).

A prior on (α(i)0 : i = 1, . . . , n) is put using a similar mechanism through

its change point τ., except that the common values of the blocks (α(i)0 : i =

1, . . . , τ. − 1) and (α(i)0 : i = τ., . . . , n) are given independent vague priors

(normal with high variance) instead of spike-and-slab. All the change pointsτ. and τjk are independently given a prior P(τ = i) = ρi, i = 1, . . . , n, forsome distribution (ρi : i = 1, . . . , n), where ρi > 0,

∑ni=1 ρi = 1. In numerical

calculations, we take (ρi : i = 1, . . . , n) to be discrete uniform on {1, . . . , n}.Finally, we put an inverse-gamma prior σ2 ∼ IG(θ, λ) on the global vari-

ance parameter σ2.

3. Posterior computation

As the change points for the wavelet coefficients can potentially be differ-ent, a synthetic change point as an indicator of overall change is obtained byconsidering the degree of similarities between the wavelet coefficients. Fol-lowing Suarez and Ghosal [15] who considered the problem of clustering offunctional data, we quantify the similarity between the ith and i′th observa-tion in the discrete model with J levels by the proportion of ties between thewavelet coefficients of the underlying signals. A tie at i and i′ can happenin two ways — either both values are zero, or both values are non-zero butequal. In the latter case, this can happen only if the change point does notlie between i and i′. Thus we can define a similarity matrix S with entries

S(i, i′) = SJ(i, i′) =1

2J[1(α

(i)0 = α

(i′)0 ) +

J−1∑j=0

2j−1∑k=0

1(β(i)jk = β

(i′)jk )]. (5)

In the continuous model (which corresponds to infinitely many levels), wemay modify the measure by a weighted sum with a geometric discount se-quence S(i, i′) =

∑∞J=1 2−JSJ(i, i′). Due to the effect of the discounting

factor, the infinite sum is easily approximated by a finite sum. The syntheticchange point of a sequence of functional data standing for the aggregate effectof changes in all possible wavelet coefficients of the underlying signals is theindex-value where the contrast between overall similarities of observations

7

before and that after is the largest. For any k that divides the data into twogroups, we compute the ratio C(k) of the mean similarity between group andthe mean similarity within group:∑

1≤i≤k−1,k≤i′≤n

S(i, i′)/[(k − 1)(n− k + 1)]

{ ∑1≤i≤i′≤k−1

S(i, i′) +∑

k≤i≤i′≤N

S(i, i′)}/[

(k − 1

2

)+

(n− k + 1

2

)]

. (6)

The change point is where this ratio is the minimum: τ = arg min{C(k) :k = 2, . . . , n}. The change point τ thus defined is a functional of (f1, . . . , fn).Bayesian inference can be made on τ by computing its posterior distribution,which may be obtained from posterior samples. Posterior samples can bedrawn using the conditional conjugacy of β

(i)jk given τjk.

A Bayesian estimator can be obtained more easily without posterior sam-pling. Define an estimator τ = arg min{C(k) : k = 2, . . . , n}, where C(k) isobtained by replacing S(i, i′) by its posterior expectation given by

2−J[P(α

(i)0 = α

(i′)0 |a

(1)0 , . . . , a

(n)0 ) +

J−1∑j=0

2j−1∑k=0

P(β(i)jk = β

(i′)jk |b

(1)jk , . . . , b

(n)jk )]. (7)

In the actual implementation, we restrict the search to 3 ≤ k ≤ n− 1, whichmeans there are at least two data points in each group.

Let i < i′ without loss of generality. Observe that P(β(i)jk = β

(i′)jk |b

(1)jk , . . . , b

(n)jk )

can be obtained from the posterior distribution of τjk:

P(β(i)jk = β

(i′)jk |b

(1)jk , . . . , b

(n)jk )

= P(τjk ≤ i|b(1)jk , . . . , b(n)jk ) + P(τjk ≥ i′ + 1|b(1)jk , . . . , b

(n)jk )

+P(β<jk = β>jk = 0, i < τjk ≤ i′|b(1)jk , . . . , b(n)jk ). (8)

Similarly, P(α(i)0 = α

(i′)0 |b

(1)jk , . . . , b

(n)jk ) can be computed.

The posterior probability of {τjk = m}, m = 1, . . . , n, is

P(τjk = m|b(1)jk , . . . , b(n)jk ) =

p(b(1)jk , . . . , b

(n)jk |τjk = m)ρm∑n

l=1 p(b(1)jk , . . . , b

(n)jk |τjk = l)ρl

, (9)

where p stands for the joint density after integrating out the mean and thevariance parameters. When τjk = 1, there is only one block. Considering

8

the possibilities that the common value is 0 which occurs with probability(1−πj), and that it is non-zero, which occurs with probability πj, we obtain

p(b(1)jk , . . . , b

(n)jk |τjk = 1)

= (1− πj)∫ { n∏

l=1

φ(b(l)jk ; 0, σ2)

}g(σ2; θ, λ)dσ

+πj

∫ ∫ { n∏l=1

φ(b(l)jk ; ξ, σ2)

}φ(ξ; 0, c2jσ

2)g(σ2; θ, λ)dξdσ

= (1− πj)(2π)−n/2λθ

Γ(θ)× Γ(n/2 + θ)

[1

2

n∑l=1

(a(l)0 )2 + λ](n/2+θ)

+πj(2π)−n/2λθ

Γ(θ)

(c2jn+ 1)−1/2Γ(n/2 + θ)[1

2

n∑l=1

(a(l)0 )2 −

c2j2(c2jn+ 1)

(n∑l=1

a(l)0 )2 + λ

](n/2+θ) .(10)

For τjk = 2, . . . , n, there are two blocks with a pair of common values.Writing ∗ for a generic non-zero value, this can be (0, 0), (0, ∗), (∗, 0) and

9

(∗, ∗), leading to

P(b(1)jk , . . . , b

(n)jk |τjk = m)

= (1− πj)2∫ { n∏

l=1

φ(b(l)jk ; 0, σ2)

}g(σ2; θ, λ)dσ

+(1− πj)πj∫ ∫ {i−1∏

l=1

φ(b(l)jk ; 0, σ2)

}{ n∏l=i


}×φ(ξ; 0, c2jσ


+πj(1− πj)∫ ∫ {i−1∏

l=1


}{ n∏l=i

φ(b(l)jk ; 0, σ2)

}×φ(ξ; 0, c2jσ


+π2j

∫ ∫ ∫ {i−1∏l=1

φ(b(l)jk ; ξ1, σ

2)}{ n∏

l=i

φ(b(l)jk ; ξ2, σ

2)}

×φ(ξ1; 0, c2jσ2)φ(ξ2; 0, c2jσ

2)g(σ2; θ, λ)dξ1dξ2dσ

= (1− πj)2(2π)−n/2λθ

Γ(θ)

Γ(n/2 + θ)[1

2

n∑l=1

(b(l)jk )2 + λ

](n/2+θ)+πj(1− πj)(2π)−n/2

λθ

Γ(θ)[c2j(i− 1) + 1]−1/2

Γ(n/2 + θ)

[Bi−1 + λ](n/2+θ)

+(1− πj)πj(2π)−n/2λθ

Γ(θ)[c2j(n− i+ 1) + 1]−1/2

Γ(n/2 + θ)

[Bn−i+1 + λ](n/2+θ)

+π2j (2π)−n/2

λθ

Γ(θ)[c2j(i− 1) + 1]−1/2[c2j(n− i+ 1) + 1]−1/2

× Γ(n/2 + θ)[Bi−1 +Bn−i+1 −

1

2

n∑l=1

(b(l)jk )2 + λ

](n/2+θ) , (11)

where

Bi−1 =1

2

n∑l=1

(b(l)jk )2 −

c2j2(c2j(i− 1) + 1)

(i−1∑l=1

b(l)jk )2, (12)

10

and

Bn−i+1 =1

2

n∑l=1

(b(l)jk )2 −

c2j2(c2j(n− i+ 1) + 1)

(n∑l=i

b(l)jk )2. (13)

Finally note that

P(β<jk = β>jk = 0, τjk = m|b(1)jk , . . . , b(n)jk )

=ρmP(β<jk = β>jk = 0|τjk = m)p(b

(1)jk , . . . , b

(n)jk |β<jk = β>jk = 0, τjk = m)∑n

l=1 ρlp(b(1)jk , . . . , b

(n)jk |τjk = l)

.

The terms in the denominator in (9) are obtained from (10) and (11) and thenumerator is given by ρm multiplied by the first term on the right side of (11).Putting this information together, the third term in (8) can be computed.

Similarly, we can compute the marginal likelihood P(a(1)0 , . . . , a

(n)0 |τ. = l)

and the posterior probability of τ. = i through the Bayes theorem.

4. Posterior Consistency

In this section, we state a posterior consistency result for the collection ofunderlying signals and the synthetic change point, under the regime that thenoise variance σ2 → 0, while the number of subjects n remains fixed. Thissetup is equivalent to the number of independent and identically distributedreplications r → ∞ keeping the variance fixed, i.e., σ2

r = σ2/r. Let Dr bethe set of all observations. This allows an application of the general theoryof posterior contraction (Ghosal and van der Vaart [9]). We shall presentthe consistency result in terms of the wavelet coefficients in the model (2),covering both the discrete model and the continuous model, respectively forJ < ∞ and J = ∞. To simplify notation, we assume that α

(i)0 = 0 for

i = 1, . . . , n.For a vector of functions f = (f1, . . . , fn), the L2-norm ‖f‖ is defined by

‖f‖2 =n∑i=1

‖fi‖22 =n∑i=1

∞∑j=0

2j−1∑k=0

|β(i)jk |

2. (14)

For a smoothness level s > 0, we define the product Sobolev space Hsn as

the collection of vectors of functions f = (f1, . . . , fn) with each component

11

s-times weakly differentiable and norm ‖f‖Hsn

defined by

‖f‖Hsn

=n∑i=1

∞∑j=0

2j−1∑k=0

22js|β(i)jk |

2; (15)

here a function with wavelet coefficients {βjk : k = 0, . . . , 2j−1, j = 1, 2, . . .}is said to have s weak derivatives if

∑∞j=1

∑2j−1k=0 22js|βjk|2 <∞.

Let f0 be the true value of the signal functions f and let β(i)jk,0, i = 1, . . . , n,

k = 0, 1, . . . , 2j − 1, j = 1, 2, . . .. Let τ0 stand for the true value of thesynthetic change point τ , obtained as arg min{C(k) : k = 2, . . . , n} from (5)

and (6) by replacing β(i)jk by β

(i)jk,0, i = 1, . . . , n, k = 0, 1, . . . , 2j−1, j = 1, 2, . . ..

The prior parameters are chosen to be c2j = ν12−γ1j and πj = min(1, ν22

−γ2j)for some ν1, ν2, γ1, γ2 > 0.

The following result shows the consistency of estimation and change pointdetection.

Theorem 1. Let γ1 > 2s + 1, and f0 ∈ Hsn be the vector of true functions.

Then the posterior is consistent for estimation of the signals and detectionof the change point, i.e., for any ε > 0, Π(‖f − f0‖ < ε|Dr) → 1 andΠ(τ = τ0|Dr)→ 1 in probability as r →∞.

Posterior consistency of the function will be established by applyingSchwartz’s theory and its extensions (see Ghosal and van der Vaart [9]). Toapply Schwartz’s posterior consistency theorem in this context, we need toestablish that the parameter space is effectively compact. By direct momentcalculations, we shall show that there exists a sufficiently large Sobolev norm-bounded set (which is automatically compact in `2) that contains almost allposterior probability, with high true probability.

After the posterior consistency of the function is established, we showthat with high posterior probability, the similarity matrix identically agreeswith the true similarity matrix of the collection (β

(i)jk ), in true probability.

This analysis comes down to showing that, for every k = 0, 1, . . . , 2j − 1,j = 1, 2, . . ., the posterior probability of forming blocks with ties in the val-ues of β

(i)jk , i = 1, . . . , n, agreeing with the blocks formed by the ties in the

values of βjk,0, tends to one, in the true probability. This is shown by acombined application of posterior consistency of the function established inthe first part of the theorem, and a consistency property of Bayes factorstesting between possible structures established by a fine analysis of marginal

12

likelihoods in conjugate normal models. In particular, the posterior consis-tency allows us to rule out certain structures as unlikely in the posterior andconcentrate on structures that are only more liberal than the true structure,formed by these ties in the values of the wavelet coefficients. The argument isin a spirit similar to that used by Suarez and Ghosal [15] to establish poste-rior consistency of functional clustering in a similar setting, but it seems thatthey missed a growing factor. This factor needs to be properly controlled andhence their proof needed to be modified. We take care of the issue and ourargument can also be used to rectify the proof of the posterior consistencyresult of Suarez and Ghosal [15].

To describe the model selection property, consider the following possiblestructures for consecutive values of βjk:

(i) Change from nonzero to nonzero at τjk = t:

S∗,∗jk (t) = {β(1)jk = · · · = β

(t−1)jk = ξ1, β

(t)jk = · · · = β

(n)jk = ξ2, ξ1 6= ξ2 6= 0},

t = 2, . . . , n;

(ii) Change from nonzero to zero at τjk = t:

S∗,0jk (t) = {β(1)jk = · · · = β

(t−1)jk = ξ, β

(t)jk = · · · = β

(n)jk = 0, ξ 6= 0},

t = 2, . . . , n;

(iii) Change from zero to nonzero at τjk = t:

S0,∗jk (t) = {β(1)

jk = · · · = β(t−1)jk = 0, β

(t)jk = · · · = β

(n)jk = ξ, ξ 6= 0},

t = 2, . . . , n;

(iv) No change and the value is nonzero:

S∗jk = {β(1)jk = · · · = β

(n)jk = ξ, ξ 6= 0};

(v) No change and the value is zero:

S0jk = {β(1)

jk = · · · = β(n)jk = 0}.

The structure of a random (β(i)jk : i = 1, . . . , n) drawn from its posterior

distribution will be denoted by Sjk. The true model Sjk,0 is defined as the

structure (β(i)jk,0 : i = 1, . . . , n) follows. A model Sjk is called incompatible

if its closure does not contain Sjk,0, i.e., (β(i)jk,0 : i = 1, . . . , n) cannot be

approached by a sequence of values within Sjk. For instance, if β(1)jk,0 = · · · =

β(t−1)jk,0 = ξ1 6= ξ2 = β

(t)jk,0 = · · · = β

(n)jk,0, ξ1 6= ξ2 6= 0, then the models

S∗,0jk (t), S(0,∗)jk (t), S∗jk and S0

jk are all incompatible, and so are (i)–(iii) fora wrong value of t. By posterior consistency for βjk, posterior probabilityof any incompatible model converges to zero in probability. A compatiblemodel can, however, be strictly bigger than the true model. For instance, all

13

models are compatible with S0jk, since zero can be approximated by non-zero

sequences arbitrarily well. The possibility of the occurrence of a compatiblemodel strictly bigger than the true model cannot be ruled out by posteriorconsistency. The following lemma, a key ingradient in the proof of posteriorconsistency of the change point, shows that the true model is indeed selectedwith high posterior probability, under the true distribution.

Lemma 1. Let Sjk,0 denote the true structure for given j, k. Then Π(Sjk =Sjk,0|Dr)→ 1 in probability as r →∞.

Proofs of Theorem 1 and Lemma 1 are given in Section 7.

5. Simulation

In order to study the performance of our method, we implement it on a setof simulated data. We generate data following the discrete model through aDWT W given by (1). Since we detect the change point through the featuresextracted by the DWT, we first generate the vector of wavelet coefficients(a

(i)0 , b

(i)jk : k = 0, . . . , 2j − 1, j = 0, . . . , J − 1)> and then apply the inverse

discrete wavelet transform W> to get the functional data. We generate 16features for the first data point by a uniform distribution on [0, 0.5]. To makethe change distinguishable, we generate 16 features for the last data point bya uniform distribution on [0.5, 1]. Suppose that we have 100 data points thatrepresent the gradual change from the first data to the last one, and there isone change in each feature. We randomly sample 16 numbers from 3 to 98and regard them as the change point for these 16 features. To generate thesequence of 100 data units, we repeat the feature of the first data point andchange it to that of the last data point after the change point. Thus, we havea sequence of data represented by the true feature values. After applying theinverse discrete wavelet transform W>, we get a sequence of 100 true signals.To generate data with noise, we sample from the normal distribution with thetrue feature values as the mean and variance of 0.01, 0.1, and 1, respectively.Hence we get three sequences of 100 functional observations with noise afterapplying the inverse discrete wavelet transform to them. Figure 1 shows theplots of true signals, and signals with different noises.

We apply our method to the observations. For the true signals, we use(5) to compute the similarity. The change point is the value of k where C(k)in (6) is the minimum. Once we detect the first change point, we divide thesequence of data into two subgroups. Furthermore, we can find the change

14

point in these two groups. We can continue the process to divide the datainto more subgroups, and stop either of the following conditions are met:

� the plot of C(k) versus k is relatively flat which means that there isnot much difference in these data;

� the minimum number of data points is reached;

� the maximum step of the resulting binary tree is reached.

(a) True Signals (b) Variance=0.01

(c) Variance=0.1 (d) Variance=1

Figure 1: Plots of true signals and signals with different noises.

In this study, we stop if either max(C(k))−min(C(k)) < 0.1, or there areless than 10 data points in the group, or the resulting binary tree has 3 steps.We compare our results with the E-Divisive method (James and Matterson[11]) in the R-package ecp, which also estimates multiple change points byiteratively applying a procedure for locating a single change point. We applythe E-Divisive method on the wavelet transform of the observations. Table 1

15

Table 1: Change point detection results comparison

Method Variance Change points

True signal 0.00 26(2), 49(1), 69(2)

Our 0.01 26(2), 49(1), 69(2)

E-Divisive 0.01 25(2), 49(1), 75(2)

Our 0.10 26(2), 49(1), 68(3), 75(2), 92(3)

E-Divisive 0.10 25(2), 49(1), 62(3), 75(2), 88(3)

Our 1.00 5(3), 24(2), 40(3), 46(1), 94(3), 99(2)

E-Divisive 1.00 42(1), 70(2)

shows the change points for different sequences of observations detected byour method and E-Divisive method. The numbers in the parentheses denotethe hierarchical order of the change points. When the variance is small(σ2 = 0.01), the change points our method detects are exactly the same asthe true change points. With a larger variance (σ2 = 0.1), our method stillcan detect most change points correctly. When the variance is large (σ2 = 1),naturally it would be difficult to detect the change points by any method.Note that the E-Divisive method finds fewer change points for data with largevariance. That is because it uses a divergence measure based on Euclideandistances (James and Matterson [12]) to find the change points, and the largevariance could reduce the difference in functions; whereas our method modelsthe noise, and is less sensitive to the noise comparing to E-Divisive method.

6. Application to a climate change dataset

On Berkeley Earth (http://berkeleyearth.org/data/), we can findthe land-surface monthly average temperature between 1753–2016. Thesetemperatures are in degrees Celsius and reported as anomalies relative tothe average temperature from January 1951 to December 1980. We canconstruct a set of functional data by the 12 monthly average temperaturesin each year. We smooth the data by the basis expansion. Thus we get264 functional data ordered by the year. We suspect that there is a changein these functional data. Therefore we apply our method and find a changepoint of this sequence of functional data at the year 1914 by minimizing C(k)with respect to k in (6), the mean similarity across groups on the two sidesof the change point, relative to the average values within these groups.

16

(a)1753–1838 (b) 1839–1913

(c) 1914–1968 (d) 1969–2016

Figure 2: Plots of the land-surface average temperature curves between given year ranges.

We apply the procedure again on these two segments 1753–1913 and 1914–2016. The corresponding change points are found to be in the years 1839 and1969. The plots of these functional datasets for each blocks 1753–1838, 1839–1913, 1914–1968, and 1969–2016 are shown in Figure 2. Figure 3 shows themean functions of these 4 segments. The dominant change happened in theyear 1914 which is the first change point detected by our method.

The procedure can be continued longer to split into further subgroups,and stop subdividing a group further if max(C(k)) − min(C(k)) < 0.1 inthat group. This will lead to 15 subgroups formed by the change points1761, 1767, 1779, 1801, 1812, 1821, 1839, 1858, 1878, 1898, 1914, 1942, 1969,and 1994. We also apply E-Divisive method on this climate data. E-Divisivefinds 7 change points 1789, 1808, 1818, 1851, 1921, 1987, and 2001. The firstchange point detected by E-Divisive method is in the year 1921.

Posterior consistency results in Section 4 require replication of functionaldata, which we cannot have in this context. A possible way to address to

17

Figure 3: Mean functions of the land-surface average temperature curves.

issue is to artificially create replication by average daily temperatures over ablock of years, say over a decade. Then change points will be identified ascertain decades. We do not carry out this modification on this dataset.

7. Proof of the theorem

Proof of Theorem 1. We treat only the continuous case J =∞. For the dis-crete case, J is an increasing sequence which in the bounds may be replacedby infinity and the arguments of the continuous case can then apply.

First we show that

limB→∞

supr>0

Ef0Π(Hsn(B)c|Dr) = 0. (16)

This will imply that the parameter space can be effectively taken to beHsn(B)

for a sufficiently large B > 0, on which Schwartz’ theory for posterior con-sistency can be applied.

To prove (16), we apply Markov’s inequality to estimate

Π(Hsn(B)c|Dr) ≤ B−2

{ n∑i=1

∞∑j=0

22js

2j−1∑k=0

E(|β(i)jk |

2∣∣Dr

)}. (17)

The expectation E(|β(i)jk |2∣∣Dr

)above is given by

n∑t=1

E(|β(i)jk |

2∣∣τjk = t,Dr

)Π(τjk = t|Dr) ≤ max

1≤t≤nE(|β(i)jk |

2∣∣τjk = t,Dr

).

18

The posterior density of the common value ξ of (β(i)jk : i = 1, . . . , t− 1) given

(βjk 6= 0, τjk = t,Dr) is proportional to exp{− (r/2σ2)

∑t−1l=1(b

(l)jk − ξ)2 −

ξ2/(2c2jσ2)}

, giving the posterior distribution N( c2j ∑t−1

l=1 b(l)jk

(t−1)c2j+1/r,

c2jσ2

1+(t−1)c2jr

). Thus

E(|β(i)jk |2∣∣τjk = t,Dr

)is given by the expressions

c2jσ2

1 + (t− 1)c2jr+( c2j

(t− 1)c2j + 1/r

)2( t−1∑l=1

b(l)jk

)2if i < t,

c2jσ2

1 + (n− t+ 1)c2jr+( c2j

(n− t+ 1)c2j + 1/r

)2( n∑l=t

b(l)jk

)2if i ≥ t.

Note that if t = 1, we only need to consider the later case. Either expression

can be bounded by c2jσ2 +( c2jc2j + 1/r

)2( n∑l=1

b(l)jk

)2, which leads to the estimate

Π(Hsn(B)c|Dr) ≤ B−2

{ n∑i=1

∞∑j=0

22js

2j−1∑k=0

[c2jσ

2 +( c2jc2j + 1/r

)2( n∑l=1

b(l)jk

)2]}.

To prove the claim, it therefore suffices to show that Ef0Π(Hsn(B)c|Dr), which

is bounded by

B−2{n∞∑j=0

22js

2j−1∑k=0

[c2jσ

2 +( c2jc2j + 1/r

)2{nσ2

r+( n∑l=1

β(l)jk,0

)2}]}

≤ B−2{n∞∑j=0

22js

2j−1∑k=0

[c2jσ

2 + nc2jσ2 +

( n∑l=1

β(l)jk,0

)2]},

is arbitrarily small for sufficiently large B > 0. Recalling that c2j = ν12−γ1j,

the term inside the bracket reduces to

n(n+ 1)σ2ν1

∞∑j=0

2(2s+1−γ1)j + n2

∞∑j=0

22js

2j−1∑k=0

n∑l=1

|β(l)jk,0|

2

= n(n+ 1)σ2ν1

∞∑j=0

2(2s+1−γ1)j + n2‖f0‖2Hsn<∞

provided that γ1 > 2s+ 1, completing the proof of the claim.

19

In view of (16), for posterior consistency, we may assume that the pa-rameter space is Hs

n(B) for some B > 0, which is norm-compact (see, ex-ample, Lemma 6.4 of Belitser and Ghosal [7], from where finiteness of cov-ering numbers, and hence compactness, follow). Since the weak topologyon the corresponding distributions is weaker and is Hausdorff, the weak andnorm topologies coincide on norm-compact sets. It, therefore, suffices toprove posterior consistency with respect to the weak topology only. Ac-cording to Example 6.20 in Ghosal and van der Vaart [9], if the prior sat-isfies the Kullback–Leibler property (Definition 6.15 of Ghosal and van derVaart [9]), then the posterior distribution is consistent in the weak topol-ogy. In our setting, the Kullback–Leibler divergence is easily seen to beproportional to ‖f − f0‖2, and hence we need to verify that for all ε > 0,

Π(∑n

i=1

∑∞j=0

∑2j−1k=0 |β

(i)jk − β

(i)jk,0|2 < ε2

)> 0, which follows from Lemma 1

and 2 of Lian [13]. This completes the proof of posterior consistency of thefunction f at its true value f0.

To show the posterior consistency of the synthetic change point τ at itstrue value τ0, as the weights over levels 2−j are summable, it suffices to proveposterior consistency of

(1{β(i)jk = β

(i)jk } : k = 0, . . . , 2j − 1, j = 1, . . . , J, i 6= i′, i, i′ = 1, . . . , n)

at

(1{β(i)jk,0 = β

(i)jk,0} : k = 0, . . . , 2j − 1, j = 1, . . . , J, i 6= i′, i, i′ = 1, . . . , n)

for any fixed J . This is equivalent to showing that for all fixed j, k, Π(Sjk =Sjk,0|Dr)→ 1 in probability, which is obtained by Lemma 1.

Proof of Lemma 1. Because incompatible structures have asymptotically neg-ligible posterior probabilities in view of posterior consistency, it suffices toshow that the ratio of the marginal likelihood of a compatible structure otherthan the true structure and that of the true structure goes to zero in proba-bility. If the true structure is S∗,∗jk , then the result is immediate, as there areno other compatible structures. In this proof, we only deal with the caseswhen the true structure is S∗jk or S0

jk. The proofs for other cases follow bysimilar arguments.

20

The marginal likelihood p(b(1)jk , . . . , b

(n)jk |S

∗,∗jk (t)) of S∗,∗jk (t) is∫ ∫ t−1∏

l=1

φ(b(l)jk ; ξ1,

σ2

r)

n∏l=t

φ(b(l)jk ; ξ2,

σ2

r)φ(ξ1; 0, c2jσ

2)φ(ξ2; 0, c2jσ2)dξ1dξ2

=(c2jr(t− 1) + 1

)−1/2(c2jr(n− t+ 1) + 1

)−1/2(2πσ2/r)−n/2

× exp{ r

2σ2

[ c2jc2j(t− 1) + 1/r

( t−1∑l=1

b(l)jk

)2+

c2jc2j(n− t+ 1) + 1/r

( n∑l=t

b(l)jk

)2 − n∑l=1

(b(l)jk )2

]}.


(n)jk |S

∗,0jk (t)) of S∗,0jk (t) is∫ t−1∏

l=1

φ(b(l)jk ; ξ, σ2/r)

n∏l=t

φ(b(l)jk ; 0, σ2/r)φ(ξ; 0, c2jσ

2)dξ

=(c2jr(t− 1) + 1

)−1/2(2πσ2/r)−n/2

× exp{ r

2σ2

[ c2jc2j(t− 1) + 1/r

( t−1∑l=1

b(l)jk

)2−

n∑l=1

(b(l)jk )2

]}.


(n)jk |S

0,∗jk (t)) of S0,∗

jk (t) is∫ t−1∏l=1

φ(b(l)jk ; 0, σ2/r)

n∏l=t

φ(b(l)jk ; ξ, σ2/r)φ(ξ; 0, c2jσ

2)dξ

=(c2jr(n− t+ 1) + 1

)−1/2(2πσ2/r)−n/2

× exp{ r

2σ2

[ c2jc2j(n− t+ 1) + 1/r

( n∑l=t

b(l)jk

)2−

n∑l=1

(b(l)jk )2

]}.


(n)jk |S∗jk) of S∗jk is∫ n∏

l=1

φ(b(l)jk ; ξ, σ2/r)φ(ξ; 0, c2jσ

2)dξ

= [(c2jrn+ 1)(2πσ2

r)n]−1/2 exp

{ r

2σ2

[ c2jc2jn+ 1/r

( n∑l=1

b(l)jk

)2−

n∑l=1

(b(l)jk )2

]}.

21

Finally the marginal likelihood p(b(1)jk , . . . , b

(n)jk |S0

jk) of S0jk is

n∏l=1

φ(b(l)jk ; 0, σ2/r) = (2πσ2/r)−n/2 exp{− r

2σ2

n∑l=1

(bjk(l))2}.

Let the true model Sjk,0 be S0jk. To prove model selection consistency in

this case, it suffices to show that the ratio of the marginal likelihoods of eachmodels S∗,∗jk (t), S0,∗

jk , and S0,∗jk , t = 2, . . . , n, to that of S∗jk, converges to 0. In

the first case, the ratio is given by√c2jrn+ 1

(c2jr(t− 1) + 1)(c2jr(n− t+ 1) + 1)exp

{ r

2σ2

[ c2jc2j(t− 1) + 1

r

( t−1∑l=1

b(l)jk

)2+

c2jc2j(n− t+ 1) + 1

r

( n∑l=t

b(l)jk

)2−

c2jc2jn+ 1

r

( n∑l=1

b(l)jk

)2]}.

The first term in the square root goes to 0 as r → ∞. Hence it suffices toshow that the form inside the exponential is bounded. As r →∞,

r

2σ2

[c2j(∑t−1l=1 b

(l)jk

)2c2j(t− 1) + 1

r

+c2j(∑n

l=t b(l)jk

)2c2j(n− t+ 1) + 1

r

−c2j

c2jn+ 1r

( n∑l=1

b(l)jk

)2]

=r

2σ2

[ (∑t−1l=1 b

(l)jk

)2(t− 1) +O(1

r)

+

(∑nl=t b

(l)jk

)2(n− t+ 1) +O(1

r)−

(∑nl=1 b

(l)jk

)2n+O(1

r)

+O(1r)]

=r

2σ2

[(∑t−1l=1 b

(l)jk

)2(t− 1)

+

(∑nl=t b

(l)jk

)2(n− t+ 1)

−

(∑nl=1 b

(l)jk

)2n

]+O(1). (18)

Let U =∑t−1

l=1 b(l)jk and V =

∑nl=t b

(l)jk . Consider a random variable W which is

U/(t−1) with probability (t−1)/n and is V/(n−t+1) with probability (n−t+1)/n. Then Jensen’s inequality implies that U2/(t − 1) + V 2/(n− t+ 1) ≥(U + V )2/n. Thus, the term in the brackets of the exponential in (18) isnonnegative. Hence it suffices to control its expectation and show that itremains bounded as r →∞. Suppose that the true value is β

(i)jk,0 = ξ. Then

the expectation of the first term of (18) with respect to the true value is

r

2σ2

[(t− 1)2ξ2 + (t−1)σ2

r

(t− 1)+

(n− t+ 1)2ξ2 + (n−t+1)σ2

r

(n− t+ 1)−n2ξ2 + nσ2

r

n

],

22

which reduces to rξ2

2σ2 [(t − 1) + (n − t + 1) − n] + O(1) = O(1), as the firstterm vanishes. This completes the convergence proof in the present case.

If the true structure is S0jk and the compatible structure is S∗,∗jk (t), the

marginal likelihood ratio is(c2jr(t− 1) + 1

)−1/2(c2jr(n− t+ 1) + 1

)−1/2× exp

{ r

2σ2

[ c2jc2j(t− 1) + 1

r

( t−1∑l=1

b(l)jk

)2+

c2jc2j(n− t+ 1) + 1

r

( n∑l=t

b(l)jk

)2]}.

The first two terms in the square root go to 0 as r → ∞. The exponential

term is r2σ2

[(∑t−1l=1 b

(l)jk

)2(t−1) +

(∑nl=t b

(l)jk

)2(n−t+1)

]+O(1). As β

(i)jk,0 = 0 and the first term is

always nonnegative, its expectation isr

2σ2

[(t− 1)σ2/r

(t− 1)+

(n− t+ 1)σ2/r

(n− t+ 1)

]=

1. This concludes the proof of the present case.Other cases can be handled similarly.

References

[1] Abramovich, F. and Sapatinas, T. and Silverman, B. W. 1998. Waveletthresholding via a Bayesian approach. Journal of the Royal StatisticalSociety. Series B, 60, 725–749.

[2] Abramovich, F. and Amato, U. and Angelini, C. 2004. On optimalityof Bayesian wavelet estimators. Scandinavian Journal of Statistics. 31,217–234.

[3] Antoniadis, A. and Brossat, X. and Cugliari, J. and Poggi, J. 2013. Clus-tering functional data using wavelets. International Journal of Wavelets,Multiresolution and Information Processing. 11, 1350003–1350032.

[4] Aston, J. and Kirch, C. 2012. Detecting and estimating changes in de-pendent functional data. Journal of Multivariate Analysis. 109, 204–220.

[5] Aue, A. and Gabrys, R. and Horvath, L. and Kokoszka, P. 2009. Estima-tion of a change-point in the mean function of functional data. Journalof Multivariate Analysis. 100, 2254–2269.

23

[6] Aue, A. and Rice, G. and Sonmez, O. 2018. Detecting and dating struc-tural breaks in functional data without dimension reduction. Journal ofthe Royal Statistical Society, Series B. 80, 509–529.

[7] Belitser, E. N. and Ghosal, S. 2003. Adaptive Bayesian inference ontghe mean of an infinite-dimensional normal distribution. The Annalsof Statistics. 31, 536–559.

[8] Berkes, I. and Gabrys, R. and Horvath, L. and Kokoszka, P. 2009. De-tecting changes in the mean of functional observations. Journal of theRoyal Statistical Society, Series B. 71, 927–946.

[9] Ghosal, S. and van der Vaart, A. 2017. Fundamentals of NonparametricBayesian Inference. Cambridge University Press, Cambridge, UK.

[10] Hyndman, R. J. and Shang, H. L. (2009). Forecasting functional timeseries. Journal of the Korean Statistical Association. 38, 199–211.

[11] James, N. A. and Matteson, D. S. 2014. ecp: An R package for non-parametric multiple change point analysis of multivariate data. Journalof Statistical Software. 62, 2014.

[12] Matteson, D. S. and James, N. A. 2014. A nonparametric approachfor multiple change point analysis of multivariate data. Journal of theAmerican Statistical Association. 109, 334–345.

[13] Lian, H. 2011. On posterior distribution of Bayesian wavelet threshold-ing. Journal of Statistical Planning and Inference. 141, 318–324.

[14] Sharipov, O. and Tewes, J. and Wendler, M. 2016. Sequential blockbootstrap in a Hilbert space with application to change point analysis.The Canadian Journal of Statistics. 44, 300–322.

[15] Suarez, A. and Ghosal, S. 2016. Bayesian clustering of functional datausing local features. Bayesian Analysis. 11, 71–98.

[16] Zhang, X. and Shao, X. and Hayhoe, K. and Wuebbles, D. 2011. Testingthe structural stability of temporally dependent functional observationsand application to climate projections. Electronic Journal of Statistics.5, 1765–1796.

24

Bayesian Change Point Detection for Functional Datasghosal/papers/functional cp.pdfChange point,...

Documents

Transcript of Bayesian Change Point Detection for Functional Datasghosal/papers/functional cp.pdfChange point,...