TSPS: A Real-Time Time Series Prediction Systemtime and prediction query latency, respectively....

TSPS: A Real-Time Time Series Prediction System

Anish Agarwal∗MIT

[email protected]

Abdullah AlomarMIT

[email protected]

Devavrat ShahMIT

[email protected]

Abstract

A major bottleneck of the current Machine Learning (ML) workflow is the timeconsuming, error prone engineering required to get data from a datastore or adatabase (DB) to the point an ML algorithm can be applied to it. This is furtherexacerbated since ML algorithms are now trained on large volumes of data, yetwe need predictions in real-time, especially in a variety of time-series applicationssuch as finance and real-time control systems. Hence, we explore the feasibility ofdirectly integrating prediction functionality on top of a data store or DB. Such asystem ideally: (i) provides an intuitive prediction query interface which alleviatesthe unwieldy data engineering; (ii) provides state-of-the-art statistical accuracywhile ensuring incremental model update, low model training time and low latencyfor making predictions. As the main contribution, we explicitly instantiate a proof-of-concept, TSPS† which directly integrates with PostgreSQL. We rigorously testTSPS’s statistical and computational performance against the state-of-the-art timeseries algorithms, including a Long-Short-Term-Memory (LSTM) neural networkand DeepAR (industry standard deep learning library by Amazon). Statistically,on standard time series benchmarks, TSPS outperforms LSTM and DeepAR with1.1-1.3x higher relative accuracy. Computationally, TSPS is 59-62x and 94-95xfaster compared to LSTM and DeepAR in terms of median ML model trainingtime and prediction query latency, respectively. Further, compared to PostgreSQL’sbulk insert time and its SELECT query latency, TSPS is slower only by 1.3x and2.6x respectively. That is, TSPS is a real-time prediction system in that its modeltraining / prediction query time is similar to just inserting / reading data from a DB.As an algorithmic contribution, we introduce an incremental multivariate matrixfactorization based time series method, which TSPS is built off. We show thismethod allows one to produce reliable prediction intervals, thereby addressing animportant problem in time series analysis.

∗All authors are with Massachusetts Institute of Technology during the course of this work. Their affiliationsinclude Department of EECS, LIDS, IDSS, Statistics and Data Science Center, and CSAIL. Their email addresses:{anish90, aalomar , devavrat}@mit.edu.†An open source implementation of TSPS is available at tspdb.mit.edu.

Preprint. Under review.

arX

iv:1

903.

0709

7v5

[cs

.DB

] 1

0 Ju

l 202

0

Contents

1 Introduction 3

1.1 Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Prediction Interface 6

3 Incremental Matrix Factorization Based Time Series Algorithm 7

3.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Incremental Variant of Mean and Variance Estimation Algorithms . . . . . . . . . 9

3.3 TSPS Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.1 Prediction Model Storage in DB . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.2 Answering a PREDICT query in TSPS . . . . . . . . . . . . . . . . . . . 10

4 Experiments 11

4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Statistical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Conclusion 14

A Algorithm Justification 17

A.1 Justification for Mean Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . 17

A.2 Justification for Variance Estimation Algorithm . . . . . . . . . . . . . . . . . . . 18

B Additional Experimental Details 18

B.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.1.1 Datasets Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.1.2 Machine and DB Configuration. . . . . . . . . . . . . . . . . . . . . . . . 19

B.1.3 TSPS Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.1.4 Benchmarking Algorithms Parameters and Settings . . . . . . . . . . . . . 20

B.1.5 Accuracy Score Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.2 Additional Statistical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B.2.1 Mean Forecasting and Imputation Figures . . . . . . . . . . . . . . . . . . 21

B.2.2 Details of Variance Estimation Experiments . . . . . . . . . . . . . . . . . . 21

B.2.3 Robustness to Observation Models . . . . . . . . . . . . . . . . . . . . . . 22

B.2.4 Prediction Gains – Multivariate vs. Univariate Algorithm. . . . . . . . . . 22

B.3 Additional Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . 23

B.3.1 Scalability of TSPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

B.3.2 Hyper-parameter Tuning - T0, T ′, γ . . . . . . . . . . . . . . . . . . . . . 23

2

1 IntroductionData Engineering: Major Bottleneck in Modern ML Workflow. An important goal in the Systemsfor ML community has been to make ML more broadly accessible [35]. Arguably, the majorbottleneck towards this goal is not the lack of access to prediction algorithms, for which manyexcellent open-source ML libraries exist. Rather, it is the complex engineering and data processingrequired to take data from a datastore or database (DB) into a particular work environment format (e.g.spark data-frame) so that a prediction algorithm can be trained, and to do so in a scalable manner. SeeFigure 1 for a visual depiction of the ML workflow as it stands, and the ‘unwieldy’ data engineeringwe aim to alleviate by building TSPS, a system that directly integrates prediction functionality on topof a DB in real-time.

Figure 1: Pictorial depiction of data engineering – data transformation, feature extraction, model training/tuning– as a major bottleneck of the ML workflow. TSPS aims to alleviate it by direct integration of predictionfunctionality on top of a DB.

The Goal. Towards easing this bottleneck, and increasing accessibility of ML, our objective is toexplore the feasibility of directly integrating a prediction system on top of the DB layer. This is inline with recent efforts to “democratize” ML [41, 40]. Indeed, the authors in [4] make a compellingcase of the potential gains to be had by direct integration of predictive functionality on top of DBs –by drawing an analogy with the now scalable, robust and mature data management capabilities ofDBs, they argue that a similar approach to the management of predictive models can lead to largeimprovements in both computational performance and accessibility of ML. In this work, we focuson multivariate time series data, that is, data where each unit (e.g. collection of stocks, temperaturereadings from a collection of sensors) is associated with a time stamp. We do so as this is one of theprimary ways data is structured in a range of applications including finance, healthcare, retail, toname a few; further, real-time predictions are crucial in many of these applications. We consider twotypes of prediction tasks for time series data: (i) imputing a missing or noisy observation for a timeseries datapoint we do observe; (ii) forecasting a time series data point in the future (i.e., data that isas of yet unobserved).

Table 1: Summary of statistical, computational performance of TSPS vs. other state-of-the-art time seriesalgorithms. We use a weighted variant of the standard Borda Count (WBC) to evaluate statistical performance(precise definition in Appendix B.1.5). WBC takes value in [0, 1]: 0.5 indicates equal performance to all otheralgorithms on average, 1 (resp. 0) indicates it “infinitely” superior (resp. inferior) performance compared to allothers.

Metric TSPS LSTM DeepAR TRMF Prophet PostgreSQL

StatisticalAccuracy(WBC)

Mean Imputation 0.572 N/A N/A 0.428 N/A N/AMean Forecasting 0.597 0.446 0.559 0.541 0.358 N/A

Variance Imputation 0.597 N/A N/A 0.403 ‡ N/A N/AVariance Forecasting 0.726 N/A 0.274 N/A N/A N/A

ComputationalPerformance

Training/Insert Time (seconds) 9.00 556.1 531.7 13.6 3445.9 6.64Prediction/Querying Time (ms) 3.45 325.2 326.2 5.28 1466.1 1.32

FunctionalitiesMultivariate Time Series Yes Yes Yes Yes No N/A

Variance Estimation Yes No Yes No No N/AWorking with Missing Values Yes No Yes Yes Yes N/A

1.1 Contributions.

The main contribution of this work is TSPS, a real-time prediction system for time series data, whichaims to alleviate the data engineering bottleneck by providing direct access to predictive functionalityfor time series data. As explained in detail in Section 2, TSPS abstracts away the ML algorithm from∗TRMF does not natively support variance imputation. We use an adapted version of TRMF since no method

in the literature exists to the best of our knowledge. Refer to Appendix B.1.4 for details.

3

an end user. Therefore, it is essential to ensure that it achieves comparable statistical accuracy tostate-of-the-art time series prediction methods for a wide variety of time series generating processes,is robust to noisy, missing data, and works for multivariate time series data. Further, as described indetail in [35], given the growing demand to embed ML functionality in high-performance systems (e.g.financial systems, control systems), it is increasingly vital that we evaluate ML algorithms/systemsnot just through prediction accuracy, but in addition through: (i) the time it takes to train a ML model;(ii) the time it takes to answer a prediction query, i.e., prediction query latency.

We find TSPS outperforms all the existing time series libraries with respect to statistical performance,speed of training the ML model and latency in making a prediction. See Table 1 for a summary of theperformance of TSPS, and the other time series algorithms we compare against, along these metrics.

Novel Incremental Matrix-Factorization Based Multivariate Time Series Algorithm. Weachieve this statistical and computational performance by developing an incremental multivari-ate matrix factorization based time series method, and build TSPS using a scalable implementation ofit. Details of the method can be found in Section 3. The algorithm we implement is a generalizationof the method proposed in [2]. In particular it extends the method in [2]: (i) to provide predictionsfor multivariate time series data, which performs significantly better than the original univariatealgorithm; (ii) to estimate the time-varying variance (a la GARCH-like models) of a multivariatetime series, thereby enabling one to produce prediction intervals (i.e., estimating the volatility) forthe predictions made; (iii) to be incremental. The method can be viewed as a variant of multivariateSingular Spectrum Analysis (mSSA). Refer to Section 1.2 for detailed comparison with [2] andmSSA. An important feature of the proposed algorithm is it is “noise agnostic”, i.e., it effectivelyimputes and forecasts both the time-varying mean and variance regardless of the observation model(e.g. Gaussian, Poisson, or Binomial). Using this algorithm, one can verify whether or not a set ofinteger observations are generated from a Poisson process (cf. [13, 26, 8, 24]) by estimating the latentmean and variance as shown in Figure 2.

(a) (b)

Figure 2: With no knowledge of the distribution, TSPS can help verify if integer observations are generatedfrom a Poisson process (Figure 2a ), by checking if the latent mean and variances are equal, or different (e.g.Binomial in Figure 2b).

Theoretical Justification. Given our focus on building a real-time time series prediction system,and comprehensively testing its statistical and computational performance, a rigorous theoreticaljustification of proving why the algorithm we propose works is beyond the scope of this paper.In a recent work, [1], we do a finite-sample analysis of the imputation and forecasting predictionerror of the non-incremental version of this algorithm. In Appendix A, we provide intuitive formaljustification which motivates the use of the stacked Page matrix, the core data structure induced fromthe collection of time series that we utilize (see Figure 4a).

Superior Statistical Accuracy. In Section 4.2, we verify TSPS’s statistical accuracy by testing itsperformance on real-world benchmarks (e.g. utilized in [47]) and synthetic datasets. We compareTSPS’s accuracy against state-of-the-art algorithms, including LSTMs [16], DeepAR (by Amazon)[37], Temporal Regularized Matrix Factorization (TRMF) [47] and Prophet [14] (by Facebook). Wemeasure the imputation and forecasting prediction accuracy (for both mean and variance) of thesevarious algorithms as we vary the level of noise, increase the fraction of missing data, and change theobservation model. We use the ‘Weighted Borda Count’ (WBC) as our statistical accuracy metric.This score is based on pairwise comparisons of the normalized root mean squared error (NRMSE)across experiments; we use it as it better captures the relative performance across all experiments and

4

methods, unlike a simple statistic such as the mean or the median §. Refer to Appendix B.1.5 forthe definition of WBC and NRMSE. On standard time series benchmarks, using WBC as our metric,we find TSPS outperforms all other methods in both mean imputation and forecasting. We highlightTSPS’s mean WBC score is 1.1-1.3x higher compared to LSTM and DeepAR, some of the mostcommonly used modern neural network based time series algorithms. For variance estimation (whichwe use to produce prediction intervals as described in Section 3.1), we again find TSPS outperformsall other methods – it is important to note that existing methods do not provide variance estimationand instead we adapt these other algorithms to be able to make such comparisons, e.g. we use TRMF’sadapted version for imputation and DeepAR’s in-built uncertainty quantification functionality forvariance forecasting – i.e. despite providing an ‘unfair’ advantage to existing approaches, TSPS stillmanages to outperform them. In short, TSPS achieves state-of-the-art statistical performance.

Real-time model training, low-latency predictions. As stated earlier, the two metrics we use tomeasure computational performance of TSPS are: ML model training time and prediction querylatency (in addition to statistical accuracy). In Section 4.3, we test how TSPS compares againstthe popular, open-source time series prediction libraries stated above – LSTM, DeepAR, TRMF,and Prophet. With respect to both ML model training time and prediction query latency, TSPSoutperforms all other prediction algorithms we compare against. We highlight that compared toLSTM and DeepAR, TSPS’s median ML model training time and prediction latency is 59-62x and94-95x quicker, respectively. See Table 1 for a summary of the computational performance of TSPScompared to the other methods. In conclusion, TSPS achieves both state-of-the-art statistical andcomputational performance.

TSPS vis-a-vis PostgreSQL DB. As a further test, we compare the computational performance ofTSPS versus the standard DB index of PostgreSQL. On standard multivariate time series datasets, wefind that the time required to build a prediction model in TSPS ranges from 0.58x-1.52x to that ofPostgreSQL’s bulk insert time. With respect to query latency, we find that the time taken to answer aPREDICT query in TSPS is 1.6 to 2.8 times higher compared to answering a standard SELECT query,(1.6-2.7x for imputation queries, 1.7-2.8x for forecasting queries). In absolute terms, this translatesto about 1.32 milliseconds to answer a SELECT query and 3.36/3.45 milliseconds to answer animputation/forecasting PREDICT query (see Appendix B.1.2 for the machine configuration used).Hence, TSPS’s computational performance is close to the time it takes to just insert and read datafrom PostgreSQL, making it a real-time prediction system.

1.2 Related Work

Predictive Modeling and DBs. As stated earlier, our work is in line with exciting recent efforts,especially in industry, exploring the potential of DB management systems to support ML algo-rithms [32, 30, 22, 33, 28, 21, 20, 29]. In a complementary vein, there has been a line of work toautomate the process of choosing an optimal ML workload for a given task [41, 40]. There has alsobeen a line of work to exploit DBs to make certain key computationally expensive steps that arepervasive in training ML algorithms, much more efficient [34, 42, 12, 9, 25]. Further, there havebeen attempts to provide a declarative SQL-like language for prediction [39, 20, 4]. TSPS is verymuch in line with these efforts. However, crucially rather than facilitating the use of a variety of MLalgorithms or sub-routines, we take the stance of abstracting the entire ML algorithm from the user,and providing a single interface to answer both standard DB queries and predictive queries. This is inline with [41, 40]. Our focus is specifically on prediction with time series data, an important part ofML with crucial applications, that has not yet been directly tackled. To justify abstracting the MLalgorithm from the user, we comprehensively test both the statistical and computational performanceof TSPS compared to state-of-the-art alternatives, and show promising results. We note that just thefirst prediction task, imputation, by itself has received considerable attention in the systems for MLcommunity. For example, in [23], they conduct a thorough empirical evaluation of twelve prominentimputation techniques and in [10, 5, 27, 36], they explore systems which directly integrate imputationof missing values as an in-built functionality. In addition to imputation, TSPS also adds forecastingpredictive functionality. And, importantly it does both with uncertainty quantification.

Modern Multivariate Time Series Algorithms. Given the ubiquity of multivariate time seriesanalysis, we cannot possibly do justice to the entire literature. Hence we focus on a few techniques,

§In fact, as evidenced in Table 2, TSPS has even better statistical performance if we use the mean NRMSEscore, as is standard practice.

5

most relevant to compare against empirically. Recently, with the advent of deep learning, neuralnetwork (NN) based approaches have been the most popular, and empirically effective. Some industrystandard NN methods include LSTMs and DeepAR (for an industry leading NN library for timeseries analysis, see [37]).

Matrix Factorization Based Time Series Algorithms. The line of time series analysis most closelyassociated with the algorithm we propose is where a matrix is constructed from a time series, andsome form of matrix factorization is done (see [45, 47]). A particularly relevant algorithm is SingularSpectrum Analysis (SSA) (see [17]). The main steps of SSA are: Step 1 - create a Hankel matrix fromunivariate time series data; Step 2 - do a Singular Value Decomposition (SVD) of it; Step 3 - groupthe singular values based on user belief of the model that generated the process; Step 4 - performdiagonal averaging to “Hankelize” the grouped rank-1 matrices outputted from the SVD to createa collection of time series; Step 5 - learn a linear model for each “Hankelized” time series for thepurpose of forecasting. A generalization of SSA to multivariate time series data is called multivariateSSA (mSSA); here the only change is in Step 1, where a Hankel matrix is constructed for each timeseries and the various matrices are concatenated column-wise to create a “stacked” Hankel [19, 18].

In [2], the authors show that a simple variant of SSA, where close to no hyper-parameter tuning isrequired, is both theoretically and empirically effective. They show that by using the Page matrixinstead of the Hankel (see Section 3.1 for the definition), doing a single SVD on it, and subsequentlylearning a single linear forecaster suffices. This is in contrast to the user having to specify how thesingular values need to be grouped in the original SSA method.

The multivariate time series algorithm we introduce is a variant of mSSA and a generalization ofthe univariate time series algorithm proposed in [2] – inspired by mSSA, the core data structurefor multivariate time series data we analyze is the “stacked” Page matrix (the matrix induced bycolumn-wise concatenation of the Page matrices constructed for each time series); in line with [2],we show that even for multivariate time series data, surprisingly one only needs to do a single SVDand learn a single linear forecaster to produce effective predictions. See Appendix A for a theoreticaljustification for why our proposed variant of mSSA works, something which importantly has beenabsent from the literature.

Another popular method in the literature is TRMF (see [47]) and one we directly compare againstdue to its popularity and empirical effectiveness for both imputation and forecasting.

Time-Varying Variance Estimation. The time-varying variance is a key input parameter in manysequential prediction algorithms themselves. For example in control systems, the widely used KalmanFilter uses an estimate of the per step variance for both filtering and smoothing. Similarly in finance,the time-varying variance of each financial instrument in a portfolio is necessary for risk-adjustment.The key challenge in estimating the variance of a time series (which itself might very well be time-varying) is that unlike the actual time series itself, we do not get to directly observe the variance.Despite the vast time series literature, existing algorithms to estimate time-varying variance aremostly heuristics and/or make restrictive, parametric assumptions about how the variance (and theunderlying mean) evolves; for example “Auto Regressive Conditional Heteroskedasticity” (ARCH)and “Generalized Auto Regressive Conditional Heteroskedasticity” (GARCH) models. See [6].

2 Prediction InterfaceAn important goal of this work is to design an ‘easy-to-use’ interface to make predictions, which: (i)alleviates the need for error-prone data engineering; (ii) directly gives access to predictive functionalityin real-time for time series data. In particular, we wish to provide access imputation and forecastingfunctionality for multivariate time series data.

Towards this goal, there has been considerable recent work, especially in industry, exploring thefeasibility of DBs to automate and natively support ML workloads [32, 30, 22, 33, 28, 21]. Thefocus of these works has been to expose an interface that allows users to select from an array of MLalgorithms (e.g. generalized linear models, random forests, neural networks) and train them (plustune hyper-parameters) in the DB itself.

Prediction Query. In contrast, a significant point of departure of TSPS is how an end user interfaceswith the system to produce predictions. In particular, to make ML more broadly accessible, we take adifferent approach and abstract the ML model from the user, and instead strive for a single interfaceto answer both standard DB queries and predictive queries. In particular, in TSPS, a predictive query

6

has the same form as a standard SELECT query. The key difference is that in a predictive query, theresponse is a prediction, rather than a retrieval of available data. For example, consider a relationof the stock price of three companies over 100 days: stock(day:timestamp, company1:float,company2:float, company3:float). An example of a predictive query in this setting is shownin Figure 3a. Note, day 101 is a forecast, i.e., a prediction, since the DB only contains data for thefirst 100 days. Similarly, if the query is changed to WHERE day = 10, then we get the imputed (orde-noised) value for the stock price on the tenth day. In this case, the PREDICT query responsediffers from a standard SELECT query as rather than simply retrieving the available data for that day,which may possibly even be missing, a predicted de-noised value is returned.

PREDICT company1FROM stockwith PREDICTION_INTERVAL = 95%WHERE day = 101;

(a) A PREDICT query in TSPS.

Create PREDICTION_MODELON stock(company1 , company2 ,

company3)with day as time_column;

(b) Building a prediction model in TSPS.Figure 3

Building a Prediction Model in DB. To enable PREDICT queries, we need to build a predictionmodel using the available multivariate time series data. Continuing the example from above, a usercan build a prediction model in TSPS as shown in Figure 3b. Note, importantly, the prediction modelcan be trained over multiple time series columns, i.e., is built by simultaneously using data frommultiple time series, in this case the stock prices for the three companies. Thus, building and makinga PREDICT query in TSPS is very much like defining a schema for storing data in a DB and thenmaking a SELECT query.

TSPS: Open-Source Implementation. As our main contribution, we explicitly instantiate a proof-of-concept, TSPS, which directly integrates with PostgreSQL, and supports the PREDICT queryfunctionality described earlier. TSPS has an open-source implementation and is an extension ofPostgreSQL; it allows users to create prediction queries over both single columns or multiple columnson a time series relation (i.e., allows for both univariate and multivariate time series prediction). Note,all results presented (e.g., in Table 1) can be reproduced using our open-source implementation, withdefault settings.

3 Incremental Matrix Factorization Based Time Series Algorithm

In this section, we describe our novel incremental algorithm for multivariate time series. Section3.1 describes batch version of the mean and variance estimation algorithms; Section 3.2 describesthe incremental variant of the algorithm used in TSPS. In Section 3.3, the system implementation ofTSPS’s direct integration with PostgreSQL is detailed. In Appendix A, we provide some intuitiveformal justification for why such an algorithm should work.

3.1 Algorithm Description

Multivariate Time Series Notation. We begin with some brief notation necessary to describe thealgorithm. Without loss of generality, we consider a discrete time index t ∈ Z ‖. Say we aim to traina prediction model using N time series, and for each time series n ∈ [N ] := {1, . . . , N}, we haveaccess to T ∈ N observations (the observations might be missing). Let Xn(t) denote the (potentiallynoisy, missing) observation of the n-th time series, Xn, at the t-th time step. For any t, s ∈ [T ]such that t < s, let Xn(t : s) := [Xn(t), . . . , Xn(s)]. Let X denote the collection of N time seriesX1, . . . , XN .

Page Matrix Data Representation. The key data representation that is utilized for the proposedalgorithm is the “stacked” Page matrix. Let integers L,P ≥ 1 be such that P = bT/Lc. LetP̄ := N × P . Then the stacked Page matrix, ZX ∈ RL×P̄ induced by X is defined as

ZXi,[j+P×(n−1)N ] = Xn(i+ (j − 1)L), for i ∈ [L], j ∈ [P ], n ∈ [N ].

‖This is not restrictive as TSPS supports prediction tasks on time series observed at different frequencies.It does so by allowing the user to pick an aggregation interval (e.g. second, minute, hour) and an aggregationfunction (e.g. mean, min, max).

7

Further, let ZX2 ∈ RL×P̄ denote the stacked Page matrix of squared observations,

ZX2

i,[j+P×(n−1)N ] = X2n(i+ (j − 1)L).

Mean Estimation Algorithm. Figure 4a provides a visual depiction of the core steps of the algorithm.Imputation. The key steps of mean imputation are as follows.

(a) Key steps of mean estimation algorithm.

(b) Incremental variant of proposed algorithm.

Figure 4: Pictorial depiction of batch mean estimation and its incremental variant.

1. (Form Page Matrix) Transform X1(1 : T ), . . . , XN (1 : T ) into stacked Page matrix ZX ∈ RL×P̄with L ≤ P̄ . Fill all missing entries in the matrix by 0.2. (Singular Value Thresholding) Let SVD of ZX = USV T , where U ∈ RL×L,V ∈ RP̄×Lrepresent left and right singular vectors and S = diag(s1, . . . , sL) be diagonal matrix of singularvalues s1 ≥ . . . ≥ sL ≥ 0. Obtain M̂f = USkV T by setting all but top k singular values to 0, i.e.Sk = diag(s1, . . . , sk, 0, . . . , 0) for some k ∈ [L].3. (Output) f̂n(i+ (j − 1)L) := M̂fi,[j+P (n−1)], i ∈ [L], j ∈ [P ].

Forecasting. Forecasting includes an additional step of fitting a linear model on the de-noised matrix.1. (Form Sub-Matrices) Let Z̃X ∈ RL−1×P̄ be sub-matrix of ZX obtained by removing its last row.Let ZXL denote the last row.2. (Singular Value Thresholding) Let SVD of Z̃X = Ũ S̃Ṽ T , where Ũ ∈ RL−1×L−1, Ṽ ∈ RP̄×L−1represent left and right singular vectors and S̃ = diag(s1, . . . , sL−1) be diagonal matrix of singular

values s̃1 ≥ . . . ≥ s̃L−1 ≥ 0. Obtain ˆ̃Mf = Ũ S̃kṼ T by setting all but top k singular values to 0,i.e. S̃k = diag(s̃1, . . . , s̃k, 0, . . . , 0) for some for some k ∈ [L− 1].

3. (Linear Regression) β̂ = arg minb∈RL−1∥∥∥ZXL − ( ˆ̃Mf )T b∥∥∥2

2.

4. (Output) f̂n(T + 1) := Xn(T − (L− 1) : T )T β̂.Variance Estimation Algorithm. For variance estimation, the mean estimation algorithm is runtwice, once on ZX and once on ZX

2

. Then to estimate the time-varying variance (for both forecastingand imputation), a simple post-processing step is done where the square of the estimate producedfrom running the algorithm on ZX is subtracted from the estimate produced from ZX

2

. Precisedetails follow.

8

Imputation. Below are steps of variance imputation.1. (Impute ZX ,ZX

2

) Use the mean imputation algorithm on X1(1 : T ), . . . , XN (1 : T ) (i.e.,ZX ) and X21 (1 : T ), . . . , X

2N (1 : T ) (i.e., Z

X2), to produce the de-noised Page matrices M̂f andM̂f

2+σ2 , respectively.2. (Output) Construct M̂f

2 ∈ RL×P̄ , where M̂f2

ij := (M̂fij)

2, and produce estimates, σ̂2n(i+ (j −1)L) := M̂f

2+σ2

i,[j+P×(n−1)] − M̂f2

i,[j+P×(n−1)], for i ∈ [L], j ∈ [P ].

Forecasting. Like mean forecasting, it involves additional step after imputation.1. (Forecast with Z̃X ,ZXL , Z̃X

2

,ZX2

L ) Using the mean forecasting algorithm on X1(1 :T ), . . . , XN (1 : T ) and X21 (1 : T ), . . . , X

2N (1 : T ), to produce forecast estimates f̂n(T + 1)

and ̂f2n + σ2n(T + 1), respectively.2. (Output) Produce the variance estimate σ̂2n(T + 1) := ̂f2n + σ2n(T + 1)− (f̂n(T + 1))2.

3.2 Incremental Variant of Mean and Variance Estimation Algorithms

The algorithms for mean and variance estimation described above, as written, are meant for batchupdating (i.e., they get to observe all data at once). However, for computational efficiency in anyreal-world application, the prediction model needs to be trained and updated incrementally, which inturn requires making these algorithms incremental. In particular, both the computational efficiencyand statistical accuracy should not degrade with volume of data inserted. The key computationallyexpensive step in the proposed algorithm is computing the SVD of the relevant Page matrices. Weaddress this by proposing a simple incremental “meta”-algorithm, which has three hyper-parameters:T0, T

′ ∈ Z and γ ∈ (0, 1]. Let t′ := N × t denotes the number of observations seen thus far.Depending on t′, there are three scenarios:

Case 1. (t′ < T0): Given that the minimum observation count has not been met, the model willoutput the average of the observations Xn(0 : t) (for both forecasting and imputation).

Case 2. (T0 ≤ t′ ≤ T ′):

1. If t′ =⌊T0(1 + γ)

`⌋

for 0 ≤ ` ≤ q0 where q0 =⌊

ln(T ′/T0)

ln(1 + γ)

⌋: Re-train M0 using X1(0 :

t), . . . , XN (0 : t) (i.e, do full batch SVD).2. Else: incrementally update the model M0 (i.e., do incremental SVD using method in [48]).Case 3. (t′ > T ′):1. IdentifyMi where i = i(t) = max(0,

⌊2t′

T ′

⌋−1), and denote the first time index ofMi as si = iT

′

2 .

2. If (t′−si) =⌊T0(1 + γ)

`⌋

for 0 ≤ ` ≤ q where q =⌊

ln(2)

ln(1 + γ)

⌋: Re-trainMi with observations

X1(si : t), . . . XN (si : t).3. Else: incrementally update Mi.

Figure 4b illustrates how the proposed segmentation is carried out, and at what points the model istrained fully and where it is incrementally updated. Note, the incremental SVD method we use wasdeveloped in the Latent Semantic Indexing literature (see [48] for details).

TSPS Algorithm Hyper-parameters. The hyper-parameters of TSPS are L, k, k1, k2, T0, T ′, γ. For allreported results in Section 4, and in our open-source implementation, these are set in an automatedmanner, as detailed in Appendix B.1.3.

3.3 TSPS Implementation Details

3.3.1 Prediction Model Storage in DB

To actually directly integrate TSPS with PostgreSQL, we store all model parameters in PostgreSQLitself. For each model, the algorithms require storing: (i) the parameters associated with the appropri-ate truncated SVD (left/right singular vectors, and singular values); (ii) the regression coefficients.Importantly, we choose to store all of these parameters in the standard relational DB; thus, in responseto a query, the DB itself is queried to make predictions. The specific schema used to store modelparameters, and the relationship between them, is depicted in Figure 5 (for the case k1 = k2 = 3).Note, the parameters associated with the variance estimation models are stored in an identical manner.

9

For a forecasting predictive query, we average the values of linear regression coefficients storedacross models. Hence, we create a standard materialized view of the coefficient table to precomputethe average weights of the last few models, for efficient querying.

Figure 5: The schema used to store the prediction models. Note that bold attributes represent the primarykeys, while underlined attributes are columns indexed by a B-tree. The columns u1, v1, s1, . . . uk1, vk1, sk1)correspond to ZX and the columns uf1, vf1, sf1, . . . , ufk2 ,vfk2, sfk2 correspond to Z̃X . Refer back to Section3.1 for definition of ZX , Z̃X

3.3.2 Answering a PREDICT query in TSPS

Recall the two prediction tasks supported in TSPS are imputation and forecasting. For both tasks, thesystem needs to provide a response, both of the estimated mean and the associated prediction interval(upper and lower bound). In particular, the atomic response boils down to providing an estimation ofthe mean of the time series at a given t with prediction interval of c% for c ∈ (0, 100). This responseis effectively constructed by estimating the mean, f̂n(t) and the standard deviation σ̂n(t).

Creating a Prediction Interval. Using a Gaussian approximation, the c% prediction interval isgiven by, [

f̂n(t)− σ̂n(t)Φ−1(1

2+

c

200), f̂n(t) + σ̂n(t)Φ

−1(1

2+

c

200)

],

where Φ : R → [0, 1] denotes the Cumulative Density Function of standard Normal distributionwith mean 0 and variance 1. One could alternatively utilize Chebyshev’s inequality to obtain a moreconservative answer as [

f̂n(t)−σ̂n(t)√1− c100

, f̂n(t) +σ̂n(t)√1− c100

].

Either can be specified in TSPS.

Answering Predictive Queries.

Now we describe how a PREDICT query in TSPS for a given t with c% prediction interval isanswered.

To start with, we determine whether it is an imputation task, i.e. t ∈ [T ] or a forecasting task, i.e.t > T .

For each of these cases, we respond as follows.

Imputation: f̂n(t) : t ∈ [T ], n ∈ [N ].

1. (Find sub-model) Let i = i(t) = max(0,⌊

2t×NT ′

⌋− 1). If i = 0 and t < T ′/2N , use

I(t) = {0} else I(t) = {i(t), i(t) + 1}.2. (Find Row, Column Indices) Let trow(j) = (t− jT

′

2N ) mod L, tcol(j) = N⌊(t− jT

′

2 )/L⌋+

(n− 1) be row, column indices of matrix corresponding to the models with j ∈ I(t).3. (Find Truncated SVD for Mean) Query left, right singular vectors U j , V j respectively from

U_table, V_table for model corresponding to mean values with index j ∈ I(t) along withsingular values Sj from s_table.

4. (Produce Mean Estimate) Setf̂n(t) =

1|I(t)|

∑j∈I(t)

∑k U

jtrow(j)k

V jtcol(j)kSjk.

10

5. (Find Truncated SVD for Second Moment) Query left, right singular vectors Ũ j , Ṽ j re-spectively from V_table, U_table for model corresponding to second moment with indexj ∈ I(t) along with singular values S̃j from s_table.

6. (Produce Variance Estimate) Set σ̂n2(t) = max(0, ̂f2n + σ2n(t) − f̂n(t)2), wherêf2n + σ2n(t) =

∑j∈I(t)

∑k Ũ

jtrow(j)kṼ

jtcol(j)k

S̃jk.

|I(t)| .

7. (Output Prediction Interval) Output prediction interval using f̂n(t), σ̂n(t) for queried confi-dence c%.

Note that answering an imputation query for a given range {t1, . . . , t2} follows a similar procedure,except that it will potentially query multiple rows from V_table, U_table, and s_table.

Forecasting: f̂n(t) : t > T, n ∈ [N ].1. (Retrieve History) Query the last L-1 observations Xn(T − L+ 1 : T )).2. For t0 ∈ {T −L+ 1, . . . , T}, set gmn (t0) = Xn(t0) if Xn(t0) is not missing, otherwise set

it to zero. Similarly, set gvn(t0) = X2n(t0)

3. (Obtain Coefficients) Obtain a certain coefficients average from the materialized view (e.g.average of last 10 models) for means β̂m and variances β̂v .

4. (Sequential Forecasting) For τ ∈ {T + 1, . . . , t} produce estimate of means and variancesas gmn (τ) =

∑L−1`=1 g

mn (τ − `)β̂m` , and gvn(τ) =

∑L−1`=1 g

vn(τ)(τ − `)β̂v` .

5. (Output Prediction Interval) Output estimate f̂n(t) = gmn (t) and its prediction interval usingσ̂n(t) = g

vn(t) for queried confidence c%.

4 Experiments

4.1 Setup

Datasets. Throughout the experiments, we use three real-world datasets that are standard benchmarks(e.g. used in [47]) in time series analysis as well as three synthetic datasets. We use these syntheticdatasets so we can compare with the latent mean and variance which are of course not observable inreal-world data. The real-world datasets come from three domains: electricity [44], traffic [31] andfinance [46]. For further details on the datasets used, refer to Appendix B.1.1

Machine and DB Configuration. In all experiments, we use an Intel Xeon E5-2683 machine with16 cores, 132 GB of RAM, and an SSD storage ¶. In Table 6 in Appendix B.1.2, we detail the relevantsettings used for PostgreSQL 12.1.

Algorithms Setup and Hyper-parameters. Throughout the experiments, we compare with severalstate-of-the-art algorithms. Including LSTM [16], DeepAR [37], TRMF [47], and Prophet [14]. Inall reported results, TSPS’s hyper-parameters (L, k, k1, k2, T0, T ′, γ) are set in an automated manner,as we show and justify in Appendix B.1.3. Refer to Appendix B.1.4 for more information about theimplementations and parameters used for all other methods.

Accuracy Metrics. As stated earlier, for each experiment, we use Normalized Root-Mean-Square-Error (NRMSE) as an accuracy metric, where normalization corresponds to rescaling each time seriesto have zero mean and unit variance before calculating the RMSE. Further, to quantify the statisticalaccuracy of the different methods used, we use a variant of the standard Borda Count (weighted by theNRMSE in each experiment), which we denote as WBC. As explained in Appendix B.1.5, WBC liesbetween 0 and 1: for a given algorithm, 0.5 means it does as well as all other algorithms on average; 1(resp. 0) means it does ‘infinitely’ better (resp. worse) than all other algorithms. Again, we emphasizethat by simply looking at the mean NRMSE as is normally done, TSPS’s relative performance is evenbetter (see Table 2). We use WBC as we believe it better summarizes the statistical accuracy of thevarious algorithms used across the various experiments.

11

Table 2: TSPS outperforms state-of-the-art algorithms in mean and variance estimation across different datasets.We use WBC (higher is better) and average NRMSE (lower is better) to summarize the results across experimentsfor each dataset.

Mean Imputation(WBC/NRMSE)

Mean Forecasting(WBC/NRMSE)

Electricity Traffic Synthetic I Financial Electricity Traffic Synthetic I Financial

TSPS 0.61/0.39 0.50/0.49 0.53/0.25 0.65/0.28 0.53/ 0.48 0.50/0.53 0.70/0.20 0.66/0.36LSTM NA NA NA NA 0.49/0.55 0.54/0.47 0.45/0.44 0.31/1.20

DeepAR NA NA NA NA 0.53/0.48 0.53/0.47 0.53/0.33 0.65/0.40TRMF 0.39/0.69 0.50/0.51 0.47/0.33 0.35/0.51 0.50/0.53 0.48/0.57 0.61/0.27 0.58/0.46Prophet NA NA NA NA 0.47/0.58 0.45/0.62 0.22/1.01 0.30/1.30

4.2 Statistical Experiments

Mean Imputation. We compare the imputation performance of TSPS against TRMF, a popularmethod for imputing (and forecasting) multivariate time series data with state-of-the-art results. Inparticular, we evaluate both algorithms using the three real-world datasets and one synthetic dataset(synthetic I). To test the algorithms’ robustness, we corrupt the various datasets by artificially maskingentries with probability (1− p), and by adding zero-mean Gaussian noise to each entry with varyingstandard deviation σ. As we vary p and σ for all four datasets, we find TSPS outperforms TRMF in80% of experiments. In Table 2, we summarize the statistical performance of TSPS against TRMF interms of WBC and the average NRMSE for each dataset across all experiments; we find that usingboth metrics, TSPS outperforms TRMF on all datasets .

Mean Forecasting. We compare the forecasting performance of TSPS against LSTM, DeepAR,TRMF, and Prophet. Using the same datasets used for mean imputation, we find that TSPS performscompetitively with several state-of-the-art deep learning algorithms (DeepAR and LSTM), andoutperforms both TRMF and Prophet. Specifically, as we vary the faction of missing values and thenoise added, TSPS is the best performing method in 50% of experiments, and at least the second bestperforming in 80% of experiments. To summarize the statistical performance of all algorithms, wecompute their WBC and the average NRMSE for each dataset in Table 2; we find TSPS, using bothmetrics, outperforms all other algorithms in the electricity, financial and synthetic I datasets, whileperforming competitively with DeepAR and LSTM in the traffic datasets.

Variance Estimation. We restrict our analysis to synthetic data as we do not get access to the trueunderlying time-varying variance in real-world data. We use the dataset synthetic II, which consistsof nine sets of multivariate time series each with a different additive combination of time seriesdynamics and a different observation model, as described in Appendix B.1.1. We again vary thefraction of observed data p ∈ {1.0, 0.8, 0.5}. We find TSPS has superior performance in all but oneexperiment (> 98%) over both an adapted version of TRMF (in imputation) and DeepAR’s in-builtfunctionality (in forecasting). See Table 3 for the NRMSE values for each experiment. Further, inTable 3, we summarize the performance of all algorithms using WBC and the mean NRMSE acrossexperiments. Note that, in summary, TSPS outperforms TRMF in variance imputation and DeepARin variance forecasting across both metrics (WBC, mean NRMSE). Refer to Appendix B.2.2 for moredetails about the variance estimations experiments.

Robustness to Observation Models & Multivariate vs. Univariate version of TSPS Algorithm.We run two additional sets of experiments. In Appendix B.2.3, we test TSPS’s robustness againstdifferent noise/observations models – pleasingly, we find TSPS effectively imputes and forecaststhe latent time-varying mean without knowledge of the underlying noise/observation model (e.g.,Gaussian, Poisson, Bernoulli). In Appendix B.2.4, we explore the statistical and computational tradeoffs in TSPS’s performance when building a multivariate vs. univariate time series prediction model –we find using multiple time series can greatly improve accuracy with a small slow down in trainingtime and query latency. However, beyond a certain point, the gain in statistical accuracy from addingmore time series (∼ N > 50) becomes minimal and can add unnecessary overhead.

¶Note all experiments were replicated on different machine configurations as well by replacing PostgreSQLby TimeScaleDB [43]. We find the metrics of interest remain similar. We omit these results due to spaceconstraints.

12

Table 3: TSPS outperforms both TRMF (in variance imputation) and DeepAR (in variance forecasting).

ObservationModel

Time SeriesDynamics

TSPS(Imputation)

TRMF(Imputation)

TSPS(Forecasting)

DeepAR(Forecasting)

p = 1.0 p = 0.8 p = 0.5 p = 1.0 p = 0.8 p = 0.5 p = 1.0 p = 0.8 p = 0.5 p = 1.0 p = 0.8 p = 0.5

GaussianHar 0.076 0.099 0.118 0.122 0.125 0.141 0.106 0.144 0.156 0.170 0.184 0.289Har + trend 0.075 0.091 0.103 0.133 0.135 0.142 0.154 0.155 0.247 0.286 0.232 0.269Har+AR+trend 0.074 0.090 0.101 0.134 0.136 0.146 0.173 0.263 0.337 0.214 0.265 0.388

PoissonHar 0.126 0.132 0.150 0.137 0.138 0.151 0.143 0.152 0.157 13.20 148.5 90.20Har+ trend 0.086 0.087 0.101 0.176 0.187 0.194 0.093 0.182 0.199 0.491 1.403 2.163Har+AR+trend 0.081 0.088 0.104 0.184 0.187 0.204 0.093 0.182 0.199 1.386 34.45 162.5

BernoulliHar 0.024 0.027 0.030 0.026 0.027 0.029 0.029 0.050 0.033 0.073 0.072 0.077Har + trend 0.022 0.025 0.025 0.030 0.030 0.032 0.029 0.048 0.059 0.049 0.068 0.076Har+AR+trend 0.022 0.024 0.025 0.030 0.030 0.032 0.036 0.036 0.056 0.070 0.082 0.111

Summary WBC 0.597 0.403 0.726 0.264Mean NRMSE 0.070 0.112 0.132 16.9

4.3 Computational Experiments

TSPS vs. Other Prediction Methods. As stated earlier, we evaluate the time to: (i) train a predictionmodel; (ii) produce a point forecast. In Table 4, we show these metric across the four datasets forTSPS, LSTM, DeepAR, TRMF, Prophet. With respect to time taken to train a model, we find TSPSis 42.4-114.9x faster than LSTM, 39.7-79.2x faster than DeepAR, 1.03-2.92x faster than TRMF,and 86.4-832.7x faster than Prophet. With respect to prediction queries, TSPS’s latency is 86.0-99.4x faster than LSTM’s, 58.0-110.4x faster than DeepAR’s, 0.8-2.2x the latency of TRMF’s, and370.3-456.9x faster than Prophet’s.

Table 4: TSPS has significantly less training time, prediction query latency compared to state-of-the-artalternatives.

Training/Insert Time(seconds)

Prediction/Querying Time(milliseconds)

Electricity Traffic Synthetic I Financial Electricity Traffic Synthetic I Financial

TSPS 11.8 14.4 6.2 2.3 3.3 3.9 3.2 3.6LSTM 1357.5 661.2 450.9 97.9 295.6 358.3 278.7 354.8DeepAR 907.6 572.2 491.2 144.5 366.6 378.5 285.8 207.2TRMF 34.5 14.8 12.4 3.6 7.3 6.0 2.5 4.6Prophet 6868.8 4726.3 535.9 2165.4 1473.4 1458.8 1480.3 1434.0

TSPS vs. PostgreSQL. We compare: (i) TSPS training time vs. PostgreSQL’s bulk insert time;(ii) TSPS’s PREDICT query latency against PostgreSQL’s SELECT query latency. We find TSPS’smodel training time ranges from 0.58-1.52x to that of the insert time of PostgreSQL. We evaluate theprediction query latency with and without uncertainty quantification (UQ), i.e., producing predictionintervals via variance estimation. Compared with a SELECT query in PostgreSQL, imputation queriesare 1.64-2.67x slower, and forecasting queries are 1.67-2.77x slower. If UQ is added, queries are3.34-5.35x and 3.42-5.48x slower for imputation and forecasting, respectively. This amounts to 1.29-2.36 milliseconds for SELECT queries, 3.22-3.87 milliseconds for imputation queries (5.65-7.88milliseconds with UQ), and 3.24-3.94 milliseconds for forecasting queries (5.67-8.08 millisecondswith UQ). Refer to Table 5 for a summary and Appendix B.3.1 for details.

Table 5: N and T refer to the number of time series and the number of observations per time series respectively.Training/Insert Time

(seconds)Query Latency(milliseconds)

N/T TSPS PostgreSQL TSPS’s Forecast/with UQTSPS’s Imputation/with UQ

PostgreSQL’sSELECT

Electricity 370 / 25968 11.81 8.34 3.32/7.02 3.25/6.93 1.34Traffic 963 / 10392 14.42 9.50 3.94/8.08 3.87/7.88 2.36Synthetic I 400 / 15000 6.20 4.93 3.24/5.67 3.22/5.65 1.31Financial 839 / 3993 2.31 4.00 3.57/7.07 3.47/6.90 1.29

13

5 ConclusionIn this paper, we build an open-source, real-time time series prediction system TSPS, which achievesstate-of-the-art statistical and computational performance with respect to state-of-the-art deep-learningbased time series algorithms. As an important design choice, we directly integrate TSPS on top ofPostgreSQL and abstract away the ML workflow from a user, this striving for a single interface toanswer both standard DB queries and predictive queries. We hope this serves as a benchmark forother such integrated prediction systems in the growing and exciting line of work in AutoML.

References

[1] A. Agarwal, A. Alomar, and D. Shah. On multivariate singular spectrum analysis. Workingpaper, 2020.

[2] A. Agarwal, M. J. Amjad, D. Shah, and D. Shen. Model agnostic time series analysis via matrixestimation. Proceedings of the ACM on Measurement and Analysis of Computing Systems,2(3):40, 2018.

[3] A. Agarwal, D. Shah, D. Shen, and D. Song. On robustness of principal component regression.In Advances in Neural Information Processing Systems, pages 9889–9900, 2019.

[4] M. Akdere, U. Cetintemel, M. Riondato, E. Upfal, and S. B. Zdonik. The case for predictivedatabase systems: Opportunities and challenges. In CIDR, pages 167–174, 2011.

[5] I. Arous, M. Khayati, P. Cudré-Mauroux, Y. Zhang, M. Kersten, and S. Stalinlov. Recovdb:accurate and efficient missing blocks recovery for large time series. In 2019 IEEE 35thInternational Conference on Data Engineering (ICDE), pages 1976–1979. IEEE, 2019.

[6] T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of economet-rics, 31(3):307–327, 1986.

[7] P. J. Brockwell and R. A. Davis. Time series: theory and methods. Springer Science & BusinessMedia, 2013.

[8] L. Brown, N. Gans, A. Mandelbaum, A. Sakov, H. Shen, S. Zeltyn, and L. Zhao. Statisticalanalysis of a telephone call center: A queueing-science perspective. Journal of the Americanstatistical association, 100(469):36–50, 2005.

[9] P. G. Brown. Overview of scidb: large scale array storage, processing and analysis. InProceedings of the 2010 ACM SIGMOD International Conference on Management of data,pages 963–968, 2010.

[10] J. Cambronero, J. K. Feser, M. J. Smith, and S. Madden. Query optimization for dynamicimputation. Proceedings of the VLDB Endowment, 10(11):1310–1321, 2017.

[11] F. Chollet. keras. https://github.com/fchollet/keras, 2015. Online; accessed 25February 2020.

[12] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysispractices for big data. Proceedings of the VLDB Endowment, 2(2):1481–1492, 2009.

[13] J. Durbin. Some methods of constructing exact tests. Biometrika, 48(1-2):41–65, 1961.[14] Facebook. Prophet. https://facebook.github.io/prophet/. Online; accessed 25 Febru-

ary 2020.

[15] M. Gavish and D. L. Donoho. The optimal hard threshold for singular values is 4/√

3. IEEETransactions on Information Theory, 60(8):5040–5053, 2014.

[16] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction withlstm. 9th International Conference on Artificial Neural Networks, 1999.

[17] N. Golyandina, V. Nekrutkin, and A. A. Zhigljavsky. Analysis of time series structure: SSA andrelated techniques. Chapman and Hall/CRC, 2001.

[18] H. Hassani, S. Heravi, and A. Zhigljavsky. Forecasting uk industrial production with multivariatesingular spectrum analysis. Journal of Forecasting, 32(5):395–408, 2013.

[19] H. Hassani and R. Mahmoudvand. Multivariate singular spectrum analysis: A general view andnew vector forecasting approach. International Journal of Energy and Statistics, 1(01):55–83,2013.

14

https://github.com/fchollet/kerashttps://facebook.github.io/prophet/

[20] J. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton,X. Feng, K. Li, et al. The madlib analytics library or mad skills, the sql. arXiv preprintarXiv:1208.4165, 2012.

[21] IBM. ibmdbr: Ibm in-database analytics for r. https://cran.r-project.org/web/packages/ibmdbR/index.html, 2020. Online; accessed 25 February 2020.

[22] IBM. Python support for ibm db2 and ibm informix. https://github.com/ibmdb/python-ibmdb, 2020. Online; accessed 25 February 2020.

[23] M. Khayati, A. Lerner, Z. Tymchenko, and P. Cudré-Mauroux. Mind the gap: an experimentalevaluation of imputation of missing values techniques in time series. Proceedings of the VLDBEndowment, 13(5):768–782, 2020.

[24] S.-H. Kim and W. Whitt. Choosing arrival process models for service systems: Tests of anonhomogeneous poisson process. Naval Research Logistics (NRL), 61(1):66–90, 2014.

[25] A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalizeddata. In Proceedings of the 2015 ACM SIGMOD International Conference on Management ofData, pages 1969–1984, 2015.

[26] P. A. Lewis. Some results on tests for poisson processes. Biometrika, 52(1-2):67–77, 1965.[27] C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference

and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference onManagement of data, pages 75–86, 2010.

[28] Microsoft. Revoscaler package. https://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscaler, 2020. Online;accessed 25 February 2020.

[29] mldb.ai. Mldb. https://mldb.ai. Online; accessed 25 February 2020.[30] MySQL. Pymysql documenation. https://pymysql.readthedocs.io/en/latest/, 2020.

Online; accessed 25 February 2020.[31] C. D. of Transportation. UCI machine learning repository - pems-sf data set. https://

archive.ics.uci.edu/ml/datasets/PEMS-SF, 2011. Online; accessed 25 February 2020.[32] oracle. cx˙oracle. https://oracle.github.io/python-cx_Oracle, 2020. Online; ac-

cessed 25 February 2020.[33] oracle. Oracle machine learning for r. https://www.oracle.com/database/

technologies/datawarehouse-bigdata/oml4r.html, 2020. Online; accessed 25 Febru-ary 2020.

[34] Y. Park, J. Qing, X. Shen, and B. Mozafari. Blinkml: Efficient maximum likelihood estima-tion with probabilistic guarantees. In Proceedings of the 2019 International Conference onManagement of Data, pages 1135–1152, 2019.

[35] A. Ratner, D. Alistarh, G. Alonso, D. G. Andersen, P. Bailis, S. Bird, N. Carlini, B. Catanzaro,J. Chayes, E. Chung, et al. Mlsys: The new frontier of machine learning systems. arXiv preprintarXiv:1904.03257, 2019.

[36] F. Saad and V. K. Mansinghka. A probabilistic programming approach to probabilistic dataanalysis. In Advances in Neural Information Processing Systems, pages 2011–2019, 2016.

[37] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecastingwith autoregressive recurrent networks. International Journal of Forecasting, 2019.

[38] R. Sen, H.-F. Yu, and I. S. Dhillon. Think globally, act locally: A deep neural network approachto high-dimensional time series forecasting. In Advances in Neural Information ProcessingSystems, pages 4838–4847, 2019.

[39] D. Shah, S. Burle, V. Doshi, Y.-z. Huang, and B. Rengarajan. Prediction query language. In2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton),pages 611–616. IEEE, 2018.

[40] Z. Shang, E. Zgraggen, B. Buratti, F. Kossmann, P. Eichmann, Y. Chung, C. Binnig, E. Upfal,and T. Kraska. Democratizing data science through interactive curation of ml pipelines. InProceedings of the 2019 International Conference on Management of Data, pages 1171–1188,2019.

15

https://cran.r-project.org/web/packages/ibmdbR/index.htmlhttps://cran.r-project.org/web/packages/ibmdbR/index.htmlhttps://github.com/ibmdb/python-ibmdbhttps://github.com/ibmdb/python-ibmdbhttps://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscalerhttps://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscalerhttps://mldb.aihttps://pymysql.readthedocs.io/en/latest/https://archive.ics.uci.edu/ml/datasets/PEMS-SFhttps://archive.ics.uci.edu/ml/datasets/PEMS-SFhttps://oracle.github.io/python-cx_Oraclehttps://www.oracle.com/database/technologies/datawarehouse-bigdata/oml4r.htmlhttps://www.oracle.com/database/technologies/datawarehouse-bigdata/oml4r.html

[41] E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, and T. Kraska. Automatingmodel search for large scale machine learning. In Proceedings of the Sixth ACM Symposium onCloud Computing, pages 368–380, 2015.

[42] E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizingpipelines for large-scale advanced analytics. In 2017 IEEE 33rd international conference ondata engineering (ICDE), pages 535–546. IEEE, 2017.

[43] timescaleDB. Time-series data: Why (and how) to use a relational database instead of nosql.[44] A. Trindade. UCI machine learning repository - individual household elec-

tric power consumption data set. https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, 2015. Online; accessed 25 February 2020.

[45] K. W. Wilson, B. Raj, and P. Smaragdis. Regularized non-negative matrix factorization withtemporal dependencies for speech denoising. In Ninth Annual Conference of the InternationalSpeech Communication Association, 2008.

[46] WRDS. The trade and quote (taq) database. "https://wrds-www.wharton.upenn.edu/pages/support/data-overview/wrds-overview-taq/". Online; accessed 25 February2020.

[47] H.-F. Yu, N. Rao, and I. S. Dhillon. Temporal regularized matrix factorization for high-dimensional time series prediction. In Advances in neural information processing systems,pages 847–855, 2016.

[48] H. Zha and H. D. Simon. On updating problems in latent semantic indexing. SIAM Journal onScientific Computing, 21(2):782–791, 1999.

16

https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014"https://wrds-www.wharton.upenn.edu/pages/support/data-overview/wrds-overview-taq/""https://wrds-www.wharton.upenn.edu/pages/support/data-overview/wrds-overview-taq/"

A Algorithm Justification

Notation. Let fn(t) denote the noiseless version of the time series, Xn(t), i.e., E[Xn(t)] = fn(t).Further, let σ2n(t) = E[Xn(t)− fn(t)]2, denote the variance of the per-step noise. Let the collectionof N time series, f1, . . . , fN be denoted by f ; define σ2 analogously with respect to σ21 , . . . , σ

2N .

Let Mf ,Mσ2 ∈ RL×P̄ denote the stacked Page matrices induced by f and σ2 (analogous to how

ZX ,ZX2

are defined). In particular,Mfi,[j+P×(n−1)N ] := fn(i+(j−1)L) andMσ2

i,[j+P×(n−1)N ] :=

σ2n(i+ (j − 1)L).

A.1 Justification for Mean Estimation Algorithm

The goal of the mean estimation algorithm is to impute and forecast the underlying time-varyingmean, f , i.e., produce an estimate f̂ .

Linear Recurrent Formulae (LRF). Consider the univariate time series setting, i.e., N = 1. Here,Mf1ij = f1(i+ (j − 1)L).Definition A.1. A time series, fn is called a order-R (univariate) LRF if i, j ∈ N with i + j ≤ T ,f1(i+ j) =

∑Rr=1 gr(i)hr(j), where gr(·), hr(·) : Z→ R for r ∈ [R].

Proposition A.1. (Proposition 5.2 in [2]) f1(t) =∑Aa=1 exp(αat) · cos(2πωat+ φa) · Pma(t)

admits an order-R LRF representation, where Pma is any polynomial of degree ma. Further theorder R of the LRF induced by f1 is independent of T , the number of observations, and is boundedby R ≤ A(mmax + 1)(mmax + 2), where mmax = maxa∈Ama.

Interpretation. From Proposition A.1, we see that LRFs admit a rich class of time series familiesincluding finite sums of products of harmonics, polynomials and exponentials. Thus LRFs well-approximate a large class of time series dynamics (e.g. stationary time series functions, see [7]).

Imputation, Forecasting for LRFs. It is easy to see that if f1 is an order-R LRF, the Page ma-trix, Mf1 , induced by it is rank R. In particular, Mf1ij =

∑Rr=1 UirVjr, where Uir = gr(i) and

Vir = hr(i). Since Mf1 is rank R and E[ZX1 ] = Mf1 , by recent advances in the matrix estimationand high-dimensional linear regression literature, we know that performing HSVT on ZX1 andsubsequently fitting a linear model (i.e., doing Principal Component Regression) leads to a consistentestimator for both imputation and forecasting, with error scaling as ∼

√R/T 1/4, ∼ R/

√T respec-

tively. Specifically, see Theorem 4.1 in [2] for the imputation result; and Theorem 4.2 and Proposition4.2 in [3] for the forecasting result. Note the key to these results is that the underlying matrix Mf1 isapproximately low-rank.

Multivariate LRF. In multivariate time series data, whereN > 1, there exists both temporal structure(the relationship within a time series) and “spatial” structure, i.e., the relationship across a collectionof related time series. We propose a natural multivariate generalization of the LRF model. Under thismodel, we justify below why our proposed algorithm provides accurate imputation and forecasting.

Definition A.2. f is an order-(K,Rmax) multivariate LRF if for all n ∈ [N ], fn(t) =∑Kk=1(θn)k ·

hk(t), where θn ∈ RK and hk : Z→ R is an order-Rk LRF, and Rk ≤ Rmax.

Interpretation. The (K,Rmax) multivariate LRF model can be interpreted as one where there are K“fundamental” time series hk(·); each hk(·) is a (univariate) LRF, with order, Rk, bounded by Rmax.Each fn(·) can thus be seen as a unique additive combination of these K LRFs, with the latent linearparameters given by θn.

Proposition A.2. If f is an order-(K,Rmax) multivariate LRF, then rank(Mf ) ≤ K ·Rmax.

Proof. Mfi,[j+P×(n−1)N ] = fn(i + (j − 1)L) =∑Kk=1(θn)k · hk(i + (j − 1)L) =

∑Kk=1(θn)k ·∑Rk

r=1 U(k)ir V

(k)jr . The last equality follows from the fact that hk(·) is a order-Rk LRF and so the

induced Page Matrix Mhk is of rank Rk. Note, U (k), V (k) ∈ RK are the left and right singularvectors of Mhk respectively. Observing that the number of terms in the summation of

∑Kk=1(θn)k ·∑Rk

r=1 U(k)ir V

(k)jr is less than K ·Rmax, completes the proof.

17

Interpretation. The key data transformation, the stacked Page matrix, Mf , has rank less thanK ·Rmax and E[ZX ] = Mf . Crucially, under this model, the rank of Mf is not growing with Nor T , thereby providing some justification for the mean estimation algorithm.

A.2 Justification for Variance Estimation Algorithm

The goal of the variance estimation algorithm is to impute and forecast the underlying time-varyingvariance, σ2, i.e., produce an estimate σ̂2. A key challenge in estimating the time-varying variance ofa time series is that it is not directly observed (as opposed to f , for which we at least get noisy, sparseobservations, X).

Multivariate LRF model for σ2. Analogous to the model we propose for f , we justify the varianceestimation algorithm for estimating σ2 under a multivariate LRF model. In particular, assume σ2 isan order-(K(2), R(2)max) multivariate LRF (and f is an order-(K,Rmax) multivariate LRF).

Repeating Mean Estimation Algorithm on ZX2

. Let Mf2 ∈ RL×P̄ by the stacked Page matrix

induced by a entry-wise squaring of Mf , i.e., Mf2

ij := (Mfij)

2. Let Mf2+σ2 ∈ RL×P̄ be the

stacked Page matrix induced by entry-wise addition of the matrices Mf2

and Mσ2

. Then by astraightforward modification of the proof of Proposition A.2, one can establish that rank(Mf

2+σ2) ≤(K ·Rmax)2 +K(2) ·R(2)max. Further note that E[X2] = f2 + σ2. Hence E[ZX

2

] = Mf2+σ2 . Then

using same argument as we did in the previous section, by applying the mean estimation algorithm onE[ZX2 ], this provides justification of why we can accurately estimate, ̂f2 + σ2, for both imputationand forecasting.

Estimating σ2 Under Multivariate LRF Model. It is then easy to see that through the followingsimple post-processing step, we can reliably estimate σ2;

σ̂2 := ̂f2 + σ2 − f̂2,

where f̂2 is the component-wise square of the estimate, f̂ , produced by the mean estimation algorithm.If ̂f2 + σ2 is close to f2 + σ2 and f̂ is close to f , by triangle inequality it must be that σ̂2 is closeto σ2. Thus surprisingly, we establish this simple variance estimation algorithm reliably recoversthe underlying (unobserved) time-varying variance under a multivariate LRF model for both f andσ2. As further evidence, we find the simple multivariate variance estimation algorithm we proposesignificantly outperforms alternatives for both imputation and forecasting.

B Additional Experimental Details

B.1 Experiment Setup

B.1.1 Datasets Description

Electricity Dataset. Obtained from the UCI repository, it records 15-minutes electricity loads of370 households ([44]). We follow the preprocessing done in [47, 38, 37]: data is aggregated intohourly intervals; the first 25968 time-points are used for training; and day-ahead forecasts for thenext seven days (i.e. 24-step ahead for 7 windows) are made.

Traffic Dataset. Obtained from the UCI repository, it records occupancy rate of traffic lanes in SanFrancisco ([44]). We follow the preprocessing done in [47, 38, 37]: data is aggregated into hourlyintervals; first 10392 time-points are used for training; day-ahead forecasts for the next seven days(i.e. 24-step ahead for 7 windows) are made.

Financial Dataset. Obtained from Wharton Research Data Services (WRDS), it records averagedaily stocks prices of 839 companies from October 2004 – November 2019 [46]. To limit numberof time series for ease of experimentation, we remove stocks with average prices below 30$ acrossthe available period, and those with null values. This preprocessing gives 839 time series (i.e. stockprices) each with 3993 daily readings. We use the first 3813 time points for training. For forecasting,we forecast 180 days ahead one day at a time (i.e. 1-step ahead forecast for 180 windows). We do soas this is a standard goal in finance.

18

Synthetic Dataset I (Mean Estimation). We generate observations of n ×m time series over Tobservations X ∈ Rn×m×T by first randomly generating the two vectors U ∈ Rn = [u1, . . . , un]and V ∈ Rm = [v1, . . . , vn]. Then, we generate r mixtures of harmonics where each mixturegk(t), k ∈ [r], is generated as: gk(t) =

∑4h=1 αh cos(ωht/T ) where the parameters are selected

randomly such that αh ∈ [−1, 10] and ωh ∈ [1, 1000]. Each observation is constructed as follows:Xi,j(t) =

∑rk=1 uivjgk(t) where r is the tensor rank, i ∈ [n], j ∈ [m]. In our experiment, we select

n = 20, m = 20, T = 15000, and r = 4. This gives us 400 time series each with 15000 observations.In the forecasting experiments, we use the first 14000 points for training. The goal here is to do10-step ahead forecasts for the final 1000 points.

Synthetic Dataset II (Variance Estimation). With k ∈ {1, 2, 3, 4}, we generate three different setsof time series as follows: (i) four harmonics ghark (t) =

∑4i=1 αi cos(ωit/T ) where αi ∈ [−1, 10] and

ωi ∈ [1, 1000]; (ii) four AR processes gARk (t) =∑3i=1 φig

ARk (t− i) + �(t) where �(t) ∼ N (0, 0.1),

φi ∈ [0.1, 0.4]; (iii) four trends: gtrendk (t) = ηt where η ∈ [10−4, 10−3]. Then we generate atensor by sampling two random vectors U ∈ R20 = [u1, . . . , u20] and V ∈ R20 = [v1, . . . , v20].Next we generate three 20 × 20 × 1500 tensors as follow: (i) A mixture of harmonics: F 1i,j(t) =∑4k=1 uivjg

hark (t); (ii) A mixture of harmonics + trend: F

2i,j(t) =

∑4k=1 uivj(g

hark (t) + g

trendk (t));

(iii) A mixture of harmonics + trend+ AR: F 3i,j(t) =∑4k=1 uivj(g

trendk (t) + g

hark (t) + g

ARk (t)).

Here i, j ∈ [1, . . . , 20]. For each tensor, we normalize its values to be between 0 and 1 and use threeobservations models to generate three tensors from each: (i) Gaussian: Xqij(t) ∼ N (F 1i,j(t), F

qi,j(t));

(ii) Bernoulli: Xqij(t) ∼ Bernoulli(Fqi,j(t)); (iii) Poisson: X

qij(t) ∼ Pois(F

qi,j(t)). Where q ∈

{1, 2, 3}. Note, this give us a total of nine observation tensors for our variance experiments. For theforecasting experiments, we use the first 14000 points for training and the last 1000 time steps fortesting.

Synthetic Dataset III (Robustness to Observation Models). We generate a f as a sum ofharmonics; f(t) =

∑4i=1 αi cos(ωit/T ), where t ∈ [T ], T = 100000, αi ∈ [−1.5, 1.5] and

ωi ∈ [1, 100] (randomly selected). We normalize f(t) to be between 0 and 1, and then generatethree different observation models: (i) XGauss(t) = f(t) + �(t), where �(t) ∼ N (0, 0.1) (float); (ii)XBernoulli(t) ∼ Bernoulli(f(t)), i.e. each observation at time t follows a Bernoulli distribution withmean f(t) (boolean); (iii) XPois(t) ∼ Pois(f(t)), i.e. each observation at time t follows a poissondistribution with mean λ = f(t) (integer). The goal is to recover (i.e. impute) and forecast f(t) foreach set of observations. Note that we use the first 96000 observations for training, and the last 4000points for testing.

B.1.2 Machine and DB Configuration.

As we said earlier, we use an Intel Xeon E5-2683 machine with 16 cores, 132 GB of RAM, and anSSD storage in all experiments. In Table 6, we detail the relevant settings used for PostgreSQL 12.1.

Table 6: The used configurations for PostgreSQL 12.1.Parameter Value

Shared buffers 30GBmaintenance work mem 2GBeffective cache size 80GBdefault transaction isolation ‘read committed’wal buffers 16MBmax parallel workers 16max worker processes 16work mem 54MB

B.1.3 TSPS Hyperparameters

All experimental results are produced using the default settings in the open-source implementation ofTSPS. In particular, the hyperparameters of TSPS are selected as follow:

19

L. We choose L using guidance from the analysis in [2], which requires L ≤ P̄ . Specifically in TSPS,L = P̄ /10.

Number of Singular Values (k, k1, k2). This choices is done in a data-driven manner. Specifically,we follow the optimal procedure proposed in [15] for choosing a threshold for HSVT. The proceduredepends on the median of the singular values and the shape of the Page matrix. For more details,refer to [15].

“Meta-Algorithm” Parameters (T0, T ′, γ). The default parameters we choose are T0 = 100, T ′ =2.5× 106, γ = 0.5. For details on how these parameters are chosen, refer to Appendix B.3.2.

B.1.4 Benchmarking Algorithms Parameters and Settings

In this section, we describe the hyper-parameters/implementation used for each method we comparewith.

DeepAR. We use the default parameters of “DeepAREstimator” algorithm provided by the GluonTSpackage. In variance forecasting, we compare with DeepAR as it natively had functionality to produceprediction intervals. It uses a Monte Carlo approach that samples the estimated posterior to producemultiple samples of the time series trajectories. We take the variance of these samples as an estimateof the forecasted variance.

LSTM. Across all datasets, we use a LSTM network with three hidden layers each, with 45 neuronsper layer, as is done in [38]. We use the Keras implementation of LSTM[11].

Prophet. We used Prophet’s Python library with the default parameters selected [14].

TRMF. We use the implementation provided by the authors ([47]) in Github. We use k = 60 for theelectricity, k = 40 for the traffic, k = 20 for the financial and synthetic data and variance experiments(selected via cross-validation); k represents the chosen rank of the T ×N time series matrix. Anotherhyper-parameter, the lag index is set to include the last day and same weekday in the last week for thetraffic and electricity data. For the financial and synthetic data, the lag index include the last 20 points.To perform variance imputation, we adapt TRMF similar to the extension used in TSPS as follows:

1. Impute the time series Xn(t), X2n(t) to produce estimates f̂nTRMF

(t), ̂f2n + σ2nTRMF

(t) re-spectively;

2. Output σ̂2nTRMF

= ̂f2n + σ2nTRMF

(t)− f̂TRMFn (t)2.

B.1.5 Accuracy Score Metrics

Normalized Root Mean Squared Error (NRMSE). Throughout all experiments, NRMSE is usedas an accuracy metric. In particular, we rescale each time series to have a zero mean and unit variancebefore calculating the RMSE. This is done to avoid over-weighting the error in time series with largermagnitudes or greater variance.

Weighted Borda Count (WBC). In Table 1 and Table 2, we use WBC to report the algorithms’performance across different experiments. This score is inspired from Borda count – a commonlyused method to rank a list of candidates. Specifically, rather than choosing a single candidate, votersrank all candidates in order of preference. Then, each candidate is given a number of points thatcorresponds to the number of candidates ranked lower.

We use a weighted version of the standard Borda Count, where we rank algorithms (i.e. candidates)based on pairwise comparisons of the normalized root mean squared error (NRMSE) across ex-periments (i.e. voters). We use this metric as it better captures the relative performance across allexperiments and methods, unlike a simple statistic such as the mean or the median. Particularly, let Arepresent the set of algorithms used in the experiments. For example, in the mean forecasting experi-ments,A = {TSPS, DeepAR, LSTM, Prophet, TRMF}. Let X represent the set of all experimentsacross different noise levels (σ), and fraction of missing values (p). Let e(a, x) represent the NRMSEof algorithm a ∈ A in experiment x ∈ X . Then the Weighted Borda Count of algorithm a ∈ A is:

WBC(a) =1

|A|−1∑

a′∈A:a′ 6=a

(1

|X |∑x∈X

e(a′, x)

e(a, x) + e(a′, x)

)

20

WBC ranges between 0 and 1 and captures the relative performance against alternative algorithms.A score of 0.5 indicates identical performance to the average of all other algorithms, a score of 0/1indicates it performs “infinitely” worse / better, respectively.

B.2 Additional Statistical Experiments

B.2.1 Mean Forecasting and Imputation Figures

Figures 6 and 7 show the results of the mean imputation and forecasting experiments as we vary thefraction of missing data and the noise level.

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 6: TSPS imputation performance surpasses TRMF in 80% of the experiments as we vary the amount ofmissing data (a-d), and noise level (e-h).

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 7: TSPS forecasting performance is competitive with/surpasses state-of-the-art algorithms in standardmultivariate time series datasets as we vary the amount of missing data (a-d), and noise level (e-h).

B.2.2 Details of Variance Estimation Experiments

As stated earlier, we use the synthetic II dataset for the variance estimation experiments. Specifically,we generate nine sets of multivariate time series each with a different additive combination of timeseries dynamics and a different observation model, as described in Appendix B.1.1. We then testTSPS’s performance as we vary the fraction of observed data p ∈ {1.0, 0.8, 0.5}.For imputation, to the best of our knowledge, there is no algorithm which produces predictionintervals. Hence we use an adaptation of TRMF (note though it does not natively support varianceestimation) to benchmark TSPS’s performance. For forecasting, we compare with DeepAR. Refer toAppendix B.1.4 for details of how we use these algorithms to do variance estimation.

Results. With respect to imputation, we find TSPS outperforms TRMF in all but one experiment,where the ratio of TRMF’s error to TSPS’s in the range of 0.97-2.27 (see Table 3). With respect toforecasting, we find TSPS outperforms DeepAR in all experiments, where the ratio of DeepAR’s

21

error to TSPS’s is in the range 1.01-1870.59 (see Table 3). TSPS’s performance is notably superiorwhen dealing with integer (e.g., Poisson generated) observations.

B.2.3 Robustness to Observation Models

In practice, the same latent time series dynamic can lead to very different observation models. Atime series predictive system should ideally be able to make accurate predictions despite access todifferent data types. We show that TSPS is pleasingly “noise agnostic”, i.e., has excellent imputationand forecasting results when we get access to different observation models (i.e., data types) for thesame underlying time series. For example a Gaussian (i.e., float) , Bernoulli (i.e., boolean), andPoisson (i.e., integer) observation model all generated from the same underlying time series such as asum of harmonics as shown in Figures 8a, 8b , 8c respectively. Refer to the description of syntheticIII in Appendix B.1.1 for details about how these observations are generated.

In this example, we find that TSPS’s imputation error in RMSE/R2 is (0.079/0.854), (0.078/0.858)and (0.102/0.758) for the Gaussian, Bernoulli and Poisson observation models respectively. Similarly,the forecasting error in RMSE/R2 is (0.041/0.979), (0.083/0.914) and (0.110/0.851) for the Gaussian,Bernoulli and Poisson observation models respectively. Figure 8 gives a visual depiction of howTSPS effectively imputes and forecasts under Gaussian, Bernoulli and Poisson observation models. Incontrast, other methods perform much worse under these other noise models – DeepAR, for example,performs rather poorly with the Bernoulli and Poisson observation models – its forecasting error inRMSE/R2 is (0.232/0.648), (0.427/0.306), (0.463/−1.367) for the Gaussian, Bernoulli and Poissonobservation models respectively.

(a) (b) (c)

(d) (e) (f)

Figure 8: Without knowledge of whether the observations come from a time-varying Gaussian process (i.e.floats, see Figure 8a), a Bernoulli process (i.e. binary data, see Figure 8b), or a Poisson process (i.e. integers, seeFigure 8c), TSPS successfully imputes and forecasts the underlying mean.

B.2.4 Prediction Gains – Multivariate vs. Univariate Algorithm.

In our proposed algorithm, a prediction model is trained using data from a collection of time series. Ifthis collection of time series are carefully selected (i.e., have high “correlation”), this should lead tobetter statistical accuracy (see Appendix A for a justification). However, the drawback is that trainingon multiple time series leads to higher model training time and prediction query latency. We evaluatethis tradeoff in TSPS’s performance across the three metrics of interest.

Specifically, we use the synthetic I and electricity datasets (detailed in Appendix B.1.1). We evaluateTSPS’s performance as we vary the number of time series, N , used in training the prediction model.N ∈ {1, 10, 40, 80, 160, 400} and N ∈ {1, 10, 40, 80, 150, 370} for the synthetic and electricitydatasets, respectively. We report performance across the three metrics relative to the univariate case.

22

Table 7: How the three metrics change as we train using more time series (N ) relative to the univariate case.Dataset Synthetic I Electricity

N 1 10 40 80 160 400 1 10 40 80 150 370Forecasting Accuracy 1.00 0.62 0.11 0.11 0.08 0.06 1.00 0.49 0.44 0.45 0.45 0.46Training Time 1.00 6.80 16.2 30.28 55.24 250.2 1.00 6.53 20.84 39.63 127.13 369.06Forecasting Latency 1.00 1.00 1.05 1.15 1.28 1.60 1.00 1.28 1.35 1.60 1.931 2.62

To reduce variance due to randomness in which time series are selected for the statistical accuracyevaluation, we average results over 50 runs.

Results. Table 7 shows how the three metrics change as we train using more time series. The trainingtime increases almost linearly, as we train on more time series in the synthetic (6.80x-250.2x) andelectricity (6.53x-369.06x)) datasets. The prediction query latency increases moderately by 1.001x-1.60x and 1.28x-2.62x for the synthetic and electricity datasets respectively. While the forecastingerror decrease by 1.61x-17.59x and 2.04x-2.27x for the synthetic and electricity datasets respectively.In summary, we find that using multiple time series to train the prediction model can greatly improveaccuracy with a reasonable slow down in the training time and query latency. However, beyond acertain point, the gain in statistical accuracy from adding more time series (∼ N > 50) becomesminimal and thus can add unnecessary overhead.

B.3 Additional Computational Experiments

B.3.1 Scalability of TSPS

Setup. We test computational performance of TSPS as we add more data and how it compares withstandard PostgreSQL metrics. The dataset used is a set of noisy observations of a sum of harmonics.With this data, we create a table of the following simple schema: synthetic(time:timestamp,time_series: float), where the column time is the primary key, indexed by the default Post-greSQL DB index (B-tree data structure). Note that indexing the time column is a standard practicein time series DBs.

Throughput. We evaluate the time taken to train the prediction model in TSPS as we vary thenumber of rows inserted from 104 to 108. We compare it against the time needed to insert the samerows into the aforementioned indexed table. We find that the time required to train TSPS’s predictionmodel is 1.13-2.17x faster than PostgreSQL insert time as we vary the dataset size. In absolute terms,the time to train TSPS’s prediction model is 2.62-7.78 microseconds per record. See Figure 9a. Sincethe time required to train the prediction model is lower than PostgreSQL’s bulk insert time, andthe two operations are independent and so can be effectively run in parallel, integrating TSPS withPostgreSQL will not bottleneck PostgreSQL’s throughput as new data points are inserted.

Query Latency. We compare the latency of a standard SELECT point query in PostgreSQL againstthe time required to produce a prediction in TSPS. We evaluate the latency for both imputation andforecasting predictions with and without uncertainty quantification. As shown in Figure 9b, we findthat imputation queries are 1.77-2.56x slower relative to SELECT queries as we vary the table size,and forecasting queries are 3.87-6.39x slower. Predictive queries with uncertainty quantification are3.04-4.60x and 6.99-12.80x slower for imputation and forecasting queries respectively. In absoluteterms, this amounts to 0.41-0.61 milliseconds for SELECT queries, 0.94-1.23 milliseconds for impu-tation queries (1.72-2.11 milliseconds with uncertainty quantification), and 1.73-3.04 millisecondsfor forecasting queries (3.02-5.85 milliseconds with uncertainty quantification).

B.3.2 Hyper-parameter Tuning - T0, T ′, γ

We discuss how the default parameters of T0, T ′, γ are selected in TSPS; we quantify the effect ofeach of these parameter on three metrics of interest: training time, prediction latency, and statisticalaccuracy. We find T0 has minimal effect on these three metrics, and for simplicity of exposition, wesimply choose it to be T0 = 100 throughout the experiments.

Setup. In this experiment, we use a synthetic time series generated from a sum of harmonics with5× 107 data points. We initially train the prediction model with all but the last 106 entries and thenincrementally add data over a 1000 batches, with a 1000 data points in each batch; note we do so to

23

(a) (b)

Figure 9: (a) Training TSPS prediction model is 1.13-2.17x faster than PostgreSQL’s bulk insert. (b) Predictionsusing TSPS are only 1.77x to 6.39x slower than standard SELECT queries.

study the effect of γ on training time, whose effect is limited to such data insertion patterns. We varythe parameter T ′ ∈ [104, 107], and γ ∈ {0.01, 0.2, 0.5, 0.7, 1.0}.Results. Figure 10 shows TSPS’s performance on the three metrics of interest as we vary theparameters. The x-axis reflects the time needed to train the prediction model relative to the timeneeded to insert the same data into a PostgreSQL table. The y-axis reflects the forecast query latencyrelative to a PostgreSQL SELECT query. The color of each dot corresponds to the forecastingaccuracy in normalized RMSE. We see there is a clear trade-off between query latency and accuracy.Specifically, using small values of T ′ ∈ [104, 105] yields faster but less accurate predictions (3x-4xslower than SELECT queries). Further, it has a negative effect on training time (it is 2x-6x the inserttime). On the other hand, using high values of T ′ ∈ [5 × 106, 5 × 107] yields more accurate butslower predictions (6x-7x slower than SELECT queries), with relatively low training time (train timeis 70% of insert time). Choosing T ′ to be 2.5× 106, gives accurate and relatively fast predictions(5.1x slower than SELECT queries) and relatively low training time (training time is 1.25x quickercompared to PostgreSQL insert time).

We find γ’s effect on the accuracy to be minimal but has significant effect on training time for certainrange of parameter choice. For example choosing γ = 0.01 slows training time by 50% compared toγ = 1.0. Training time does not vary for γ ∈ {0.5, 0.7, 1.0}, with choice of γ = 0.5 performing best.Hence we choose γ = 0.5. Note γ has no effect on predictions latency.

Figure 10: T ′ and γ can significantly affect the three metrics of interest. We choose T ′ = 2.56 and γ = 0.5which gives the best tradeoff across all three metrics.

24

1 Introduction1.1 Contributions.1.2 Related Work

2 Prediction Interface3 Incremental Matrix Factorization Based Time Series Algorithm3.1 Algorithm Description3.2 Incremental Variant of Mean and Variance Estimation Algorithms3.3 TSPS Implementation Details3.3.1 Prediction Model Storage in DB3.3.2 Answering a PREDICT query in TSPS

4 Experiments4.1 Setup4.2 Statistical Experiments4.3 Computational Experiments

5 ConclusionA Algorithm JustificationA.1 Justification for Mean Estimation AlgorithmA.2 Justification for Variance Estimation Algorithm

B Additional Experimental DetailsB.1 Experiment SetupB.1.1 Datasets DescriptionB.1.2 Machine and DB Configuration.B.1.3 TSPS HyperparametersB.1.4 Benchmarking Algorithms Parameters and SettingsB.1.5 Accuracy Score Metrics

B.2 Additional Statistical ExperimentsB.2.1 Mean Forecasting and Imputation FiguresB.2.2 Details of Variance Estimation ExperimentsB.2.3 Robustness to Observation ModelsB.2.4 Prediction Gains – Multivariate vs. Univariate Algorithm.

B.3 Additional Computational ExperimentsB.3.1 Scalability of TSPSB.3.2 Hyper-parameter Tuning - T_0, T',

TSPS: A Real-Time Time Series Prediction Systemtime and prediction query latency, respectively....

Documents

Transcript of TSPS: A Real-Time Time Series Prediction Systemtime and prediction query latency, respectively....