A Database Model for Time Series : From a traditional Data ...ordonez/pdf/Ladjel-EDA-IJDWM.pdf ·...

A Database Model for Time Series : From a traditional DataWarehouse to a Mathematical Models Warehouse

Cyrille Ponchateau1, Ladjel Bellatreche2, Carlos Ordonez3, and Mickael Baron4

1 ISAE-ENSMA, [email protected]


3 Houston University, [email protected]


Abstract. In scientific research, the results of an experiment commonly take the form of a timeseries, in which such time series consists of measurements collected from a sensor over time. Aftertime series are stored mathematical models are derived using numerical methods. Even thoughthere exist plenty of tools to store and analyze time series data, there is scarce research aiming atstoring and querying derived models, which are the most important mechanism for a scientist tounderstand data. In this work, we propose to help scientists with a flexible database structure topersist and manage mathematical models with a mathematical models warehouse, adapting andextending traditional data warehouses. This paper has two main contributions: (1) an ER modelto store differential equations for a time series; (2) a novel ETL process, considering mathematicalmodels and time series data separately. To validate our proposed ER model and new ETL features,we present an experimental evaluation with synthetic time series data. Our results confirm ourmathematical model database representation is intuitive, flexible and accurate.

1 Introduction

The evolution of the computer technologies in terms of computing and storage capacity, now allows theprocessing of large set of numerical data. In particular, time series processing fund diverse applications innumerous scientific fields, from finance (weekly sales total[13], stock price movements[23]. . . ) to medicine(medical imaging[3], electrocardiogram[22, 13] or medical surveillance in general[10]) and physics (particletracking[22]). They are also used in experimental sciences in general[33].

Our work, in particular, was initiated by an automatic control team. In their domain (as in otherexperimental sciences domains), the searchers design experiments to study physical systems. We canconsider an electrical motor as an example. When a voltage is applied at the terminals of the motor,it starts rotating and the rotating speed depends on the aforementioned voltage. Then, the aim of theautomatic control researcher would be to mathematically describe the dependence between the voltageand the rotating speed. To do so, the motor would be tested with different voltage and the rotating speedevolution would be measured by a sensor. The measurements would be taken at a fixed rate, generatinga series of chronological values, represented as a time series. The next step would consist in analyzing thetime series, using numerical tools or software such as Matlab, Octave, R. . . Finally, the analysis wouldprovide a mathematical model (a differential equation), describing how the rotating speed of the motorbehaves according to the voltage.

When the models are identified, the searchers have no standard structure to store them. Therefore,they are store in different format, sometimes hidden in a Matlab or python script, or a text files. . . As aresult, the retrieval of a particular model requires long search among different files and folders and filereading. In order to bring a solution that would provide both standardization and organization and avoidmanual search and retrieval of models, we propose to store the models in a data warehouse and add aprocess specific to time series to the ETL process.

Currently, lots of mature technologies exist in order to allow massive storage of time series (TokuDB[26, 4], Vertica [21], OpenTSDB [9], SciDB [36]. . . ). However, storing the time series only is not enough.Indeed, the interest of experimental scientists is the derivation of mathematical models adapted to theobservations they made (that fits the time series data). Especially, when the said model is not only amathematical formula that happens to fit the data. Indeed, they usually have a concrete explanation ofhow the observed systems behave and why. Such information is more crucial to persist than the observedtime series. Also, since mathematics gives very general tools to solve a diversity of problems, one modelcan be used to model several different systems.

Therefore, our goal is to provide a standard solution to represent, store and exchange models, as wellas providing help in the models scanning process. Our proposal is divided into the two following points:

– An ER model to store differential equations in a relational database ;– An ETL-like software layer, to help the end-user exploit the database and enhance its integration in a

numerical software environment data scientists are more commonly used to (python, R, Matlab. . . ).

This paper is organized as follows. The related work, in section 3, is a review of time series repre-sentations techniques, in order to position our representation among those existing methods. Then, insection 2, we detail the context of our study and section 4 deals with the representation of an equation ina computer program and in a relational database. The mathematical models warehouse design is detailedin section 5. In the section 6 we present our prototype of mathematical models warehouse. Eventually,the section 7 concludes this paper, with some more suggestions.

2 Context of Study

In automatic control (as in other experimental sciences domains), the searchers design experiments tostudy physical systems. We can consider an electrical motor as an example. When a voltage is appliedat the terminals of the motor, it starts rotating and the rotating speed depends on the aforementionedvoltage. Then, the aim of the automatic control researcher would be to mathematically describe thedependence between the voltage and the rotating speed. To do so, the motor would be tested withdifferent voltage and the rotating speed evolution would be observed by measuring it at a fixed rate. Atthe end of the process, the taken measures are a list of chronological speed values, thus giving a timeseries. The next step would consist in analyzing the time series, using numerical tools or software such asMatlab, Octave, R. . . Finally, the analysis would provide a mathematical model (a differential equation),describing how the rotating speed of the motor behaves according to the voltage.

When the models are identified, the searchers have no standard structure to store them. Hence adisorganized storage of the models, with no management system. The models can be stored in differentformat, such as Matlab or python scripts, or text files. . . The documents are all spread in the file systemsof each searchers machines. As a result, the retrieval of a particular model requires long search amongdifferent files and folders and file reading. In order to bring a solution that would provide both standard-ization and organization and avoid manual search and retrieval of models, we propose to store the modelsin a data warehouse and add a process specific to time series to the classical ETL process. Note that thistime series specific process does deal with the data quality to load in the database, similarly to the workin [27]. However, unlike the aforementioned work, our work does not deal with data transformation andloading either, it only extracts time series from source files.

Having an organized storage system to store equations could be used to automatize some particularresearch. For instance, some experiments may give similar results (e.g. two different motors from twodifferent producers) and it could be interesting to check this similarity by looking at their equations. But,those experiments are not necessarily made at the same time (one motors is delivered months after theother). After the second experience (on the second motor), we need to retrieve an equation found sometime ago, from the new time series raw data found for the said second experiments.

In our approach, we deal with linear differential equations. The aim is to develop a data warehousethat will allow the end-user to:

– Populate the warehouse with new equations ;– Make comparison queries: give a time series as an input of the warehouse, in order for it to find the

model or set of models that fit or fit the most the input series.

In addition, the user must be able to perform some queries on the equations. Let consider the followingexample, as a use case:

Ty′(t) + y(t) = Ku(t) (1)

where T and K are real parameters defined as the time constant and the gain, respectively, u(t) and y(t)are the system inputs and outputs, respectively.

A query example could be to select all systems that have a time constant between two defined values.Considering the equation, the time constant is readable directly in the T variable. Calculating the timeconstant from the time series would require to calculate the asymptotic value (or an approximation) of theseries, before being able to approximate the time constant and compare the value to the given interval.

Another query could be to sort some second order damped oscillating systems into overdamped,critically damped, underdamped and undamped categories. This classification requires to compute severalphysical parameters of the systems (damping ratio in particular). Those parameters are directly readable,if we know the equations that govern the systems. Another type of possible query is the selection of allthe equations of a given order.

3 Related Work

3.1 Temporal representations

In [13], the simple sampling[37] is presented as the simplest representation method. It consists in samplingthe series with a fixed sampling rate. Then, several works tend to select wisely the points to extract, tocompress the time series more efficiently. Different strategies were introduced over the last decades.In [25], the authors proposed an algorithm to identify all important peaks and troughs, called controlpoints, and classify them into an ordered multiple layer structure they called a lattice structure. Whenconsidering more and more layers, the representation gets more and more accurate. In [28], a landmarkmodel to identify the important points in the time series is introduced. A landmark, in time series, is aset of points of great importance. In [8], the authors introduce a method to identify salient points, calledPerceptually Important Points (PIP). The points of the series can be ordered considering their importancemeasure, through the PIPs identification process. This method is made to be sure to keep first the pointsthat are the most meaningful. In [11, 12], the authors compress time series, by extracting the extremaand giving them an importance rate, which allow them to sort the extrema by their importance andselect only the extrema having a rate higher than a user-defined threshold. Globally, those methods aimat eliminating some points of the time series, with different choice heuristic. The final goal is to providea reduced set of points to the algorithm that will analyze the series. Hence, a reduction of the processingtime of the said algorithm.

3.2 Segmentation

Segmentation is a particular case of temporal representation. It consists in dividing a time series intosegments [24]. A segment (also called time window) is defined as a consecutive set of points (a subsequence)of the time series. Inside a segment, the data points can be represented as one specific value (such as themean or median value) or with a function adapted to the data points (generally a straight line).

Several segmentation methods were introduced in the scientific literature. In [15], the Piecewise Aggre-gate Approximation (PAA) is introduced. The method consists in dividing the series into segments of thesame length and keep only the mean value of each segments. In [6], the authors proposed an enhancementof the PAA method, called Adaptive Piecewise Constant Approximation (APCA). The APCA is basically

the same than PAA, except that the segment length can vary, to adapt to the time series variations, inorder to adapt the segments length to the global shape of the series.

The previous methods use one or two values to approximate a segment, whereas in [33], the authorsproposed to use continuous digitalized functions. They chose to use straight lines, that can be seen as firstorder polynomials. In [16], the authors states that lines can be computed using either Linear Regression,which corresponds to the first order polynomial case of [33] or Linear Interpolation, which consist inrepresenting a segment by the line connecting its left bound to its right bound.

Since the segments can have a variable length, each method has to find an optimal way to segment timeseries. According Keogh et al. in [16], there is no universal optimum algorithm. Therefore, segmentationheuristics are used and usually classified into three categories:

– Bottom-Up algorithms: the time series is divided into a large number of equi-length segments to obtainthe finest possible approximation. Then, the segments are combined together, trying to introduce theless errors as possible.

– Top-Down algorithms: the whole series is considered as one big segment. Then the segments are splitinto smaller ones. The splitting positions are chosen to reduce the error as much as possible.

– Sliding-Window algorithms: the series is read chronologically. The current read point is added to thecurrent segment, until the error reach a threshold value. Then, the next points will be added to anew segment.

The segmentation philosophy is not to eliminate points, but trying to represent all of them, in a morecompact format.

3.3 Spectral representations

In [32], the author explained that any continuous, periodic function can be represented as a combinationof sines and cosines. Then, he explained how to use Fourier Transform on a sampled function, leading tothe Discrete Fourier Transform (DFT), which maps a discrete periodic sequence to a discrete sequence ofcoefficients in the frequency domain. In [35], a wave is defined as an oscillating function of time and space.A wavelet oscillates only on a particular time period and equal to zero everywhere else. Any continuousfunction can be represented as a combination of wavelets. Unlike the Fourier Transform, the functiondoes not need to be periodic.

According [29], the FFT is usually not used for reduction purposes, but to ease some operations,such as convolution, that are more simple to perform in frequency domain, than in time domain. On thecontrary, the wavelet transform does not necessarily make operations easier, but it can be truncated. Inthe series point list, each individual point carries the same amount of information, whereas a wavelettransform makes some values more significant than others. Therefore, the less significant ones can beignored.

3.4 Other representations

In [2], the authors used time series to train a Hidden Markov Model (HMM) to be able to reproduce thetime series data. In [34], the author discuses about the AutoRegressive Moving Average (ARMA) model.The ARMA model consist in finding a relation (by regression) between a current value and its past values.In [20], the authors mention the Singular Value Decomposition (SVD), which consists in finding similargroups of points corresponding to a "pattern", also called "principle component". In those representations,the data are represented using a derived mathematical model or compared to some patterns.

3.5 Related Work Conclusion

Among the methods reviewed, several strategies were pointed out:

– number of points reduction, by elimination ;

– data reduction, by using another representation of the points value ;– using frequency domain to ease some operations ;– using a transformation that introduces a hierarchy in the points ;– derived mathematical models or pattern comparison.

Our solution is similar to derived mathematical methods. Differential equations are a particular math-ematical model and it is possible to use them to fit the points of a particular time series. However, in thecase of data science, differential equations are more meaningful than other classical mathematical models(more details are given in section 2).

In addition, this related work shows that storing time series, building time series databases, queryingthose databases and comparing time series with each other is not a new issue. Indeed, several works aimsat searching or matching a query time series against a database of time series [20, 31, 7, 1, 25, 8, 17–19].Using a representation can be used to optimize the storage space, but the usual main interest is thesearching and matching queries optimization.

There are also relational time series databases, but a relational table is usually not adapted to storinga time series. First, the database can have to store more than one series in a table and in an asynchronousenvironment. In [26], the authors mention the Fractal Tree index. It is a tree index structure that keepsdata always sorted and allow fast searches and sequential access (similarly the B-tree index) and ithas buffers at each node to keep record of the changes. The buffers help optimization of disk writingscheduling, so that each disk writes will deal with large block of data at once. In order to be able to storetime series in a customized form, that meets the need of the processing method, it is usually preferableto use NoSQL databases.

An equation describes a time series using variables, functions, numbers and operators related to eachother. Those mathematical objects can be seen as entities and the equation defines a relation betweenthose entities. As a result, they appear to be suitable for a storage in a relational database. For theautomatic control team, the experimental data do not necessarily need to be dealt with in real time, sincean experiment consists in measuring values, format them for analysis and then perform the analysis.Therefore, the warehouse should deal with finite (still potentially high-dimensional) time series.

4 ER Model to Represent Time Series Equations

4.1 Equations Represented as Parser Binary Trees

To represent the formula, we chose to use a binary tree form instead of a simple string. If the structureis more complicated the features, such as equation order, or variables are easier to recognize, since it isdirectly readable in the nodes attributes.

Let consider the example equation1. The given expression is the formula of the equation. The formulaalone is rather useless. Indeed, if it is possible to solve them symbolically, it is not possible to compute anynumerical value of the solution. Consequently, it has to be given with the following set of data: numericalvalues for the literal variables (T and K) ; a definition of the input functions (u(t)) ; a time origin t0 ; andinitial values (y0). The aforementioned set of data is the parameters set of the equation. A differentialequation will be defined by one formula and one parameters set.

Figure 1 gives the binary tree of the aforementioned example equation 1. The leaves of the tree arethe different terms of the equation (constants, functions...), whereas non-leaf nodes are operators. Theroot node always contains the equality operator (=). Figure 2, shows the attributes of a single node.The math_object attributes specify if the node is a binary operator (requires two operands), an unaryoperator (requires only one operand) a function (such as y and u), a variable (such as T and K) or anumber (a constant coefficient). The value attribute is either the name of a function or variable, theoperator or the number value. The attribute deriv is used to precise the order of derivation of a function(1 for y′). The two last attribute or the left and right children of the node.

To read the equation within the tree structure, we have to use a deep-first search in-order algorithm.When considering a node, we need to read its left subtree, then the node itself and its right subtree.

=

+

∗

T y′

y

∗

K u

Fig. 1: Binary tree of the equation 1

Fig. 2: Single node attributes

For instance, if we consider the left node containing the ∗ sign, we have to read its left subtree, whichcontains the T constant, then the node itself, which indicates a multiplication operation and the rightsubtree containing the y′ term. the complete read term is then: Ty′. Applying this reading methodrecursively from the root node allows to retrieve entirely the equation 1.

From Binary Tree to ER Model We chose a star-like schema (see Figure 3). Having nested objectssuch as trees and lists (of variables or initial values or input functions) requires use of associations [14, 5].Therefore, introducing many-to-many relationships into the star schema that original takes only one-to-many relationships was necessary. But we kept a schema were equations are center of the schema and hasa role that is similar to a fact table in a traditional multidimensional model. In addition, it is then stillpossible to add actual dimensions to the model, in order to store more information about the equations(date of insertion, experiment, name of the experimenter. . . ). Those pieces of information would then bestored by adding dimensions as in a classical star-schema. For instance, each equation can be associatedthe experimenter who added the equations. The experimenter has his personal identity (name, date ofbirth. . . ) and may also be part of a team, which can be part of a department. The dimension has thefollowing hierarchy : department < team < experimenter. The star-like schema gives the opportunity toperform operations such as Roll-Up or Drill-Down, in order to group equations that belongs to the sameexperimenter, or to the same team, to the same department. Similarly, adding the date of insertion ofthe equations would require an additional time dimension. The equations could then be aggregated bymonth or year, which could help following the laboratory activity.

The proposed schema has the following tables :

– Differential_Equations : it plays the role of the fact table.– Node_Content : The Node_Content table, contains the math_object, value and deriv attributes of

the node, since they can be shared among several nodes.

Fig. 3: Star-Like Schema

– Node : contains the ID of the node. An association table specifies the left and right children of thenode, the ID of its parent and the ID of the equation the node belongs to.

– Initial_Value : contains the initial values of the equations. The y0 attribute contains the actualvalue, while the deriv attribute contains the order of derivation. Therefore, a tuple such as (y0 =0, deriv = 1) means y′0 = 0.

– Variable : contains the variables such as T or K, saved in the attribute name, associated to theirvalues, in the attribute value.

– Input : contains the input functions, such as u in the example from Figure 1. The attribute namestores the name of the function as it appears in the equation and the attribute file contains the textfile, in which the function values are stored.

Note that t0 do not appear in the table. It is the first time value of the input functions. Therefore,storing t0 in the Variable table is in practice not useful, but it can optimize the access to the value, ifnecessary.

4.2 Comparing Time Series and Equations

Timestamp Value0.0000000e+00 0.0000000e+001.0000000e+00 9.4105346e-022.0000000e+00 1.8452077e-013.0000000e+00 2.7139095e-014.0000000e+00 3.5485491e-015.0000000e+00 4.3504619e-016.0000000e+00 5.1209313e-017.0000000e+00 5.8611902e-018.0000000e+00 6.5724231e-019.0000000e+00 7.2557682e-01... ...

Table 1: Example of time series (first ten ele-ments)

Iteration Value0 0.0000000e+001 9.4105346e-022 1.8452077e-013 2.7139095e-014 3.5485491e-015 4.3504619e-016 5.1209313e-017 5.8611902e-018 6.5724231e-019 7.2557682e-01... ...

Table 2: First ten values given by Octave

It is possible to compute the values of the time series, from the equation, by solving the latternumerically using numerical algorithms, such as Euler and Runge-Kutta methods (see [30]). Table 2

shows the first ten values given by a Runge-Kutta algorithm and we can see that they are equal or closeto their corresponding values in the series from Table 1. Once the values generated, it is possible tocompare it with time series, given by the user, as input raw series.

When comparing an equation to a time series, several cases are to be considered:

– a matching equation exists in the warehouse (referred as ModelAccepted case) ;– no matching equation exists (referred as ModelRejected case) ;– some equations give results close to the series, but not enough to be considered as a matching (referred

as Undetermined case, an example is given in section 6.1).

The comparison of time series is not an easy task, since the values are continuous, the comparisonmethods needs to cope with values approximation and noise. In this work, we propose a comparisontechnique, consisting in computing the errors between the data calculated from the equations and theinput data, then calculate the mean error and standard deviation (it is also possible to add the calculationof other statistical values e.g. correlation coefficients). Then, we can define discriminatory threshold values,to decide whether or not an equation fit the data. Both the mean and standard deviation will come witha minimum and a maximum threshold values. Using those values, the ModelRejected, ModelAcceptedand Undetermined cases can be defined as follows:

– ModelRejection: the mean error or the standard deviation are greater than the maximum mean orstandard deviation threshold values ;

– ModelAcceptance: both mean and standard deviation are less than the minimum mean and standarddeviation threshold values ;

– Undetermined: none of the two previous cases is met.

The Figure 4 gives the diagram of the algorithm described above.

Input function storing and Continuous Interpolation : Since the input function is still represented by atime series. It can become an issue, if their volume starts growing too much. Therefore, we first proposeto use some dimensionality reduction techniques, to reduce the memory size necessary to store them.Segmentation methods are particularly interesting, since they also provide a continuous interpolation ofthe points and allow to control the approximation error made on the series points, during the segmenta-tion. The continuous interpolation of the series is useful, since the numerical solving of the equation mayrequire the evaluation of the input functions at some points that are not included in the series point list.Therefore, those values have to be interpolated using the closest available values.

5 New ETL process for Time Series Models and Data

As shown on Figure 5, the sources of the warehouse can be either raw time series or already built models.Consequently, the mathematical models warehouse will have to handle those two kind of sources. TheETL process will then be split into two subprocesses, one specific to each type of sources.

5.1 ETL for Mathematical Models

As shown on the generic schema of a mathematical models warehouse drawn on Figure 5, the equations gofrom data sources, through an ETL process, before being loading into the target schema of the warehouse.This ETL subprocess is then similar to a classical ETL process. Its role is to extract the equations fromthe sources. Since they are generated with a binary tree semantic, it will have to translate it into relationalrelationships. Therefore, a node list will be created, from the node contained in the equations trees. Eachnode will be given a unique ID. For each node, the left right children ID’s attributes are set using thestructure of the trees. Retrieving the equations from the database will need the inverse procedure, inorder to rebuild the trees from the tuples of the database.

Fig. 4: Comparison algorithm diagram

5.2 ETL for Time Series Data

In the case of time series, the ETL process is more complex. Firstly, the role of the ETL process is notto load the series into the target schema of the warehouse, but to look for an already existing equationin the warehouse, to associate the series to and notify the user. To do so, time series data will first beextracted from sources. The sources extraction process may suffer from the same heterogeneity problemsencountered in a data warehouse (file formats, structure, unit conversion,. . . ). Then, as drawn on Figure5, the data will be propagated to the Comparator module, which role is to find the equation correspondingto the processed input time series. The Comparator will request time series data from the Generator. Thelatter module will get the equations from the warehouse (DBMS module). Then, the equations will besolved with a numerical algorithm to solve the equations, using methods such as the Runge-Kutta (seesection 4). Solving the equations produces time series objects, that will be propagated to the Comparator,as well. At this point, the Comparator is able to compare the raw data coming from the ETL module,with the regenerated data coming from the Generator module, using the algorithm defined in section4.2(Fig.4).

For a new input series, if an equation matches the series or fall in the undetermined case, the useris notified about the said equation and the algorithm keeps looking for other matching equations orindetermination. If there is no match, the user is notified to decide of the next step. The algorithm onFigure 6 summarizes the ETL process for time series.

6 Experimental Evaluation of Prototype

A prototype of mathematical models warehouse, that implements the architecture described has beenimplemented with the following characteristics : (i) the database is deployed in PostgreSQL ; (ii) the time

Source Generator

Comparator

TS

Model

models oriented ETL

User Ap-plicationsETL DBMS

Model

TS

Model

TS

Fig. 5: Generic models warehouse schema

Fig. 6: Input time series process algorithm diagram

series sources are csv formatted text files ; (iii) the equations sources are XML files ; (iv) the Generatortakes the set of equations to generate time series Java object (which consists in a set of key-valuespairs, where the keys are the time indexes), using the fourth order Runge-Kutta method mentioned insection 4.2 ; (v) the Comparator module, using the algorithm 4 ; (vi) a first graphical interface has alsobeen implemented and is currently being improved.

6.1 Models compared to reviewed methods

In order to compare our solution to measure the impact of the input function, the tests have been madeusing different representations of the input functions series :

– An identity representation, in which the input functions series are stored without any transformation(denoted Raw).

– Three segmented representations (see section 3.2), where segments values are represented with straightinterpolating lines. The time series is segmented using either a bottom-up algorithm (denoted BU),or a sliding-window segmentation algorithm (denoted SW), or a top-down segmentation algorithm(denoted TD). For each segment, the sum of the approximation errors must be under a thresholdvalue of 0.01.

These representations are tested using three series given by the automatic control research team. Thefirst one contains one thousand values and is represented by the following equation :

Ty′(t) + y(t) = Ku(t) (2)

BU SW TD Raw

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

·10−2

First Second Third

Fig. 7: Generation times in second

Which comes with the following parameters set : T = 25 ; K = 2.4 ; t0 = 0.0 ; y0 = 0.0 ; h = 1.0 ; uis a Heaviside function, which values are equal to 1 or −1.

The second series contains also one thousand elements and its equation has the same formula as thefirst one, but not the same parameters set : T = 2 ; K = 1.3 ; t0 = 0.0 ; y0 = 0.0 ; h = 0.1 ; u.

The third series contains two thousand values and is represented by the second following second orderequation :

1

ω20

y′′(t) +2m

ω0y(t) = Ku(t) (3)

With the following parameters set : K = 0.8 ; m = 0.4 ; ω0 = 0.1 ; t0 = 0.0 ; y0 = 0.0 ; h = 1.0 ; u isalso a Heaviside function, with values equal to 1 and −1, but its variations are different from the inputfunction of the first series.

Generation time : Graph 7 displays the necessary time in seconds to regenerate the entire series fromthe different input function series representation. The generation time is in the order of 10 ms for anyrepresentation. Using segmentation does not affect the computation cost. It is the numerical solvingalgorithm that is the most time consuming. However, it is faster to retrieve, solve and compare equations

BU SW TD Raw

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8·10−8

Avg Min Max standard deviation

Fig. 8: Errors calculation for the first series, depending on the input representation

from the database, than it is for searchers, particularly, when the equations are not centralized andorganized.

Errors : Graphs 8, 9 and 10 displays the average, minimum and maximum errors between the generatedseries and the original series for each equations and for each input function series representation. Eachfour column is the standard deviation. For the first and third equations the errors are approximately10−8 and 10−7 (the values of the series being about 102 to 1). Whereas, for the second series, it is about10−5, for the segmented representations, while with the raw input, the errors is about 10−8, like for thefirst equation. In addition, for the first and third equations, the minimum compression ratios reachedwith the segmentation are, respectively, 87.66 % and 89.51 %, whereas the maximum compression ratioreached for the second equation is 31.55 %. This is explained by the fact that the first and third inputfunctions are step functions, which are quite easy to segment with high accuracy. On the contrary, thesecond input function is continuous, with lots of changing in variations. Therefore, it is more difficult toreach an accurate enough segmented representation. In this case, it is not really interesting to segmentit, since it triggers many approximation errors, for a lower compression ratio.

BU SW TD Raw

0

0.5

1

1.5

2

2.5

3

·10−5


Fig. 9: Errors calculation for the second series, depending on the input representation

7 Conclusion and further work

In this article, we gave the detail of the conception and implementation of a mathematical datawarehousefor differential equations. Thus, an ER Model to represent differential equations was introduced alongwith an ETL process to manage both time series and mathematical equations, using numerical algorithms.Also, a prototype has been described as a proof of concept. Some experiments showed that the numericalsolving of the equations does introduce too much approximation errors, as long as the input functionsseries are stored as raw series or using a representation that do not introduce too much approximations(this said representation depends on the functions series).

Our current work consists in give possibility to parametrize comparison requests (select only a partof the equations, give comparison method in parameter, give generation method in parameter, whengive possibility to override initial conditions or input functions are even variables values), which couldlater lead to an extension of the SQL language, that implements those options. Future work will focuson providing an efficient storage solution for input functions series to be associated with the warehouse.Another perspective is the integration of differential equations into the PMML (Predictive Model MarkupLanguage) standard. It could allow to share differential equations with other existing scientific tools (such

BU SW TD Raw

0

1

2

3

4

5

6

7

8

·10−7


Fig. 10: Errors calculation for the third series, depending on the input representation

as Scikit-Learn 5 with Sklearn 6 for python) using a standard language. The aim is to allow integrationof the mathematical models warehouse into an existing scientific programming environment. We alreadypropose an XML representation of the equations, in order to have a standard communication betweenseveral applications. For instance, the database could be useful to select a set of interesting models giventhe input time series, but if the database does not implement the right numerical tools, the user can thenextract those equations in XML and use another numerical tool to go further in the analysis.

References

1. Rakesh Agrawal, Christos Faloutsos, and Arun N. Swami. Efficient similarity search in sequence databases.In Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms,(FODO), pages 69–84, London, UK, UK, 1993. Springer-Verlag.

5 http://scikit-learn.org/stable/index.html6 http://pypi.python.org/pypi/sklearn-pmml/0.1.0

2. M. Azzouzi and I. Nabney. Analysing time series structure with hidden markov models. In Proceedings ofthe 1998 IEEE Signal Processing Workshop on Neural networks for Signal Processing, pages 402–408. IEEE,1998.

3. A. Bagnall, C. Ratanamahatana, E. Keogh, S. Lonardi, and G. Janacek. A bit level representation for timeseries data mining with shape based similarity. Data Mining Knowledge Discovery (DMKD), 13(1):11–40,2006.

4. D. Bartholomew. MariaDB Cookbook. PACKT, 2014.5. J. Celko. Joe Celko’s Trees and Hierachies in SQL for Smarties. Morgan Kaufman, 2004.6. K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for

indexing large time series databases. In SIGMOD, pages 151–162, 2002.7. K. Chan and A. Fu. Efficient time series matching by wavelets. In ICDE, pages 126–133, Washington, DC,

USA, 1999. IEEE.8. F. Chung, T. Fu, R. Luk, and V. Ng. Flexible time series pattern matching based on perceptually important

points. In International joint conference on artificial intelligence workshop on learning from temporal andspatial data, pages 1–7, 2001.

9. T. Dunning and E. Friedman. Time Series Databases: New Ways to Store and Access Data. O’REILLY, 2015.10. P. Esling and C. Agón. Time-series data mining. ACM Comput. Surv., 45(1):12:1–12:34, December 2012.11. E. Fink and H. S. Gandhi. Important extrema of time series. In Proceedings of the IEEE International

Conference on Systems, Man and Cybernetics, pages 366–372, 2007.12. E. Fink and H. S. Gandhi. Compression of time series by extracting major extrema. Journal of Experimental

and Theoretical Artificial Intelligence, 23(2):255–270, 2011.13. T.-C. Fu. A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1):164–

181, 2011.14. B. Karwin. SQL Antipattern. Pragmatic Bookshelf, 2010.15. E. Keogh, K Chakrabarti, M Pazzani, and S Mehrotra. Dimensionality reduction for fast similarity search in

large time series databases. Journal of Knowledge and Information Systems, 3(3):263–286, 2000.16. E. Keogh, S. Chu, D. Hart, and M. Pazzani. Segmenting time series: A survey and novel approach. In

Proceedings of the 2001 IEEE International Conference on Data Mining, pages 1–21, 2004.17. E. Keogh and M. Pazzani. Relevance feedback retrieval of time series data. In Proceedings of the 22nd Annual

International Conference on Research and Development in Information Retrieval, (SIGIR), pages 183–190.ACM, 1999.

18. E. Keogh and M. Pazzani. A simple dimensionality reduction technique for fast similarity search in largetime series databases. In Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and DataMining (PAKDD), pages 122–133, 2000.

19. E. Keogh and P. Smyth. A probabilistic approach to fast pattern matching in time series databases. InProceedings of the Third International Conference on Knowledge Discovery and Data Mining (SIGKDD),pages 24–30, 1997.

20. F. Korn, H. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of timesequences. SIGMOD record, 26(2):289–300, 1997.

21. A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The vertica analyticdatabase: C-store 7 years later. Proc. VLDB Endow., 5(12):1790–1801, 2012.

22. W. Lang, M. Morse, and J.M. Patel. Dictionary-based compression for long time-series similarity. IEEETransactions on Knowledge and Data Engineering, 22(11):1609–1622, 2010.

23. S. Lee, D. Kwon, and S. Lee. Dimensionality reduction for indexing time series based on the minimumdistance. Journal of Information Science and Engineering, 19:697–711, 2003.

24. M. Lovrić, M. Milanović, and M. Stamenković. Algorithmic methods for segmentation of time series: anoverview. Journal of Contemporary Economic and Business Issues, 1(1):31–53, 2014.

25. P. Man and M. Wong. Efficient and robust feature extraction and pattern matching of time series by a latticestructure. In Proceedings of the Tenth International Conference on Information and Knowledge Management,(CIKM), pages 271–278. ACM, 2001.

26. D. Namiot. Time series databases. In Proceedings of the XVII International Conference "Data Analytics andManagement in Data Intensive Domains" (DAMDID/RCDL), pages 132–137, 2015.

27. C. Ordonnez, S. Maabout, D. Sergio Matusevich, and W. Cabrera. Extending er models to capture databasetransformations to build data sets for data mining. Data Knowl. Eng., 89:38–54, 2014.

28. C. Perng, H. Wang, S. Zhang, and D. Stott Parker. Landmarks: A new model for similarity-based patternquerying in time series databases. In ICDE, pages 33–42. IEEE, 2000.

29. W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes in C: The Art of ScientificComputing. Cambridge University Press, second edition, 2002.

30. W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes in C: The Art of ScientificComputing, pages 707–752. Cambridge University Press, 2002.

31. T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh.Searching and mining trillions of time series subsequences under dynamic time warping. In Proceedings ofthe 18th International Conference on Knowledge Discovery and Data Mining, (KDD), pages 262–270. ACM,2012.

32. H. Shatkay. The fourier transform - a primer. Technical report, Providence, RI, USA, 1995.33. H. Shatkay and S. Zdonik. Approximate queries and representations for large data sequences. In ICDE, pages

536–545, 1996.34. R. Shumway and D. Stoffer. Time Series Analysis and It’s Applications. Springer, 2015.35. C. Sidney, R. Gopinath, and H. Guo. Introduction to Wavelets and Wavelet Transforms: A Primer. Pearson,

1997.36. M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In Proceedings of the 23rd

International Conference on Scientific and Statistical Database Management, SSDBM’11, pages 1–16, Berlin,Heidelberg, 2011. Springer-Verlag.

37. K. Åström. On the choice of sampling rates in parametric identification of time series. Information Sciences,1(3):273–278, 1969.

A Database Model for Time Series : From a traditional Data ...ordonez/pdf/Ladjel-EDA-IJDWM.pdf ·...

Documents

Transcript of A Database Model for Time Series : From a traditional Data ...ordonez/pdf/Ladjel-EDA-IJDWM.pdf ·...