An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions,...

10
An Overview of Deep Learning Strategies for Time Series Prediction Rodrigo Neves Instituto Superior T´ ecnico, Lisboa, Portugal [email protected] June 2018 Abstract—Deep learning is getting a lot of attention in the last few years, mainly due to the state-of-the-art results obtained in different areas like object detection, natural language processing, sequential modeling, among many others. Time series problems are a special case of sequential data, where deep learning models can be applied. The standard option to this type of problems are Recurrent Neural Networks (RNNs), but recent results are supporting the idea that Convolutional Neural Networks (CNNs) can also be applied to time series with good results. This raises the following question - Which are the best attributes and architectures to apply in time series prediction problems? It was assessed which is the current state on deep learning applied to time-series and studied which are the most promising topologies and characteristics on sequential tasks that are worth it to be explored. The study focused on two different time series problems, wind power forecasting and predictive maintenance. Both experiments were conducted under the same conditions across different models to guarantee a fair comparison basis. The study showed that different models and architectures can be applied on distinct time series problems with some level of success, thus showing the value and versatility of deep learning models in distinct areas. The results also showed that CNNs, together with recurrent architectures, are a viable option to apply in time series problems. I. I NTRODUCTION Time series modeling has always been under intense de- velopment and research in statistics, and more recently, in the machine learning area. Time series data often arises in different areas, such as economics, business, engineering and many others, and can have different applications. It is possible to obtain an understanding of the underlying structure that is produced by the observed data or to fit a model in order to make predictions about the future. For example, in the energy sector is of extreme priority to know in advance which will be the power consumed in the next days, or which will be the amount of energy generated from renewable sources. In the retail sector, every company wants to know, in advance which are going to be their sales, or how many pieces of a certain product will sell to make decisions on how they should manage their portfolio. In the heavy industry, a single machine failure can lead to enormous losses, and thereby, if that fault could be predicted in advance, could save both time and money. Several techniques and models can be applied on the field of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA) family models. These models were greatly influenced by the Box- Jenkins methodology [2] and were developed mainly in the area of econometrics and statistics. Driven by the rise of popularity and attention in the area of deep learning, due to the state-of-the-art results obtained in several areas, Artificial Neural Networks (ANN), which are non-linear function approximator, have also been receiving an increasing attention in time series field. This rise of popularity is associated with the breakthroughs achieved in this area, with the successful application of CNNs and RNNs to sequential modeling problems, with promising results [3]. RNNs are a type of artificial neural networks that were specially designed to deal with sequential tasks, and thus can be used on time series. They have shown promising results in the area of time series forecasting [4] and predictive maintenance [5]. CNNs that are a different type of neural network, originally designed to deal with images, are also being applied to sequence modeling with very promising results [6]. The recent advances in sequence tasks like, natural language processing [7], speech recognition [8], machine translation [9], are being adopted to be applied in time series problems [10]. These advances in sequence modeling within the area of deep learning are enormous which makes it overwhelming, being complicated to follow all new advances and research directions. In the area of sequence modeling the default option was always to make use of recurrent models like Long-Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) to capture and model the inherent sequential dependences. Re- cently, supported by new paths of research convolutional-based models are reaching state-of-the-art results in fields like audio synthesis or machine translation. A study [11] was made to asses the performance of a convolutional model in sequential tasks where the authors found that ”a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverge range os tasks and datasets, while demonstrating longer effective memory”. In this work we present a systematic and fair comparison among different deep learning approaches that are possible to use in time series problems, with the objective of assessing which are the best characteristics and topologies that we should aim to have in a model. This task was performed in two problems: wind power generation forecasting and predictive maintenance areas, where we used and compared different recurrent and convolutional architectures. The results suggests that convolutional-based algorithms,

Transcript of An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions,...

Page 1: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

An Overview of Deep Learning Strategies for TimeSeries Prediction

Rodrigo NevesInstituto Superior Tecnico, Lisboa, Portugal

[email protected] 2018

Abstract—Deep learning is getting a lot of attention in the lastfew years, mainly due to the state-of-the-art results obtained indifferent areas like object detection, natural language processing,sequential modeling, among many others. Time series problemsare a special case of sequential data, where deep learning modelscan be applied. The standard option to this type of problemsare Recurrent Neural Networks (RNNs), but recent results aresupporting the idea that Convolutional Neural Networks (CNNs)can also be applied to time series with good results. This raisesthe following question - Which are the best attributes andarchitectures to apply in time series prediction problems? It wasassessed which is the current state on deep learning applied totime-series and studied which are the most promising topologiesand characteristics on sequential tasks that are worth it tobe explored. The study focused on two different time seriesproblems, wind power forecasting and predictive maintenance.Both experiments were conducted under the same conditionsacross different models to guarantee a fair comparison basis.The study showed that different models and architectures canbe applied on distinct time series problems with some level ofsuccess, thus showing the value and versatility of deep learningmodels in distinct areas. The results also showed that CNNs,together with recurrent architectures, are a viable option to applyin time series problems.

I. INTRODUCTION

Time series modeling has always been under intense de-velopment and research in statistics, and more recently, inthe machine learning area. Time series data often arises indifferent areas, such as economics, business, engineering andmany others, and can have different applications. It is possibleto obtain an understanding of the underlying structure that isproduced by the observed data or to fit a model in order tomake predictions about the future. For example, in the energysector is of extreme priority to know in advance which willbe the power consumed in the next days, or which will be theamount of energy generated from renewable sources. In theretail sector, every company wants to know, in advance whichare going to be their sales, or how many pieces of a certainproduct will sell to make decisions on how they should managetheir portfolio. In the heavy industry, a single machine failurecan lead to enormous losses, and thereby, if that fault couldbe predicted in advance, could save both time and money.

Several techniques and models can be applied on the fieldof time series predictions, being the most known ones from theAutoRegressive Integrated Moving Average (ARIMA) familymodels. These models were greatly influenced by the Box-

Jenkins methodology [2] and were developed mainly in thearea of econometrics and statistics.

Driven by the rise of popularity and attention in the areaof deep learning, due to the state-of-the-art results obtainedin several areas, Artificial Neural Networks (ANN), which arenon-linear function approximator, have also been receiving anincreasing attention in time series field. This rise of popularityis associated with the breakthroughs achieved in this area, withthe successful application of CNNs and RNNs to sequentialmodeling problems, with promising results [3]. RNNs are atype of artificial neural networks that were specially designedto deal with sequential tasks, and thus can be used on timeseries. They have shown promising results in the area of timeseries forecasting [4] and predictive maintenance [5].

CNNs that are a different type of neural network, originallydesigned to deal with images, are also being applied tosequence modeling with very promising results [6]. The recentadvances in sequence tasks like, natural language processing[7], speech recognition [8], machine translation [9], are beingadopted to be applied in time series problems [10].

These advances in sequence modeling within the area ofdeep learning are enormous which makes it overwhelming,being complicated to follow all new advances and researchdirections. In the area of sequence modeling the default optionwas always to make use of recurrent models like Long-ShortTerm Memory (LSTM) and Gated Recurrent Unit (GRU) tocapture and model the inherent sequential dependences. Re-cently, supported by new paths of research convolutional-basedmodels are reaching state-of-the-art results in fields like audiosynthesis or machine translation. A study [11] was made toasses the performance of a convolutional model in sequentialtasks where the authors found that ”a simple convolutionalarchitecture outperforms canonical recurrent networks such asLSTMs across a diverge range os tasks and datasets, whiledemonstrating longer effective memory”.

In this work we present a systematic and fair comparisonamong different deep learning approaches that are possible touse in time series problems, with the objective of assessingwhich are the best characteristics and topologies that weshould aim to have in a model. This task was performed in twoproblems: wind power generation forecasting and predictivemaintenance areas, where we used and compared differentrecurrent and convolutional architectures.

The results suggests that convolutional-based algorithms,

Page 2: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

together with recurrent networks, should be seen as a goodoption to use in time series problems. This is supported bythe results obtained on both problems, where we could seethat convolutional-based models were able to match and evenoutperform recurrent models.

II. BACKGROUND

Machine learning, and especially deep learning, have beenunder great development in the last few years. Machinelearning is a subfield of artificial intelligence (AI), where thesystems have the ability to acquire knowledge, by extractingpatters from raw data, learning from it, and then makinga determination or prediction about the application that itwas designed for. Deep learning, which in turn is, a subfieldof machine learning, is mainly based on an algorithm thatis known as neural network. Neural networks are able tobuild complex relationships, extracting the information fromdata, without any previous knowledge. Theoretically neuralnetworks are universal functions approximator [1] and theycan represent any function. CNNs are a type of neural networkthat was inspired on convolution operations and was speciallydesigned to be used with data that has a grid-like topology.RNNs are specifically designed to deal with sequential tasks,due to the recurrent connections that they have.

A. Convolution Neural NetworkCNNs are a family of neural networks that uses convolution

operation in place of general matrix multiplication, in at leaston of their layers. The typical convolution operation appliedin CNNs is shown in equation 1, where I is the input and Kis known as the kernel. The output is refereed as the featuremap.

S(i, j) = (I∗K)(i, j) =∑m

∑n

I(m,n)K(i−m, j−n) (1)

Convolution operation leverages three important aspects thatcan help improving a machine learning system: sparse inter-actions, parameters sharing and equivariant representations.Convolution also provides a way of working with variableinput sizes. In shallow neural networks every input interactswith every output, but that does not happen in CNNs. Forexample, when processing an image, the input might havethousands or millions of pixels, but it is possible to detectsmall, meaningful features, such as edges, with kernels thatoccupy just tens or hundreds of pixels. This means that itis needed to store less parameters, which both reduces thememory requirements of the model and improves its statisticalefficiency. It also means that computing the output requiresfewer operations.

Parameter sharing refers to using the same parameter formore than one function in a model. In a traditional neuralnetwork, each element of the weight matrix is used exactlyonce when computing the output of a layer. In convolutionalnetworks, each member of the kernel is used at every positionof the input. Taking advantage of that, convolution operations

only learn one set of parameters, rather than learning adifferent set of parameters for every location. Due to parametersharing CNNs have a property called invariance to translation,meaning that if the input changes the output changes in thesame way.

A typical layer of a convolutional network consists of threestages. In the first stage the layer performs several convolutionsoperations in parallel, where is produced distinct featuresmaps. The second stage performs an activation function toeach linear activation, introducing an non-linearity. In the thirdstage, it is used a pooling function to further modify the outputof the layer. This modification introduced by the poolingfunction replaces the nets output with a summary statistic ofnearby outputs. The most used ones are the max and averagepooling operations.

B. Recurrent Neural NetworkRNNs are a type of neural networks that have recurrent

connections and are specially designed for processing sequen-tial data. One key characteristic of RNNs is the use of sharedparameters across all sequence. Instead of having a differentset of parameters to process each sequence step, the same setis used across all sequence, allowing the model to generalizeacross sequences not seen during training. Parameter sharingis also important when a specific piece of information canoccur at multiple positions within the sequence. The parametersharing used in recurrent networks relies on the assumptionthat the same parameters can be used for different time steps.RNNs can be formally introduced as a set of operationsrepresented by a directed acyclic computational graph, as inFigure 1. The recursive operation in a RNN is represented byequation 2, where h(t) is the hidden state and x(t) the inputat step t. The additional layers seen in Figure 1 adds furthertransformations to the data.

h(t) = f(h(t−1),x(t);θ) (2)

When the recurrent network is trained to perform a taskthat requires predicting the next step based on previous values,the network typically learns to use the hidden state, h(t), asa summary of task-relevant aspects from the past sequenceinput.

(a) RNN structure (b) Unfolded RNN

Fig. 1. RNN operations represented in an unfolded graph.

This process has two advantages, being the first one thatindependently of the sequence length, the hidden state will

Page 3: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

always have the same size because it is expressed as transitionsfrom one state to another. The second advantage is that ispossible to use the same function f for all transitions. Thesetwo factors make it possible to learn a single model thatoperates across all time steps and sequence lengths that takesadvantage of shared parameters across all sequence.

There is a wide variety of RNNs relationships that is pos-sible to create with the idea of graph unrolling and parametersharing across different time steps. It is possible to create aRNN with an one-to-one, one-to-many, many-to-one or many-to-many relation. Those relationships are show in Figure 2.

(a) One-to-One (b) One-to-Many

(c) Many-to-One (d) Many-to-Many

Fig. 2. Recurrent neural networks architectures.

1) Back Propagation Through Time: Computing the gra-dient through a RNN is the same as applying the general-ized backpropagation algorithm to the unrolled computationalgraph. Gradients obtained, by backpropagation, may then beused with any general purpose gradient-based technique totrain the RNN parameters. The steps to calculate the gradientcomprise the forward and backpropagation step through thecomputational graph. The complexity of this operation isthereby, O(τ), and cannot be reduced by parallelization sincethis is an inherently sequential process, where each time stept depends on all previous time steps. This backpropagationstep in a recurrent graph in known as the BackPropagationThrough Time (BPTT).

A major problem with BPTT is that the gradient is prop-agated over many stages, due to the recurrence presented

in the network. This causes the gradient to either vanish orexploding. Even if we assume that the parameters are suchthat the recurrent network is stable (can retain information,with gradients not exploding), the difficulty with long-termdependencies arises from the exponentially smaller weightsgiven to long-term interactions compared with short-termones. This problem is particular to recurrent architectures.In practice, the experiments shows that if the span of thedependencies that need to be captured are increased, gradient-based optimization becomes increasingly difficult, with theprobability of successful training of a traditional recurrentneural network via gradient descent rapidly reaches zero forsequences of only a length of 10 or 20 [13].

2) Long-Short Term Memory: LSTMs were introduced in1997 [12], and were especially designed to deal with longterm dependencies in RNN architectures. The main idea is theintroduction of self-loops to produce paths where the gradientcan flow and making the weights on this self-loop conditionedon the context, rather than fixed. The core idea behind behindLSTMs is the the introduction of the cell state Ct, wherethe LSTM has the ability to add or remove information to it,functioning as a memory. This process is controlled by threegates mechanisms, forget (equation 3), external input (equation4) and output gates (equation 6). Equation 5 is the update cellstate operation. The introduction of these gates mechanismattenuates the problem with vanish gradient, helping in thetask of learning long-term dependencies.

f t = σ(W f × [h(t−1),xt] + bf ) (3)

it = σ(W i × [h(t−1),xt] + bi)

Ct = tanh(WC × [h(t−1),xt] + bC)(4)

Ct = f t ⊗C(t−1) + it ⊗ Ct (5)

ot = σ(W o × [h(t−1),xt] + bo)

ht = ot ⊗ tanh(Ct)(6)

3) Variants: From LSTM similar ideas were developed,being the most known one the GRU cell. GRU combines forgetand input gates into a single gate. The cell and hidden stateare also merged. This is a simpler model than LSTM but hasbeen achieving good results as well. Figure 3 shows LSTMand GRU cell architectures. Several new cell architectures havebeen introduced, like Lattice Recurrent Unit [14], StatisticalRecurrent Unit [15] or independently Recurrent Neural Net-work [16].

III. MODELS

Despite the recent developments in the last years, usingdeep learning algorithms to sequential problems still remainsa challenging problem, mainly due to the vanish gradientproblems that makes learning long-term dependencies difficultand expensive. Recurrent networks were specially designed

Page 4: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

(a) Long Short-Term Memory

(b) Gated Recurrent Unit

Fig. 3. Recurrent neural networks architectures variants.

to deal with sequential problems, and are the default optionfor problems of this nature. RNNs are capable of encodingthe sequential context in their hidden state h(t), which, the-oretically, can capture infinitely long-term dependencies. Inpractice, RNNs have shown problems to capture long-termdependencies, mostly due to the vanish gradient problem. Toovercome that difficult LSTM, and other variants, like GRU,have been being introduced and got an immense popularitydue to the success in applications such as, language modeling,machine translation, speech recognition and others. Sincethere is a big flexibility in the design of RNN architectures,there are many other models and training strategies that havebeen introduced and are being explored. Several empiricalstudies were developed to explore the performance of differentarchitectures on various sequential tasks, but in [17] it wasconclude that, if there were ”architectures much better thanthe LSTM”, they were ”not trivial to find”.

CNNs are mostly known for their applications on tasksinvolving images, like image classification or face recognition.But CNNs are also being applied to sequential problems, intasks like, speech recognition [18], sentence classification [19]or audio synthesis [20]. The results obtained in sequential taskswith CNNs are notable and promising.

A. Recurrent Architectures1) RNN Architecture: When dealing with sequential prob-

lems we are interested in predicting the next value of asequence conditioned on all previous time steps. This canbe modeled by maximizing the likelihood of our target valuegiven all previous steps. Given a predictor with parameters θ,

the task is to maximize y(t) conditioned on previous steps, asshown in equation 7.

p(y|x) =T∏

t=0

p(y(t)|x(0), ...,x(T ),θ) (7)

RNNs architectures are specially designed to deal withsequential problems. When a RNN is used to deal withsequential information the network is trained to conditioneach output y(t) based on previous inputs (x(0), ...,x(t)). Toencode all previous information the RNN architecture uses thehidden state, as shown in equation 8, where h(t−1) encodesall information up until t−1, and the output y(t) is a functionof h(t−1) and the input xt.

y(t) = f(h(t−1),x(t);θ) (8)

These characteristics makes RNNs architectures a naturalsolution to learn the probability distribution functionintroduced in equation 7. Depending if we apply a vanillaRNN architecture or an RNN architecture with a GRU orLSTM cell the function f that is applied is different.

2) Encoder-Decoder Architectures: The encoder-decoderRNN architecture was introduced in [9], and it combines twodifferent RNNs. One for the encoder and other for the decoder.The encoder has the task of encoding a sequence with anarbitrary length sequence to a fixed vector length, whereas thesecond RNN is used to decode the information in that vectorinto another sequence, as shown in Figure 4.

Fig. 4. Encoder Decoder architecture.

From a probabilistic perspective, this method learns a prob-ability distribution of a future sequence based on previousvalues [9], as in equation 9.

p(y|x) =T 1∏t=0

p(y(t)|x(0), ...,x(T 2), y(0), ..., y(t−1),θ) (9)

The model will encode all information about the inputsequence in the hidden state, h(t), by applying equation 2. Thehidden state is a latent representation from the input sequence(x(0), ...,x(T 2)), and is usually refereed as the context vector,C. The decoder is trained to predict the next value y(t), giventhe previous hidden state h(t−1) and also conditioned on thecontext vector C, as in equation 10, where the function fapplied depends on the cell state used.

Page 5: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

y(t) = f(h(t−1),y(t−1),C;θ) (10)

Attention SystemA new approach to encoder-decoder models, introduced in

[7], exploits the bottleneck of encoding an arbitrary sequencelength into a fixed length vector. The objective is to allow thedecoder learn which is the hidden state (or context vector),from the encoder that is most valuable to predict the nextoutput. This enables the decoder output to choose the contextvector C, based on the set of positions where the inputsequence carries the most relevant information.

Fig. 5. Encoder Decoder with attention.

Applying this attention system, the output from the decoderis given by equation 11, where the difference, compared withthe encoder-decoder model without attention, is that the outputyt is now conditioned on a different context vector Ci for eachi, as shown in Figure 5

y(t) = f(h(t−1), y(t−1),Ci;θ) (11)

The context vector will be a weighted sum from the encoderhidden states, h(i). Each h(i) contains information about thewhole sequence until the i-th input, but it has a very strongfocus on the i-th input.

3) Dilated Recurrent Neural Network: Using RNN ar-chitectures to train on long sequences results in three majorproblems: complex dependencies, gradient problems (van-ish/exploding) and long training times. Dilated Recurrent Neu-ral Networks (DilatedRNN) were introduced to deal with theselimitations [21]. They are characterized by a multi-resolutiondilated recurrent skip connections which can decrease thenumber of parameters necessary in tasks involving long termdependencies. The use of dilated recurrent connections canalleviate the gradient problems, enabling the training in longersequences and reducing the number of parameters, whichincreases the computational efficiency. The DilatedRNN modelis shown in Figure 6.

On a vanilla recurrent connection the hidden state h(t) is afunction of the previous hidden state h(t−1) and the currentinput x(t), as in equation 8. When a dilated recurrent skipconnections is used the hidden state h(t) is a function of thecurrent input x(t) and the hidden state h(t−s), as shown inequation 12, where where h(t) is the hidden state at time step

Fig. 6. DilatedRNN model with 3 layers.

t and s is the skip length or the dilation of the l-th layer. Thefunction f depends on the RNN cell used.

h(t)l = f(x

(t)l ,h

(t−s)l ;θ) (12)

Dilated recurrent skip connections allow information totravel along fewer edges and can be calculated in a parallelfashion. This can be seen in Figure 7, where the input isprocessed in parallel instead of a unique sequence. In [21]the authors used exponentially increasing of dilation factors,to extract complex time dependencies.

(a) Dilated connections. (b) Optimized dilated layer.

Fig. 7. Dilated recurrent skip connections can reduce the number ofparameters used and allow parallel computations.

B. Convolutional Architectures

1) Wavenet: Wavenet [20] is a CNN architecture, that usesstacked convolution layers to model the conditional probabilitydistribution shown in equation 7. Causal convolutions are usedto deal with the time ordering of samples and guarantee thatthe model at timestep t cannot depend on any future valuesx(t+1) . An example of a causal convolution is depicted inFigure 8.

Fig. 8. Causal convolution.

Page 6: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

To increase the receptive field size, thus increasing theamount of information from past samples, Wavenet modelmakes use of dilated causal convolutions, as in Figure 9.Dilated convolution is a convolution where the filter has alarger area than its length by skipping inputs, with a lengththat is equivalent to the dilation factor. This allows the modelto have a larger receptive field size, using the same numberof layers, and the same number of parameters.

Fig. 9. Dilated causal convolution.

Wavenet model uses residual and parameterised skip con-nections, to speed up convergence [21].

Z = tanh(W f,k ∗ x)⊗ σ(W g,k ∗ x) (13)

Gated activation units are used, as in equation 13, wherek is the layer index, f and g the filter and gate, respectivelyand W represents the convolution filter that is learned duringtraining. The Wavenet model architecture can be seen in Figure10.

Fig. 10. Wavenet model architecture.

2) Temporal Convolution Network: The work introducedin [11], showed a convolution neural network architectureoutperforming several recurrent architectures for sequencemodeling and it was inspired in recent works about sequentialtasks. The authors stated that convolution networks shouldbe regarded as a natural starting point for sequential tasks.

The model introduced is called Temporal Convolution Net-work (TCN). The model was designed to combine simplicity,autoregressive prediction and very long memory. The modelarchitecture can be seen in Figure 11. Similarly to the Wavenetarchitecture TCN uses dilated convolutions and residual con-nections to build very long and effective receptive fields.

Fig. 11. Temporal convolution block architecture.

TCN, and convolutional models in general, have severaladvantages over RNN architectures for sequential tasks. Con-volutional models can be parallelized, unlike RNNs where theoutput at time step t must wait for all previous time steps to becomputed. They have a flexible receptive field size that can bechanged in several ways, like changing the number of layersor the kernel size from dilated convolutions. Stable gradientsare another advantage, because the backpropagation algorithmdo not suffer from exploding/vanish gradients like in RNNsdue the recurrent connections. Low memory for training anda variable length inputs are also two advantages. A drawbackis the memory storage needed during evaluation, since allsequence must be kept in memory, which is a problem thatRNN don’t have because they only need to store the last hiddenstate to compute the next output.

3) Quasi Recurrent Neural Network: The Quasi RecurrentNeural Network (QRNN) [22] alternates convolutional layerswith a recurrent pooling operation that can be applied inparallel. This can speed up the model training, when comparedto RNNs.

The two main components of a QRNN are the convolutionlayer and the pooling operation layer. The first componentapplies a causal convolution, where giving an input X of Tn-dimensional vectors (x1, ...,xT ), it will be produced threesequences of m-dimensional vectors (see equation 14), wherem is the number of filters used.

Page 7: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

Z = tanh(W z ∗X)

F = σ(W f ∗X)

O = σ(W o ∗X)

(14)

The recurrent pooling layer operations are inspired in theelementwise operations used in LSTM cells. The objectiveis to have a function controlled by gates that are able tomix states across timesteps that act independently on eachchannel of the state vector. Three types of pooling operationsare introduced, f -pooling (equation 15), fo-pooling (equation16) and ifo-pooling (equation 17).

h(t) = f (t) ⊗ h(t−1) + (1− f (t))⊗ z(t) (15)

c(t) = f (t) ⊗ c(t−1) + (1− f (t))⊗ z(t)

h(t) = o(t) ⊗ c(t)(16)

c(t) = f (t) ⊗ c(t−1) + i(t) ⊗ z(t)

h(t) = o(t) ⊗ c(t)(17)

An example of a QRNN layer is given in Figure 12.

Fig. 12. Quasi Recurrent Neural Network layer.

IV. METHODOLOGY

The performance and versatility of the models will be testedon two different time series problems, with the objective ofunderstanding which are the models characteristics that yieldbetter results and are worth to be further investigated. Allexperiments were developed in Python and the deep learningmodels were developed with PyTorch [23].

A. Wind Power GenerationThe data used was collected by Redes Energticas Na-

cionais (REN) and consists on the injected wind power inthe Portuguese power system. It was sampled at a 15 minutesresolution from the first day of 2010 until the last day of2016. The data collected pertains to all wind farms that are

connected to REN’s telemetry system. For a review about windpower generation forecasting problems see [24]. Persistencemethod is used as baseline to compare the algorithms. Short-term and medium-term time horizon will be predicted. Morespecifically, time horizons of one, six and 24 hours ahead willbe forecast. This means that three different experiments will bemade, where depending on the horizon a different number ofpredictions will be made. For example, in a one hour horizonforecast we will make four predictions, for 15, 30, 45 and 60minutes ahead. This will be made consecutively, this meansthat we will use the 15 minutes ahead prediction as input topredict the next point, and so on. The training data will rangefrom the first day of 2010 until the last day of 2015, whilethe full year of 2016 will be predicted.

1) Data Preparation: The data collected can be repre-sented in a matrix form such as, X = [x(1), x(2), ..., x(T )],where each x(t) ∈ R, represents the wind power gener-ated at time-step t. The data will be transformed to X =[x1,x2, ...,xT−p]

T , where each xt ∈ R1×p, represents anordered sequence of wind power generated of length p. Foreach training step the label will also be an ordered sequence,and depending on the model used, can be represented asY = [x1,x2, ...,xT−n]

T , where each yt ∈ R1×n. The datawill be standardized, having zero mean a unit variance.

2) Metrics: Mean Squared Error (MSE), Mean AbsoluteError (MAE) and Mean Absolute Percentage Error (MAPE)will be used as metrics to compare the performance.

MSE =

∑Ni=1(y

itrue − yipredicted)2

N(18)

MAE =

∑Ni=1 ‖yitrue − yipredicted‖

N(19)

MAPE =100

N∑i=1

‖yitrue − yipredicted‖yitrue

(20)

B. C-MAPSS TurbofanCommercial Modular Aero-Propulsion System Simulation

(C-MAPSS) is a turbofan simulation model used to generatea simulated run-to-failure dataset from a turbofan engine andit is published in NASA’s prognostics center of excellencerepository. A big bottleneck in data driven approaches totackle problem in predictive maintenance scope is the lackof availability run-to-failure datasets This simulated datasetallows researchers to build, test and benchmark differentapproaches to this problem.

An overview to C-MAPSS dataset is given in Table I. Thedata consists of multiple multivariate time series measure-ments. Within each dataset, each engine is considered froma fleet of engines of the same type and where each time seriesis from a single engine. Each engine starts with different levelsof initial wear and manufacturing variation which is unknown.These wear and variations are considered as a normal behaviorfor each engine. The engine is operating normally at the startof each time series, and develops a fault at some point. In the

Page 8: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

TABLE IC-MAPSS DATASET OVERVIEW.

Data set FD001 FD002 FD003 FD004

Train Units 100 260 100 249Test Units 100 259 100 248Operation Conditions 1 6 1 6Fault Conditions 1 1 2 2

training set, the fault grows in magnitude until engine failure.In the test set, the time series ends some time prior to thefailure. The objective is to predict the number of remainingoperational cycles before failure in the test set, i.e., the numberof operational cycles left in the engine. This is known as theRemaining Useful Life (RUL).

1) Data Preparation: It is known that there are six oper-ation conditions, thereby a clustering algorithm was appliedto obtain those conditions in an unsupervised way. The threesensors that measure the operational setting were used in theclustering. HDBScan algorithm was used [25]. The Min-Maxnormalization was adopted, where each sensor was normalizedtaking in account the operational setting assigned. A piece-wise RUL label is used where the limit is settled at 130 timecycles similarly to the approach used in [26].

Each engine can be represented in matrix form as Xn =

[x1,x2, ...,xP ] ∈ RP×Tn where xp = [x(1)p , x

(2)p , ..., x

(T )p ] ∈

R1×Tn is the sensor measurement p for the life time ofengine n. The data is transformed to look like X =[U1,U2, ...,UK ]T ∈ RK×Tn×p where n is the sequencelength and U i = [x

(1)i ,x

(2)i , ...,x

(n)i ]T ∈ Rn×p is a matrix

representing all measurements until time step n for all sensors,where x(n)

i is the vector representing all sensors measurementsfrom time step n.

2) Metrics: MSE, MAE and a score function (equation 21)are used to asses the quality of the models. The score functionis used to penalize more when the RUL is overestimated.

S =

{∑ni=1(e

−hi13 − 1) when, hi < 0∑n

i=1(ehi10 − 1) when, hi >= 0

(21)

V. RESULTS

A. Wind Power Generation Problem

Analyzing the results shown in table II it is possible to seethat all models are achieving a lower error than the baseline,specially for the 1 hour horizon. This means that our modelsare correctly identifying the short term wind patterns. Thequality of our predictions decreases when we increase thetime horizon. This was expected since the uncertainty becomeshigher when we want to predict several steps ahead. Thisbehavior is shown in Figure 13.

RNNs architectures (LSTMs and GRUs) achieved better aperformance for the 1 hour horizon. When the time horizonis wider, we can see that the best results are obtained withconvolutional-based methods, except for the 6 hour horizon in

Fig. 13. Wind power forecasting for a 24 horizon.

the MAPE, where the best result was achieved by the encoder-decoder model with attention and a GRU cell. This suggeststhat convolution based methods are capable of learning longterm dependencies presented and are able to generalize forlonger sequences.

Being the wind speed a very unstable resource, wind powergeneration does not have a clear and predictable patternthroughout the days. As consequence of this is, we can see inthe results obtained that the differences between models areslim, meaning that, overall, each model is capable of learningthe few patterns present in the data. This is supported by thelearning curves presented in Figure 14 where we can see thatthe models are converging to approximately the same trainingloss value. If more complex relationships would be presentedin the data, the advantage of applying a LSTM cell over aRNN cell would appear, since it is widely known that LSTMcells are capable of keeping information for longer sequencesand modeling more complicated relationships in the data.

Fig. 14. Learning curves for all models for 1 hour horizon.

Based on the results we can see that TCN and Wavenetmodels, both convolutional-based methods can match or evenperform better than other type of recurrent-based algorithms,which is a clear indication that this type of architectures area viable option to learn sequential dependencies.

B. C-MAPSS Dataset ProblemAnalyzing the results, shown in table IV, we can see that

the TCN model was the one with the best performance. Allmodels are able to learn the mapping between the sensor data

Page 9: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

TABLE IIRESULTS FOR WIND POWER GENERATION FORECAST FOR AN ONE, SIX AND 24 HOUR HORIZON.

Model1 hour 6 hour 24 hour

MSE MAE MAPE MSE MAE MAPE MSE MAE MAPE

Baseline 12.24× 103 75.67 8.55 175.88× 103 285.41 36.87 679.10× 103 592.07 84.35

RNN 6.49× 103 53.18 5.68 139.70× 103 247.73 33.54 678.26× 103 602.38 93.08

GRU 6.49× 103 52.27 5.55 131.79× 103 238.77 29.52 608.25× 103 543.59 70.59

LSTM 6.33 × 103 52.83 5.61 129.84× 103 239.97 33.07 547.24× 103 539.32 89.40

DRNN - RNN 6.66× 103 54.07 5.77 137.58× 103 245.88 31.93 601.13× 103 556.37 89.47

DRNN - GRU 6.34× 103 52.46 5.70 131.21× 103 243.14 36.87 547.20× 103 559.63 98.92

DRNN - LSTM 6.37× 103 52.72 5.87 129.64× 103 243.89 38.11 554.35× 103 558.98 103.20

Wavenet 6.67× 103 54.41 6.06 179.40× 103 283.01 30.14 537.86 × 103 547.47 101.97

TCN 6.39× 103 52.68 5.68 125.24 × 103 234.72 30.18 544.05× 103 526.74 68.57

QRNN 6.40× 103 52.76 5.62 132.21× 103 242.52 32.30 555.63× 103 546.89 90.36

EncDec RNN 7.70× 103 60.37 8.25 155.62× 103 273.29 41.98 613.75× 103 558.80 76.15

EncDec GRU 6.74× 103 54.07 5.70 142.17× 103 249.17 30.92 590.26× 103 560.79 91.45

EncDec LSTM 6.94× 103 56.71 6.59 132.51× 103 243.57 32.92 562.07× 103 566.40 109.94

Attention RNN 7.22× 103 56.85 6.29 140.03× 103 255.58 32.60 608.93× 103 577.95 86.40

Attention GRU 6.71× 103 54.71 5.96 139.95× 103 247.79 28.37 544.83× 103 538.70 91.64

Attention LSTM 7.02× 103 56.82 6.70 142.79× 103 254.25 32.60 571.68× 103 539.70 79.28

and the RUL, as shown in Figure 15. We can also identifythat the noisy sensor data is affecting our results, since thepredictions are also very noisy.

Fig. 15. RUL predictions for different models.

The theoretical advantage presented by DRNN models isonly seen for RNN cells. This suggests, that a vanilla RNN cellis not capable of learning long term dependencies. The use ofdilated steps is helping the model to learn those dependencies,improving the score in the test set. In this problem we can seethat a LSTM cell has yielded better results than the GRU cell,both with a recurrent and dilated recurrent models. Analyzingthe performance of the QRNN model we can see that it canmatch the results from a recurrent model with a GRU cell,even though being a convolutional-based model. As for thewind power generation forecasting problem we can see thatconvolutional-based methods can outperform recurrent-basedmodels. More specifically, the TCN model architecture shouldbe seen as a default option for sequential modeling, because ithas the capacity of learning long term dependencies and bringsseveral other advantages for being a convolutional model.

TABLE IIITHE BEST MODEL (TCN MODEL) RESULTS FOR THE DIFFERENT

DATASETS.

Data set FD001 FD002 FD003 FD004

MSE 153.16 128.83 135.8 141.02

MAE 8.70 8.08 8.32 8.10

Score 268.06 634.95 260.56 730.29

In Table III it is shown the results obtained for the bestmodel, for all datasets. The majority of the bibliography workabout this problem have been mostly focused on one of thefour data sets. This means that those approaches will be ableto learn the patterns that arises solely from that specific dataset. Making a model that is capable of generalizing acrossdifferent operation modes is thus more difficult, but also morerelevant.

VI. CONCLUSIONS

We have conducted an overview about deep learning meth-ods to apply on time series. The study showed us thatrecurrent architectures with a LSTM or GRU cells are thedefault options when we talk about time series, in the deeplearning framework. Lately, convolutional-based models werebeing applied to sequential modeling problems with very goodresults. To asses the viability of applying convolutional-basedmodels on time series problems different architectures werechosen to be tested. The performance from both approacheswas tested and compared.

The main conclusions are that convolutional-based algo-rithms, together with recurrent models, should also be seen asa good option to use in time series problems. This is supported

Page 10: An Overview of Deep Learning Strategies for Time Series ...€¦ · of time series predictions, being the most known ones from the AutoRegressive Integrated Moving Average (ARIMA)

TABLE IVRESULTS OBTAINED FOR THE C-MAPSS DATASET.

Modellr = 0.005 lr = 0.0075 lr = 0.01

MSE MAE Score MSE MAE MSE MSE MAE Score

RNN 417.95 15.37 15930.16 2735.41 40.75 182673.16 449.23 16.70 15233.22

GRU 361.01 13.80 6030.78 243.62 11.19 3303.27 280.79 11.75 4602.78

LSTM 175.62 9.06 2613.54 287.60 11.78 4204.77 337.85 14.38 4886.68

DRNN - RNN 303.92 13.76 5270.08 370.68 13.86 9710.93 431.47 16.66 11101.54

DRNN - GRU 384.83 15.46 7843.49 476.16 16.44 9206.52 323.96 13.03 5196.69

DRNN - LSTM 326.01 12.58 5315.37 642.55 20.04 13167.01 274.43 11.37 4443.00

TCN 137.53 8.21 1893.87 221.62 11.21 3435.50 170.55 9.86 2256.57

QRNN 375.50 13.38 16360.84 331.95 13.71 4700.95 328.18 13.27 5166.03

by the results obtained on both problems, where we can seethat convolutional-based models were able to match and evenoutperform recurrent models.

These are promising results, because a typical problemin sequential modeling, in deep learning, is the challengeof learning long term decencies due to the fact of havinga recurrent gradient in the optimization step. Convolutionalbased methods can avoid this recurrence and be therefore,faster than recurrent models, and are also more stable in thetraining phase.

The code developed along this thesis is available at [27] and[28].

REFERENCES

[1] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforwardnetworks are universal approximators. Neural networks, 1989.

[2] G. E. P. Box and G. M. Jenkins. Time Series Analysis, Forecasting andControl. Holden-Day, San Francisco, 1976.

[3] J. C. B. Gamboa. Deep learning for time series analysis. CoRR,abs/1701.01887, 2017.

[4] C. Liu, Z. Jin, J. Gu, and C. Qiu. Short-term load forecasting using along short-term memory network. In 2017 IEEE PES Innovative SmartGrid Technologies Conference Europe (ISGT-Europe), Sept 2017.

[5] S. Zheng, K. Ristovski, A. Farahat, and C. Gupta. Long short-termmemory network for remaining useful life estimation. In 2017 IEEEInternational Conference on Prognostics and Health Management, June2017.

[6] J. F. Chen, W. L. Chen, C. P. Huang, S. H. Huang, and A. P. Chen.Financial time series data analysis using deep convolutional neuralnetworks. In 2016 7th International Conference on Cloud Computingand Big Data , Nov 2016.

[7] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation byjointly learning to align and translate. CoRR, abs/1409.0473, 2014.

[8] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deeprecurrent neural networks. In Acoustics, speech and signal processing(ICASSP). IEEE, 2013.

[9] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio. Learning phrase representations usingrnn encoder-decoder for statistical machine translation. arXiv preprintarXiv:1406.1078, 2014.

[10] A. Borovykh, S. Bohte, and C. W. Oosterlee. Conditional timeseries forecasting with convolutional neural networks. arXiv preprintarXiv:1703.04691, 2017.

[11] S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of genericconvolutional and recurrent networks for sequence modeling. arXivpreprint arXiv:1803.01271, 2018.

[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neuralcomputation, 1997.

[13] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning,volume 1. MIT press Cambridge, 2016.

[14] T. Lei and Y. Zhang. Training rnns as fast as cnns. arXiv preprintarXiv:1709.02755, 2017.

[15] J. Su, Z. Tan, D. Xiong, R. Ji, X. Shi, and Y. Liu. Lattice-based recurrentneural network encoders for neural machine translation. In AAAI, 2017.

[16] S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao. Independently recurrentneural network (indrnn): Building a longer and deeper rnn. arXiv preprintarXiv:1803.04831, 2018.

[17] K. Greff, R. K. Srivastava, J. Koutnık, B. R. Steunebrink, and J. Schmid-huber. Lstm: A search space odyssey. IEEE transactions on neuralnetworks and learning systems, 2017.

[18] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, andD. Yu. Convolutional neural networks for speech recognition. IEEE/ACMTransactions on audio, speech, and language processing, 2014.

[19] Y. Kim. Convolutional neural networks for sentence classification. arXivpreprint arXiv:1408.5882, 2014.

[20] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet:A generative model for raw audio. arXiv preprint arXiv:1609.03499,2016.

[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 770–778, 2016.

[21] S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan, X. Cui,M. Witbrock, M. A. Hasegawa-Johnson, and T. S. Huang. Dilatedrecurrent neural networks. In Advances in Neural Information ProcessingSystems, 2017.

[22] J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-recurrent neuralnetworks. arXiv preprint arXiv:1611.01576, 2016.

[23] Pytorch - Deep learning framework. http://pytorch.org.[24] S. S. Soman, H. Zareipour, O. Malik, and P. Mandal. A review of wind

power and wind speed forecasting methods with different time horizons.In North American power symposium (NAPS), 2010, IEEE, 2010.

[25] L. McInnes, J. Healy, and S. Astels. hdbscan: Hierarchical density basedclustering. The Journal of Open Source Software, 2017.

[26] F. O. Heimes. Recurrent neural networks for remaining useful lifeestimation. In Prognostics and Health Management (PMH), 2008.

[27] Github - Wind Power Generation Code https://github.com/RodrigoNeves95/DeepLearningTimeSeries.

[28] Github - C-MAPSS Problem Code https://github.com/RodrigoNeves95/C-MAPSS Problem.