Noisy Time Series Prediction using Recurrent Neural ...

23
Machine Learning, 44, 161–183, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Noisy Time Series Prediction using Recurrent Neural Networks and Grammatical Inference C. LEE GILES * [email protected] School of Information Sciences and Technology, and Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16803 STEVE LAWRENCE [email protected] NEC Research Institute, 4 Independence Way, Princeton, NJ 08540, USA AH CHUNG TSOI [email protected] Faculty of Informatics, University of Wollongong, NSW 2522 Australia Editors: Colin de la Higuera and Vasant Honavar Abstract. Financial forecasting is an example of a signal processing problem which is challenging due to small sample sizes, high noise, non-stationarity, and non-linearity. Neural networks have been very successful in a number of signal processing applications. We discuss fundamental limitations and inherent difficulties when using neural networks for the processing of high noise, small sample size signals. We introduce a new intelligent signal processing method which addresses the difficulties. The method proposed uses conversion into a symbolic representation with a self-organizing map, and grammatical inference with recurrent neural networks. We apply the method to the prediction of daily foreign exchange rates, addressing difficulties with non-stationarity, overfitting, and unequal a priori class probabilities, and we find significant predictability in comprehensive experiments covering 5 different foreign exchange rates. The method correctly predicts the direction of change for the next day with an error rate of 47.1%. The error rate reduces to around 40% when rejecting examples where the system has low confidence in its prediction. We show that the symbolic representation aids the extraction of symbolic knowledge from the trained recurrent neural networks in the form of deterministic finite state automata. These automata explain the operation of the system and are often relatively simple. Automata rules related to well known behavior such as trend following and mean reversal are extracted. Keywords: time series prediction, recurrent neural networks, grammatical inference, financial prediction, foreign exchange rates 1. Introduction 1.1. Predicting noisy time series data The prediction of future events from noisy time series data is commonly done using various forms of statistical models (Granger & Newbold, 1986). Typical neural network models are closely related to statistical models, and estimate Bayesian a posteriori probabilities when given an appropriately formulated problem (Ruck et al., 1990). Neural networks have * Lee Giles performed this work while he was at NEC Research Institute and is a consulting scientist there.

Transcript of Noisy Time Series Prediction using Recurrent Neural ...

Page 1: Noisy Time Series Prediction using Recurrent Neural ...

Machine Learning, 44, 161–183, 2001c© 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

Noisy Time Series Prediction using RecurrentNeural Networks and Grammatical Inference

C. LEE GILES∗ [email protected] of Information Sciences and Technology, and Department of Computer Science and Engineering, ThePennsylvania State University, University Park, PA 16803

STEVE LAWRENCE [email protected] Research Institute, 4 Independence Way, Princeton, NJ 08540, USA

AH CHUNG TSOI [email protected] of Informatics, University of Wollongong, NSW 2522 Australia

Editors: Colin de la Higuera and Vasant Honavar

Abstract. Financial forecasting is an example of a signal processing problem which is challenging due tosmall sample sizes, high noise, non-stationarity, and non-linearity. Neural networks have been very successful ina number of signal processing applications. We discuss fundamental limitations and inherent difficulties whenusing neural networks for the processing of high noise, small sample size signals. We introduce a new intelligentsignal processing method which addresses the difficulties. The method proposed uses conversion into a symbolicrepresentation with a self-organizing map, and grammatical inference with recurrent neural networks. We apply themethod to the prediction of daily foreign exchange rates, addressing difficulties with non-stationarity, overfitting,and unequal a priori class probabilities, and we find significant predictability in comprehensive experimentscovering 5 different foreign exchange rates. The method correctly predicts the direction of change for the nextday with an error rate of 47.1%. The error rate reduces to around 40% when rejecting examples where the systemhas low confidence in its prediction. We show that the symbolic representation aids the extraction of symbolicknowledge from the trained recurrent neural networks in the form of deterministic finite state automata. Theseautomata explain the operation of the system and are often relatively simple. Automata rules related to well knownbehavior such as trend following and mean reversal are extracted.

Keywords: time series prediction, recurrent neural networks, grammatical inference, financial prediction,foreign exchange rates

1. Introduction

1.1. Predicting noisy time series data

The prediction of future events from noisy time series data is commonly done using variousforms of statistical models (Granger & Newbold, 1986). Typical neural network modelsare closely related to statistical models, and estimate Bayesiana posterioriprobabilitieswhen given an appropriately formulated problem (Ruck et al., 1990). Neural networks have

∗Lee Giles performed this work while he was at NEC Research Institute and is a consulting scientist there.

Page 2: Noisy Time Series Prediction using Recurrent Neural ...

162 C.L. GILES, S. LAWRENCE AND A.C. TSOI

been very successful in a number of pattern recognition applications. For noisy time seriesprediction, neural networks typically take a delay embedding of previous inputs1 which ismapped into a prediction. For examples of these models and the use of neural networksand other machine learning methods applied to finance see CIFEr (1996, 1999), Machinelearning (1995) and Refenes (1995). However, high noise, high non-stationarity time seriesprediction is fundamentally difficult for these models:

1. The problem of learning from examples is fundamentally ill-posed, i.e. there are infinitelymany models which fit the training data well, but few of these generalize well. In orderto form a more accurate model, it is desirable to use as large a training set as possible.However, for the case of highly non-stationary data, increasing the size of the trainingset results in more data with statistics that are less relevant to the task at hand being usedin the creation of the model.

2. The high noise and small datasets make the models prone to overfitting. Random cor-relations between the inputs and outputs can present great difficulty. The models typ-ically do not explicitly address the temporal relationship of the inputs—e.g. they donot distinguish between those correlations which occur in temporal order, and thosewhich do not.

In our approach we consider recurrent neural networks (RNNs). Recurrent neural networksemploy feedback connections and have the potential to represent certain computationalstructures in a more parsimonious fashion (Elman, 1991). RNNs address the temporalrelationship of their inputs by maintaining an internal state. RNNs are biased towardslearning patterns which occur in temporal order—i.e. they are less prone to learning randomcorrelations which do not occur in temporal order.

Training RNNs tends to be difficult with high noise data however, with a tendency forlong-term dependencies to be neglected (experiments reported in Bengio (1996) found atendency for recurrent networks to take into account short-term dependencies but not long-term dependencies), and for the network to fall into a naive solution such as always predictingthe most common output. We address this problem by converting the time series into asequence of symbols using a self-organizing map (SOM). The problem then becomes oneof grammatical inference—the prediction of a given quantity from a sequence of symbols. Itis then possible to take advantage of the known capabilities of recurrent neural networks tolearn grammars (Giles et al., 1992; Zeng et al., 1994) in order to capture any predictabilityin the evolution of the series.

The use of a recurrent neural network is significant for two reasons: firstly, the temporalrelationship of the series is explicitly modeled via internal states, and secondly, it is possibleto extract rules from the trained recurrent networks in the form of deterministic finite stateautomata (Omlin & Giles, 1996).2 The symbolic conversion is significant for a number ofreasons: the quantization effectively filters the data or reduces the noise, the RNN trainingbecomes more effective, and the symbolic input facilitates the extraction of rules from thetrained networks. Furthermore, it can be argued that the instantiation of rules from predictorsnot only gives the user confidence in the prediction model but facilitates the use of the modeland can lead to a deeper understanding of the process (Tickle et al., 1998).

Page 3: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 163

Financial time series data typically contains very high noise levels and significant non-stationarity. In this paper, the noisy times series prediction problem considered is theprediction of foreign exchange rates. A brief overview of foreign exchange rates is pre-sented in the next section.

1.2. Foreign exchange rates

The foreign exchange market as of 1997 is the world’s largest market, with more than $US1.3 trillion changing hands every day. The most important currencies in the market are the USdollar (which acts as a reference currency), the Japanese Yen, the British Pound, the GermanMark, and the Swiss Franc (Lequarr´e, 1993). Foreign exchange rates exhibit very high noise,and significant non-stationarity. Many financial institutions evaluate prediction algorithmsusing the percentage of times that the algorithm predicts the right trend from today untilsome time in the future (Lequarr´e, 1993). Hence, this paper considers the prediction of thedirection of change in foreign exchange rates for the next business day.

The remainder of this paper is organized as follows: Section 2 describes the data set wehave used, Section 3 discusses fundamental issues such as ill-posed problems, the curse ofdimensionality, and vector space assumptions, and Section 4 discusses the efficient markethypothesis and prediction models. Section 5 details the system we have used for prediction.Comprehensive test results are contained in Section 6, and Section 7 considers the extractionof symbolic data. Section 8 provides concluding remarks, and Appendix A provides fullsimulation details.

2. Exchange rate data

The data we have used is publicly available athttp://www.stern.nyu.edu/∼

aweigend/Time-Series/SantaFe.html. It was first used by Weigend, Huberman &Rumelhart (1992). The data consists of daily closing bids for five currencies (German Mark(DM), Japanese Yen, Swiss Franc, British Pound, and Canadian Dollar) with respect tothe US Dollar and is from the Monetary Yearbook of the Chicago Mercantile Exchange.There are 3645 data points for each exchange rate covering the period September 3, 1973to May 18, 1987. In contrast to Weigend et al., this work considers the prediction of all fiveexchange rates in the data and prediction for all days of the week instead of just Mondays.

3. Fundamental issues

This section briefly discusses fundamental issues related to learning by example using tech-niques such as multi-layer perceptrons (MLPs): ill-posed problems, the curse of dimension-ality, and vector space assumptions. These help form the motivation for our new method.

The problem of inferring an underlying probability distribution from a finite set of datais fundamentally an ill-posed problem because there are infinitely many solutions (Jordan,1996), i.e. there are infinitely many functions which fit the training data exactly but differin other regions of the input space. The problem becomes well-posed only when additionalconstraints are used. For example, the constraint that a limited size MLP will be used mightbe imposed. In this case, there may be a unique solution which best fits the data from the

Page 4: Noisy Time Series Prediction using Recurrent Neural ...

164 C.L. GILES, S. LAWRENCE AND A.C. TSOI

range of solutions which the model can represent. The implicit assumption behind thisconstraint is that the underlying target function is “smooth”. Smoothness constraints arecommonly used by learning algorithms (Friedman, 1994). Without smoothness or someother constraint, there is no reason to expect a model to perform well on unseen inputs (i.e.generalize well).

Learning by example often operates in a vector space, i.e. a function mappingRn toRm is approximated. However, a problem with vector spaces is that the representationdoes not reflect any additional structure within the data. Measurements that are related(e.g. from neighboring pixels in an image or neighboring time steps in a time series) andmeasurements that do not have such a relationship are treated equally. These relationshipscan only be inferred in a statistical manner from the training examples.

The use of a recurrent neural network instead of an MLP with a window of time delayedinputs introduces another assumption by explicitly addressing the temporal relationship ofthe inputs via the maintenance of an internal state. This can be seen as similar to the use ofhints (Abu-Mostafa, 1990) and should help to make the problem less ill-posed.

Thecurse of dimensionalityrefers to the exponential growth of hypervolume as a func-tion of dimensionality (Bellman, 1961). Considerxi ∈Rn. The regression,y= f (x) is ahypersurface inRn+1. If f (x) is arbitrarily complex and unknown then dense samples arerequired to approximate the function accurately. However, it is hard to obtain dense samplesin high dimensions.3 This is the “curse of dimensionality” (Bellman, 1961; Friedman, 1994;Friedman, 1995). The relationship between the sampling density and the number of pointsrequired is≈ N

1n (Friedman, 1995) wheren is the dimensionality of the input space andN

is the number of points. Thus, ifN1 is the number of points for a given sampling density in1 dimension, then in order to keep the same density as the dimensionality is increased, thenumber of points must increase according toNn

1 .It has been suggested that MLPs do not suffer from the curse of dimensionality (Farag´o &

Lugosi, 1993; Barron, 1993). However, this is not true in general (although MLPs may copebetter than other models). The apparent avoidance of the curse of dimensionality in Barron(1993) is due to the fact that the function spaces considered are more and more constrainedas the input dimension increases (Kurkova, 1991). Similarly, smoothness conditions mustbe satisfied for the results of Farag´o and Lugosi (1993).

The use of a recurrent neural network is important from the viewpoint of the curse ofdimensionality because the RNN can take into account greater history of the input. Tryingto take into account greater history with an MLP by increasing the number of delayed inputsresults in an increase in the input dimension. This is problematic, given the small numberof data points available. The use of a self-organizing map for the symbolic conversion isalso important from the viewpoint of the curse of dimensionality. As will be seen later,the topographical order of the SOM allows encoding a one dimensional SOM into a singleinput for the RNN.

4. The efficient market hypothesis

The Efficient Market Hypothesis (EMH) was developed in 1965 by Fama (1965, 1970) andfound broad acceptance in the financial community (Anthony & Norman, 1995; Malkiel,

Page 5: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 165

1987; Tisibouris & Zeidenberg, 1995). The EMH, in its weak form, asserts that the priceof an asset reflects all of the information that can be obtained from past prices of the asset,i.e. the movement of the price is unpredictable. The best prediction for a price is the currentprice and the actual prices follow what is called a random walk. The EMH is based on theassumption that all news is promptly incorporated in prices; since news is unpredictable(by definition), prices are unpredictable. Much effort has been expended trying to prove ordisprove the EMH. Current opinion is that the theory has been disproved (Ingber, 1996;Taylor, 1994), and much evidence suggests that the capital markets are not efficient (Lo &MacKinlay, 1999; Malkiel, 1996).

If the EMH was true, then a financial time series could be modeled as the addition of anoise component at each step:

x(k+ 1) = x(k)+ ε(k) (2)

whereε(k) is a zero mean Gaussian variable with varianceσ . The best estimation is:

x(k+ 1) = x(k) (3)

In other words, if the series is truly a random walk, then the best estimate for the nexttime period is equal to the current estimate. Now, if it is assumed that there is a predictablecomponent of the series then it is possible to use:

x(k+ 1) = x(k)+ f (x(k), x(k− 1), . . . , x(k− n+ 1))+ ε(k) (4)

whereε(k) is a zero mean Gaussian variable with varianceσ , and f (·) is a non-linearfunction in its arguments. In this case, the best estimate is given by:

x(k+ 1) = x(k)+ f (x(k), x(k− 1), . . . , x(k− n+ 1)) (5)

Prediction using this model is problematic as the series often contains a trend. For example,a neural network trained on section A in figure 1 has little chance of generalizing to the testdata in section B, because the model was not trained with data in the range covered by thatsection.

A common solution to this is to use a model which is based on the first order differences(Eq. (7)) instead of the raw time series (Granger & Newbold, 1986):

δ(k+ 1) = f (δ(k), δ(k− 1), . . . , δ(k− n+ 1))+ ν(k) (6)

where

δ(k+ 1)1= x(k+ 1)− x(k) (7)

andν(k) is a zero mean Gaussian variable with varianceσ . In this case the best estimate is:

δ(k+ 1) = f (δ(k), δ(k− 1), . . . , δ(k− n+ 1)) (8)

Page 6: Noisy Time Series Prediction using Recurrent Neural ...

166 C.L. GILES, S. LAWRENCE AND A.C. TSOI

Figure 1. An example of the difficulty in using a model trained on a raw price series. Generalization to sectionB from the training data in section A is difficult because the model has not been trained with inputs in the rangecovered by section B.

5. System details

A high-level block diagram of the system we used is shown in figure 2. The raw time seriesvalues arey(k), k = 1, 2, . . . , N wherey(k) ∈ R. These denote the daily closing bids ofthe financial time series. The first difference of the series,y(k), is taken as follows:

δ(k) = y(k)− y(k− 1) (9)

This producesδ(k), δ(k) ∈ R, k = 1, 2, . . . , N − 1. In order to compress the dynamicrange of the series and reduce the effect of outliers, a log transformation of the data is used:

x(k) = sign(δ(k))(log(|δ(k)| + 1)) (10)

resulting inx(k), k = 1, 2, . . . , N − 1, x(k) ∈ R. As is usual, a delay embedding of thisseries is then considered (Yule, 1927):

X(k, d1) = (x(k), x(k− 1), x(k− 2), . . . ,x(k− d1+ 1)) (11)

Figure 2. A high-level block diagram of the system.

Page 7: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 167

whered1 is the delay embedding dimension and is equal to 1 or 2 for the experimentsreported here.X(k, d1) is a state vector. This delay embedding forms the input to the SOM.Hence, the SOM input is a vector of the lastd1 values of the log transformed differencedtime series. The output of the SOM is the topographical location of the winning node.Each node represents one symbol in the resulting grammatical inference problem. A briefdescription of the self-organizing map is contained in the next section.

The SOM can be represented by the following equation:

S(k) = g(X(k, d)) (12)

whereS(k) ∈ [1, 2, 3, . . .ns], andns is the number of symbols (nodes) for the SOM. Eachnode in the SOM has been assigned an integer index ranging from 1 to the number ofnodes.

An Elman recurrent neural network is then used4 which is trained on the sequence ofoutputs from the SOM. The Elman network was chosen because it is suitable for the problem(a grammatical inference style problem) (Elman, 1991), and because it has been shown toperform well in comparison to other recurrent architectures (e.g. see Lawrence, Giles &Fong, 2000). The Elman neural network has feedback from each of the hidden nodes to allof the hidden nodes, as shown in figure 3. For the Elman network:

O(k+ 1) = CTzk + c0 (13)

andzk = Fnh(Azk−1+ Buk + b) (14)

whereC is anh × no vector representing the weights from the hidden layer to the outputnodes,nh is the number of hidden nodes,no is the number of output nodes,c0 is a constantbias vector,zk, zk ∈ Rnh , is annh × 1 vector, denoting the outputs of the hidden layerneurons.uk is ad2 × 1 vector as follows, whered2 is the embedding dimension used forthe recurrent neural network:

uk =

S(k)

S(k− 1)

S(k− 2)

. . .

S(k− d2+ 1)

(15)

A andB are matrices of appropriate dimensions which represent the feedback weights fromthe hidden nodes to the hidden nodes and the weights from the input layer to the hiddenlayer respectively.Fnh is anh × 1 vector containing the sigmoid functions.b is annh × 1vector, denoting the bias of each hidden layer neuron.O(k) is ano×1 vector containing theoutputs of the network.no is 2 throughout this paper. The first output is trained to predictthe probability of a positive change, the second output is trained to predict the probabilityof a negative change.

Thus, for the complete system:

O(k+ 1) = F1(δ(k), δ(k − 1), δ(k − 2), δ(k − 3), δ(k − 4)) (16)

Page 8: Noisy Time Series Prediction using Recurrent Neural ...

168 C.L. GILES, S. LAWRENCE AND A.C. TSOI

Figure 3. The pre-processed, delay embedded series is converted into symbols using a self-organizing map. AnElman neural network is trained on the sequence of symbols.

Page 9: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 169

which can be considered in terms of the original series:

O(k+ 1) = F2(y(k), y(k− 1), y(k− 2), y(k− 3), y(k− 4), y(k− 5)) (17)

Note that this is not a static mapping due to the dependence on the hidden state of therecurrent neural network.

The following sections describe the self-organizing map, recurrent network grammaticalinference, dealing with non-stationarity, controlling overfitting, a priori class probabilities,and a method for comparison with a random walk.

5.1. The self-organizing map

5.1.1. Introduction. The self-organizing map, or SOM (Kohonen, 1995) is an unsuper-vised learning process which learns the distribution of a set of patterns without any classinformation. A pattern is projected from a possibly high dimensional input spaceS to aposition in the map, a low dimensional display spaceD. The display spaceD is often di-vided into a grid and each intersection of the grid is represented in the network by a neuron.The information is encoded as the location of an activated neuron. The SOM is unlike mostclassification or clustering techniques in that it attempts to preserve the topological order-ing of the classes in the input spaceS in the resulting display spaceD. In other words, forsimilarity as measured using a metric in the input spaceS, the SOM attempts to preservethe similarity in the display spaceD.

5.1.2. Algorithm. We give a brief description of the SOM algorithm, for more details seeKohonen (1995). The SOM defines a mapping from an input spaceRn onto a topologicallyordered set of nodes, usually in a lower dimensional spaceD. A reference vector,mi ≡[µi 1, µi 2, . . . , µin]T ∈ Rn, is assigned to each node in the SOM. Assume that there areMnodes. During training, each input,x ∈ Rn, is compared tomi , i = 1, 2, . . . ,M , obtainingthe location of the closest match (||x−mc|| = mini {||x−mi ||}). The input point is mappedto this location in the SOM. Nodes in the SOM are updated according to:

mi (t + 1) = mi (t)+ hci (t)[x(t)−mi (t)] (18)

wheret is the time during learning andhci (t) is theneighborhood function, a smoothingkernel which is maximum atmc. Usually,hci (t) = h(||rc−ri ||, t), whererc andri representthe locations of nodes in the SOM output spaceD. rc is the node with the closest weightvector to the input sample andri ranges over all nodes.hci (t) approaches 0 as||rc − ri ||increases and also ast approaches∞. A widely applied neighborhood function is:

hci = α(t) exp

(−||rc − ri ||2

2σ 2(t)

)(19)

whereα(t) is a scalar valued learning rate andσ(t) defines the width of the kernel. Theyare generally both monotonically decreasing with time (Kohonen, 1995). The use of the

Page 10: Noisy Time Series Prediction using Recurrent Neural ...

170 C.L. GILES, S. LAWRENCE AND A.C. TSOI

neighborhood function means that nodes which are topographically close in the SOMstructure are moved towards the input pattern along with the winning node. This creates asmoothing effect which leads to a global ordering of the map. Note thatσ(t) should not bereduced too far as the map will lose its topographical order if neighboring nodes are notupdated along with the closest node. The SOM can be considered a non-linear projection ofthe probability density,p(x), of the input patternsx onto a (typically) smaller output space(Kohonen, 1995).

5.1.3. Remarks. The nodes in the display spaceD encode the information contained inthe input spaceRn. Since there areM nodes inD, this implies that the input pattern vectorsx ∈ Rn are transformed to a set ofM symbols, while preserving their original topologicalordering inRn. Thus, if the original input patterns are highly noisy, the quantization intothe set ofM symbols while preserving the original topological ordering can be understoodas a form of filtering. The amount of filtering is controlled byM . If M is large, this impliesthere is little reduction in the noise content of the resulting symbols. On the other hand,if M is small, this implies that there is a “heavy” filtering effect, resulting in only a smallnumber of symbols.

5.2. Recurrent network prediction

In the past few years several recurrent neural network architectures have emerged whichhave been used for grammatical inference (Cleeremans et al., 1989; Giles et al., 1992; Gileset al., 1990). The induction of relatively simple grammars has been addressed often—e.g.(Giles et al., 1992; Watrous & Kuhn, 1992a, Watrous & Kuhn, 1992b) on learning Tomitagrammars (Tomita, 1982). It has been shown that a particular class of recurrent networksare Turing equivalent (Siegelmann & Liu, 1998).

The set ofM symbols from the output of the SOM are linearly encoded into a singleinput for the Elman network (e.g. ifM = 3, the single input is either−1, 0, or 1). Thelinear encoding is important and is justified by the topographical order of the symbols. If noorder was defined over the symbols then a separate input should be used for each symbol,resulting in a system with more dimensions and increased difficulty due to the curse ofdimensionality. In order to facilitate the training of the recurrent network, an input windowof 3 was used, i.e. the last three symbols are presented to 3 input neurons of the recurrentneural network.

5.3. Dealing with non-stationarity

The approach used to deal with the non-stationarity of the signal in this work is to buildmodels based on a short time period only. This is an intuitively appealing approach becauseone would expect that any inefficiencies found in the market would typically not last for along time. The difficulty with this approach is the reduction in the already small number oftraining data points. The size of the training set controls a noise vs. non-stationarity tradeoff(Moody, 1995). If the training set is too small, the noise makes it harder to estimate theappropriate mapping. If the training set is too large, the non-stationarity of the data will mean

Page 11: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 171

Figure 4. A depiction of the training and test sets used.

more data with statistics that are less relevant for the task at hand will be used for creatingthe estimator. We ran tests on a segment of data separate from the main tests, from whichwe chose to use models trained on 100 days of data (see Appendix A for more details).After training, each model is tested on the next 30 days of data. The entire training/testset window is moved forward 30 days and the process is repeated, as depicted in figure 4.In a real-life situation it would be desirable to test only for the next day, and advance thewindows by steps of one day. However, 30 days is used here in order to reduce the amountof computation by a factor of 30, and enable a much larger number of tests to be performed(i.e. 30 times as many predictions are made from the same number of training runs), therebyincreasing confidence in the results.

5.4. Controlling overfitting

Highly noisy data has often been addressed using techniques such as weight decay, weightelimination and early stopping to control overfitting (Weigend et al., 1992). For this problemwe use early stopping with a difference: the stopping point is chosen using multiple testson a separate segment of data, rather than using a validation set in the normal manner. Testsshowed that using a validation set in the normal manner seriously affected performance.This is not very surprising because there is a dilemma when choosing the validation set. Ifthe validation set is chosen before the training data (in terms of the temporal order of theseries), then the non-stationarity of the series means the validation set may be of little use forpredicting performance on the test set. If the validation set is chosen after the training data,a problem occurs again with the non-stationarity—it is desirable to make the validation setas small as possible to alleviate this problem, however the set cannot be made too smallbecause it needs to be a certain size in order to make the results statistically valid. Hence,we chose to control overfitting by setting the training time for an appropriate stopping pointa priori. Simulations with a separate segment of data not contained in the main data (seeAppendix A for more details) were used to select the training time of 500 epochs.

Page 12: Noisy Time Series Prediction using Recurrent Neural ...

172 C.L. GILES, S. LAWRENCE AND A.C. TSOI

Figure 5. A depiction of a simplified error surface. If a naive solution, A, performs well, then a better solution,e.g. B, may be difficult to find.

We note that our use of a separate segment of data to select training set size and trainingtime are not expected to be ideal solutions, e.g. the best choice for these parameters maychange significantly over time due to the non-stationarity of the series.

5.5. A priori data probabilities

For some financial data sets, it is possible to obtain good results by simply predicting themost common change in the training data (e.g., noticing that the change in prices is positivemore often than it is negative in the training set and always predicting positive for the testset). Although predictability due to such data bias may be worthwhile, it can prevent thediscovery of a better solution (Lawrence et al., 1998). Figure 5 shows a simplified depictionof the problem—if the naive solution, A, has a better mean squared error than a randomsolution then it may be hard to find a better solution, e.g. B. This may be particularlyproblematic for high noise data although the true nature of the error surface is not yet wellunderstood.

Hence, in all results reported here the models have been prevented from acting on anysuch bias by ensuring that the number of training examples for the positive and negativecases is equal. The training set actually remains unchanged so that the temporal order is notdisturbed but there is no training signal for the appropriate number of positive or negativeexamples starting from the beginning of the series. The number of positive and negativeexamples is typically approximately equal to begin with, therefore the use of this simpletechnique did not result in the loss of many data points.

5.6. Comparison with a random walk

The prediction corresponding to the random walk model is equal to the current price, i.e.no change. A percentage of times the model predicts the correct change (apart from nochange) cannot be obtained for the random walk model. However, it is possible to usemodels with data that corresponds to a random walk and compare the results to thoseobtained with the real data. If equal performance is obtained on the random walk data ascompared to the real data then no significant predictability has been found. These tests wereperformed by randomizing the classification of the test set. For each original test set fiverandomized versions were created. The randomized versions were created by repeatedly

Page 13: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 173

Table 1. The results as shown in figure 6.

SOM embedding dimension: 1

SOM size 2 3 4 5 6 7

Error 48.5 47.5 48.3 48.3 48.5 47.6

Randomized error 50.1 50.5 50.4 50.2 49.8 50.3

SOM embedding dimension: 2

SOM size 2 3 4 5 6 7

Error 48.8 47.8 47.9 47.7 47.6 47.1

Randomized error 49.8 50.4 50.4 50.1 49.7 50.2

selecting two random positions in time and swapping the two classification values (the inputvalues remain unchanged). An additional benefit of this technique is that if the model isshowing predictability based on any data bias (as discussed in the previous section) then theresults on the randomized test sets will also show this predictability, i.e. always predictinga positive change will result in the same performance on the original and the randomizedtest sets.

6. Experimental results

We present results using the previously described system and alternative systems for com-parison. In all cases, the Elman networks contained 5 hidden units and the SOM had onedimension. Further details of the individual simulations can be found in Appendix A. Fig-ure 6 and Table 1 show the average error rate as the number of nodes in the self-organizingmap (SOM) was increased from 2 to 7. Results are shown for embedding dimensions of1 and 2 for the input to the SOM. Results are shown using both the original data and therandomized data for comparison with a random walk. The results are averaged over the5 exchange rates with 30 simulations per exchange rate and 30 test points per simulation,for a total of 4500 test points. The best performance was obtained with a SOM size of 7nodes and an embedding dimension of 2 where the error rate was 47.1%. At-test indicatesthat this result is significantly different from the null hypothesis of the randomized testsets atp = 0.0076 or 0.76% (t = 2.67) indicating that the result is significant. We alsoconsidered a null hypothesis corresponding to a model which makes completely randompredictions (corresponding to a random walk)—the result is significantly different from thisnull hypothesis atp = 0.0054 or 0.54% (t = 2.80). Figure 7 and Table 2 show the averageresults for the SOM size of 7 on the individual exchange rates.

In order to gauge the effectiveness of the symbolic encoding and recurrent neural network,we investigated the following two systems:

1. The system as presented here without the symbolic encoding, i.e. the preprocessed datais entered directly into the recurrent neural network without the SOM stage.

Page 14: Noisy Time Series Prediction using Recurrent Neural ...

174 C.L. GILES, S. LAWRENCE AND A.C. TSOI

Figure 6. The error rate as the number of nodes in the SOM is increased for SOM embedding dimensionsof 1 and 2. Each of the results represents the average over the 5 individual series and the 30 simulations perseries, i.e. over 150 simulations which tested 30 points each, or 4500 individual classifications. The results forthe randomized series are averaged over the same values with the 5 randomized series per simulation, for atotal of 22,500 classifications per result (all of the data available was not used due to limited computationalresources).

Figure 7. Average results for the five exchange rates using the real test data and the randomized test data. TheSOM size is 7 and the results for SOM embedding dimensions of 1 and 2 are shown.

2. The system as presented here with the recurrent network replaced by a standard MLPnetwork.

Tables 3 and 4 show the results for systems 1 and 2 above. The performance of thesesystems can be seen to be in-between the performance of the null hypothesis of a randomwalk and the performance of the hybrid system, indicating that the hybrid system can resultin significant improvement.

Page 15: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 175

Table 2. Results as shown in figure 7.

Exchange Rate British Pound Canadian Dollar German Mark Japanese Yen Swiss Frank

Embedding dimension 1 48.5 46.9 46 47.7 48.9Embedding dimension 2 48.1 46.8 46.9 46.3 47.4Randomized embedding 50.1 50.5 50.4 49.8 50.7

dimension 1Randomized embedding 50.2 50.6 50.5 49.7 49.8

dimension 2

Table 3. The results for the system without the symbolic encoding, i.e. the preprocessed data is entered directlyinto the recurrent neural network without the SOM stage.

Error 49.5

Randomized error 50.1

Table 4. The results for the system using an MLP instead of the RNN.

SOM embedding dimension: 1

SOM size 2 3 4 5 6 7

Error 49.1 49.4 49.3 49.7 49.9 49.8

Randomized error 49.7 50.1 49.9 50.3 50.0 49.9

6.1. Reject performance

We used the following equation in order to calculate a confidence measure for every pre-diction:γ = ymax(ymax− ymin) whereymax is the maximum andymin is the minimum of thetwo outputs (for outputs which have been transformed using thesoftmaxtransformation:yi = exp(ui )∑k

j=1 exp(u j )whereui are the original outputs,yi are the transformed outputs, andk

is the number of outputs). The confidence measureγ gives an indication of how confidentthe network model is for each classification. Thus, for a fixed value ofγ , the predictionobtained by a particular algorithm can be divided into two sets, one setB for those below thethreshold, and another setA for those above the threshold. We can then reject those pointswhich are in setB, i.e. they are not used in the computation of the classification accuracy.

Figure 8 shows the classification performance of the system (percentage of correct signpredictions) as the confidence threshold is increased and more examples are rejected forclassification. Figure 9 shows the same graph for the randomized test sets. It can be seen thata significant increase in performance (down to approximately 40% error) can be obtainedby only considering the 5–15% of examples with the highest confidence. This benefit cannot be seen for the randomized test sets. It should be noted that by the time 95% of testpoints are rejected, the total number of remaining test points is only 225.

Page 16: Noisy Time Series Prediction using Recurrent Neural ...

176 C.L. GILES, S. LAWRENCE AND A.C. TSOI

Figure 8. Classification performance as the confidence threshold is increased. This graph is the average of the150 individual results.

Figure 9. Classification performance as the confidence threshold is increased for the randomized test sets. Thisgraph is the average of the individual randomized test sets.

7. Automata extraction

A common complaint with neural networks is that the models are difficult to interpret—i.e.it is not clear how the models arrive at the prediction or classification of a given inputpattern (Setiono & Liu, 1996). A number of people have considered the extraction ofsymbolic knowledge from trained neural networks for both feedforward and recurrent neuralnetworks (Denker et al., 1987; Omlin & Giles, 1996; Towell & Shavlik, 1993b). For recurrentnetworks, the ordered triple of a discrete Markov process ({state; input→ next state}) canbe extracted and used to form an equivalent deterministic finite state automata (DFA). Thiscan be done by clustering the activation values of the state neurons in the network (Omlin &Giles, 1996). The automata extracted in this process can only recognize regular grammars.5

The algorithm used for automata extraction is the same as that described in Giles et al.(1992). The algorithm is based on the observation that the activations of the recurrent stateneurons in a trained network tend to cluster. The output of each of theN state neurons isdivided intoq intervals of equal size, yieldingqN partitions in the space of outputs of thestate neurons. Starting from an initial state, the input symbols are considered in order. Ifan input symbol causes a transition to a new partition in the activation space then a newDFA state is created with a corresponding transition on the appropriate symbol. In the eventthat different activation state vectors belong to the same partition, the state vector whichfirst reached the partition is chosen as the new initial state for subsequent symbols. The

Page 17: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 177

Figure 10. Sample deterministic finite state automata (DFA) extracted from trained financial prediction networksusing a quantization level of 5.

extraction would be computationally infeasible if allqN partitions had to be visited, but theclusters corresponding to DFA states often cover only a small percentage of the input space.

Sample deterministic finite state automata (DFA) extracted from trained financial predic-tion networks using a quantization level of 5 can be seen in figure 10. The DFAs have beenminimized using standard minimization techniques (Hopcroft & Ullman, 1979). The DFAswere extracted from networks trained on different segments of data with a SOM embeddingdimension of 1 and SOM size of 2. As expected, the DFAs extracted in this case are relativelysimple. In all cases, the SOM symbols end up representing positive and negative changes inthe series—transitions marked with a solid line correspond to positive changes and transi-tions marked with a dotted line correspond to negative changes. For all DFAs, state 1 is thestarting state. Double circled states correspond to prediction of a positive change, and stateswithout the double circle correspond to prediction of a negative change. (a) can be seen ascorresponding to mean-reverting behavior—i.e. if the last change in the series was negativethen the DFA predicts that the next change will be positive and vice versa. (b) and (c) canbe seen as variations on (a) where there are exceptions to the simple rules of (a), e.g. in (b) anegative change corresponds to a positive prediction, however after the input changes fromnegative to positive, the prediction alternates with successive positive changes. In contrastwith (a)–(c), (d)–(f) exhibit behavior related to trend following algorithms, e.g. in (d) apositive change corresponds to a positive prediction, and a negative change corresponds to

Page 18: Noisy Time Series Prediction using Recurrent Neural ...

178 C.L. GILES, S. LAWRENCE AND A.C. TSOI

Figure 11. A sample DFA extracted from a network where the SOM size was 3 (and the SOM embeddingdimension was 1). The SOM symbols represent a large positive (solid lines) or large negative (dotted lines)change, or a change close to zero (gray line). In this case, the rules are more complex than the simple rules in theprevious figure.

a negative prediction except for the first negative change after a positive change. Figure 11shows a sample DFA for a SOM size of 3—more complex behavior can be seen in this case.

8. Conclusions

Traditional learning methods have difficulty with high noise, high non-stationarity time se-ries prediction. The intelligent signal processing method presented here uses a combinationof symbolic processing and recurrent neural networks to deal with fundamental difficulties.The use of a recurrent neural network is significant for two reasons: the temporal relation-ship of the series is explicitly modeled via internal states, and it is possible to extract rulesfrom the trained recurrent networks in the form of deterministic finite state automata. Thesymbolic conversion is significant for a number of reasons: the quantization effectivelyfilters the data or reduces the noise, the RNN training becomes more effective, and thesymbolic input facilitates the extraction of rules from the trained networks. Consideringforeign exchange rate prediction, we performed a comprehensive set of simulations cov-ering 5 different foreign exchange rates, which showed that significant predictability doesexist with a 47.1% error rate in the prediction of the direction of the change. Additionally,the error rate reduced to around 40% when rejecting examples where the system had lowconfidence in its prediction. The method facilitates the extraction of rules which explainthe operation of the system. Results for the problem considered here are positive and showthat symbolic knowledge which has meaning to humans can be extracted from noisy timeseries data. Such information can be very useful and important to the user of the technique.

We note that the predictability found here is on data beginning in 1973, and similar pre-dictability may not be found on current data. Additionally, we note that accurate prediction

Page 19: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 179

of the direction of change does not necessarily result in the ability to create a profitable trad-ing system, which may rely on prediction of the magnitude of changes, ability to deal withlarge but infrequent changes in price, control of trading frequency, and sufficient accuracyto overcome trading costs.

We have shown that a particular recurrent network combined with symbolic methods cangenerate useful predictions and rules for such predictions on a non-stationary time series.It would be interesting to compare this approach with other neural networks that have beenused in time series analysis (Back & Tosi, 1991; Kehagias & Petridis, 1997; Mukherjee,Osuna & Girosi, 1997; Principe & Kuo, 1995; Saad, Prokhorov, & Wunsch II, 1998), andother machine learning methods (Ghahramani & Jordan, 1997; Weiss & Indurkhya, 1995),some of which can also generate rules. It would also be interesting to incorporate this modelinto a trading policy.

Appendix A: Simulation details

For the experiments reported in this paper,nh = 5 andno = 2. The RNN learning rate waslinearly reduced over the training period (see Darken & Moody, 1991 regarding learningrate schedules) from an initial value of 0.5. All inputs were normalized to zero mean and unitvariance. All nodes included a bias input which was part of the optimization process. Weightswere initialized as shown in Haykin (1994). Target outputs were−0.8 and 0.8 using the tanhoutput activation function and we used the quadratic cost function. In order to initialize thedelays in the network, the RNN was run without training for 50 time steps prior to the start ofthe datasets. The number of training examples for the positive and negative cases was madeequal by not training on the appropriate number of positive or negative examples startingfrom the beginning of the training set (i.e. the temporal order of the series is not disturbed butthere is no training signal for either a number of positive or a number of negative examplesstarting at the beginning of the series). The SOM was trained for 10,000 iterations using thefollowing learning rate schedule 0.5× (1−ni /nti) whereni is the current iteration numberandnti is the total number of iterations (10,000). The neighborhood size was altered duringtraining using floor(1+ ((2/3)ns − 1)(1− ni /nti) + 0.5). The data for each successivesimulation was offset by the size of the test set, 30 points, as shown in figure 4. The numberof points used for the selection of the training dataset size and the training time was 750 foreach exchange rate (50+ 100+ 20 test sets of size 30). The number of points used in themain tests from each exchange rate was 1,050 which is 50 (used to initialize delays)+ 100(training set size)+ 30× 30 (30 test sets of size 30). There was no overlap between thedata used for selection of the training dataset size and the training time, and the data usedfor the main tests.

Acknowledgment

We would like to thank Doyne Farmer, Christian Omlin, Andreas Weigend, and the anony-mous referees for helpful comments.

Page 20: Noisy Time Series Prediction using Recurrent Neural ...

180 C.L. GILES, S. LAWRENCE AND A.C. TSOI

Notes

1. For example,x(t), x(t −1), x(t −2), . . . , x(t − N+1) from the inputs for a delay embedding of the previousN values of a series (Takens, 1981; Yule, 1927).

2. Rules can also be extracted from feedforward networks (Alexander & Mozer, 1995; Hayashi, 1991; Ishikawa,1996; McMillan, Mozer & Smolensky, 1992; Setiono & Liu, 1996; Towell & Shavlik, 1993a), and othertypes of recurrent neural networks (Giles, Horne & Lin, 1995) however the recurrent network approach anddeterministic finite state automata extraction seem particularly suitable for a time series problem.

3. Kolmogorov’s theorem shows that any continuous function ofn dimensions can be completely characterizedby a one dimensional continuous function. Specifically, Kolmogorov’s theorem (Friedman, 1995; Kolmogorov,1957; Kurkova, 1991; Kurkova, 1995) states that for any continuous function:

f (x1, x2, . . . , xn) =2n+1∑j=1

gf

(n∑

i=1

λi Q j (xi )

)(1)

where{λi }n1 are universal constants that do not depend onf , {Q j }2n+11 are universal transformations which do

not depend onf , andgf (u) is a continuous, one-dimensional function which totally characterizesf (x1, x2,

. . . , xn) (gf is typically highly non-smooth). In other words, for any continuous function ofn arguments, thereis a one dimensional continuous function that completely characterizes the original function. As such, it can beseen that the problem is not so much the dimensionality, but the complexity of the function (Friedman, 1995),i.e. the curse of dimensionality essentially says that in high dimensions, as fewer data points are available, thetarget function has to be simpler in order to learn it accurately from the given data.

4. Elman refers to the topology of the network—full backpropagation through time was used as opposed to thetruncated version used by Elman.

5. A regular grammarG is a 4-tupleG = {S, N, T, P}whereS is the start symbol,N andT are non-terminal andterminal symbols, respectively, andP represents productions of the formA→ a or A→ aB whereA, B ∈ Nanda ∈ T .

References

Machine Learning. (1995).International Journal of Intelligent Systems in Accounting, Finance and Management,4:2, Special Issue.

Proceedings of IEEE/IAFE Conference on Computational Intelligence for Financial Engineering (CIFEr),1996–2001.

Abu-Mostafa, Y. S. (1990). Learning from hints in neural networks.Journal of Complexity, 6, 192.Alexander, J. A. & Mozer, M. C. (1995). Template-based algorithms for connectionist rule extraction. In G.

Tesauro, D. Touretzky, & T. Leen (Eds.),Advances in Neural Information Processing Systems(Vol. 7, pp.609–616), The MIT Press.

Anthony, M. & Biggs, Norman, L. (1995). A computational learning theory view of economic forecasting withneural nets. In A. Refenes (Ed.),Neural Networks in the Capital Markets. John Wiley and Sons.

Back, A. D. & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time seriesmodeling.Neural Computation, 3:3, 375–385.

Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transac-tions on Information Theory, 39:3, 930–945.

Bellman, R. E. (1961).Adaptive Control Processes. Princeton, NJ: Princeton University Press.Bengio, Y. (1996).Neural Networks for Speech and Sequence Recognition. Thomson.Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite state automata and simple recurrent

networks.Neural Computation, 1:3, 372–381.Darken, C. & Moody, J. E. (1991). Note on learning rate schedules for stochastic optimization. In R. P. Lippmann.,

J. E. Moody, & D. S. Touretzky, (Eds.),Advances in Neural Information Processing Systems(Vol. 3, pp.832–838), San Mateo, CA: Morgan Kaufmann.

Page 21: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 181

Denker, J. S., Schwartz, D., Wittner, B., Solla, S.A., Howard, R., Jackel, L., & Hopfield, J. (1987). Large automaticlearning, rule extraction, and generalization.Complex Systems, 1, 877–922.

Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure.MachineLearning, 7:2/3, 195–226.

Fama, E. F. (1965). The behaviour of stock market prices.Journal of Business, January:34–105.Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work.Journal of Finance, May:383–

417.Farago, A. & Lugosi, G. (1993). Strong universal consistency of neural network classifiers.IEEE Transactions on

Information Theory, 39:4, 1146–1151.Friedman, J. H. (1994). An overview of predictive learning and function approximation. In V. Cherkassky, J.

H. Friedman, & H. Wechsler, (Eds.),From Statistics to Neural Networks, Theory and Pattern RecognitionApplications(Vol. 136 ofNATO ASI Series F, pp. 1–61) Springer.

Friedman, J. H. (1995). Introduction to computational learning and statistical prediction. Tutorial Presented atNeural Information Processing Systems, Denver, CO.

Ghahramani, Z. & Jordan, M. I. (1997). Factorial hidden Markov models.Machine Learning, 29:245.Giles, C. L., Horne, B. G., & Lin, T. (1995). Learning a class of large finite state machines with a recurrent neural

network.Neural Networks, 8:9, 1359–1365.Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting finite

state automata with second-order recurrent neural networks.Neural Computation, 4:3, 393–405.Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., & Y. C. Lee, (1992). Extracting and learning an

unknown grammar with recurrent neural networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann, (Eds.),Advances in Neural Information Processing Systems(Vol. 4, pp. 317–324), San Mateo, CA: Morgan KaufmannPublishers.

Giles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C., & Chen, D. (1990). Higher order recurrent networks andgrammatical inference. In D. S. Touretzky (Ed.),Advances in Neural Information Processing Systems(Vol. 2,pp. 380–387), San Mateo, CA: Morgan Kaufmann Publishers.

Granger, C. W. J. & Newbold, P. (1986).Forecasting Economic Time Series. 2nd edn., San Diego: Academicpress.

Hayashi, Y. (1991). A neural expert system with automated extraction of fuzzy if-then rules. In R. P. Lippmann, J.E. Moody, & D. S. Touretzky, (Eds.),Advances in Neural Information Processing Systems, (Vol. 3, pp. 578–584),Morgan Kaufmann Publishers.

Haykin, S. (1994).Neural Networks, A Comprehensive Foundation, New York, NY: Macmillan.Hopcroft, J. E. & Ullman, J. D. (1979).Introduction to Automata Theory, Languages, and Computation. Reading,

MA: Addison-Wesley Publishing Company.Ingber, L. (1996). Statistical mechanics of nonlinear nonequilibrium financial markets: Applications to optimized

trading.Mathematical Computer Modelling.Ishikawa, M. (1996). Rule extraction by successive regularization. InInternational Conference on Neural Net-

works, (pp. 1139–1143), IEEE Press.Jordan, M. I. (1996). Neural networks. In A. Tucker, (Ed.),CRC Handbook of Computer Science. Boca Raton,

FL: CRC Press.Kehagias, A. & Petridis, V. (1997). Time series segmentation using predictive modular neural networks.Neural

Computation, 9, 1691–1710.Kohonen, T. (1995).Self-Organizing Maps. Berlin, Germany: Springer-Verlag.Kolmogorov, A. N. (1957). On the representation of continuous functions of several variables by superpositions

of continuous functions of one variable and addition.Dokl, 114, 679–681.Kurkova, V. (1991). Kolmogorov’s theorem is relevant.Neural Computation, 3:4, 617–622.Kurkova, V. (1995). Kolmogorov’s theorem. In M. A. Arbib (Ed.),The Handbook of Brain Theory and Neural

Networks(pp. 501–502), Massachusetts: MIT Press, Cambridge.Lawrence, S., Burns, I., Back, A. D., Tsoi, A. C., & Giles, C. L. (1998). Neural network classification and unequal

prior class probabilities. In G. Orr, K.-R. M¨uller, and R. Caruana, (Eds.),Tricks of the Trade(pp. 299–314),Springer Verlag. Lecture Notes in Computer Science State-of-the-Art Surveys.

Lawrence, S., Giles, C. L., & Fong, S. (2000). Natural language grammatical inference with recurrent neuralnetworks.IEEE Transactions on Knowledge and Data Engineering, 12:1, 126–140.

Page 22: Noisy Time Series Prediction using Recurrent Neural ...

182 C.L. GILES, S. LAWRENCE AND A.C. TSOI

Lequarre, J. Y. (1993). Foreign currency dealing: A brief introduction (data set C). In A. S. Weigend & N. A.Gershenfeld, (Eds.),Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley.

Lo, A. W. & MacKinlay, A. C. (1999).A Non-Random Walk Down Wall Street. Princeton University Press.Malkiel, B. G. (1987).Efficient Market Hypothesis. London: Macmillan.Malkiel, B. G. (1996).A Random Walk Down Wall Street. New York, NY: Norton.McMillan, C., Mozer, M. C., & Smolensky, P. (1992). Rule induction through integrated symbolic and subsymbolic

processing. In J. E. Moody, S. J. Hanson, & R. P. Lippmann, (Eds.),Advances in Neural Information ProcessingSystems(Vol. 4, pp. 969–976), Morgan Kaufmann Publishers.

Moody, J. E. (1995). Economic forecasting: Challenges and neural network solutions. InProceedings of theInternational Symposium on Artificial Neural Networks, Hsinchu, Taiwan.

Mukherjee, S., Osuna, E., & Girosi, F. (1997). Nonlinear prediction of chaotic time series using support vectormachines. In J. Principe, L. Giles, N. Morgan, & E. Wilson, (Eds.),IEEE Workshop on Neural Networks forSignal Processing VII(p. 511), IEEE Press.

Omlin, C. W. & Giles, C. L. (1996). Extraction of rules from discrete-time recurrent neural networks.NeuralNetworks, 9:1, 41–52.

Principe, J. C. & Kuo, J.-M. (1995). Dynamic modelling of chaotic time series with neural networks. In G. Tesauro,D. Touretzky, & T. Leen, (Eds.),Advances in Neural Information Processing Systems(Vol. 7, pp. 311–318),MIT Press.

Refenes, A. (Ed.), (1995).Neural Networks in the Capital Markets. New York, NY: John Wiley and Sons.Ruck, D. W., Rogers, S. K., Kabrisky, K., Oxley, M. E., & Suter, B. W. (1990). The multilayer perceptron as an

approximation to an optimal Bayes estimator.IEEE Transactions on Neural Networks, 1:4, 296–298.Saad, E. W., Prokhorov, D. V., & Wunsch II, D. C. (1998). Comparative study of stock trend prediction using time

delay, recurrent and probabilistic neural networks.IEEE Transactions on Neural Networks, 9:6, 1456.Setiono, R. & Liu, H. (1996). Symbolic representation of neural networks.IEEE Computer, 29:3, 71–77.Siegelmann, H. T. & Sontag, E. D. (1995). On the computational power of neural nets.Journal of Computer and

System Sciences, 50:1, 132–150.Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. Young, (Eds.),Dynamical

Systems and Turbulence, (pp. 366–381), Berlin: Springer-Verlag, vol. 898 ofLecture Notes in Mathematics.Taylor, S. J. (Ed.), (1994).Modelling Financial Time Series. Chichester: J. Wiley & Sons.Tickle, A. B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and

challenges in extracting the knowledge embedded within trained artificial neural networks.IEEE Transactionson Neural Networks, 9:6, 1057.

Tomita, M. (1982). Dynamic construction of finite-state automata from examples using hill-climbing. InProceed-ings of the Fourth Annual Cognitive Science Conference(pp. 105–108), Ann Arbor, MI.

Towell, G. & Shavlik, J. (1993a). The extraction of refined rules from knowledge based neural networks.MachineLearning, 131, 71–101.

Towell, G. G. & Shavlik, J. W. (1993b). Extracting refined rules from knowledge-based neural networks.MachineLearning, 13, 71.

Tsibouris, G. & Zeidenberg, M. (1995). Testing the efficient markets hypothesis with gradient descent algorithms.In A. Refenes, (Ed.),Neural Networks in the Capital Markets. John Wiley and Sons.

Watrous, R. L. & Kuhn, G. M. (1992a). Induction of finite state languages using second-order recurrent networks.In J. E. Moody, S. J. Hanson, & R. P Lippmann, (Eds.),Advances in Neural Information Processing Systems,(Vol. 4, pp. 309–316), San Mateo, CA: Morgan Kaufmann Publishers.

Watrous, R. L. & Kuhn, G. M. (1992b). Induction of finite-state languages using second-order recurrent networks.Neural Computation, 4:3, 406.

Weigend, A. S., Huberman, B. A., & Rumelhart, D. E. (1992). Predicting sunspots and exchange rates withconnectionist networks. In M. Casdagli & S. Eubank, (Eds.),Nonlinear Modeling and Forecasting, SFI Studiesin the Sciences of Complexity, Proceedings Vol. XII(pp. 395–432), Addison-Wesley.

Weiss S. M. & Indurkhya, N. (1995). Rule-based machine learning methods for functional prediction.Journal ofArtificial Intelligence Research.

Yule, G. U. (1927). On a method of investigating periodicities in disturbed series with special reference to Wolfer’ssunspot numbers.Philosophical Transactions Royal Society London Series A, 226, 267–298.

Page 23: Noisy Time Series Prediction using Recurrent Neural ...

NOISY TIME SERIES PREDICTION 183

Zeng, Z., Goodman, R. M., & Smyth, P. (1994). Discrete recurrent neural networks for grammatical inference.IEEE Transactions on Neural Networks, 5:2, 320–330.

Received December 12, 1998Revised April 21, 2000Accepted May 23, 2000Final Manuscript May 23, 2000