‘Simple’ neural networks for forecasting

Available online at www.sciencedirect.com

Omega 32 (2004) 97–100www.elsevier.com/locate/dsw

‘Simple’ neural networks for forecasting

Bruce Curry∗

Cardi� Business School, Cardi� University, Aberconway Building, Colum Drive, Cardi� CF10 3EU, UK

Received 7 October 2002; accepted 26 September 2003

Abstract

In a recent article in this journal Hwarng and Ang (HA) introduce what they describe as a ‘simple’ neural network for timeseries forecasting. It is argued here that the approach is better described as logistic regression applied in a time-series context.However, the HA model cannot be implemented through the standard LOGIT technique for handling qualitative dependentvariables. Nor is it same as the logistic di8erence equation used in population biology. In fact, it seems to have no ‘pedigree’in the time-series literature. The paper explores the dynamic properties of the model. Chaotic behaviour will not arise, andstability, especially in the 9rst-order case, is quite likely.? 2003 Elsevier Ltd. All rights reserved.

Keywords: Neural network; Forecasting; Perceptron; Logistic; Stability

1. Introduction

In a recent article in this journal Hwarng and Ang (HA)[1] introduce what they describe as a ‘simple’ neural networkfor time-series forecasting. However, the authors, whom Ishall refer to as ‘HA’, do not appear to have recognisedthat what they actually are doing is in fact just a form oflogistic regression applied in a time-series context. Strictlyspeaking, their model may be seen of course as a specialcase of a feedforward Neural Network (NN), but then inexactly the same way even simple linear regression amountsto a special variant of the same general structure. The labelNN does not seem appropriate.

What is interesting, however, is that the HA model is notexactly the same as the very familiar logistic regression orLOGIT model. Nor does it appear to have any ‘pedigree’ ina time-series context. It does not appear in the comprehen-sive review of nonlinear models provided by Tong [2]. Thelogistic transform has been employed in modern nonlineartime-series models, but rather as a method of incorporat-ing threshold e8ects and not in this particular way. Equally,the HA model is quite di8erent from the logistic di8erence

∗ Tel.: +44-1222-875668; fax: +44-1222-874419.E-mail address: [email protected] (B. Curry).

equation model which has been extensively used both asa model of population growth and also as an example ofcomplex chaotic behaviour (see e.g. [3]). In what follows Idiscuss the model in NN terms and examine its relationshipwith conventional logistic regression. Given that the modelhas no previous TS pedigree, I also consider its dynamicproperties from which some interesting behaviour emerges.

2. Feedforward neural networks

Using similar notation to HA we can de9ne a feedforwardlogistic NN, applied to a pth order autoregressive process,as follows

zt =h∑i=1

oif

(p∑j=1

wjizt−j

); (1)

where zt represents the value of the process at time t, andoi, wji are weights.

The terms inside the bracket represent the linear inputsto the hidden nodes. Activation of the hidden nodes is de-scribed by the logistic function f(). HA use the standardterm multi-layer perceptron (MLP). The MLP model caninclude other activation functions and can be employedwith a transform to the output. The so-called ‘universal

0305-0483/$ - see front matter ? 2003 Elsevier Ltd. All rights reserved.doi:10.1016/j.omega.2003.09.011

mailto:[email protected]

98 B. Curry /Omega 32 (2004) 97–100

approximation’ property of feedforward networks, which isthe main reason for using it to model nonlinear behaviour,arises from the superposition of logistic/sigmoid functionson its right-hand side (see e.g. [4]).

3. The HA model

The essential feature of the HA model is that it is a sim-pli9ed version of Eq. (1) with the right-hand side being asimple linear autoregressive model. The hidden nodes onthe right-hand side and their concomitant weights disappear.Hence, we have the HA model with weights wj:

zt = f

(p∑j=1

wjzi−j

): (2)

One can see that Eq. (2) is e8ectively an application oflogistic regression within an autoregressive model. It hasvery much the same form as the LOGIT model

yi = f

(m∑j=0

ojxji

): (3)

In this equation xji denotes the ith value of the jth ex-planatory variable and yi is the probability of the binaryvariable �i taking on the value unity. The main di8erencebetween Eqs. (2) and (3) is patently that the logit modeluses explanatory variables xj , and not autoregressive terms.Also, yi is a probability rather than a conventional depen-dent variable.

The key to the HA model is that it uses the logistic func-tion as part of the speci9ed autoregressive model. However,it is notably di8erent from other uses of the logistic in atime-series context. Of these there are two main variants. Inthe 9rst, the logistic function is used to model transitionsover time between parameter regimes. This is the SmoothTransition Auto-Regressive (STAR) model, which providesan alternative to using step functions to describe the transi-tion. (see [2]). A second time-series approach using the lo-gistic function is the so-called logistic di8erence equation.In its 9rst-order form this can be written as

Nt = Nt−1(r − sNt−1) (4)

where Nt is population size at time t, and r, s are parameters.The model has a long pedigree in population biology,

showing the long-term growth in the population of a givenspecies, subject to an upper limit (see e.g. [3]). Again, how-ever, it is patently quite di8erent from HA’s model.

4. Implementation of the HA model

Interestingly, although the LOGIT model is alternativelytitled ‘logistic regression’ its actual implementation is notthrough simple regression or even least-squares as such.Rather, because the term yi in Eq. (3) is in fact a probabil-ity the model is usually 9tted by direct maximisation of the

log-likelihood function (see [5]). Now, in Eq. (2), the vari-ables are not probabilities, even though the e8ect of the lo-gistic function is to force them into the unit interval. Hencethe standard MLE implementation for the LOGIT model isnot applicable.

One approach to 9tting Eq. (2) is to insert an additiveerror term, and thence to adopt a nonlinear least-squaresmethod, thereby applying maximum likelihood principles.HA do this implicitly in employing the packpropagation(BP) algorithm, suitably simpli9ed for their model fromthe standard version. BP has become the orthodox way ofapplying a least-squares or minimum RMS metric for NNs.However, as a gradient-based method it has been subjectedto some criticism; e.g. Curry and Morgan [6], who haveadvocated the use of other methods.

An anonymous reviewer has pointed out that Eq. (2) canbe 9tted conveniently by OLS, using a logit transform tolinearise the right-hand side: the result is

ln(

zt1− zt

)=−

p∑j=1

wjzt−j : (5)

Eq. (5) is certainly attractive in practical terms. Strictlyspeaking OLS can be applied only if (5) possesses a suitablywell-behaved error term (see [2]). This will not be the case ifthe well-behaved error term is located in the basic HAmodelin (2). In such a case the transformation produces a modelwhich conJates the original error term and the exponentcontaining the lagged values of zt . Speci9cally, denotingthe error term by u and using E to denote the exponentialterm, the right-hand side takes the form ln(−(u+ uE− E)=(1 + u+ uE)).This is a signi9cant departure from the conditions under

which one would apply OLS. The choice between the OLSform in (5) and a nonlinear regression applied to (2) de-pends on the appropriate location of the error term. The HAequation with added error conforms to the standard de9ni-tion of a pth-order nonlinear AR model and can be extendedto include moving average terms. In that case one could ar-gue that such a form should be used to compare with bothlinear and nonlinear ARMA formulations.

5. Time-series properties

For this section we adopt the notation that LAR(p) de-notes the HA model with p linear autoregressive terms onthe right-hand side. For simplicity we may write the sim-plest case of p= 1, with a single constant parameter b, as

yt = f(yt−1) =1

1 + exp(−byt−1): (6)

Here the error term is omitted, allowing us to considerstability of the deterministic ‘skeleton’ with yt = y∗ ∀t,where y∗ is a constant. Such an approach is standard(see [2]). Stability of the skeleton, together with suitable

B. Curry /Omega 32 (2004) 97–100 99

Fig. 1. Stability for b = 1.

properties of the error term broadly ensures ergodicity. Inthe case of linear models it immediately provides a simplecondition for stability. Here however, we encounter the ob-stacle that (6) has no analytic solution for the 9xed point y∗.This is no great surprise given the remark of Tong [2] that‘it is the exception rather than the rule that a solution shouldexist for a nonlinear di8erence equation’. On the other hand,it is easy to compute the solution numerically for any givenvalue of b.

Interestingly, b∗, the value of b required for stability de-pends crucially on y∗. This can be shown by simple inver-sion of (6), giving

b∗ =− 1y∗

ln(1− y∗y∗

): (7)

Hence, although (6) is useable in one direction only, it doesprovide useful insights. A stability analysis can be conducteddiagrammatically, using ‘cobweb’ methods. For example,Fig. 1 shows what happens for the case b = 1: the axeshave di8erent scales in order to display the curvature ofthe function. For this case y∗ = 0:65904607, so that y∗ =f(y∗). It is clear that the model is very stable (perhaps eventoo stable). A starting value of y0 = 0:1 leads back quitequickly to the intersection point with the 45◦ line. Afterthree iterations we have a value of 0.65210646.

Since for any random starting point y0 greater than unitythe logistic function automatically produces a value f(y0)less than unity, there is no loss of generality involved inconcentrating on the interval (0,1). In general, for equationsof this type a 9xed point is stable if the absolute value of theslope of the function at the 9xed point is less than unity (see

Fig. 2. Allowable slope values at 9xed points.

e.g. [3,2]). Using Eq. (7) the slope at 9xed points y∗ is

t = f′(y∗) =− ln(1− yy

)(1− y): (8)

A plot of this function, in Fig. 2, indicates that the slopecan never exceed unity at the intersection with the 45◦ line.Only for suitably small values of y∗ can the slope fall be-low −1. Numeric computation using the MAPLE symbolicmanipulation software gives the value at which this happensas y∗ = 0:21781171, corresponding to b∗ =−5:8695861.

This boundary value for y∗ is apparent in Fig. 2. Hence,for values of b greater than −5:8695861 we have stability.Most interestingly, where b¡=−5:8695861 we have a pe-riodic solution with alternating values of y. Fig. 3 showsthe situation for b=−10. The long run solution is an alter-nation between y = 0:47950454 and y = 0:0082027815. Aty= 0:47950454 the height of f(y) is 0.008202781452 andin turn at x=0:0082027815 we have f(y)=0:4795045371.

Interestingly, although the function in Fig. 2 appears to bereasonably straightforward, there are no analytical solutionseither for the single turning point or for the point at whichthe function 9rst yields a slope value of −1. However, theMAPLE software returns solutions for both these cases interms of the Lambert function, which has a history datingback to 1758. The original reference is [7]; the function ismore than a mere mathematical curiosity, having numerousapplications in physics and other disciplines (see e.g. [8]).Speci9cally, the Lambert functionW (x) is a complex-valuedfunction which satis9es

W (x)eW (x) = x: (9)

As a solution to the equation t(x) =−1 MAPLE gives,

LambertW (e(−1))LambertW (e(−1)) + 1

:

100 B. Curry /Omega 32 (2004) 97–100

Fig. 3. Alternating equilibrium for b =−10.

The importance of this result is that it leads directly to thevalues b∗ and y∗, calculated above, for which the modelis stable. To calculate the value one can use MAPLE togenerate a Taylor series expansion which yields y∗ for anyrequired degree of accuracy.

Finally, it is interesting in this section to consider thespeed of convergence of the model, either to a stable 9xedpoint or to an alternating solution. Convergence in somecases is extremely rapid. For example, it was noted abovethat with b = 1, starting at y0 = 0:1 we move very rapidlyto the stable equilibrium at y∗ = 0:65210646. The logisticor sigmoid function is aptly described as a ‘squashing func-tion’, and in this example it has the e8ect of imposing fastconvergence to an equilibrium value. In the other example,for b=−10 we again have fast convergence −10. Here thealternating solution is approached very closely after just fouriterations. On the other hand, with b=−6, very close to thestability boundary, we have extremely slow convergence.Speci9cally, from y0 = 0:1 it takes some 555 iterations toreach convergence for eight decimal places to the alternat-ing values of (0.28820998, 0.15068228): even after 50 iter-ations the successive values of y are merely (0.29622485,0.14463077).

Both very fast and very slow convergence are interest-ing from the point of view of time-series modelling. It fol-lows that if the LAR(p) model is to be used for forecastingpurposes, the parameter values need to be quoted and con-vergence examined. This is in contrast with the practice ofmany of those who use NNs; actual weight values are sel-dom published.

This section has dealt with the simplest case of the model.We may allow for more than a single lagged value by

rewriting Eq. (6) to include up to p autoregessive terms withcoeKcients b1 · · · bp. The stability condition then becomes

y∗ = f(y∗) =1

1 + exp(−b1y∗ − b2y∗ · · · − bpy∗)

=1

1 + exp(−�y∗) ; (10)

where � =∑p

j=1 bj .Hence we can check for 9xed points using conditions

applied to �.

6. Summary and discussion

I have argued that the ‘simple NN’ model of Hwarng andAng [1] is better described as logistic regression applied ina TS context. Although the HA model is technically a spe-cial type of NN, it does not possess the strong approxima-tion properties of multi-layer perceptrons. Although the HAmodel is an application of logistic regression in an autore-gressive model it di8ers from the LOGIT model in that itsdependent variable is not a probability. HA is patently notrelated to other common uses of the logistic function in adi8erence equation or autoregressive model.

The stability properties of the LAR model are interest-ing; the ‘squashing’ e8ect of the logistic function may im-pose a rather tight form of stability. The 9rst-order case isvery likely to be stable. In some cases, there may be pe-riodic behaviour in the long run, possibly with very slowconvergence. The analysis above suggests that chaotic be-haviour does not arise, in contrast to the logistic di8erenceequation ([3]).

References

[1] Hwarng HB, Ang HT. A simple neural network forARMA(p; q) time series. OMEGA 2001;29(4):319–33.

[2] Tong H. Non-linear time series analysis: a dynamical systemapproach. Oxford, UK: Clarendon Press; 1982.

[3] May RM. Simple mathematical models with very complicateddynamics. Nature 1976;261:459–67.

[4] Barron R. Universal approximation bounds for superpositionsof a sigmoid function. IEEE Transactions on InformationTheory 1993;39:930–45.

[5] Amemiya T. Advanced econometrics. Oxford, UK: BasilBlackwell; 1976.

[6] Curry BP, Morgan P. Neural networks: a need for caution.OMEGA 1997;25(1):123–33.

[7] Lambert JH. Observations variae in mathesin puram. ActaHelvetica Physico–Mathemetico-Anatomico-Botanico-Medica1758;3:128–68.

[8] Corless RM, Gonnet GH, Hare DEG, Je8rey DJ, Knuth DE.On the Lambert W function. Advances in ComputationalMathematics 1996;5:329–59.

‘Simple’ neural networks for forecasting

Documents

Transcript of ‘Simple’ neural networks for forecasting