Nonlinear single layer neural network training algorithm for incremental, nonstationary and...

Pattern Recognition 45 (2012) 4536–4546

Contents lists available at SciVerse ScienceDirect

Pattern Recognition

0031-32

http://d

n Corr

E-m

ofonten

journal homepage: www.elsevier.com/locate/pr

Nonlinear single layer neural network training algorithm for incremental,nonstationary and distributed learning scenarios

David Martınez-Rego, Oscar Fontenla-Romero n, Amparo Alonso-Betanzos

Department of Computer Science, University of A Coruna, Campus de Elvina s/n, 15071 A Coruna, Spain

a r t i c l e i n f o

Article history:

Received 16 June 2011

Received in revised form

2 April 2012

Accepted 11 May 2012Available online 22 May 2012

Keywords:

Artificial neural networks

Incremental learning

Nonstationary learning

Distributed learning

03/$ - see front matter & 2012 Elsevier Ltd. A

x.doi.org/10.1016/j.patcog.2012.05.009

esponding author. Tel.: þ34 981 167000; fax

ail addresses: [email protected] (D. Martınez

[email protected] (O. Fontenla-Romero), ciamparo@u

a b s t r a c t

Incremental learning of neural networks has attracted much interest in recent years due to its wide

applicability to large scale data sets and to distributed learning scenarios. Moreover, nonstationary

learning paradigms have also emerged as a subarea of study in Machine Learning literature due to the

problems of classical methods when dealing with data set shifts. In this paper we present an algorithm

to train single layer neural networks with nonlinear output functions that take into account

incremental, nonstationary and distributed learning scenarios. Moreover, it is demonstrated that

introducing a regularization term into the proposed model is equivalent to choosing a particular

initialization for the devised training algorithm, which may be suitable for real time systems that have

to work under noisy conditions. In addition, the algorithm includes some previous models as special

cases and can be used as a block component to build more complex models such as multilayer

perceptrons, extending the capacity of these models to incremental, nonstationary and distributed

learning paradigms. In this paper, the proposed algorithm is tested with standard data sets and

compared with previous approaches, demonstrating its higher accuracy.

& 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Traditionally, most Machine Learning (ML) algorithms lay onassuming that the data being analyzed was drawn from astationary distribution. However, in many real-life problems thedistribution of the data changes over time, and so what waslearned in the past is not accurate or even significant for presentdata [1]. Moreover, among the current most challenging ML taskswe can find: (a) learning when data is not available all at once orit is distributed in remote locations, under circumstances where itis not possible to move data between nodes, and (b) learning froma large amount of data available which is so massive that itbecomes impossible to tackle learning using batch ML classicalapproaches. The former problem has been presented under thename of horizontal partitioned learning and has emerged as asubarea of machine learning. Currently, there is a great deal ofinterest in this area which continuously generates challenges andprojects [2]. Recently, many classical ML methods have beenstudied from a distributed learning perspective [3,4] and evenprivacy preserving issues are being discussed in current MLresearch literature [5]. The latter problem has been named Large

ll rights reserved.

: þ34 981 167160.

-Rego),

dc.es (A. Alonso-Betanzos).

Scale Learning [6,7] and has attracted considerable interest due tothe expansion and exploitation of heavy enterprise databasescharacterized by large numbers of data points and/or highdimensions. Incremental learning models, such as the onepresented in this work, are promising tools in this area. Muchresearch has been devoted to these areas, which have expandedML applicability to real life scenarios. The work in this paper isunderpinned by previous work [8] where a new convex objectivefunction for one-layer neural networks has been presented whichcan exactly adjust (up to the first order of a Taylor series) theweight matrix of a neural network without hidden layers andnonlinear output functions, provided that these have inverse andderivative. In that research it was succinctly pointed out that theproposed approach opened up the opportunity of learning incre-mentally (i.e., without the need for storing previous data). How-ever, this incremental capacity involves the inversion of an I � I

matrix for each new data point, leading in some situations tonumerical instabilities and with a complexity of OðI3

Þ, I being thedimension of the input space. Afterwards, in [9] the incrementallearning capability of the model described above [8] was exploredand extended to concept drift scenarios, obtaining good results.This algorithm weights the importance of each data sample takinginto account whether it is recent or not, giving exponentiallymore importance to recent data points. Although it demonstratesthat it is an effective method for concept drift problems, it still hasto solve a system of linear equations for each new data sample

www.elsevier.com/locate/pr

www.elsevier.com/locate/pr

dx.doi.org/10.1016/j.patcog.2012.05.009



mailto:[email protected]




Fig. 1. Architecture of a single-layer feedforward neural network.

D. Martınez-Rego et al. / Pattern Recognition 45 (2012) 4536–4546 4537

and needs to periodically reset the weighting of the data samples,thus leading to a cumbersome algorithm. Although not consid-ered in those previous works, both algorithms share their roots inthe classical Recursive Least Squares (RLS) [10] algorithm, origin-ally designed for solving least-square problems. The RLS methodis an efficient semi-second-order approach that leads to a fasterconvergence compared with first-order models. It has beenextensively studied and applied in the last decades to problemssuch as real time signal processing, control, adaptive filtering andnoise cancellation, among others [11,12]. The algorithm has theadvantage of exhibiting extremely fast convergence in a few stepsof learning. However, each iteration has a high computationalcomplexity and potentially poor tracking performance when thesystem to be estimated changes [13]. In addition, it has the extralimitation of only considering linear output functions. The algo-rithm proposed in this paper presents the following maincharacteristics:

�
It is able to train a single layer neural network with anynonlinear output function that complies with the aforemen-tioned conditions. � Since most of the output functions used in Artificial Neural
Networks [14] comply with these requisites, it can be used as abasic building block for more complex neural models.
� It generalizes previous models. Depending on the values given
to the hyper-parameters and the selected output function, itincludes as special cases: RLS [10], the model in [8] and theone in [9].
� The relation between its initialization scheme is demonstrated
and it is the regularization capacity which can improve itsgeneralization ability in the case of ill-conditioned problems(high-dimensional, noise, etc.).
� Due to its incremental nature, it is suitable for applications in
nonstationary and/or distributed scenarios.

Specifically, compared with the previous models detailedabove, the proposed model:

�
improves the learning and concept-drift capabilities, with lesscomputational complexity; � includes the possibility of considering nonlinear output func-
tions, regarding RLS methods.

Finally, experimental results demonstrate that the proposedmodel can obtain accurate results and fast convergence in stable,nonstationary and distributed scenarios.

2. Background: nonlinear single layer neural networklearning algorithm

2.1. Incremental algorithm

In this section we present the derivation of a previous algo-rithm that obtains the optimal weights of a single layer feedfor-ward neural network with nonlinear output functions which needto have inverse and derivative. These restrictions come from thefact that we use a theorem demonstrated in [8], where anequivalent formulation for minimizing the error of a nonlinearsingle layer neural network was presented. This derivationfollows a very different philosophy in comparison to previousalgorithms since it backpropagates the networks’s desired outputsignal instead of the error committed. In Fig. 1 this process isdepicted graphically. For each pattern xs, its desired output ds ispropagated backwards using the inverse of the output function

for each neuron f�1j and we tackle the minimization of the error

between the internal network value zjs and f�1j ðdjsÞ.

The theorem presented in that work is the first step in thederivation of the proposed algorithm and it states:

Theorem 1. Let xARIþ1 be the input of a single-layer feedforward

neural network, d; yARJ be the desired and real outputs,WARJ�ðIþ1Þ be the weight matrix, and f ; f�1; f 0 : RJ-RJ be the

nonlinear function, its inverse and its derivative. Then, the minimiza-

tion of the MSE between d and y at the output of the nonlinearity

minW

1

S

XS

s ¼ 1

Jys�dsJ2

ð1Þ

where S is the number of data points, y¼ f Wxð Þ, is equivalent, up to

first Taylor order, to minimizing the MSE before the nonlinearity, i.e.,between z¼Wx and d ¼ f�1

ðdÞ weighted depending on the value of

the derivative of the nonlinearity at the corresponding operating

point. Mathematically, this property can be written as

minW

E½ðd�yÞT ðd�yÞ� �minW

E½ðf 0ðdÞ � e ÞT ðf 0ðdÞ � e Þ� ð2Þ

where (�) denotes the element-wise Hadamard product of vectors

f 0ðdÞ and e ¼ d�z.

The details of the proof of this theorem can be consulted in [8].In the following, we center our attention on only one neuron inorder to avoid a cumbersome derivation. To solve a full layer ofneurons as the one in Fig. 1, the process has to be appliedidentically for each neuron.

The weight vector of a neural network, such as the one inFig. 1, using this theorem has to be a stationary point of the righthand side of Eq. (2). Thus, taking derivatives of this expressionand equating to 0, we conclude that the optimal model w is theone which solves the following system of linear equations:

Aw¼ b ð3Þ

where A and b are defined as

A¼XS

t ¼ 1

xtxTt f 02ðdt Þ

b¼XS

t ¼ 1

dt xtf02ðdt Þ ð4Þ

With this model we have a way of tackling both batch andincremental learning scenarios, as we can save previous At and bt ,and when new information is supplied up to time tþp we canincrementally construct Atþp and btþp using Eq. (4). Althoughmathematically correct, this approach has the following problem:it needs to solve a system of equations each time new informationis provided, and a much simpler numerical algorithm to solve thisincremental learning scenario is desirable.

D. Martınez-Rego et al. / Pattern Recognition 45 (2012) 4536–45464538

2.2. Concept-drift learning algorithm

Taking advantage of the incremental learning capacity of thepresented model, this can be extended to nonstationary learningscenarios. In [9] an algorithm for tackling incremental learningwith forgetting capacity based on this previous model wasdevised. It consisted of weighting in (4) each pattern exponen-tially depending on how much time passed since its inclusion intothe learning process. This algorithm is equivalent to exponentiallyreducing the importance of the error committed for a past patternxt proportionally to the time that has passed since it appeared. Ifwe combine this idea with the result of Theorem 1, we arrive atthe following error function to minimize

minwðd�XT wÞT FLðd�XT wÞ ð5Þ

where X is a matrix with data patterns fx1,x2, . . . ,xSg as columns,dARS (for a single neuron) complies with d ¼ f�1

ðdÞ, L is adiagonal matrix with diagonal elements Lii ¼ lS�i for i¼ 1, . . . ,Sand F is a diagonal matrix with Fii ¼ f 02ðdi Þ. If we take derivativeswith respect to w and equating the result to 0, we arrive at thefollowing system of linear equations that solves the time weighted

optimal neural network.

AS ¼XS

t ¼ 1

lS�txtxTt f 02ðdt Þ

bS ¼XS

t ¼ 1

lS�tdt xtf02ðdt Þ ð6Þ

The parameter l controls the ability of the network to forgetwhen the system under identification changes. It can be fixed inadvance or it can be changed dynamically based on the network’serror history [17–19], controlling the length of the time windowconsidered in order to adjust the weights.

Up to this moment, in every scenario (nonstationary, incre-mental or distributed learning), we have to solve the system in (3)or (6) every time new information was received. This can lead toan inefficient and complex algorithm when incremental or non-stationary learning scenarios are considered since in these casesthe network has to be updated each time a new pattern isreceived.

3. Diminishing complexity and incrementing efficiency: thenew approach

In this work we present and demonstrate two lemmas thatobtain and clarify a much simpler and efficient algorithm whenapplied to the aforementioned scenarios. In the next subsection,the new algorithm is presented and, subsequently, a lemma thatdemonstrates the relation between its initialization and theregularization capacity of the model is detailed. Finally, in the lastsubsection the main differences and advantages of the proposedmodel compared to previous approaches are analyzed.

3.1. Incrementing efficiency: online algorithm

Algorithm 1 is able to solve the same model as the onepresented in Section 2, and its equivalence can be demonstratedby the following lemma:

Algorithm 1. Nonlinear single layer neural network training.

Input: Data set that comprises inputs X¼ fx1,x2, . . . ,xSg and

desired outputs D¼ fd1,d2, . . . ,dSg, forgetting factor l, initial

value dOutput: Optimal weight vector wS.

Initialize P0 ¼ dI and w0 ¼ 0

For t¼1,y,S

kt ¼l�1Pt�1xt

1þl�1xTt Pt�1xt f 02ðdt Þ

Pt ¼ l�1½Pt�1�ktxT

t Pt�1f 02ðdt Þ�

at ¼ dt�xTt wt�1

wt ¼wt�1þktatf02ðdt Þ

end

Lemma 1. Algorithm 1 solves the optimal weights of a single layer

neural network as the one depicted in Fig.1 up to first order Taylor

approximation.

The proof of this lemma is deferred to the Appendix. As can beseen in the algorithm, we update the weight vector for eachpattern through a vector kt pondered by the error committed forthat pattern and the derivative of the output function in dt . Thisvector kt is proportional to a matrix Pt�1 which represents A�1

t�1

(see Appendix for details). In order to initialize the method, wehave to give a value to P0, before any pattern is presented to thenetwork, with the initialization term P0 ¼ dI, where I representsthe identity matrix. In the next section, the meaning of thisinitialization is analyzed.

3.2. Regularization property

It is a well known fact in statistics and machine learning thatregularization schemes lead to models with a better general-ization performance when ill-conditioned parameter estimationproblems are faced [14]. One standard way of bringing inregularization in linear models is by entering a penalty term intothe error function that introduces a bias in the training phasetowards simpler models with a better generalization ability [15]:

Error0 ¼ Error on dataþComplexity penalty ð7Þ

For linear models fitted by least squares method, Tikhonovregularization [16] has demonstrated its effectiveness and has aBayesian interpretation. The Tikhonov regularization term hasthis form

Complexity penalty¼ JGwJ2ð8Þ

w being the vector of parameters of the linear model and G¼ mI

with I the identity matrix as the most common choice. This choicefor the initialization of P0 introduces a Tikhonov regularizationterm indirectly into the network error function as the followinglemma states:

Lemma 2. The parameter d in Algorithm 1 is inversely proportional

to the value m in the following extended error function.

minwðd�XT wÞT FKðd�XT wÞþmwT w ð9Þ

where d ¼ f�1ðdÞ, X is a matrix with data patterns fx1,x2, . . . ,xSg as

columns, K is a diagonal matrix with diagonal elements Lii ¼ lS�i for

i¼ 1, . . . ,S and F is a diagonal matrix with Fii ¼ f 02ðdi Þ.

One again, the proof of this lemma is deferred to the Appendix.

3.3. Main differences and advantages

The results previously detailed in this section allow us to derive analgorithm which can incrementally train a regularized single layerneural network and optionally includes the capacity of forgetting pastinformation in the presence of changes of the system under identi-fication. Both parameters d and l control this behavior and can betuned in order to obtain a quick response under changes in the modelbeing learned and to introduce a bias towards simpler models in


noisy situations. Parameter l takes values in the interval ð0;1�, andcontrols the ability of the network to forget past patterns. If it takesvalue 1, this algorithm approximately obtains the same network asthe one in [8]. Exact equivalence depends on the value of d. Thedifference emerges from the fact that the work in [8] does not havethe ability to introduce regularization into the network. In theproposed algorithm, d controls the importance of a Tikhonovregularization term introduced into the network error function. Bothmodels are equivalent in the limit d¼ 1=m-1. It is important toremark that this algorithm is equivalent to the RLS algorithm whenthe output function is linear.

It can be also observed that the proposed algorithm is able totrain a nonlinear model through simple matrix algebra opera-tions, particularly by matrix–vector multiplications and vectorsummations, which makes it suitable for implementing singlelayer neural networks trained in embedded and real time systemswithout the need for a matrix algebra package. In the previousapproach, a system of linear equations has to be solved at eachstep, being the complexity of the algorithm OðS� I3

Þ, where S isthe number of patterns and I the number of inputs of the model.In the new approach the complexity has been diminished in oneorder of magnitude, OðS� I2

Þ.Since it carries all past information in Pt and wt , it is also suitable

for incremental learning and horizontal partitioned distributed learn-ing scenarios. In both scenarios, it is impossible to access the wholedata set available for training, in the former case due to real time orstorage restrictions, and in the latter due to the fact that data isdistributed in remote nodes and it cannot be collected in a centralnode. In both cases, learning can be suspended and continuedafterwards in a different remote node or future time thanks to theinformation carried by the aforementioned Pt and wt .

In [21], an extension of the RLS algorithm to nonlinear neuralnetworks was presented. Although similar in philosophy to thepresent work, their derivation obtains a different algorithmthrough the linearization of the network’s output function insteadof the backpropagating network’s desired output signal. In theexperimental section, we prove that the proposed derivationleads to an algorithm which obtains more accurate results and afaster convergence to the minimum in comparison with thisprevious work. In addition, when applied to nonstationary envir-onments, the proposed model obtains more precise and fastersystem identification.

4. Experimental results

In this section we study the following aspects of the proposedmodel: (a) its regularization capacity for some real data sets,

00.0

20.0

40.0

60.0

8 0.15.465.48

5.55.525.545.565.58

5.65.625.64

�

Test

MS

E

Fig. 2. Regularization behavior for Ionosphere data set. (a) Test error for ion

(b) its convergence for a series of standard data sets in stableconditions, (c) its convergence and system change identificationability in nonstationary learning problems, (d) its behavior indistributed environments, (e) the interaction between the forget-ting factor and regularization.

4.1. Regularization behavior

As we mentioned in the previous sections, when a regulariza-tion term is introduced into the objective function of a model, thesolution is biased towards more simple models with bettergeneralization performance, avoiding overfitting to data. How-ever, when this bias is too restrictive, it leads to oversimplemodels which cannot learn the underlying function. This behavioris tested experimentally in this section for both a classificationand a regression problem. For the classification experiment, wedivided the ionosphere data set [20] into training and test subsets.For the regression experiment, an artificial problem was created.For the artificial case, we generated a problem taking a randomweight vector of 50 dimensions and a reduced training set of 500patterns to which we introduced random noise. In both situationsAlgorithm 1 was used to train the network with the wholetraining partition and finally tested with the test one.

In Fig. 2 one can see how initialization parameter d controlsthe regularization behavior of the network. Fig. 2(a) plots the testerror and Fig. 2(b) shows the sum (in absolute value) of theweights’ network. Both graphics have the different values of theregularization term m¼ 1=d in the abscissas axis. As can beobserved, what we demonstrated in Section 3.2 is experimentallyconfirmed. On one hand, the sum of the final weights decreases aswe increase the regularization term (initial d). On the other hand,regarding the test error, there is an optimal point where the biasintroduced by the regularization term into the training phasemakes the model increase its generalization avoiding overfittingto the training set and, from this point, this bias is too restrictiveas it is impossible to properly learn the underlying desiredfunction. Analogously, Fig. 3 shows the test error and weightsum for the regression data set. As can be seen, the behavior is thesame as in the classification example.

4.2. Standard data sets comparison

In order to illustrate the performance of the proposedalgorithm we have applied it to the on-line prediction of 12 timeseries. Table 1 contains the characteristics of the data sets. Theresults were also compared with the RLS approach proposedin [21,22] to check the efficiency of the new algorithm. For alldata sets, a cross-validation procedure was used to perform

00.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

9 0.110

10.511

11.512

12.513

13.514

14.5

�

||w||

osphere data set. (b) Weight sum (in absolute value) of the final model.

00.0

20.0

40.0

60.0

8 0.10.060.070.080.09

0.10.110.120.130.140.15

�

Test

MS

E

00.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

9 0.1

9.79.89.910

10.110.210.310.410.510.610.7

�

||w||

Fig. 3. Regularization behavior for linear data set. (a) Test error for artificial data set. (b) Weight sum (in absolute value) of the final model.

Table 1Data sets employed in the comparative study.

Data set Samples Inputs

Artificial 1 20 000 3



Annulusa 15 000 6

Lorenza 20 000 6

Kobe earthquakeb 3000 6

Concrete compressive strengthc 1030 9

Forest firesc 517 12

Glassfurnaced 1247 3

pH neutralization processd 2001 2

Industrial dryerd 867 3

Industrial winding processd 2500 5

a Available at Eric Weeks’s Chaotic Time Series repository (http://www.

physics.emory.edu/�weeks/research/tseries4.html).b Available at Time Series Data Library (http://robjhyndman.com/TSDL).c Available at UCI Machine Learning Repository (http://archive.ics.uci.edu/ml).d Available at DaISy: Database for the Identification of Systems (http://homes.esat.

kuleuven.be/�smc/daisy).


model selection and setting the parameters of each model(dA ½0:1,1;10, . . . ,100 000� and lA ½0:9,0:91,0:92, . . . ,1�). In thecase of l parameter, values under 0.9 were not used as this is astationary scenario. For each possible combination of the para-meters, 20 different simulations with random permutations of thesamples were carried out to construct the training and test sets.The selected model was the one that obtained the minimummean error for all the simulations in each data set. Fig. 4 showsthe average test MSE curves of the best model in the 20simulations for the 12 time series studied. As the learning isaccomplished in an on-line fashion, the curves show the error foreach iteration of the process until the whole data set is presentedto the model. In each subfigure the optimal combination ofparameters, using cross-validation, is specified for each algorithm.As can be observed in this figure the proposed method obtainsbetter results, in many of the experiments, than the RLS versionproposed by Leung et al., also achieving a faster convergence speed.

4.3. Non stationary scenarios

4.3.1. Regression problems

The performance of the proposed algorithm in nonstationaryregression problems was checked both for artificial and chaotictime series. In the first experiment we generated an artificial dataset with three inputs and one output to be predicted using alinear combination of the inputs. This combination is changedtwice during the learning process and thus three different

mixtures of inputs are obtained. The generated training setcontains 27 000 samples and, for each data point, a differentrandom test set of 3000 data points was created using theassociated parameters of the current context. As in the previousexperiments, 20 different simulations were performed andthe averaged MSE curves were calculated. The RLS algorithmproposed by Leung et al. [21] was used once again for comparisonpurposes. However, in this case, and in order to be fair, it wasmodified to include a forgetting factor, not proposed by theauthors in their original formulation. This term is mandatory inthis experiment because we are managing nonstationary data.

Fig. 5 includes the results for these simulations. The trainingMSE curves for the on-line training process, and the test curvesobtained using the specific test set for each sample, are shown.Each subfigure contains the results for a different forgetting factor(l) in the algorithm. As can be observed, the proposed algorithmpresents the fastest convergence speed to the optimum when achange is introduced in the function to be learned. Specifically, inthe most conservative scenario (l¼ 0:99) the proposed method isable to recover its best performance, in the presence of change, ataround 3300 data points while the algorithm by Leung et al. needsmany more samples. In the most adaptive scenario shown in thefigure (l¼ 0:50) the presented method requires approximately 60samples to obtain the new optimal parameters whilst the otherone achieves the same results but using around [1000–1200] newsamples. It is important to remark that as the forgetting factordecreases, and therefore a shorter window of relevant data isused, the variability of error is higher.

Subsequently, the proposed model was tested for the predic-tion of the Mackey–Glass [23] and Lorenz [24] chaotic time series.In order to test the ability to adapt to changes in nonstationaryenvironments in complex scenarios, the data sets were generatedin the following manner: (a) the parameters of the Mackey–Glassequations were changed, (each by 900 data points) in thefollowing order t¼ f10;15,10;14,10;13g. The task was to predictthe value 85 steps ahead using an embedding dimension of eightvalues and (b) the Rayleigh number r of the Lorenz equations waschanged in the following order r¼ f13;14,20;28g and the taskwas to predict the next sample using an embedding dimension of10 values. In both cases and for the two methods compared, l wasset to 0.99. Fig. 6 contains the results of these experiments. It canbe observed how, also in complex identification tasks, theproposed model presents, in some situations, a fast convergenceto the optimal and high accuracy.

4.3.2. Classification problems

Moreover, the proposed method was applied to the NebraskaWeather Prediction Data and the classical SEA concept problem

http://www.physics.emory.edu/~weeks/research/tseries4.html



http://robjhyndman.com/TSDL

http://archive.ics.uci.edu/ml

http://homes.esat.kuleuven.be/~smc/daisy



Fig. 4. Comparative results for the 12 data sets.


[26,27]. The first one comes from the U.S. National Oceanic andAtmospheric Administration weather measurements from over9000 weather stations worldwide [28]. Records date back to the1930s, providing a wide scope of weather trends. Daily measure-ments include a variety of features (temperature, pressure windspeed, etc.) and indicators for precipitation and other weather-

related events. Following the methodology in [27] the Offutt AirForce Base in Bellevue, Nebraska, was chosen for this experimentdue to its extensive range of 50 years (1949–1999) and diverseweather patterns, making it a longterm precipitation classifica-tion/prediction drift problem. In this case the experimentalsetting was as follows: the model is sequentially updated for

0 0.5 1 1.5 2 2.5 3

x 104

10−40

10−20

100

# dataM

SE

trai

ning

RLS−proposedRLS−Leung et al.

0 0.5 1 1.5 2 2.5 3

x 104

10−40

10−20

10 0

# data

MS

E te

st


0 0.5 1 1.5 2 2.5 3

x 104

10−40

10−20

10 0

# data

MS

E tr

aini

ng


0 0.5 1 1.5 2 2.5 3

x 104

10−40

10−20

100

# data

MS

E te

st


0 0.5 1 1.5 2 2.5 3

x 104

10−40

10−20

100

# data

MS

E tr

aini

ng RLS−proposedRLS−Leung et al.

0 0.5 1 1.5 2 2.5 3

x 104

10−40

10−20

10 0

# data

MS

E te

st


Fig. 5. Results for the first nonstationary data set. Results for a forgetting parameter (a) l¼ 0:99. (b) l¼ 0:90. (c) l¼ 0:50.


each new pattern and next 300 patterns were used as test set.Class labels are based on the binary indicator provided for eachdaily reading of rain or not rain. The first row in Fig. 7 contains thetest results using an optimal l and some values of the dparameter. The results are comparable to those obtained by thenonlinear approach proposed in [27], and both approachescompared in this study obtained similar results.

The second problem is the classical SEA concept. The data set ischaracterized by extended periods without any drift with occasionalsharp changes in the class boundary. The data set includes two classesand three features. Class labels are assigned based on the sum of therelevant features, and are differentiated by comparing this sum to athreshold that separates a 2-D hyperplane: an instance is assigned topositive class if the sum of its (relevant) features fall below thethreshold, and assigned to the negative class, otherwise. At regular

intervals, the threshold is changed, creating an abrupt shift in theclass boundary. Data are uniformly distributed between 0 and 10, andthe threshold is changed three times throughout the experiment withvarious values. Training procedure is identical to that described in[29]. The second row in Fig. 7 shows the results for this data set. Ascan be seen, again the results obtained by both approaches in thisclassification problem are very similar, as it was also the case of theprediction scenario above.

4.4. Distributed environments

Due to the incremental nature of the proposed algorithm,learning the optimal weights of a single layer neural network indistributed environments does not pose a further challenge (inthis case, patterns are scattered in several processing nodes and


cannot be shared in a central node due to privacy or storagelimitations). In order to fulfill a distributed learning task, thelearning process has to be paused in a processing node andcontinued in another node after interchanging the pair fw,Pg.Three situations can arise:

1.

MS

E te

stM

SE

test

Fignon

Acc

urac

y ra

teC

lass

ifica

tion

erro

r rat

e

Distributed batch learning, in this scenario the nodes have allthe available data from the beginning. Once all data have beenprocessed in one node, the learning process continues in thenext one until no single node has any unprocessed data.

2.
Distributed online learning, in this scenario the processingnodes receive available data online. In this case, if the currentnode has no unprocessed data and another node receives newdata, the learning process is transferred to the latter node.
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

10−3

# data


0 500 1000 1500 2000 2500 3000 3500# data


10−1

10−2

10−4

10−5

100

10−5

10−10

10−15

. 6. Results for the chaotic time series. (a) Results for the Mackey–Glass

stationary data set. (b) Results for the Lorenz nonstationary data set.

1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000

0.4

0.5

0.6

0.7

0.8

0.9

year

RLS−proposed (λ=0.99, δ=1)RLS−Leung et al.(λ=0.99, δ=1)

0 1 2 3 4 5 6x 104

0

0.1

0.2

0.3

0.4

0.5

# data

RLS−proposed (λ=0.999, δ=1)RLS−Leung et al. (λ=0.999, δ=1)

Fig. 7. Results for the Nebraska Weather (first row) and SEA concept da

3.
Distributed concept drift learning, in this case in addition to thedistributed nature of the learning, concept drift is present. So,whilst in the first two situations, constant l is set to 1, in thissituation it should be set to a value in ð0;1Þ in order to adapt tothe change. This value should be shared among theprocessing nodes.
Fig. 8 presents the results of the proposed algorithm in theaforementioned three scenarios for the distributed stairs dataset. To the best of the author’s knowledge, this problem has notbeen discussed elsewhere. In the data set depicted in Fig. 8(a),each step is stored in a processing node and data from differentsteps cannot be shared between nodes. Although both the localproblems and the global problem are easy to solve, there is noinformation whatsoever in the individual nodes that guides theirsolutions to an optimal global solution. Only incremental meth-ods that share global statistics, like the one proposed in this papercan obtain an accurate global solution in an effective manner. Theline in Fig. 8(a) represents the global classifier obtained inlearning scenarios (1) and (2). Note that they are superimposeddue to the equivalence of the two situations for the case of theproposed model. In Fig. 8(b), learning scenario (3) is presented forthis data set (the same distribution among the nodes as inFig. 8(a) is used) and Fig. 8(c) presents the classification error ofthe model over time for this situation with l¼ 0:99. It can beobserved how the presented model is able to quickly adapt in aconcept drift scenario even in a distributed environment.

4.5. Interaction between forgetting factor and regularization

In this section we explore how the forgetting factor l and theregularization term m interact in an ill-posed nonstationaryproblem. On one hand we have proved how the forgetting factorcontrols the ability of the network to adapt to changes in theunderlying function to be learned. This is done such that if toomuch adaptation is allowed the network obtains an unstableaccuracy in the presence of noise. If on the contrary, it is too large,the model is too inflexible when it needs to adapt to changesin the data. On the other hand, the initialization introducedpreviously allows the network to control the complexity of themodel, obtaining better performance when an adequate regular-ization level is introduced in the initialization of Algorithm 1.Taking these two arguments into account, when we try to tackle

1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000

0.4

0.5

0.6

0.7

0.8

0.9

year

Acc

urac

y ra

te

RLS−proposed (λ=0.99, δ=1000)RLS−Leung et al.(λ=0.99, δ=1000)

0 1 2 3 4 5 6x 104

0

0.1

0.2

0.3

0.4

0.5

# data

Cla

ssifi

catio

n er

ror r

ate

RLS−proposed (λ=0.999, δ=1000)RLS−Leung et al. (λ=0.999, δ=1000)

ta sets (second row) employing different values of the d parameter.

3 4 5 6 7 8 9 104

5

6

7

8

9

10

11

12

1

1

2

3

4

5

2

3

4

5 2 4 6 8 104

6

8

10

12

0 2 4 66

8

10

12

14

−2 −1 0 16

8

10

12

14

16

−8 −6 −4 −2 04

6

8

10

12

14

0 1000 2000 3000 4000 5000 6000 7000 80000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

# pattern

cla

ssifi

catio

n er

ror r

atio

Fig. 8. Results for the distributed stairs data set. (a) Results for the distributed stairs data set. (b) Distributed stairs data set with concept drift. (c) Classification error for

the distributed stairs data set with concept drift.

10

9

8

7

6

51

0.9

0.80.7

Forgetting Factor 0.6 20 4060 80 100

Regularization term

MS

E

x 10-3

Fig. 9. Test error for an artificial data set in function of the regularization term and

the forgetting factor.


an ill-posed nonstationary learning problem, we should take bothparameters into account since there should be an optimumcombination of adaptation–regularization level for each specific

problem depending on its particular properties. In order to testthis, we generated 30 dimensions random regression ill-posedproblems with three changes of context in each one. Thus, in theproblems generated, it is necessary to control both the complexityof the model and its adaptation to changes in the function to belearned. Analogously to the process carried out above, and shownin Fig. 5, the mean error committed by the network during itsadaptation is calculated for a test set. Thus, in Fig. 9 we show themean test error obtained by the network for different combina-tions of regularization and adaptation levels. In the x-axis thevalue of m is represented whilst in the y-axis the different valuesof l can be found. As can be derived from the obtained curves, anoptimum combination of m and l values exists which presents anequilibrium between adaptation and simplicity of the learnedmodel.

5. Conclusions

In this work we have presented a numerical training algorithmfor single layer neural networks with nonlinear output functions.The derivation is theoretically underpinned by previously demon-strated results such as the one in [8]. As special cases, it containsthe works in [8,9] and the classical Recursive Least Squares (RLS)


algorithm, which is interesting from a theoretical point of view.For practical purposes, it avoids the necessity of solving a systemof linear equations each time a new network update is required,as in [8,9], thus making it an easier and more efficient algorithmfor distributed and concept drift scenarios. For its application tolarge scale learning scenarios, the proposed model complies withthe property of incremental learning, making it a suitable algo-rithm for learning from batch data which need to be consideredby parts. Finally, an initialization scheme has also been proposed,which is equivalent to introducing a Tikhonov regularization termin the training objective function, as was demonstrated. This lastproperty makes the proposed algorithm suitable for complexhigh-dimensional or noisy problems which are typically ill-posed.Experimental results show high accuracy and better performancecompared with previous extensions of RLS algorithm to nonlinearactivation functions [21].

6. Acknowledgments

This work was co-supported by Xunta de Galicia under GrantProject Code CN 2011/007, and partly by the Spanish Ministry ofScience and Innovation (MICINN), Grant code TIN2009-10748(partially supported by FEDER funds). David Martınez-Rego issupported by the Spanish Ministry of Education FPU GrantProgram. We would like to thank the reviewers for veryconstructive and detailed comments that have significantlyimproved the quality of the paper.

Appendix A. Algorithm and regularization term proofs

In this appendix we include the proofs of the lemmaspresented in this paper:

Lemma 1. Algorithm 1 solves the optimal weights of a single layer

neural network such as the one depicted in Fig. 1 up to first order

Taylor approximation.

Proof. Following Theorem 1 proved in [8], the optimal weightvector of a single layer neural network (see Fig. 1) for a data set ofinputs X ¼ fx1,x2, . . . ,xSg and desired outputs D¼ fd1,d2, . . . ,dSg

can be obtained by solving the following system of linearequations:

ASwS ¼ bS ðA:1Þ

where AS and bS are defined as

AS ¼XS

t ¼ 1


bS ¼XS

t ¼ 1

lS�tdt xtf02ðdt Þ

and this solution is given by

wS ¼ A�1S bS ðA:2Þ

If we unroll equation (A.14) for AS and bS we obtain

AS ¼ lXS�1

t ¼ 1

lS�1�txtxTt f 02ðdt ÞþxSxT

S f 02ðdS Þ

bS ¼ lXS�1

t ¼ 1

lS�1�tdt xtf02ðdt ÞþxSdS f 02ðdS Þ

and using the Woodbury identity [25]:

A¼ B�1þCD�1CT

A�1¼ B�BCðDþCT BCÞ�1CT B

with D¼1, B�1¼ lAS�1 and C¼ f 0ðdS ÞxS we have that

A�1S ¼ l�1A�1

S�1�l�2f 02ðdS ÞA

�1S�1xSxT

S A�1S�1

1þl�1xTS A�1

S�1xSf 02ðdS ÞðA:3Þ

if we rename

PS ¼A�1S

kS ¼l�1PS�1xS

1þl�1xTS PS�1xSf 02ðdS Þ

ðA:4Þ

we have that

PS ¼ l�1½PS�1�kSxT

S PS�1f 02ðdS Þ� ðA:5Þ

kS ¼ PSxS ðA:6Þ

since

kS½1þl�1xT

S PS�1xSf 02ðdS Þ� ¼ l�1PS�1xS ðA:7Þ

kSþl�1kSxT

S PS�1xSf 02ðdS Þ ¼ l�1PS�1xS ðA:8Þ

which using (A.5) leads to

kS ¼ l�1½PS�1�kSxT

S PS�1f 02ðdS Þ�xS ¼ PSxS ðA:9Þ

If we plug these results into the solution of the linear system inEq. (A.2), we finally have that

wS ¼A�1S bS ¼ PS½lbs�1þxSdS f 02ðdS Þ�

¼ PS½lAS�1wS�1þxSdS f 02ðdS Þ�

¼ PS½lðAS�xSxTS f 02ðdS ÞÞwS�1þxSdS f 02ðdS Þ�

¼wS�1�f 02ðdS ÞPSxSxTS wS�1þPSxSdS f 02ðdS Þ

¼wS�1þ f 02ðdS ÞPSxS½dS�xTS wS�1�

¼wS�1þksaSf 02ðdS Þ

where

aS ¼ dS�xTS wS�1 ðA:10Þ

These last two equations complete Algorithm 1 in conjunctionwith Eqs. (A.4) and (A.5).

Lemma 2. Parameter d in Algorithm 1 is inversely proportional to

value m in the following extended error function.

minwðd�XT wÞT FKðd�XT wÞþmwT w ðA:11Þ

where d ¼ f�1ðdÞ, X is a matrix with data patterns fx1,x2, . . . ,xSg as

columns, K is a diagonal matrix with diagonal elements Lii ¼ lS�i for

i¼ 1, . . . ,S and F is a diagonal matrix with Fii ¼ f 02ðdi Þ.

Proof. In order to find the minimum of expression (A.11), we takethe derivative and equate it to 0. Thus we arrive at the followingexpression:

ðmIþXFKXTÞw¼XFKd ðA:12Þ

Therefore, in order to obtain the optimal weights, we have tosolve the following system:

ASwS ¼ bS ðA:13Þ

where AS and bS are defined as

AS ¼ mIþXS

t ¼ 1


bS ¼XS

t ¼ 1

lS�tdt xtf02ðdt Þ


Using these expressions for AS and bS we can use the derivation ofthe previous proof to derive Algorithm 1. Specifically, in this case,we have an expression for P0,

P0 ¼A�10 ¼ ðmIÞ�1

¼1

m I ðA:14Þ

As we can observe, this expression corresponds to the initializa-tion we have detailed in Algorithm 1 with d¼ 1=m.

References

[1] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, N.D. Lawrence, DatasetShift in Machine Learning, The MIT Press, 2009.

[2] Distributed data mining project, 2011. URL /http://www.distributeddatamining.org/ChallengesS.

[3] J. Predd, S. Kulkarni, H. Poor, A collaborative training algorithm for distrib-uted learning, IEEE Transactions on Information Theory 55 (4) (2009)1856–1871.

[4] D. Caragea, Learning Classifiers from Distributed, Semantically Heteroge-neous, Autonomous Data Sources, Ph.D. Thesis, Department of ComputerScience, Iowa State University, Ames, IA, USA, 2004.

[5] C.C. Aggarwal, P.S. Yu, Privacy-Preserving Data Mining: Models and Algo-rithms, Springer Publishing Company, 2008.

[6] P. Sun, X. Yao, Sparse approximation through boosting for learning large scalekernel machines, IEEE Transactions on Neural Networks 21 (6) (2010)883–894.

[7] Pascal large scale learning challenge, 2008. URL /http://largescale.ml.tu-berlin.de/about/S.

[8] O. Fontenla-Romero, B. Guijarro-Berdinas, B. Perez-Sanchez, A. Alonso-Betan-zos, A new convex objective function for the supervised learning of single-layer neural networks, Pattern Recognition 43 (2010) 1984–1992.

[9] D. Martınez-Rego, O. Fontenla-Romero, A. Alonso-Betanzos, B. Perez-Sanchez,A robust incremental learning method for non-stationary environments,Neurocomputing 74 (11) (2011) 1800–1808.

[10] M.H. Hayes, Recursive least squares, Statistical Digital Signal Processing andModeling, Wiley, 1986, pp. 541–551.

[11] C. Kamali, A. Pashilkar, J. Raol, Evaluation of recursive least squares algorithmfor parameter estimation in aircraft real time applications, Aerospace Scienceand Technology 15 (3) (2011) 165–174.

[12] J. Gomes, V. Barroso, Array-based QR-RLS multichannel lattice filtering, IEEETransactions on Signal Processing 56 (8) (2008) 3510–3522.

[13] W.-C. Yu, N.-Y. Shih, Bi-loop recursive least squares algorithm with forgettingfactors, IEEE Signal Processing Letters 13 (8) (2006) 505–508.

[14] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.[15] E. Alpaydin, Introduction to Machine Learning (Adaptive Computation and

Machine Learning), The MIT Press, 2004.[16] A.N. Tikhonov, Solution of incorrectly formulated problems and the regular-

ization method, Soviet Mathematics—Doklady 4 (1963) 1035–1038.[17] J. Jiang, R. Cook, Fast parameter tracking RLS algorithm with high noise

immunity, Electronics Letters 28 (22) (1992) 2043–2045.[18] S.-H. Leung, C.F. So, Gradient-based variable forgetting factor RLS algorithm

in time-varying environments, IEEE Transactions on Signal Processing 53 (8)(2005) 3141–3150.

[19] J. Hirayama, J. Yoshimoto, S. Ishii, Balancing plasticity and stability of on-linelearning based on hierarchical Bayesian adaptation of forgetting factors,Neurocomputing 69 (2006) 1954–1961.

[20] A. Frank, A. Asuncion, UCI Machine Learning Repository, 2010. URL /http://archive.ics.uci.edu/mlS.

[21] C.-S. Leung, K.-W. Wong, P.-F. Sum, L.-W. Chan, A pruning method for therecursive least squared algorithm, Neural Networks 14 (2001) 147–174.

[22] S. Shah, F. Palmieri, M. Datum, Optimal filtering algorithm for fast learning infeedforward neural networks, Neural Networks 5 (1992) 779–787.

[23] M. Mackey, L. Glass, Oscillation and chaos in physological control systems,Science 197 (1977) 287.

[24] E.N. Lorenz, Deterministic nonperiodic flow, Journal of the AtmosphericSciences 20 (2) (1963) 130–141.

[25] G.H. Golub, C.F.V. Loan, Matrix Computations, 3rd edition, Johns Hopkins,Baltimore, MD, 1996.

[26] T.R. Hoens, N.V. Chawla, R. Polikar, Heuristic updatable weighted randomsubspaces for non-stationary environments, in: IEEE Data Mining ConferenceICDM’11, 2011, pp. 241–250.

[27] R. Elwell, R. Polikar, Incremental learning of concept drift in nonstationaryenvironments, IEEE Transactions on Neural Networks 22 (10) (2011)1517–1531.

[28] U.S. National Oceanic and Atmospheric Administration. Federal ClimateComplex Global Surface Summary of Day Data [Online]. Available FTP:/ftp.ncdc.noaa.gov/pub/data/gsodS.

[29] W.N. Street, Y. Kim, A streaming ensemble algorithm (SEA) for large-scaleclassification, in: Proceedings of the 7th ACM SIGKDD International Con-ference on Knowledge and Discovery and Data Mining, 2001, pp. 377–382.

David Martınez Rego was born in Ferrol, Spain, in 1984. He received his M.S. degree in computer science from the University of A Coruna, in 2008. He works as researchstaff, with a FPU grant of the Ministerio de Ciencia e Innovacion, at the Department of Computer Science of University of A Coruna since 2009. His current research interestsinclude theoretical works on neural networks, predictive maintenance and learning optimization.

Oscar Fontenla-Romero was born in Ferrol, Spain, in 1974. He received his B.S., M.S. and Ph.D. degrees in computer science from the University of A Coruna, in 1997, 1998,and 2002, respectively. He works as an Associate Professor at the Department of Computer Science of University of A Coruna since 2004. His current research interestsinclude new linear learning methods and noise immunity for neural networks and functional networks. His main current areas are neural networks, functional networksand non-linear optimization.

Amparo Alonso-Betanzos (M’88) was born in Vigo, Spain, in 1961. She graduated with the degree in chemical engineering from the University of Santiago, Spain, in 1984.In 1985 she joined the Department of Applied Physics, Santiago de Compostela, Spain, where she received the M.S. degree for work in monitoring and control of biomedicalsignals. In 1988, she received the Ph.D. (cum laude and premio extraordinario) degree for work in the area of medical expert systems. From 1988 through 1990, she was apostdoctoral fellow in the Department of Biomedical Engineering Research, Medical College of Georgia, Augusta. She is currently a Full Professor in the Department ofComputer Science, University of A Coruna. Her main current areas are hybrid intelligent systems, intelligent multi-agent systems, linear optimization methods and entropybased cost functions for neural networks and functional networks. Dr. Alonso-Betanzos is member of various scientific societies, including the ACM and IEEE.

http://www.distributeddatamining.org/Challenges

http://www.distributeddatamining.org/Challenges

http://largescale.ml.tu-berlin.de/about/

http://largescale.ml.tu-berlin.de/about/



ftp.ncdc.noaa.gov/pub/data/gsod

Nonlinear single layer neural network training algorithm for incremental, nonstationary and...

Documents

Transcript of Nonlinear single layer neural network training algorithm for incremental, nonstationary and...